Indexing PDF files
The library on the corner, we used to go to, wants to expand its collection and become available for the wider public through the World Wide Web. It asked its book suppliers to provide sample chapters of all the books in PDF format so that they can share it with online users. With all the samples provided by the supplier comes a problem—how to extract data for the search box from more than 900,000 PDF files. Solr can do it with the use of Apache Tika (http://tika.apache.org/). This recipe will show you how to handle such a task.
How to do it...
To index PDF files, we will need to set up Solr to use extracting request handlers. To do this, we will take the following steps:
First, let's edit our Solr instance,
solrconfig.xml
, and add the following configuration:<requestHandler name="/update/extract" class="solr.extraction.ExtractingRequestHandler"> <lst name="defaults"> <str name="fmap.content">text</str> <str name="lowernames">true</str...