Indexing PDF and Word documents
We'll create a new schema that will hold the metadata information for our indexed files. Apache Tika will extract the metadata information from the file that we pass to it. The schema.xml
configuration, which we'll use, looks like the following:
<?xml version="1.0" encoding="UTF-8" ?> <schema name="tika-example" version="1.5"> <field name="title" type="text_general" indexed="true" stored="true" multiValued="true"/> <field name="author" type="text_general" indexed="true" stored="true"/> <field name="content" type="text_general" indexed="true" stored="true" multiValued="true"/> <dynamicField name="attr_*" type="text_general" indexed="true" stored="false" multiValued="true"/> <fieldType name="text_general" class="solr.TextField" positionIncrementGap="100"> <analyzer> <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="([\\n])" replacement=""/> <tokenizer class...