Working with rich documents
We have seen how Apache Solr has inbuilt handlers for CSV, JSON, and XML formats in the last section. In any content management system of an organization, a data item may be residing in documents which are in different formats, such as PDF, DOC, PPT, XLS. The biggest challenge with these types is, they are all semi-structured forms. Interestingly, Apache Solr handles many of these formats directly, and it is capable of extracting the information from these types of data sources, thanks to Apache Tika! Apache Solr uses code from the Apache Tika project to provide a framework for incorporating many different file-format parsers such as Apache PDFBox and Apache POI into Solr itself.
Note
The framework to extract content from different data sources in Apache Solr is also called Solr CEL, solr-cell or more commonly Solr Cell.
Understanding Apache Tika
Apache Tika is a SAX-based parser for extracting the metadata from different types of documents. Apache Tika uses the...