Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds
Arrow up icon
GO TO TOP
Scaling Apache Solr

You're reading from   Scaling Apache Solr Optimize your searches using high-performance enterprise search repositories with Apache Solr

Arrow left icon
Product type Paperback
Published in Jul 2014
Publisher
ISBN-13 9781783981748
Length 298 pages
Edition 1st Edition
Languages
Tools
Concepts
Arrow right icon
Author (1):
Arrow left icon
Hrishikesh Vijay Karambelkar Hrishikesh Vijay Karambelkar
Author Profile Icon Hrishikesh Vijay Karambelkar
Hrishikesh Vijay Karambelkar
Arrow right icon
View More author details
Toc

Table of Contents (13) Chapters Close

Preface 1. Understanding Apache Solr 2. Getting Started with Apache Solr FREE CHAPTER 3. Analyzing Data with Apache Solr 4. Designing Enterprise Search 5. Integrating Apache Solr 6. Distributed Search Using Apache Solr 7. Scaling Solr through Sharding, Fault Tolerance, and Integration 8. Scaling Solr through High Performance 9. Solr and Cloud Computing 10. Scaling Solr Capabilities with Big Data A. Sample Configuration for Apache Solr Index

Working with rich documents

We have seen how Apache Solr has inbuilt handlers for CSV, JSON, and XML formats in the last section. In any content management system of an organization, a data item may be residing in documents which are in different formats, such as PDF, DOC, PPT, XLS. The biggest challenge with these types is, they are all semi-structured forms. Interestingly, Apache Solr handles many of these formats directly, and it is capable of extracting the information from these types of data sources, thanks to Apache Tika! Apache Solr uses code from the Apache Tika project to provide a framework for incorporating many different file-format parsers such as Apache PDFBox and Apache POI into Solr itself.

Note

The framework to extract content from different data sources in Apache Solr is also called Solr CEL, solr-cell or more commonly Solr Cell.

Understanding Apache Tika

Apache Tika is a SAX-based parser for extracting the metadata from different types of documents. Apache Tika uses the...

lock icon The rest of the chapter is locked
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime