Using Apache Tika for content analysis and extraction
Apache Tika is capable of detecting and extracting metadata and text from thousands of different type of files, such as .doc
, .docx
, .ppt
, .pdf
, .xls
, and so on. It can be used for various file formats, which makes it useful for search engines, indexing, content analysis, translation, and so on. It can be downloaded from https://tika.apache.org/download.html. This section will explore how Tika can be used for text extraction for various formats. We will use Testdocument.docx
and TestDocument.pdf
only.
Using Tika is very straightforward, as shown in the following code:
File file = new File("TestDocument.pdf"); Tika tika = new Tika(); String filetype = tika.detect(file); System.out.println(filetype); System.out.println(tika.parseToString(file));
Simply create an instance of Tika
and use the detect
and parseToString
methods to get the following output:
application/pdf Jump to navigation Jump to search Welcome to Wikipedia...