There is a large amount of text found in Word documents. In this recipe, we will illustrate how to obtain this text using the Apache PDFBox API. We will reuse the Word document created in the Extracting text from a PDF document recipe
Extracting text from a Word document
Getting ready
To prepare the recipe, we need to do the following:
- Create a new Maven project.
- Add the following dependency to the project's POM file:
<dependency>
<groupId>org.apache.pdfbox</groupId>
<artifactId>pdfbox</artifactId>
<version>2.0.13</version>
</dependency>
- Use the Word document created in the Extracting text from a PDF document recipe. Save it in the root directory of this project...