Analyzing PDF reports in a folder with the tm package
Text analytics is basically a way to perform quantitative analysis on qualitative information stored in text. In this recipe, we will create a corpus of documents from PDF files and perform descriptive analytics on them, looking for the most frequent terms.
This is a particularly useful recipe for professionals who work with PDF reports.
In this recipe, we will explore the full text of the Italian medieval masterpiece Divine Comedy by Dante Alighieri. You can find out more on Wikipedia at https://en.wikipedia.org/wiki/Divine_Comedy:
Dante Alighieri is shown holding a copy of the Divine Comedy
, next to the entrance to Hell, the seven terraces of Mount Purgatory and the city of Florence, with the spheres of Heaven above, in Michelino's fresco.
Getting ready
In this recipe, we will use the pdftotext
utility in order to read text from the PDF format.
You can download pdftotext...