Extracting PDF content
In Chapter 1, New Missions – New Tools, we installed PDF Miner 3K to parse PDF files. It's time to see how this tool works. Here's the link to the documentation for this package: http://www.unixuser.org/~euske/python/pdfminer/index.html. This link is not obvious from the PyPI page, or from the BitBucket site that contains the software. An agent who scans the docs/index.html
will see this reference.
In order to see how we use this package, visit http://www.unixuser.org/~euske/python/pdfminer/programming.html. This has an important diagram that shows how the various classes interact to represent the complex internal details of a PDF document. For some helpful insight, visit http://denis.papathanasiou.org/2010/08/04/extracting-text-images-from-pdf-files/.
A PDF document is a sequence of physical pages. Each page has boxes of text (in addition to images and line graphics). Each textbox contains lines of text and each line contains the individual characters...