Extracting metadata from PDF documents
Document metadata is a type of information that is stored within a file and is used to provide additional information about that file. This information could be related to the software used to create the document, the name of the author or organization, as well as the date and time the file was created or modified.
Each application stores metadata differently, and the amount of metadata that is stored in a document will almost always depend on the software used to create the document.
In this section, we will review how to extract metadata from PDF documents with the pyPDF2
module. The module can be installed directly with the pip install
utility since it is located in the official Python repository:
$ pip3 install PyPDF2
At the URL https://pypi.org/project/PyPDF2, we can see the last version of this module:
>>> import PyPDF2 >>> dir(PyPDF2) ['PageRange', 'PdfFileMerger', 'PdfFileReader...