Extracting metadata with PyPDF2
We will start with PyPDF2, whose module can be installed directly with the following command:
$ pip install PyPDF2
This module offers us the ability to extract document information using the PdfFileReader
class and the getDocumentInfo()
method, which returns a dictionary with the data of the document.
We could start by extracting the number of pages using the getNumPages()
method from the PdfFileReader
class. We could also use the output of the pdfinfo
command to obtain this information. You can find the following code in the get_num_pages_pdf.py
file in the pypdf2
folder:
from PyPDF2 import PdfFileReader
pdf = PdfFileReader(open('pdf/XMPSpecificationPart3.pdf','rb'))
print(str(pdf.getNumPages()))
from subprocess import check_output
def get_num_pages(pdf_path):
output = check_output(["pdfinfo", pdf_path]).decode()
pages_line = [line for line in output.splitlines() if "Pages:" in line]...