Extracting metadata with PyMuPDF
Another way to extract text from PDF documents is using the PyMuPDF module (https://github.com/pymupdf/PyMuPDF), which is available in the PyPi repository, and you can install it with the following command:
$ pip install PyMuPDF
Viewing document information and extracting text from a PDF document is done similarly to with PyPDF2. The module to be imported is called fitz and provides a method called load_page()
for loading a specific page, and for extracting text from a specific page, we can use the get_text()
method from the page
object. The following script allows us to obtain the text for a specific page number. You can find the following code in the extractTextFromPDF_fitz.py
file in the pymupdf
folder:
import fitz
pdf_document = "pdf/XMPSpecificationPart3.pdf"
doc = fitz.open(pdf_document)
print ("number of pages: %i" % doc.page_count)
page_number= input("Enter page number:")
page = doc.load_page(int...