Getting text data from a document
We'll need to add some more features to our class definition so that we can extract meaningful, aggregated blocks of text. We'll need to add some layout rules and a text aggregator that uses the rules and the raw page to create aggregated blocks of text.
We'll override the init_device()
method to create a more sophisticated device. Here's the next subclass, built on the foundation of the Miner_Page
and Miner
classes:
from pdfminer.converter import PDFPageAggregator from pdfminer.layout import LAParams class Miner_Layout(Miner_Page): def __init__(self, *args, **kw): super().__init__(*args, **kw) def init_device(self, resource_manager, **params): """Return an PDFPageAggregator as a device.""" self.layout_params = LAParams(**params) return PDFPageAggregator(resource_manager, laparams=self.layout_params) def page_iter(self): """Yields a LTPage...