Parsing and processing Word and PDF documents
As we know, Microsoft Office documents are everywhere, especially Word and Excel documents. Of course, PDF documents are also used widely to share reports and information. In fact, certain fields, such as finance and public service, are absolutely drowning in PDF documents.
Reading text from Word documents
Let's first look at reading text from Word documents. We will assume the role of a data scientist working with a non-profit organization that is trying to reduce gun violence in schools. We have a few Microsoft Word documents from the US Department of Education's Gun-Free Schools Act reports (these are stored as .docx
files in the GitHub repository for the book under https://github.com/PacktPublishing/Practical-Data-Science-with-Python/tree/main/Chapter6/data/gfsr_docs/docx). As our first step, we want to extract the text from the Word files and look at the most common words and word pairs.
There aren't a whole...