Reading Word documents
Word documents (.docx
) are another common kind of document that stores mainly text. They are typically generated with Microsoft Office, but other tools also produce compatible files. They are probably the most common format to share files that need to be editable, but they are also common for distributing documents.
We'll see in this recipe how to extract text information from a Word document.
Getting ready
We'll use the python-docx
module to read and process Word documents:
$ echo "python-docx==0.8.10" >> requirements.txt
$ pip install -r requirements.txt
We have prepared a test file, available in the GitHub Chapter04/documents
directory, called document-1.docx
, which we'll use in this recipe. Note that this document follows the same Lorem Ipsum pattern that was described in the test document for the Reading PDF files recipe.
How to do it...
- Import
python-docx
:
...>> import docx