Data Wrangling Documents and Spreadsheets
Now that we have some basic Python and data skills under our belt, let's take a look at how we can work with some common types of data you will see in the wild: documents and spreadsheets. Most organizations use Microsoft Office with Word and Excel, and this generates huge amounts of data. There are also loads of PDF documents out there with valuable information contained within. If our data lies in a pile of Excel and PDF files, then dealing with these types of data becomes necessary when doing data science. Once we have data loaded from these files, it's also useful to have a few basic analysis techniques at the ready. We'll learn data extraction techniques, as well as basic analysis techniques for the text from documents and the data from Excel spreadsheets that we might encounter. Specifically, we'll learn the Python tools and techniques for:
- Loading Word and PDF documents using the
python-docx
andPyPDF2...