Chapter 2: Anticipating Data Cleaning Issues when Importing HTML and JSON into pandas
This chapter continues our work on importing data from a variety of sources, and the initial checks we should do on the data after importing it. Gradually, over the last 25 years, data analysts have found that they increasingly need to work with data in non-tabular, semi-structured forms. Sometimes they even create and persist data in those forms themselves. We work with a common alternative to traditional tabular datasets in this chapter, JSON, but the general concepts can be extended to XML and NoSQL data stores such as MongoDB. We also go over common issues that occur when scraping data from websites.
In this chapter, we will work through the following recipes:
- Importing simple JSON data
- Importing more complicated JSON data from an API
- Importing data from web pages
- Persisting JSON data