Anticipating Data Cleaning Issues When Working with HTML, JSON, and Spark Data
This chapter continues our work on importing data from a variety of sources and the initial checks we should do on the data after importing it. Over the last 25 years, data analysts have found that they increasingly need to work with data in non-tabular, semi-structured forms. Sometimes, they even create and persist data in those forms. We will work with a common alternative to traditional tabular datasets in this chapter, JSON, but the general concepts can be extended to XML and NoSQL data stores such as MongoDB. We will also go over common issues that occur when scraping data from websites.
Data analysts have also been finding that increases in the volume of data to be analyzed have been even greater than improvements in machine processing power, at least those computing resources that are available locally. Working with big data sometimes requires us to rely on technology like Apache Spark, which...