Emphasizing the importance of exploratory data analysis (EDA)
Data quality problems cost US businesses more than $3 trillion a year (reference: https://hbr.org/2016/09/bad-data-costs-the-u-s-3-trillion-per-year). In the previous chapter, we examined the capabilities of Delta, such as ACID transactions and schema evolution, which help ensure a high degree of data integrity as data is being processed. But what about the characteristics and temperament of the raw data itself? If it is riddled with holes and gaps, then using it to build a model will result in suboptimal, if not inaccurate, insights. Understanding the quality and reliability of the working datasets is an important step and should not be skipped.
EDA refers to the process of statistical analysis to review the source data and understand its structure, content, and interrelationships to help identify the true potential for data projects. This is where profiling the data is important as it produces critical insights...