Assessing dataset integrity
Dataset integrity refers to the quality and consistency of data within a dataset. It is the assurance that the data is accurate, complete, reliable, and free from errors or inconsistencies. Understanding your data’s integrity is important for ensuring the quality and usability of data and determining to what degree you will need to cleanse or transform data. A dataset with poor integrity can lead to incorrect analysis, inaccurate reports, and misinformed business decisions. There are several ways to assess dataset integrity. In this section, we will discuss techniques and considerations for assessing dataset integrity in BigQuery.
The shape of the dataset
Understanding your dataset’s shape helps you form a baseline expectation for the quality of results you will receive from queries. Consider Figure 9.3. Your dataset may be taller than wide, indicating a lot of rows and few columns. This may present a situation where you want to join...