Chapter 4. Data Cleaning
Real-world data is frequently dirty and unstructured, and must be reworked before it is usable. Data may contain errors, have duplicate entries, exist in the wrong format, or be inconsistent. The process of addressing these types of issues is called data cleaning. Data cleaning is also referred to as data wrangling, massaging, reshaping , or munging. Data merging, where data from multiple sources is combined, is often considered to be a data cleaning activity.
We need to clean data because any analysis based on inaccurate data can produce misleading results. We want to ensure that the data we work with is quality data. Data quality involves:
- Validity: Ensuring that the data possesses the correct form or structure
- Accuracy:Ă‚Â The values within the data are truly representative of the dataset
- Completeness: There are no missing elements
- Consistency: Changes to data are in sync
- Uniformity: The same units of measurement are used
There are several techniques and tools used to...