Chapter 2. Data Cleaning
Clean data is an essential element of good data analysis. Poor data quality is a primary reason for problems in business intelligence analysis. Data cleaning is the process of transforming raw data into usable data. Cleaning data, checking quality, and standardizing data types accounts for the majority of an analytic project schedule.
Anthony Goldbloom, the CEO of Kaggle, said: Eighty percent of data science is cleaning data and the other twenty percent is complaining about cleaning data (personal communication, February 14, 2016).
This chapter covers four key topics using some of the newer packages available within the R environment:
- Summarizing your data for inspection
- Finding and fixing flawed data
- Converting inputs to data types suitable for analysis
- Adapting string variables to a standard
Business analysts spend a lot of time cleaning data before moving to the analysis phase. Data cleaning does not have to be a dreaded task. This chapter provides business...