Understanding data cleaning and data wrangling
If you analyze data or train models on datasets, a large chunk of your time will be spent doing data cleaning. Data cleaning is the act of resolving inconsistencies and impurities in your data so that you can take subsequent actions with your data source. This is critical because most data you work with will require some degree of cleaning before you can work with it.
Where unclean data comes from
Data sources are rarely ever the way you wish them to be. Data commonly has artifacts such as missing values for required fields or various typos or inconsistencies.
For example, a data source may have a Country
field that could have many different values that all refer to the same country. To help illustrate this, Figure 4.1 contains a set of football players all born in the United States, but each one has a different value for their country_of_birth
:
Figure 4.1: Different US values – note the United...