Handling duplicate, missing, or invalid data
So far, we have discussed things we could change with the way the data was represented with zero ramifications. However, we didn't discuss a very important part of data cleaning: how to deal with data that appears to be duplicated, invalid, or missing. This is separated from the rest of the data cleaning discussion because it is an example where we will do some initial data cleaning, then reshape our data, and finally look to handle these potential issues; it is also a rather hefty topic.
We will be working in the 5-handling_data_issues.ipynb
notebook and using the dirty_data.csv
file. Let's start by importing pandas
and reading in the data:
>>> import pandas as pd >>> df = pd.read_csv('data/dirty_data.csv')
The dirty_data.csv
file contains wide format data from the weather API that has been altered to introduce many common data issues that we will encounter in the wild. It contains the following...