Summary
This chapter was rather long, but it makes sense – as we've covered a few times, data scientists can spend anywhere between 25% and 75% (sometimes upwards of 90%) of their time cleaning and preparing data. The pandas package is the main package for loading and cleaning data in Python (which is built on top of NumPy), so it's important we have a basic grasp of how to use pandas for data preparation and cleaning. We've seen the core of pandas from beginning to end:
- Loading data
- Examining data with EDA
- Cleaning and preparing data for further analysis
- Saving data to disk
We also took a look at NumPy, but keep in mind that most NumPy functionality can be used directly from pandas. It's only when you need more advanced math that you might have to turn to NumPy.
In our next chapter, we'll take our EDA and visualization skills to a whole new level.