Chapter 3. The Data Science Pipeline
Until now, we explored how to load data into Python and process it up to a point to create a dataset as a bidimensional NumPy array of numeric values. At this point, we are ready to get fully immersed into data science and extract meaning from data and potential data products. This chapter and the next chapter on machine learning are the most challenging sections of the entire book.
In this chapter, you will learn how to:
- Briefly explore data and create new features
- Reduce the dimensionality of data
- Spot and treat outliers
- Decide on the score or loss metrics that are the best for your project
- Apply the scientific methodology and effectively test the performance of your machine learning hypothesis
- Select the best feature set
- Optimize your learning parameters