Summary
In this chapter, we learned a little about the history of data wrangling and became familiar with its definition. Every task performed in order to transform or enhance the data and to make it ready for analysis and modeling is what we call data wrangling or data munging.
We also discussed some topics stating the importance of wrangling data before modeling it. A model is a simplified representation of reality, and an algorithm is like a student that needs to understand that reality to give us the best answer about the subject matter. If we teach this student with bad data, we cannot expect to receive a good answer. A model is as good as its input data.
Continuing further in the chapter, we reviewed the benefits of data wrangling, proving that we can improve the quality of our data, resulting in faster results and better outcomes.
In the final sections, we reviewed the basic steps of data wrangling and learned more about three of the most commonly used frameworks for Data Science – KDD, SEMMA, and CRISP-DM. I recommend that you review more information about them to have a holistic view of the life cycle of a Data Science project.
Now, it is important to notice how these three frameworks preach the selection of a representative dataset or subset of data. A nice example is given by Aurélien Géron (Hands-on Machine Learning with Scikit-Learn, Keras and TensorFlow, 2nd edition, (2019): 32-33). Suppose you want to build an app to take pictures of flowers and recognize and classify them. You could go to the internet and download thousands of pictures; however, they will probably not be representative of the kind of pictures that your model will receive from the app users. Ergo, the model could underperform. This example is relevant to illustrate the garbage in, garbage out idea. That is, if you don’t explore and understand your data thoroughly, you won’t know whether it is good enough for modeling.
The frameworks can lead the way, like a map, to explore, understand, and wrangle the data and to make it ready for modeling, decreasing the risk of having a frustrating outcome.
In the next chapter, let’s get our hands on R and start coding.