What is data wrangling?
Data wrangling is the process of modifying, cleaning, organizing, and transforming data from one given state to another, with the objective of making it more appropriate for use in analytics and data science.
This concept is also referred to as data munging, and both words are related to the act of changing, manipulating, transforming, and incrementing your dataset.
I bet you’ve already performed data wrangling. It is a common task for all of us. Since our primary school years, we have been taught how to create a table and make counts to organize people’s opinions in a dataset. If you are familiar with MS Excel or similar tools, remember all the times you have sorted, filtered, or added columns to a table, not to mention all of those lookups that you may have performed. All of that is part of the data-wrangling process. Every task performed to somehow improve the data and make it more suitable for analysis can be considered data wrangling.
As a data scientist, you will constantly be provided with different kinds of data, with the mission of transforming the dataset into insights that will, consequentially, form the basis for business decisions. Unlike a few years ago, when the majority of data was presented in a structured form such as text or tables, nowadays, data can come in many other forms, including unstructured formats such as video, audio, or even a combination of those. Thus, it becomes clear that most of the time, data will not be presented ready to work and will require some effort to get it in a ready state, sometimes more than others.
Figure 1.1 – Data before and after wrangling
Figure 1.1 is a visual representation of data wrangling. We see on the left-hand side three kinds of data points combined, and after sorting and tabulating, the data is clearer to be analyzed.
A wrangled dataset is easier to understand and to work with, creating the path to better analysis and modeling, as we shall see in the next section when we will learn why data wrangling is important to a data science project.