It takes a lot of time and effort to deliver data in a format that is ready for its end use. Let's use an example of an online gaming site that wants to post the high score for each of its games every month. In order to make this data available, the site's developers would need to set up a database to keep data on all of the scores. In addition, they would need a system to retrieve the top scores every month from that database and display it to the end users.
For the users of our hypothetical gaming site, getting this month's high scores is fairly straightforward. This is because finding out what the high scores are is a rather general use case. A lot of people will want that specific data in that specific form, so it makes sense to develop a system to deliver the monthly high scores.
Unlike the users of our hypothetical gaming site, data programmers have very specialized use cases for the data that they work with. A data journalist following politics may want to visualize trends in government spending over the last few years. A machine learning engineer working in the medical industry may want to develop an algorithm to predict a patient's likelihood of returning to the hospital after a visit. A statistician working for the board of education may want to investigate the correlation between attendance and test scores. In the gaming site example, a data analyst may want to investigate how the distribution of scores changes based on the time of the day.
A short side note on terminology:
Data science as an all encompassing term can be a bit elusive. As it is such a new field, the definition of a data scientist can change depending on who you ask. To be more general, the term data programmer will be used in this book to refer to anyone who will find data wrangling useful in their work.
Drawing insight from data requires that all the information that is needed is in a format that you can work with. Organizations that produce data (for example, governments, schools, hospitals, and web applications) can't anticipate the exact information that any given data programmer might need for their work. There are too many possible scenarios to make it worthwhile. Data is therefore generally made available in its raw format. Sometimes this is enough to work with, but usually it is not. Here are some common reasons:
- There may be extra steps involved in getting the data
- The information needed may be spread across multiple sources
- Datasets may be too large to work with in their original format
- There may be far more fields or information in a particular dataset than needed
- Datasets may have misspellings, missing fields, mixed formats, incorrect entries, outliers, and so on
- Datasets may be structured or formatted in a way that is not compatible with a particular application
Due to this, it is often the responsibility of the data programmer to perform the following functions:
- Discover and gather the data that is needed (getting data)
- Merge data from different sources if necessary (merging data)
- Fix flaws in the data entries (cleaning data)
- Extract the necessary data and put it in the proper structure (shaping data)
- Store it in the proper format for further use (storing data)
This perspective helps give some context to the relevance and importance of data wrangling. Data wrangling is sometimes seen as the grunt work of the data programmer, but it is nevertheless an integral part of drawing insights from data. This book will guide you through the various skill sets, most common tools, and best practices for data wrangling. In the following section, I will break down the tasks involved in data wrangling and provide a broad overview of the rest of the book. I will discuss the following steps in detail and provide some examples:
- Getting data
- Cleaning data
- Merging and shaping data
- Storing data
Following the high-level overview, I will briefly discuss Python and R, the tools used in this book to conduct data wrangling.