Search icon CANCEL
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Conferences
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
Data Wrangling with R

You're reading from   Data Wrangling with R Load, explore, transform and visualize data for modeling with tidyverse libraries

Arrow left icon
Product type Paperback
Published in Feb 2023
Publisher Packt
ISBN-13 9781803235400
Length 384 pages
Edition 1st Edition
Languages
Concepts
Arrow right icon
Author (1):
Arrow left icon
Gustavo Santos Gustavo Santos
Author Profile Icon Gustavo Santos
Gustavo Santos
Arrow right icon
View More author details
Toc

Table of Contents (21) Chapters Close

Preface 1. Part 1: Load and Explore Data
2. Chapter 1: Fundamentals of Data Wrangling FREE CHAPTER 3. Chapter 2: Loading and Exploring Datasets 4. Chapter 3: Basic Data Visualization 5. Part 2: Data Wrangling
6. Chapter 4: Working with Strings 7. Chapter 5: Working with Numbers 8. Chapter 6: Working with Date and Time Objects 9. Chapter 7: Transformations with Base R 10. Chapter 8: Transformations with Tidyverse Libraries 11. Chapter 9: Exploratory Data Analysis 12. Part 3: Data Visualization
13. Chapter 10: Introduction to ggplot2 14. Chapter 11: Enhanced Visualizations with ggplot2 15. Chapter 12: Other Data Visualization Options 16. Part 4: Modeling
17. Chapter 13: Building a Model with R 18. Chapter 14: Build an Application with Shiny in R 19. Conclusion 20. Other Books You May Enjoy

Why data wrangling?

Now you know what data wrangling means, and I am sure that you share the same view as me that this is a tremendously important subject – otherwise, I don’t think you would be reading this book.

In statistics and data science areas, there is this frequently repeated phrase: garbage in, garbage out. This popular saying represents the central idea of the importance of wrangling data because it teaches us that our analysis or even our model will only be as good as the data that we present to it. You could also use the weakest link in the chain analogy to describe that importance, meaning that if your data is weak, the rest of the analysis could be easily broken by questions and arguments.

Let me give you a naïve example, but one that is still very precise, to illustrate my point. If we receive a dataset like in Figure 1.2, everything looks right at first glance. There are city names and temperatures, and it is a common format used to present data. However, for data science, this data may not be ideal for use just yet.

Figure 1.2 – Temperatures for cities

Figure 1.2 – Temperatures for cities

Notice that all the columns are referring to the same variable, which is Temperature. We would have trouble plotting simple graphics in R with a dataset presented as in Figure 1.2, as well as using the dataset for modeling.

In this case, a simple transformation of the table from wide to long format would be enough to complete the data-wrangling task.

Figure 1.3 – Dataset ready for use

Figure 1.3 – Dataset ready for use

At first glance, Figure 1.2 might appear to be the better-looking option. And, in fact, it is for human eyes. The presentation of the dataset in Figure 1.2 makes it much easier for us to compare values and draw conclusions. However, we must not forget that we are dealing with computers, and machines don’t process data the same way humans do. To a computer, Figure 1.2 has seven variables: City, Jan, Feb, Mar, Apr, May, and Jun, while Figure 1.3 has only three: City, Month, and Temperature.

Now comes the fun part; let’s compare how a computer would receive both sets of data. A command to plot the temperature timeline by city for Figure 1.2 would be as follows: Computer, take a city and the temperatures during the months of Jan, Feb, Mar, Apr, May, and Jun in that city. Then consider each of the names of the months as a point on the x axis and the temperature associated as a point on the y axis. Plot a line for the temperature throughout the months for each of the cities.

Figure 1.3 is much clearer to the computer. It does not need to separate anything. The dataset is ready, so look how the command would be given: Computer, for each city, plot the month on the x axis and the temperature on the y axis.

Much simpler, agree? That is the importance of data wrangling for Data Science.

Benefits

Performing good data wrangling will improve the overall quality of the entire analysis process. Here are the benefits:

  • Structured data: Your data will be organized and easily understandable by other data scientists.
  • Faster results: If the data is already in a usable state, creating plots or using it as input to an algorithm will certainly be faster.
  • Better data flow: To be able to use the data for modeling or for a dashboard, it needs to be properly formatted and cleaned. Good data wrangling enables the data to follow to the next steps of the process, making data pipelines and automation possible.
  • Aggregation: As we saw in the example in the previous section, the data must be in a suitable format for the computer to understand. Having well-wrangled datasets will help you to be able to aggregate them quickly for insight extraction.
  • Data quality: Data wrangling is about transforming the data to the ready state. During this process, you will clean, aggregate, filter, and sort it accordingly, visualize the data, assess its quality, deal with outliers, and identify faulty or incomplete data.
  • Data enriching: During wrangling, you might be able to enrich the data by creating new variables out of the original ones or joining other datasets to make your data more complete.

Every project, being related with Data Science or not, can benefit from data wrangling. As we just listed, it brings many benefits to the analysis, impacting the quality of the deliverables in the end. But to get the best from it, there are steps to follow.

You have been reading a chapter from
Data Wrangling with R
Published in: Feb 2023
Publisher: Packt
ISBN-13: 9781803235400
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime