You're reading from Data Wrangling with R Load, explore, transform and visualize data for modeling with tidyverse libraries

Product type Paperback

Published in Feb 2023

Publisher Packt

ISBN-13 9781803235400

Length 384 pages

Edition 1st Edition

Languages

Tools

Power BI

Concepts

Data Mining

Author (1):

Gustavo Santos

View More author details

Table of Contents (21) Chapters

Preface

1. Part 1: Load and Explore Data

2. Chapter 1: Fundamentals of Data Wrangling FREE CHAPTER

3. Chapter 2: Loading and Exploring Datasets

4. Chapter 3: Basic Data Visualization

5. Part 2: Data Wrangling

6. Chapter 4: Working with Strings

7. Chapter 5: Working with Numbers

8. Chapter 6: Working with Date and Time Objects

9. Chapter 7: Transformations with Base R

10. Chapter 8: Transformations with Tidyverse Libraries

11. Chapter 9: Exploratory Data Analysis

12. Part 3: Data Visualization

13. Chapter 10: Introduction to ggplot2

14. Chapter 11: Enhanced Visualizations with ggplot2

15. Chapter 12: Other Data Visualization Options

16. Part 4: Modeling

17. Chapter 13: Building a Model with R

18. Chapter 14: Build an Application with Shiny in R

19. Conclusion

20. Other Books You May Enjoy

Summary

In this chapter, we learned a little about the history of data wrangling and became familiar with its definition. Every task performed in order to transform or enhance the data and to make it ready for analysis and modeling is what we call data wrangling or data munging.

We also discussed some topics stating the importance of wrangling data before modeling it. A model is a simplified representation of reality, and an algorithm is like a student that needs to understand that reality to give us the best answer about the subject matter. If we teach this student with bad data, we cannot expect to receive a good answer. A model is as good as its input data.

Continuing further in the chapter, we reviewed the benefits of data wrangling, proving that we can improve the quality of our data, resulting in faster results and better outcomes.

In the final sections, we reviewed the basic steps of data wrangling and learned more about three of the most commonly used frameworks for Data Science – KDD, SEMMA, and CRISP-DM. I recommend that you review more information about them to have a holistic view of the life cycle of a Data Science project.

Now, it is important to notice how these three frameworks preach the selection of a representative dataset or subset of data. A nice example is given by Aurélien Géron (Hands-on Machine Learning with Scikit-Learn, Keras and TensorFlow, 2nd edition, (2019): 32-33). Suppose you want to build an app to take pictures of flowers and recognize and classify them. You could go to the internet and download thousands of pictures; however, they will probably not be representative of the kind of pictures that your model will receive from the app users. Ergo, the model could underperform. This example is relevant to illustrate the garbage in, garbage out idea. That is, if you don’t explore and understand your data thoroughly, you won’t know whether it is good enough for modeling.

The frameworks can lead the way, like a map, to explore, understand, and wrangle the data and to make it ready for modeling, decreasing the risk of having a frustrating outcome.

In the next chapter, let’s get our hands on R and start coding.

You're reading from Data Wrangling with R Load, explore, transform and visualize data for modeling with tidyverse libraries

Table of Contents (21) Chapters

Summary

Authors (1)

Personalised recommendations for you