Loading, Tidying, and Cleaning Data in the tidyverse
Cleaning data is a crucial step in the data science process. It involves identifying and correcting errors, inconsistencies, and missing values in the data, as well as formatting and structuring the data in a way that makes it easy to work with. This allows the data to be used effectively for analysis, modeling, and visualization. The R tidyverse is a collection of packages designed for data science and includes tools for data manipulation, visualization, and modeling. The dplyr
and tidyr
packages are two of the most widely used packages within the tidyverse for data cleaning. dplyr
provides a set of functions for efficiently manipulating large datasets, such as filtering, grouping, and summarizing data. tidyr
is specifically designed for tidying (or restructuring) data, making it easier to work with. It provides functions for reshaping data, such as gathering and spreading columns, and allows for the creation of a consistent structure in the data. This makes it easier to perform data analysis and visualization. Together, these packages provide powerful tools for cleaning and manipulating data in R, making it a popular choice among data scientists. In this chapter, we will look at tools and techniques for preparing data in the tidyverse set of packages. You will learn how to deal with different formats and quickly interconvert them, merge different datasets, and summarize them. You will also learn how to bring data from outside sources not in handy files into your work.
In this chapter, we will cover the following recipes:
- Loading data from files with
readr
- Tidying a wide format table into a tidy table with
tidyr
- Tidying a long format table into a tidy table with
tidyr
- Combining tables using join functions
- Reformatting and extracting existing data into new columns using
stringr
- Computing new data columns from existing ones and applying arbitrary functions using
mutate()
- Using
dplyr
to summarize data in large tables - Using
datapasta
to create R objects from cut-and-paste data