Managing data with R
One of the challenges faced while working with massive datasets involves gathering, preparing, and otherwise managing data from a variety of sources. Although we will cover data preparation, data cleaning, and data management in depth by working on real-world machine learning tasks in later chapters, this section highlights the basic functionality for getting data in and out of R.
Saving, loading, and removing R data structures
When you’ve spent a lot of time getting a data frame into the desired form, you shouldn’t need to recreate your work each time you restart your R session.
To save data structures to a file that can be reloaded later or transferred to another system, the save()
function can be used to write one or more R data structures to the location specified by the file
parameter. R data files have an .RData
or .rda
extension.
Suppose you had three objects named x
, y
, and z
that you would like to save to a permanent file. These might be vectors, factors, lists, data frames, or any other R object. To save them to a file named mydata.RData
, use the following command:
> save(x, y, z, file = "mydata.RData")
The load()
command can recreate any data structures that have been saved to an .RData
file. To load the mydata.RData
file created in the preceding code, simply type:
> load("mydata.RData")
This will recreate the x
, y
, and z
data structures in your R environment.
Be careful what you are loading! All data structures stored in the file you are importing with the load()
command will be added to your workspace, even if they overwrite something else you are working on.
Alternatively, the saveRDS()
function can be used to save a single R object to a file. Although it is much like the save()
function, a key distinction is that the corresponding loadRDS()
function allows the object to be loaded with a different name to the original object. For this reason, saveRDS()
may be safer to use when transferring R objects across projects, because it reduces the risk of accidentally overwriting existing objects in the R environment.
The saveRDS()
function is especially helpful for saving machine learning model objects. Because some machine learning algorithms take a long time to train the model, saving the model to an .rds
file can help avoid a long re-training process when a project is resumed. For example, to save a model object named my_model
to a file named my_model.rds
, use the following syntax:
> saveRDS(my_model, file = "my_model.rds")
To load the model, use the readRDS()
function and assign the result an object name as follows:
> my_model <- readRDS("my_model.rds")
After you’ve been working in an R session for some time, you may have accumulated unused data structures. In RStudio, these objects are visible in the Environment tab of the interface, but it is also possible to access these objects programmatically using the listing function ls()
, which returns a vector of all data structures currently in memory.
For example, if you’ve been following along with the code in this chapter, the ls()
function returns the following:
> ls()
[1] "blood" "fever" "flu_status" "gender"
[5] "m" "pt_data" "subject_name" "subject1"
[9] "symptoms" "temperature"
R automatically clears all data structures from memory upon quitting the session, but for large objects, you may want to free up the memory sooner. The remove function rm()
can be used for this purpose. For example, to eliminate the m
and subject1
objects, simply type:
> rm(m, subject1)
The rm()
function can also be supplied with a character vector of object names to remove. This works with the ls()
function to clear the entire R session:
> rm(list = ls())
Be very careful when executing the preceding code, as you will not be prompted before your objects are removed!
If you need to wrap up your R session in a hurry, the save.image()
command will write your entire session to a file simply called .RData
. By default, when quitting R or RStudio, you will be asked if you would like to create this file. R will look for this file the next time you start R, and if it exists, your session will be recreated just as you had left it.
Importing and saving datasets from CSV files
It is common for public datasets to be stored in text files. Text files can be read on virtually any computer or operating system, which makes the format nearly universal. They can also be exported and imported from and to programs such as Microsoft Excel, providing a quick and easy way to work with spreadsheet data.
A tabular (as in “table”) data file is structured in matrix form, such that each line of text reflects one example, and each example has the same number of features. The feature values on each line are separated by a predefined symbol known as a delimiter. Often, the first line of a tabular data file lists the names of the data columns. This is called a header line.
Perhaps the most common tabular text file format is the comma-separated values (CSV) file, which, as the name suggests, uses the comma as a delimiter. CSV files can be imported to and exported from many common applications. A CSV file representing the medical dataset constructed previously could be stored as:
subject_name,temperature,flu_status,gender,blood_type
John Doe,98.1,FALSE,MALE,O
Jane Doe,98.6,FALSE,FEMALE,AB
Steve Graves,101.4,TRUE,MALE,A
Given a patient data file named pt_data.csv
located in the R working directory, the read.csv()
function can be used as follows to load the file into R:
> pt_data <- read.csv("pt_data.csv")
This will read the CSV file into a data frame titled pt_data
. If your dataset resides outside the R working directory, the full path to the CSV file (for example, "/path/to/mydata.csv"
) can be used when calling the read.csv()
function.
By default, R assumes that the CSV file includes a header line listing the names of the features in the dataset. If a CSV file does not have a header, specify the option header = FALSE
as shown in the following command, and R will assign generic feature names by numbering the columns sequentially as V1
, V2
, and so on:
> pt_data <- read.csv("pt_data.csv", header = FALSE)
As an important historical note, in versions of R prior to 4.0, the read.csv()
function automatically converted all character type columns into factors due to a stringsAsFactors
parameter that was set to TRUE
by default. This feature was occasionally helpful, especially on the smaller and simpler datasets used in the earlier years of R. However, as datasets have become larger and more complex, this feature began to cause more problems than it solved. Now, starting with version 4.0, R sets stringsAsFactors = FALSE
by default. If you are certain that every character column in a CSV file is truly a factor, it is possible to convert them using the following syntax:
> pt_data <- read.csv("pt_data.csv", stringsAsFactors = TRUE)
We will set stringsAsFactors = TRUE
occasionally throughout the book, when working with datasets in which all character columns are truly factors.
Getting results data out of R can be almost as important as getting it in! To save a data frame to a CSV file, use the write.csv()
function. For a data frame named pt_data
, simply enter:
> write.csv(pt_data, file = "pt_data.csv", row.names = FALSE)
This will write a CSV file with the name pt_data.csv
to the R working folder. The row.names
parameter overrides R’s default setting, which is to output row names in the CSV file. Generally, this output is unnecessary and will simply inflate the size of the resulting file.
For more sophisticated control over reading in files, note that read.csv()
is a special case of the read.table()
function, which can read tabular data in many different forms. This includes other delimited formats such as tab-separated values (TSV) and vertical bar (|
) delimited files. For more detailed information on the read.table()
family of functions, refer to the R help page using the ?read.table
command.
Importing common dataset formats using RStudio
For more complex importation scenarios, the RStudio Desktop software offers a simple interface, which will guide you through the process of writing R code that can be used to load the data into your project. Although it has always been relatively easy to load plaintext data formats like CSV, importing other common analytical data formats like Microsoft Excel (.xls
and .xlsx
), SAS (.sas7bdat
and .xpt
), SPSS (.sav
and .por
), and Stata (.dta
) was once a tedious and time-consuming process, requiring knowledge of specific tricks and tools across multiple R packages. Now, the functionality is available via the Import Dataset command near the upper right of the RStudio interface, as shown in Figure 2.1:
Figure 2.1: RStudio’s “Import Dataset” feature provides options to load data from a variety of common formats
Depending on the data format selected, you may be prompted to install R packages that are required for the functionality in question. Behind the scenes, these packages will translate the data format so that it can be used in R. You will then be presented with a dialog box allowing you to choose the options for the data import process and see a live preview of how the data will appear in R as these changes are made.
The following screenshot illustrates the process of importing a Microsoft Excel version of the used cars dataset using the readxl
package (https://readxl.tidyverse.org), but the process is similar for any of the dataset formats:
Figure 2.2: The data import dialog provides a “Code Preview” that can be copy-and-pasted into your R code file
The Code Preview in the bottom-right of this dialog provides the R code to perform the importation with the specified options. Selecting the Import button will immediately execute the code; however, a better practice is to copy and paste the code into your R source code file, so that you can re-import the dataset in future sessions.
The read_excel()
function RStudio uses to load Excel data creates an R object called a “tibble” rather than a data frame. The differences are so subtle that you may not even notice! However, tibbles are an important R innovation enabling new ways to work with data frames. The tibble and its functionality are discussed in Chapter 12, Advanced Data Preparation.
The RStudio interface has made it easier than ever to work with data in a variety of formats, but more advanced functionality exists for working with large datasets. In particular, if you have data residing in database platforms like Microsoft SQL, MySQL, PostgreSQL, and others, it is possible to connect R to such databases to pull the data into R, or even utilize the database hardware itself to perform big data computations prior to bringing the results into R. Chapter 15, Making Use of Big Data, introduces these techniques and provides instructions for connecting to common databases using RStudio.