Search icon CANCEL
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Conferences
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
Data Analysis with R, Second Edition

You're reading from   Data Analysis with R, Second Edition A comprehensive guide to manipulating, analyzing, and visualizing data in R

Arrow left icon
Product type Paperback
Published in Mar 2018
Publisher Packt
ISBN-13 9781788393720
Length 570 pages
Edition 2nd Edition
Languages
Tools
Arrow right icon
Author (1):
Arrow left icon
Tony Fischetti Tony Fischetti
Author Profile Icon Tony Fischetti
Tony Fischetti
Arrow right icon
View More author details
Toc

Table of Contents (19) Chapters Close

Preface 1. RefresheR 2. The Shape of Data FREE CHAPTER 3. Describing Relationships 4. Probability 5. Using Data To Reason About The World 6. Testing Hypotheses 7. Bayesian Methods 8. The Bootstrap 9. Predicting Continuous Variables 10. Predicting Categorical Variables 11. Predicting Changes with Time 12. Sources of Data 13. Dealing with Missing Data 14. Dealing with Messy Data 15. Dealing with Large Data 16. Working with Popular R Packages 17. Reproducibility and Best Practices 18. Other Books You May Enjoy

Loading data into R

Thus far, we've only been entering data directly into the interactive R console. For any dataset of non-trivial size, this is, obviously, an intractable solution. Fortunately for us, R has a robust suite of functions to read data directly from external files.

Go ahead and create a file on your hard disk called favorites.txt that looks like this:

flavor,number 
pistachio,6 
mint chocolate chip,7 
vanilla,5
chocolate,10 
strawberry,2 
neopolitan,4 

This data represents the number of students in a class that prefer a particular flavor of soy ice cream. We can read the file into a variable called favs as follows:

 > favs <- read.table("favorites.txt", sep=",", header=TRUE) 

If you get an error that there is no such file or directory, give R the full path name to your dataset or, alternatively, run the following command:

 > favs <- read.table(file.choose(), sep=",", header=TRUE) 

The preceding command brings up an open file dialog to let you navigate to the file that you've just created.

The sep="," argument tells R that each data element in a row is separated by a comma. Other common data formats have values separated by tabs and pipes ("|"). The value of sep should then be "\t" and "|", respectively.

The header=TRUE argument tells R that the first row of the file should be interpreted as the names of the columns. Remember, you can enter ?read.table at the console to learn more about these options.

Reading from files in this comma-separated values format (usually with the .csv file extension) is so common that R has a more specific function just for it. The preceding data import expression can be best written simply as follows:

 > favs <- read.csv("favorites.txt") 

Now, we have all the data in the file held in a variable of the data.frame class. A data frame can be thought of as a rectangular array of data that you might see in a spreadsheet application. In this way, a data frame can also be thought of as a matrix; indeed, we can use matrix-style indexing to extract elements from it. A data frame differs from a matrix, though, in that a data frame may have columns of differing types. For example, whereas a matrix would only allow one of these types, the dataset that we just loaded contains character data in its first column and numeric data in its second column.

Let's check out what we have using the head() command, which will show us the first few lines of a data frame:

  > head(favs) 
                 flavor number 
  1           pistachio      6 
  2 mint chocolate chip      7 
  3             vanilla      5 
  4           chocolate     10 
  5          strawberry      2 
  6          neopolitan      4 
 
  > class(favs) 
  [1] "data.frame" 
  > class(favs$flavor) 
  [1] "factor" 
  > class(favs$number) 
  [1] "numeric" 

I lied, okay! So what?! Technically, flavor is a factor data type, not a character type.

We haven't seen factors yet, but the idea behind them is really simple. Essentially, factors are codings for categorical variables, which are variables that take on one of a finite number of categories--think {"high", "medium", and "low"} or {"control", "experimental"}.

Though factors are extremely useful in statistical modeling in R, the fact that R, by default, automatically interprets a column from the data read from disk as a type factor if it contains characters is something that trips up novices and seasoned useRs alike. Due to this, we will primarily prevent this behavior manually by adding the stringsAsFactors optional keyword argument to the read.* commands:

  > favs <- read.csv("favorites.txt", stringsAsFactors=FALSE) 
  > class(favs$flavor) 
  [1] "character" 

Much better, for now! If you'd like to make this behavior the new default, read the ?options manual page. We can always convert to factors later on if we need to!

If you haven't noticed already, I've snuck a new operator on you--$, the extract operator. This is the most popular way to extract attributes (or columns) from a data frame. You can also use double square brackets ([[ and ]]) to do this.

These are both in addition to the canonical matrix indexing option. The following three statements are thus, in this context, functionally identical:

  > favs$flavor 
  [1] "pistachio"           "mint chocolate chip" "vanilla"             
  [4] "chocolate"           "strawberry"          "neopolitan"          
  > favs[["flavor"]] 
  [1] "pistachio"           "mint chocolate chip" "vanilla"             
  [4] "chocolate"           "strawberry"          "neopolitan"          
  > favs[,1] 
  [1] "pistachio"           "mint chocolate chip" "vanilla"             
  [4] "chocolate"           "strawberry"          "neopolitan"   
Notice how R has now printed another number in square brackets--besides [1]--along with our output. This is to show us that chocolate is the fourth element of the vector that was returned from the extraction.

You can use the names() function to get a list of the columns available in a data frame. You can even reassign names using the same:

  > names(favs) 
  [1] "flavor" "number" 
  > names(favs)[1] <- "flav" 
  > names(favs) 
  [1] "flav"   "number" 

Lastly, we can get a compact display of the structure of a data frame using the str() function on it:

  > str(favs) 
  'data.frame': 6 obs. of  2 variables: 
   $ flav  : chr  "pistachio" "mint chocolate chip" "vanilla"   
"chocolate" ...
$ number: num 6 7 5 10 2 4

Actually, you can use this function on any R structure--the property of functions that change their behavior based on the type of input is called polymorphism.

lock icon The rest of the chapter is locked
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at €18.99/month. Cancel anytime