Packt+ | Advance your knowledge in tech

You're reading from Data Analysis with R

Product type Book

Published in Dec 2015

Publisher

ISBN-13 9781785288142

Pages 388 pages

Edition 1st Edition

Languages

Concepts

Data Analysis

Table of Contents (20) Chapters

Data Analysis with R

Credits

About the Author

About the Reviewer

www.PacktPub.com

Preface

1. RefresheR

2. The Shape of Data

3. Describing Relationships

4. Probability

5. Using Data to Reason About the World

6. Testing Hypotheses

7. Bayesian Methods

8. Predicting Continuous Variables

9. Predicting Categorical Variables

10. Sources of Data

11. Dealing with Messy Data

12. Dealing with Large Data

13. Reproducibility and Best Practices

Index

Vectors

Vectors are the most basic data structures in R, and they are ubiquitous indeed. In fact, even the single values that we've been working with thus far were actually vectors of length 1. That's why the interactive R console has been printing [1] along with all of our output.

Vectors are essentially an ordered collection of values of the same atomic data type. Vectors can be arbitrarily large (with some limitations), or they can be just one single value.

The canonical way of building vectors manually is by using the c() function (which stands for combine).

  > our.vect <- c(8, 6, 7, 5, 3, 0, 9)
  > our.vect
  [1] 8 6 7 5 3 0 9

In the preceding example, we created a numeric vector of length 7 (namely, Jenny's telephone number).

Note that if we tried to put character data types into this vector as follows:

  > another.vect <- c("8", 6, 7, "-", 3, "0", 9)
  > another.vect
  [1] "8" "6" "7" "-" "3" "0" "9"

R would convert all the items in the vector (called elements) into character data types to satisfy the condition that all elements of a vector must be of the same type. A similar thing happens when you try to use logical values in a vector with numbers; the logical values would be converted into 1 and 0 (for TRUE and FALSE, respectively). These logicals will turn into TRUE and FALSE (note the quotation marks) when used in a vector that contains characters.

Subsetting

It is very common to want to extract one or more elements from a vector. For this, we use a technique called indexing or subsetting. After the vector, we put an integer in square brackets ([]) called the subscript operator. This instructs R to return the element at that index. The indices (plural for index, in case you were wondering!) for vectors in R start at 1, and stop at the length of the vector.

  > our.vect[1]                  # to get the first value
  [1] 8
  > # the function length() returns the length of a vector
  > length(our.vect)
  [1] 7
  > our.vect[length(our.vect)]   # get the last element of a vector
  [1] 9

Note that in the preceding code, we used a function in the subscript operator. In cases like these, R evaluates the expression in the subscript operator, and uses the number it returns as the index to extract.

If we get greedy, and try to extract an element at an index that doesn't exist, R will respond with NA, meaning, not available. We see this special value cropping up from time to time throughout this text.

  > our.vect[10]
  [1] NA

One of the most powerful ideas in R is that you can use vectors to subset other vectors:

  > # extract the first, third, fifth, and
  > # seventh element from our vector
  > our.vect[c(1, 3, 5, 7)]
  [1] 8 7 3 9

The ability to use vectors to index other vectors may not seem like much now, but its usefulness will become clear soon.

Another way to create vectors is by using sequences.

  > other.vector <- 1:10
  > other.vector
   [1]  1  2  3  4  5  6  7  8  9 10
  > another.vector <- seq(50, 30, by=-2)
  > another.vector
   [1] 50 48 46 44 42 40 38 36 34 32 30

Above, the 1:10 statement creates a vector from 1 to 10. 10:1 would have created the same 10 element vector, but in reverse. The seq() function is more general in that it allows sequences to be made using steps (among many other things).

Combining our knowledge of sequences and vectors subsetting vectors, we can get the first 5 digits of Jenny's number thusly:

  > our.vect[1:5]
  [1] 8 6 7 5 3

Vectorized functions

Part of what makes R so powerful is that many of R's functions take vectors as arguments. These vectorized functions are usually extremely fast and efficient. We've already seen one such function, length(), but there are many many others.

  > # takes the mean of a vector
  > mean(our.vect)
  [1] 5.428571
  > sd(our.vect)    # standard deviation
  [1] 3.101459
  > min(our.vect)
  [1] 0
  > max(1:10)
  [1] 10
  > sum(c(1, 2, 3))
  [1] 6

In practical settings, such as when reading data from files, it is common to have NA values in vectors:

  > messy.vector <- c(8, 6, NA, 7, 5, NA, 3, 0, 9)
  > messy.vector
  [1]  8  6 NA  7  5 NA  3  0  9
  > length(messy.vector)
  [1] 9

Some vectorized functions will not allow NA values by default. In these cases, an extra keyword argument must be supplied along with the first argument to the function.

  > mean(messy.vector)
  [1] NA
  > mean(messy.vector, na.rm=TRUE)
  [1] 5.428571
  > sum(messy.vector, na.rm=FALSE)
  [1] NA
  > sum(messy.vector, na.rm=TRUE)
  [1] 38

As mentioned previously, vectors can be constructed from logical values too.

  > log.vector <- c(TRUE, TRUE, FALSE)
  > log.vector
   [1]  TRUE TRUE FALSE

Since logical values can be coerced into behaving like numerics, as we saw earlier, if we try to sum a logical vector as follows:.

  > sum(log.vector)
  [1] 2

we will, essentially, get a count of the number of TRUE values in that vector.

There are many functions in R which operate on vectors and return logical vectors. is.na() is one such function. It returns a logical vector—that is, the same length as the vector supplied as an argument—with a TRUE in the position of every NA value. Remember our messy vector (from just a minute ago)?

  > messy.vector
  [1]  8  6 NA  7  5 NA  3  0  9
  > is.na(messy.vector)
  [1] FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE FALSE
  > #  8     6      NA   7     5      NA   3       0    9

Putting together these pieces of information, we can get a count of the number of NA values in a vector as follows:

  > sum(is.na(messy.vector))
  [1] 2

When you use Boolean operators on vectors, they also return logical vectors of the same length as the vector being operated on.

  > our.vect > 5
  [1]  TRUE  TRUE  TRUE FALSE FALSE FALSE  TRUE

If we wanted to—and we do—count the number of digits in Jenny's phone number that are greater than five, we would do so in the following manner:

  > sum(our.vect > 5)
  [1] 4

Advanced subsetting

Did I mention that we can use vectors to subset other vectors? When we subset vectors using logical vectors of the same length, only the elements corresponding to the TRUE values are extracted. Hopefully, sparks are starting to go off in your head. If we wanted to extract only the legitimate non-NA digits from Jenny's number, we can do it as follows:

  > messy.vector[!is.na(messy.vector)]
  [1] 8 6 7 5 3 0 9

This is a very critical trait of R, so let's take our time understanding it; this idiom will come up again and again throughout this book.

The logical vector that yields TRUE when an NA value occurs in messy.vector (from is.na()) is then negated (the whole thing) by the negation operator !. The resultant vector is TRUE whenever the corresponding value in messy.vector is not NA. When this logical vector is used to subset the original messy vector, it only extracts the non-NA values from it.

Similarly, we can show all the digits in Jenny's phone number that are greater than five as follows:

  > our.vect[our.vect > 5]
  [1] 8 6 7 9

Thus far, we've only been displaying elements that have been extracted from a vector. However, just as we've been assigning and re-assigning variables, we can assign values to various indices of a vector, and change the vector as a result. For example, if Jenny tells us that we have the first digit of her phone number wrong (it's really 9), we can reassign just that element without modifying the others.

  > our.vect
  [1] 8 6 7 5 3 0 9
  > our.vect[1] <- 9
  > our.vect
  [1] 9 6 7 5 3 0 9

Sometimes, it may be required to replace all the NA values in a vector with the value 0. To do that with our messy vector, we can execute the following command:

  > messy.vector[is.na(messy.vector)] <- 0
  > messy.vector
  [1] 8 6 0 7 5 0 3 0 9

Elegant though the preceding solution is, modifying a vector in place is usually discouraged in favor of creating a copy of the original vector and modifying the copy. One such technique for performing this is by using the ifelse() function.

Not to be confused with the if/else control construct, ifelse() is a function that takes 3 arguments: a test that returns a logical/Boolean value, a value to use if the element passes the test, and one to return if the element fails the test.

The preceding in-place modification solution could be re-implemented with ifelse as follows:

  > ifelse(is.na(messy.vector), 0, messy.vector)
  [1] 8 6 0 7 5 0 3 0 9

Recycling

The last important property of vectors and vector operations in R is that they can be recycled. To understand what I mean, examine the following expression:

  > our.vect + 3
  [1] 12  9 10  8  6  3 12

This expression adds three to each digit in Jenny's phone number. Although it may look so, R is not performing this operation between a vector and a single value. Remember when I said that single values are actually vectors of the length 1? What is really happening here is that R is told to perform element-wise addition on a vector of length 7 and a vector of length 1. Since element-wise addition is not defined for vectors of differing lengths, R recycles the smaller vector until it reaches the same length as that of the bigger vector. Once both the vectors are the same size, then R, element-by-element, performs the addition and returns the result.

  > our.vect + 3
  [1] 12  9 10  8  6  3 12

is tantamount to…

  > our.vect + c(3, 3, 3, 3, 3, 3, 3)
  [1] 12  9 10  8  6  3 12

If we wanted to extract every other digit from Jenny's phone number, we can do so in the following manner:

  > our.vect[c(TRUE, FALSE)]
  [1] 9 7 3 9

This works because the vector c(TRUE, FALSE) is repeated until it is of the length 7, making it equivalent to the following:

  > our.vect[c(TRUE, FALSE, TRUE, FALSE, TRUE, FALSE, TRUE)]
  [1] 9 7 3 9

One common snag related to vector recycling that R users (useRs, if I may) encounter is that during some arithmetic operations involving vectors of discrepant length, R will warn you if the smaller vector cannot be repeated a whole number of times to reach the length of the bigger vector. This is not a problem when doing vector arithmetic with single values, since 1 can be repeated any number of times to match the length of any vector (which must, of course, be an integer). It would pose a problem, though, if we were looking to add three to every other element in Jenny's phone number.

  > our.vect + c(3, 0)
  [1] 12  6 10  5  6  0 12
  Warning message:
  In our.vect + c(3, 0) :
    longer object length is not a multiple of shorter object length

You will likely learn to love these warnings, as they have stopped many useRs from making grave errors.

Before we move on to the next section, an important thing to note is that in a lot of other programming languages, many of the things that we did would have been implemented using for loops and other control structures. Although there is certainly a place for loops and such in R, oftentimes a more sophisticated solution exists in using just vector/matrix operations. In addition to elegance and brevity, the solution that exploits vectorization and recycling is often many, many times more efficient.