R is a vector-oriented programming language since most of the objects are organized in vector or matrix fashion. While most of us associate vectors and matrices with linear algebra or other mathematics fields, R defines those as a flexible data structure that supports both numeric and non-numeric values. This makes working with data easier and simpler, especially when we work with mixed data classes. The matrix structure is a generic format for many tabular data types in R.
Among those, the most common types are as follows (the function's package name is in brackets):
- matrix (base): This is the basic matrix format and is based on the numeric index of rows and columns. This format is strict about the data class, and it isn't possible to combine multiple classes in the same table. For example, it is not possible to have both numeric and strings at the same table.
- data.frame (base): This is one of the most popular tabular formats in R. This is a more progressive and liberal version of the matrix function. It includes additional attributes, which support the combination of multiple classes in the same table and different indexing methods.
- tibble (tibble): It is part of the tidyverse family of packages (RStudio designed packages for data science applications). This type of data is another tabular format and an improved version of the data.frame base package with the improvements that are related to printing and sub-setting applications.
- ts (stats) and mts (stats): This is R's built-in function for time series data, where ts is designed to be used with single time series data and multiple time series (mts) supports multiple time series data. Chapter 3, The Time Series Object, focuses on the time series object and its applications.
- zoo (zoo) and xts (xts): Both are designated data structures for time series data and are based on the matrix format with a timestamp index. Chapter 4, Decomposition of Time Series Data, provides an in-depth introduction to the zoo and xts objects.
If you have never used R before, the first data structure that you will meet will probably be the data frame. Therefore, this section focuses on the basic techniques that you can use for querying and exploring data frames (which, similarly, can be applied to the other data structures). We will use the famous iris dataset as an example.
Let's load the iris dataset from the datasets package:
# Loading dataset from datasets package
data("iris", package = "datasets")
Like we did previously, let's review the object structure using the str function:
str(iris)
## 'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ... ## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
As you can see from the output of the str function, the iris data frame has 150 observations and 5 variables. The first four variables are numeric, while the fifth variable is a categorical variable (factor). This mixed structure of both numeric and categorical variables is not possible in the normal matrix format. A different view on the table is available with the summary function, which provides summary statistics for the data frame's variables:
summary(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
##
As you can see from the preceding output, the function calculates the numeric variables' mean, median, minimum, maximum, and first and third quartiles.