There are several ways to query a data frame. This includes the use of built-in functions or the use of the data frame rows and columns index. For example, let's assume that we want to get the first five observations of the second variable (Sepal.Width). We will take a look at four different ways that we can do this:
- We can do so using the row and column index of the data frame with the square brackets, where the left-hand side represents the row index and the right-hand side represents the column index:
iris[1:5, 2]
## [1] 3.5 3.0 3.2 3.1 3.6
- We can do so specifying a specific variable in the data frame using the $ operator and the relevant row index. This method is limited to one variable as opposed to the previous method, which supports multiple rows and columns:
iris$Sepal.Width[1:5]
## [1] 3.5 3.0 3.2 3.1 3.6
- Similar to the first approach, we can use the row index and column names of the data frame with square brackets:
iris[1:5, "Sepal.Width"]
## [1] 3.5 3.0 3.2 3.1 3.6
- We can do so using a function that retrieves the index parameter of the rows or columns. In the following example, the which function returns the index value of the Sepal.Width column based on the following argument:
iris[1:5, which(colnames(iris) == "Sepal.Width")]
## [1] 3.5 3.0 3.2 3.1 3.6
When working with R, you can always be sure that there is more than one way to do a specific task. We used four methods, all of which achieved similar results. The use of square brackets is typical for any index vector or matrix format in R, where the index parameters are related to the number of dimensions. In all of these examples, besides the second one, the object is the data frame, and therefore there are two dimensions (rows and columns index). In the second example, we specify the variable (or the column) we want to use and, therefore, there is only one dimension, that is, the row index. In the third method, we used the variable name instead of the index, and in the fourth method, we used a built-in function that returns the variable index. Using a specific name or function to identify the variable index value is useful in a scenario where the column name is known, but the index value is dynamic (or unknown).
Now, let's assume that we are interested in identifying the key attributes of setosa, one of the three species of the Iris flower in the dataset. First, we have to subset the data frame and use only the observations of setosa. Here are three simple methods to extract the setosa values (of course, there are more methods):
- We can use the subset function, where the first argument is the data that we wish to subset and the second argument is the condition we want to apply:
Setosa_df1 <- subset(x = iris, iris$Species == "setosa")
Let's use the head(Setosa_df1) function:
head(Setosa_df1)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
- Similarly, you can use the filter function.
- Alternatively, you can use the index method we introduced previously with the which argument in order to assign the number of rows where the species is equal to setosa. Since we want all of the columns, we will leave the columns argument empty:
Setosa_df2 <- iris[which(iris$Species == "setosa"), ]
Let's use the head(Setosa_df2) function:
head(Setosa_df2)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
You can see that the results from both methods are identical:
identical(Setosa_df1, Setosa_df2)
## [1] TRUE
Using the subset data frame, we can get summary statistics for the setosa species using the summary function:
summary(Setosa_df1)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.300 Min. :1.000 Min. :0.100
## 1st Qu.:4.800 1st Qu.:3.200 1st Qu.:1.400 1st Qu.:0.200
## Median :5.000 Median :3.400 Median :1.500 Median :0.200
## Mean :5.006 Mean :3.428 Mean :1.462 Mean :0.246
## 3rd Qu.:5.200 3rd Qu.:3.675 3rd Qu.:1.575 3rd Qu.:0.300
## Max. :5.800 Max. :4.400 Max. :1.900 Max. :0.600
## Species
## setosa :50
## versicolor: 0
## virginica : 0
##
##
##
The summary function has broader applications beside the summary statistics of the data.frame object and can be used to summarize statistical models and other types of objects.