R provides a wide range of statistical functions, allowing users to obtain the summary statistics of data, generate frequency and contingency tables, produce correlations, and conduct statistical inferences. This recipe covers basic statistics that can be applied to a dataset.
Applying basic statistics
Getting ready
Ensure you have completed the previous recipes by installing R on your operating system.
How to do it...
Perform the following steps to apply statistics to a dataset:
- Load the iris data into an R session:
> data(iris)
- Observe the format of the data:
> class(iris) [1] "data.frame"
- The iris dataset is a DataFrame containing four numeric attributes: petal length, petal width, sepal width, and sepal length. For numeric values, you can perform descriptive statistics, such as mean, sd, var, min, max, median, range, and quantile. These can be applied to any of the four attributes in the dataset:
> mean(iris$Sepal.Length) Output: [1] 5.843333 > sd(iris$Sepal.Length) Output: [1] 0.8280661 > var(iris$Sepal.Length) Output: [1] 0.6856935 > min(iris$Sepal.Length) Output: [1] 4.3 > max(iris$Sepal.Length) Output: [1] 7.9 > median(iris$Sepal.Length) Output: [1] 5.8 > range(iris$Sepal.Length) Output: [1] 4.3 7.9 > quantile(iris$Sepal.Length) Output: 0% 25% 50% 75% 100% 4.3 5.1 5.8 6.4 7.9
- The preceding example demonstrates how to apply descriptive statistics to a single variable. In order to obtain summary statistics on every numeric attribute of the DataFrame, one may use sapply. For example, to apply the mean on the first four attributes in the iris DataFrame, ignore the na value by setting na.rm as TRUE:
> sapply(iris[1:4], mean, na.rm=TRUE) Output: Sepal.Length Sepal.Width Petal.Length Petal.Width 5.843333 3.057333 3.758000 1.199333
- As an alternative to using sapply to apply descriptive statistics on given attributes, R offers the summary function that provides a full range of descriptive statistics. In the following example, the summary function provides the mean, median, 25th and 75th quartiles, min, and max of every iris dataset numeric attribute:
> summary(iris) Output: Sepal.Length Sepal.Width Petal.Length Petal.Width Species Min. 4.300 Min. :2.000 Min. :1.000 Min. :0.100 setosa :50 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
versicolor:50 Median :5.800 Median :3.000 Median :4.350 Median :1.300
virginica :50 Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800 Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
- The preceding example shows how to output the descriptive statistics of a single variable. R also provides the correlation for users to investigate the relationship between variables. The following example generates a 4x4 matrix by computing the correlation of each attribute pair within the iris:
> cor(iris[,1:4]) Output: Sepal.Length Sepal.Width Petal.Length Petal.Width Sepal.Length 1.0000000 -0.1175698 0.8717538 0.8179411 Sepal.Width -0.1175698 1.0000000 -0.4284401 -0.3661259 Petal.Length 0.8717538 -0.4284401 1.0000000 0.9628654 Petal.Width 0.8179411 -0.3661259 0.9628654 1.0000000
- R also provides a function to compute the covariance of each attribute pair within the iris dataset:
> cov(iris[,1:4]) Output: Sepal.Length Sepal.Width Petal.Length Petal.Width Sepal.Length 0.6856935 -0.0424340 1.2743154 0.5162707 Sepal.Width -0.0424340 0.1899794 -0.3296564 -0.1216394 Petal.Length 1.2743154 -0.3296564 3.1162779 1.2956094 Petal.Width 0.5162707 -0.1216394 1.2956094 0.5810063
- Statistical tests are performed to access the significance of the results; here we demonstrate how to use a t-test to determine the statistical differences between two samples. In this example, we perform a t.test on the petal width an of an iris in either the setosa or versicolor species. If we obtain a p-value less than 0.5, we can be certain that the petal width between the setosa and versicolor will vary significantly:
> t.test(iris$Petal.Width[iris$Species=="setosa"], + iris$Petal.Width[iris$Species=="versicolor"]) Output:
Welch Two Sample t-test
data: iris$Petal.Width[iris$Species == "setosa"] and
iris$Petal.Width[iris$Species == "versicolor"] t = -34.0803, df = 74.755, p-value < 2.2e-16 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -1.143133 -1.016867 sample estimates: mean of x mean of y 0.246 1.326
- Alternatively, you can perform a correlation test on the sepal length to the sepal width of an iris, and then retrieve a correlation score between the two variables. The stronger the positive correlation, the closer the value is to 1. The stronger the negative correlation, the closer the value is to -1:
> cor.test(iris$Sepal.Length, iris$Sepal.Width) Output: Pearson's product-moment correlation data: iris$Sepal.Length and iris$Sepal.Width t = -1.4403, df = 148, p-value = 0.1519 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: -0.27269325 0.04351158 sample estimates: cor -0.1175698
How it works...
R has a built-in statistics function, which enables the user to perform descriptive statistics on a single variable. The recipe first introduces how to apply mean, sd, var, min, max, median, range, and quantile on a single variable. Moreover, in order to apply the statistics on all four numeric variables, one can use the sapply function. In order to determine the relationships between multiple variables, one can conduct correlation and covariance. Finally, the recipe shows how to determine the statistical differences of two given samples by performing a statistical test.
There's more...
If you need to compute an aggregated summary of statistics against data in different groups, you can use the aggregate and reshape functions to compute the summary statistics of data subsets:
- Use aggregate to calculate the mean of each iris attribute group by the species:
> aggregate(x=iris[,1:4],by=list(iris$Species),FUN=mean)
- Use reshape to calculate the mean of each iris attribute group by the species:
> library(reshape) > iris.melt <- melt(iris,id='Species') > cast(Species~variable,data=iris.melt,mean, subset=Species %in% c('setosa','versicolor'), margins='grand_row')
For information on reshape and aggregate, refer to the help documents by using ?reshape or ?aggregate.