Packt+ | Advance your knowledge in tech

You're reading from Machine Learning with R Cookbook, Second Edition Analyze data and build predictive models

Product type Paperback

Published in Oct 2017

Publisher Packt

ISBN-13 9781787284395

Length 572 pages

Edition 2nd Edition

Languages

Tools

Hadoop

Concepts

Machine Learning

Authors (2):

Ashish Bhatia

Yu-Wei, Chiu (David Chiu)

View More author details

Perform the following steps to apply statistics to a dataset:

Load the iris data into an R session:

        > data(iris)

Observe the format of the data:

        > class(iris)
        [1] "data.frame"

The iris dataset is a DataFrame containing four numeric attributes: petal length, petal width, sepal width, and sepal length. For numeric values, you can perform descriptive statistics, such as mean, sd, var, min, max, median, range, and quantile. These can be applied to any of the four attributes in the dataset:

        > mean(iris$Sepal.Length)
        Output:
        [1] 5.843333
        > sd(iris$Sepal.Length)
        Output:
        [1] 0.8280661
        > var(iris$Sepal.Length)
        Output:
        [1] 0.6856935
        > min(iris$Sepal.Length)
        Output:
        [1] 4.3
        > max(iris$Sepal.Length)
        Output:
        [1] 7.9
        > median(iris$Sepal.Length)
        Output:
        [1] 5.8
        > range(iris$Sepal.Length)
        Output:
        [1] 4.3 7.9
        > quantile(iris$Sepal.Length)
        Output:
        0%  25%  50%  75% 100% 
        4.3  5.1  5.8  6.4  7.9

The preceding example demonstrates how to apply descriptive statistics to a single variable. In order to obtain summary statistics on every numeric attribute of the DataFrame, one may use sapply. For example, to apply the mean on the first four attributes in the iris DataFrame, ignore the na value by setting na.rm as TRUE:

        > sapply(iris[1:4], mean, na.rm=TRUE)
        Output:
        Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
          5.843333     3.057333     3.758000     1.199333

As an alternative to using sapply to apply descriptive statistics on given attributes, R offers the summary function that provides a full range of descriptive statistics. In the following example, the summary function provides the mean, median, 25th and 75th quartiles, min, and max of every iris dataset numeric attribute:

        > summary(iris)
        Output:
        Sepal.Length  Sepal.Width   Petal.Length   Petal.Width  Species  
        Min.  4.300 Min.   :2.000 Min.   :1.000 Min.   :0.100 setosa    :50  
        1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
        versicolor:50  
        Median :5.800 Median :3.000 Median :4.350 Median :1.300 
        virginica :50  
        Mean :5.843 Mean   :3.057 Mean   :3.758 Mean   :1.199                  
        3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800                  
        Max. :7.900 Max.   :4.400 Max.   :6.900 Max.   :2.500

The preceding example shows how to output the descriptive statistics of a single variable. R also provides the correlation for users to investigate the relationship between variables. The following example generates a 4x4 matrix by computing the correlation of each attribute pair within the iris:

        > cor(iris[,1:4])
        Output:
        Sepal.Length Sepal.Width Petal.Length Petal.Width
        Sepal.Length    1.0000000  -0.1175698    0.8717538   0.8179411
        Sepal.Width    -0.1175698   1.0000000   -0.4284401  -0.3661259
        Petal.Length    0.8717538  -0.4284401    1.0000000   0.9628654
        Petal.Width     0.8179411  -0.3661259    0.9628654   1.0000000

R also provides a function to compute the covariance of each attribute pair within the iris dataset:

        > cov(iris[,1:4])
        Output:
        Sepal.Length Sepal.Width Petal.Length Petal.Width
        Sepal.Length    0.6856935  -0.0424340    1.2743154   0.5162707
        Sepal.Width    -0.0424340   0.1899794   -0.3296564  -0.1216394
        Petal.Length    1.2743154  -0.3296564    3.1162779   1.2956094
        Petal.Width     0.5162707  -0.1216394    1.2956094   0.5810063

Statistical tests are performed to access the significance of the results; here we demonstrate how to use a t-test to determine the statistical differences between two samples. In this example, we perform a t.test on the petal width an of an iris in either the setosa or versicolor species. If we obtain a p-value less than 0.5, we can be certain that the petal width between the setosa and versicolor will vary significantly:

        > t.test(iris$Petal.Width[iris$Species=="setosa"], 
        +        iris$Petal.Width[iris$Species=="versicolor"])
        Output:
        
        Welch Two Sample t-test
        
        data:  iris$Petal.Width[iris$Species == "setosa"] and 
        iris$Petal.Width[iris$Species == "versicolor"]
        t = -34.0803, df = 74.755, p-value < 2.2e-16
        alternative hypothesis: true difference in means is not equal to 0
        95 percent confidence interval:
       -1.143133 -1.016867
        sample estimates:
        mean of x mean of y 
        0.246     1.326

Alternatively, you can perform a correlation test on the sepal length to the sepal width of an iris, and then retrieve a correlation score between the two variables. The stronger the positive correlation, the closer the value is to 1. The stronger the negative correlation, the closer the value is to -1:

        > cor.test(iris$Sepal.Length, iris$Sepal.Width)
        Output:
        Pearson's product-moment correlation
        data:  iris$Sepal.Length and iris$Sepal.Width
        t = -1.4403, df = 148, p-value = 0.1519
        alternative hypothesis: true correlation is not equal to 0
        95 percent confidence interval:
       -0.27269325  0.04351158
        sample estimates:
            cor 
       -0.1175698

> library(reshape) > iris.melt <- melt(iris,id='Species') > cast(Species~variable,data=iris.melt,mean, subset=Species %in% c('setosa','versicolor'), margins='grand_row')