Search icon CANCEL
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Conferences
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
Machine Learning with R Cookbook, Second Edition

You're reading from   Machine Learning with R Cookbook, Second Edition Analyze data and build predictive models

Arrow left icon
Product type Paperback
Published in Oct 2017
Publisher Packt
ISBN-13 9781787284395
Length 572 pages
Edition 2nd Edition
Languages
Tools
Arrow right icon
Authors (2):
Arrow left icon
Ashish Bhatia Ashish Bhatia
Author Profile Icon Ashish Bhatia
Ashish Bhatia
Yu-Wei, Chiu (David Chiu) Yu-Wei, Chiu (David Chiu)
Author Profile Icon Yu-Wei, Chiu (David Chiu)
Yu-Wei, Chiu (David Chiu)
Arrow right icon
View More author details
Toc

Table of Contents (15) Chapters Close

Preface 1. Practical Machine Learning with R FREE CHAPTER 2. Data Exploration with Air Quality Datasets 3. Analyzing Time Series Data 4. R and Statistics 5. Understanding Regression Analysis 6. Survival Analysis 7. Classification 1 - Tree, Lazy, and Probabilistic 8. Classification 2 - Neural Network and SVM 9. Model Evaluation 10. Ensemble Learning 11. Clustering 12. Association Analysis and Sequence Mining 13. Dimension Reduction 14. Big Data Analysis (R and Hadoop)

Applying basic statistics

R provides a wide range of statistical functions, allowing users to obtain the summary statistics of data, generate frequency and contingency tables, produce correlations, and conduct statistical inferences. This recipe covers basic statistics that can be applied to a dataset.

Getting ready

Ensure you have completed the previous recipes by installing R on your operating system.

How to do it...

Perform the following steps to apply statistics to a dataset:

  1. Load the iris data into an R session:
        > data(iris)
  1. Observe the format of the data:
        > class(iris)
        [1] "data.frame"  
  1. The iris dataset is a DataFrame containing four numeric attributes: petal length, petal width, sepal width, and sepal length. For numeric values, you can perform descriptive statistics, such as mean, sd, var, min, max, median, range, and quantile. These can be applied to any of the four attributes in the dataset:
        > mean(iris$Sepal.Length)
        Output:
        [1] 5.843333
        > sd(iris$Sepal.Length)
        Output:
        [1] 0.8280661
        > var(iris$Sepal.Length)
        Output:
        [1] 0.6856935
        > min(iris$Sepal.Length)
        Output:
        [1] 4.3
        > max(iris$Sepal.Length)
        Output:
        [1] 7.9
        > median(iris$Sepal.Length)
        Output:
        [1] 5.8
        > range(iris$Sepal.Length)
        Output:
        [1] 4.3 7.9
        > quantile(iris$Sepal.Length)
        Output:
        0%  25%  50%  75% 100% 
        4.3  5.1  5.8  6.4  7.9
  1. The preceding example demonstrates how to apply descriptive statistics to a single variable. In order to obtain summary statistics on every numeric attribute of the DataFrame, one may use sapply. For example, to apply the mean on the first four attributes in the iris DataFrame, ignore the na value by setting na.rm as TRUE:
        > sapply(iris[1:4], mean, na.rm=TRUE)
        Output:
        Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
          5.843333     3.057333     3.758000     1.199333 
    
  
  1. As an alternative to using sapply to apply descriptive statistics on given attributes, R offers the summary function that provides a full range of descriptive statistics. In the following example, the summary function provides the mean, median, 25th and 75th quartiles, min, and max of every iris dataset numeric attribute:
        > summary(iris)
        Output:
        Sepal.Length  Sepal.Width   Petal.Length   Petal.Width  Species  
        Min.  4.300 Min.   :2.000 Min.   :1.000 Min.   :0.100 setosa    :50  
        1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
versicolor:50 Median :5.800 Median :3.000 Median :4.350 Median :1.300
virginica :50 Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800 Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
  1. The preceding example shows how to output the descriptive statistics of a single variable. R also provides the correlation for users to investigate the relationship between variables. The following example generates a 4x4 matrix by computing the correlation of each attribute pair within the iris:
        > cor(iris[,1:4])
        Output:
        Sepal.Length Sepal.Width Petal.Length Petal.Width
        Sepal.Length    1.0000000  -0.1175698    0.8717538   0.8179411
        Sepal.Width    -0.1175698   1.0000000   -0.4284401  -0.3661259
        Petal.Length    0.8717538  -0.4284401    1.0000000   0.9628654
        Petal.Width     0.8179411  -0.3661259    0.9628654   1.0000000
  1. R also provides a function to compute the covariance of each attribute pair within the iris dataset:
        > cov(iris[,1:4])
        Output:
        Sepal.Length Sepal.Width Petal.Length Petal.Width
        Sepal.Length    0.6856935  -0.0424340    1.2743154   0.5162707
        Sepal.Width    -0.0424340   0.1899794   -0.3296564  -0.1216394
        Petal.Length    1.2743154  -0.3296564    3.1162779   1.2956094
        Petal.Width     0.5162707  -0.1216394    1.2956094   0.5810063
  1. Statistical tests are performed to access the significance of the results; here we demonstrate how to use a t-test to determine the statistical differences between two samples. In this example, we perform a t.test on the petal width an of an iris in either the setosa or versicolor species. If we obtain a p-value less than 0.5, we can be certain that the petal width between the setosa and versicolor will vary significantly:
        > t.test(iris$Petal.Width[iris$Species=="setosa"], 
        +        iris$Petal.Width[iris$Species=="versicolor"])
        Output:
        
Welch Two Sample t-test
data: iris$Petal.Width[iris$Species == "setosa"] and
iris$Petal.Width[iris$Species == "versicolor"] t = -34.0803, df = 74.755, p-value < 2.2e-16 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -1.143133 -1.016867 sample estimates: mean of x mean of y 0.246 1.326
  1. Alternatively, you can perform a correlation test on the sepal length to the sepal width of an iris, and then retrieve a correlation score between the two variables. The stronger the positive correlation, the closer the value is to 1. The stronger the negative correlation, the closer the value is to -1:
        > cor.test(iris$Sepal.Length, iris$Sepal.Width)
        Output:
        Pearson's product-moment correlation
        data:  iris$Sepal.Length and iris$Sepal.Width
        t = -1.4403, df = 148, p-value = 0.1519
        alternative hypothesis: true correlation is not equal to 0
        95 percent confidence interval:
       -0.27269325  0.04351158
        sample estimates:
            cor 
       -0.1175698   

How it works...

R has a built-in statistics function, which enables the user to perform descriptive statistics on a single variable. The recipe first introduces how to apply mean, sd, var, min, max, median, range, and quantile on a single variable. Moreover, in order to apply the statistics on all four numeric variables, one can use the sapply function. In order to determine the relationships between multiple variables, one can conduct correlation and covariance. Finally, the recipe shows how to determine the statistical differences of two given samples by performing a statistical test.

There's more...

If you need to compute an aggregated summary of statistics against data in different groups, you can use the aggregate and reshape functions to compute the summary statistics of data subsets:

  1. Use aggregate to calculate the mean of each iris attribute group by the species:
        > aggregate(x=iris[,1:4],by=list(iris$Species),FUN=mean)  
  1. Use reshape to calculate the mean of each iris attribute group by the species:
        >  library(reshape)
        >  iris.melt <- melt(iris,id='Species')
        >  cast(Species~variable,data=iris.melt,mean,
             subset=Species %in% c('setosa','versicolor'),
             margins='grand_row')

For information on reshape and aggregate, refer to the help documents by using ?reshape or ?aggregate.

You have been reading a chapter from
Machine Learning with R Cookbook, Second Edition - Second Edition
Published in: Oct 2017
Publisher: Packt
ISBN-13: 9781787284395
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime