Packt+ | Advance your knowledge in tech

You're reading from Applied Unsupervised Learning with R Uncover hidden relationships and patterns with k-means clustering, hierarchical clustering, and PCA

Product type Paperback

Published in Mar 2019

Publisher

ISBN-13 9781789956399

Length 320 pages

Edition 1st Edition

Languages

Concepts

Machine Learning

Authors (2):

Bradford Tuckfield

Alok Malik

View More author details

Table of Contents (9) Chapters

Applied Unsupervised Learning with R

Preface

1. Introduction to Clustering Methods

2. Advanced Clustering Methods FREE CHAPTER

3. Probability Distributions

4. Dimension Reduction

5. Data Comparison Methods

6. Anomaly Detection

Appendix

Chapter 3: Probability Distributions

Activity 8: Finding the Standard Distribution Closest to the Distribution of Variables of the Iris Dataset

Solution:

Load the Iris dataset into the df variable:
```
df<-iris
```
Select rows corresponding to the setosa species only:
```
df=df[df$Species=='setosa',]
```
Import the kdensity library:
```
library(kdensity)
```
Calculate and plot the KDE from the kdensity function for sepal length:
```
dist <- kdensity(df$Sepal.Length)
plot(dist)
```
The output is as follows:
Figure 3.36 Plot of the KDE for sepal length
This distribution is closest to the normal distribution, which we studied in the previous section. Here, the mean and median are both around 5.
Calculate and plot the KDE from the kdensity function for sepal width:
```
dist <- kdensity(df$Sepal.Width)
plot(dist)
```
The output is as follows:
Figure 3.37 Plot of the KDE for sepal width

This distribution is also closest to normal distribution. We can formalize this similarity with a Kolmogorov-Smirnov test.

Activity 9: Calculating the CDF and Performing the Kolmogorov-Simonov Test with the Normal Distribution

Solution:

Load the Iris dataset into the df variable:
```
df<-iris
```
Keep rows with the setosa species only:
```
df=df[df$Species=='setosa',]
```
Calculate the mean and standard deviation of the sepal length column of df:
```
sdev<-sd(df$Sepal.Length)
mn<-mean(df$Sepal.Length)
```
Generate a new distribution with the standard deviation and mean of the sepal length column:
```
xnorm<-rnorm(100,mean=mn,sd=sdev)
```
Plot the CDF of both xnorm and the sepal length column:
```
plot(ecdf(xnorm),col='blue')
plot(ecdf(df$Sepal.Length),add=TRUE,pch = 4,col='red')
```
The output is as follows:
Figure 3.38: The CDF of xnorm and sepal length
The samples look very close to each other in the distribution. Let's see, in the next test, whether the sepal length sample belongs to the normal distribution or not.
Perform the Kolmogorov-Smirnov test on the two samples, as follows:
```
ks.test(xnorm,df$Sepal.Length)
```
The output is as follows:
```
    Two-sample Kolmogorov-Smirnov test
data: xnorm and df$Sepal.Length
D = 0.14, p-value = 0.5307
alternative hypothesis: two-sided
```
Here, p-value is very high and the D value is low, so we can assume that the distribution of sepal length is closely approximated by the normal distribution.

Repeat the same steps for the sepal width column of df:

sdev<-sd(df$Sepal.Width)
mn<-mean(df$Sepal.Width)
xnorm<-rnorm(100,mean=mn,sd=sdev)
plot(ecdf(xnorm),col='blue')
plot(ecdf(df$Sepal.Width),add=TRUE,pch = 4,col='red')

The output is as follows:

Figure 3.39: CDF of xnorm and sepal width

Perform the Kolmogorov-Smirnov test as follows:

ks.test(xnorm,df$Sepal.Length)

The output is as follows:

    Two-sample Kolmogorov-Smirnov test

data: xnorm and df$Sepal.Width
D = 0.12, p-value = 0.7232
alternative hypothesis: two-sided

Here, also, the sample distribution of sepal width is closely approximated by the normal distribution.

The rest of the chapter is locked

You're reading from Applied Unsupervised Learning with R Uncover hidden relationships and patterns with k-means clustering, hierarchical clustering, and PCA

Table of Contents (9) Chapters

Chapter 3: Probability Distributions

Activity 8: Finding the Standard Distribution Closest to the Distribution of Variables of the Iris Dataset

Activity 9: Calculating the CDF and Performing the Kolmogorov-Simonov Test with the Normal Distribution

Authors (2)

Other recommended products

Personalised recommendations for you

You're reading from Applied Unsupervised Learning with R Uncover hidden relationships and patterns with k-means clustering, hierarchical clustering, and PCA

Table of Contents (9) Chapters

Chapter 3: Probability Distributions

Activity 8: Finding the Standard Distribution Closest to the Distribution of Variables of the Iris Dataset

Activity 9: Calculating the CDF and Performing the Kolmogorov-Simonov Test with the Normal Distribution

Unlock this book and the full library FREE for 7 days

Authors (2)

Other recommended products

Personalised recommendations for you