Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds
Arrow up icon
GO TO TOP
Applied Unsupervised Learning with R

You're reading from   Applied Unsupervised Learning with R Uncover hidden relationships and patterns with k-means clustering, hierarchical clustering, and PCA

Arrow left icon
Product type Paperback
Published in Mar 2019
Publisher
ISBN-13 9781789956399
Length 320 pages
Edition 1st Edition
Languages
Arrow right icon
Authors (2):
Arrow left icon
Bradford Tuckfield Bradford Tuckfield
Author Profile Icon Bradford Tuckfield
Bradford Tuckfield
Alok Malik Alok Malik
Author Profile Icon Alok Malik
Alok Malik
Arrow right icon
View More author details
Toc

Chapter 3: Probability Distributions


Activity 8: Finding the Standard Distribution Closest to the Distribution of Variables of the Iris Dataset

Solution:

  1. Load the Iris dataset into the df variable:

    df<-iris
  2. Select rows corresponding to the setosa species only:

    df=df[df$Species=='setosa',]
  3. Import the kdensity library:

    library(kdensity)
  4. Calculate and plot the KDE from the kdensity function for sepal length:

    dist <- kdensity(df$Sepal.Length)
    plot(dist)

    The output is as follows:

    Figure 3.36 Plot of the KDE for sepal length

    This distribution is closest to the normal distribution, which we studied in the previous section. Here, the mean and median are both around 5.

  5. Calculate and plot the KDE from the kdensity function for sepal width:

    dist <- kdensity(df$Sepal.Width)
    plot(dist)

    The output is as follows:

    Figure 3.37 Plot of the KDE for sepal width

This distribution is also closest to normal distribution. We can formalize this similarity with a Kolmogorov-Smirnov test.

Activity 9: Calculating the CDF and Performing the Kolmogorov-Simonov Test with the Normal Distribution

Solution:

  1. Load the Iris dataset into the df variable:

    df<-iris
  2. Keep rows with the setosa species only:

    df=df[df$Species=='setosa',]
  3. Calculate the mean and standard deviation of the sepal length column of df:

    sdev<-sd(df$Sepal.Length)
    mn<-mean(df$Sepal.Length)
  4. Generate a new distribution with the standard deviation and mean of the sepal length column:

    xnorm<-rnorm(100,mean=mn,sd=sdev)
  5. Plot the CDF of both xnorm and the sepal length column:

    plot(ecdf(xnorm),col='blue')
    plot(ecdf(df$Sepal.Length),add=TRUE,pch = 4,col='red')

    The output is as follows:

    Figure 3.38: The CDF of xnorm and sepal length

    The samples look very close to each other in the distribution. Let's see, in the next test, whether the sepal length sample belongs to the normal distribution or not.

  6. Perform the Kolmogorov-Smirnov test on the two samples, as follows:

    ks.test(xnorm,df$Sepal.Length)

    The output is as follows:

        Two-sample Kolmogorov-Smirnov test
    data: xnorm and df$Sepal.Length
    D = 0.14, p-value = 0.5307
    alternative hypothesis: two-sided

    Here, p-value is very high and the D value is low, so we can assume that the distribution of sepal length is closely approximated by the normal distribution.

  7. Repeat the same steps for the sepal width column of df:

    sdev<-sd(df$Sepal.Width)
    mn<-mean(df$Sepal.Width)
    xnorm<-rnorm(100,mean=mn,sd=sdev)
    plot(ecdf(xnorm),col='blue')
    plot(ecdf(df$Sepal.Width),add=TRUE,pch = 4,col='red')

    The output is as follows:

    Figure 3.39: CDF of xnorm and sepal width

  8. Perform the Kolmogorov-Smirnov test as follows:

    ks.test(xnorm,df$Sepal.Length)

    The output is as follows:

        Two-sample Kolmogorov-Smirnov test
    
    data: xnorm and df$Sepal.Width
    D = 0.12, p-value = 0.7232
    alternative hypothesis: two-sided

Here, also, the sample distribution of sepal width is closely approximated by the normal distribution.

lock icon The rest of the chapter is locked
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime
Banner background image