Packt+ | Advance your knowledge in tech

You're reading from Applied Unsupervised Learning with R Uncover hidden relationships and patterns with k-means clustering, hierarchical clustering, and PCA

Product type Paperback

Published in Mar 2019

Publisher

ISBN-13 9781789956399

Length 320 pages

Edition 1st Edition

Languages

Concepts

Machine Learning

Authors (2):

Bradford Tuckfield

Alok Malik

View More author details

Table of Contents (9) Chapters

Applied Unsupervised Learning with R

Preface

1. Introduction to Clustering Methods FREE CHAPTER

2. Advanced Clustering Methods

3. Probability Distributions

4. Dimension Reduction

5. Data Comparison Methods

6. Anomaly Detection

Appendix

Introduction to Market Segmentation

Market segmentation is dividing customers into different segments based on common characteristics. The following are the uses of customer segmentation:

Increasing customer conversion and retention
Developing new products for a particular segment by identifying it and its needs
Improving brand communication with a particular segment
Identifying gaps in marketing strategy and making new marketing strategies to increase sales

Exercise 4: Exploring the Wholesale Customer Dataset

In this exercise, we will have a look at the data in the wholesale customer dataset.

Note

For all the exercises and activities where we are importing an external CSV or image files, go to RStudio-> Session-> Set Working Directory-> To Source File Location. You can see in the console that the path is set automatically.

To download the CSV file, go to https://github.com/TrainingByPackt/Applied-Unsupervised-Learning-with-R/tree/master/Lesson01/Exercise04/wholesale_customers_data.csv. Click on wholesale_customers_data.csv.
Note
This dataset is taken from the UCI Machine Learning Repository. You can find the dataset at https://archive.ics.uci.edu/ml/machine-learning-databases/00292/. We have downloaded the file and saved it at https://github.com/TrainingByPackt/Applied-Unsupervised-Learning-with-R/tree/master/Lesson01/Exercise04/wholesale_customers_data.csv.
Save it to the folder in which you have installed R. Now, to load it in R, use the following function:
```
ws<-read.csv("wholesale_customers_data.csv")
```
Now we may have a look at the different columns and rows in this dataset by using the following function in R:
```
head(ws)
```
The output is as follows:
Figure 1.19: Columns of the wholesale customer dataset

These six rows show the first six rows of annual spending in monetary units by category of product.

Activity 2: Customer Segmentation with k-means

For this activity, we're going to use the wholesale customer dataset from the UCI Machine Learning Repository. It's available at: https://github.com/TrainingByPackt/Applied-Unsupervised-Learning-with-R/tree/master/Lesson01/Activity02/wholesale_customers_data.csv. We're going to identify customers belonging to different market segments who like to spend on different types of goods with clustering. Try k-means clustering for values of k from 2 to 6.

Note

This dataset is taken from the UCI Machine Learning Repository. You can find the dataset at https://archive.ics.uci.edu/ml/machine-learning-databases/00292/. We have downloaded the file and saved it at https://github.com/TrainingByPackt/Applied-Unsupervised-Learning-with-R/tree/master/Lesson01/Activity02/wholesale_customers_data.csv.

These steps will help you complete the activity:

Read data downloaded from the UCI Machine Learning Repository into a variable. The data can be found at: https://github.com/TrainingByPackt/Applied-Unsupervised-Learning-with-R/tree/master/Lesson01/Activity02/wholesale_customers_data.csv.
Select only two columns, Grocery and Frozen, for easy visualization of clusters.
As in Step 2 of Exercise 4, Exploring the Wholesale Customer Dataset, change the value for the number of clusters to 2 and generate the cluster centers.
Plot the graph as in Step 4 in Exercise 4, Exploring the Wholesale Customer Dataset.
Save the graph you generate.
Repeat Steps 3, 4, and 5 by changing value for the number of clusters to 3, 4, 5, and 6.
Decide which value for the number of clusters best classifies the dataset.