Packt+ | Advance your knowledge in tech

You're reading from Applied Unsupervised Learning with Python Discover hidden patterns and relationships in unstructured data with Python

Product type Paperback

Published in May 2019

Publisher

ISBN-13 9781789952292

Length 482 pages

Edition 1st Edition

Languages

Python

Tools

Scikit-learn

Concepts

Machine Learning

Authors (3):

Benjamin Johnston

Christopher Kruger

Aaron Jones

View More author details

Table of Contents (12) Chapters

Applied Unsupervised Learning with Python

Preface

1. Introduction to Clustering FREE CHAPTER

2. Hierarchical Clustering

3. Neighborhood Approaches and DBSCAN

4. Dimension Reduction and PCA

5. Autoencoders

6. t-Distributed Stochastic Neighbor Embedding (t-SNE)

7. Topic Modeling

8. Market Basket Analysis

9. Hotspot Analysis

Appendix

Chapter 9: Hotspot Analysis

Activity 21: Estimating Density in One Dimension

Solution:

Open a new notebook and install all the necessary libraries.

get_ipython().run_line_magic('matplotlib', 'inline')

import matplotlib.pyplot as plt
import numpy
import pandas
import seaborn
import sklearn.datasets
import sklearn.model_selection
import sklearn.neighbors

seaborn.set()

Sample 1,000 data points from the standard normal distribution. Add 3.5 to each of the last 625 values of the sample (that is, the indices between 375 and 1,000). To do this, set a random state of 100 using numpy.random.RandomState to guarantee the same sampled values, and then randomly generate the data points using the randn(1000) call:
```
rand = numpy.random.RandomState(100)
vals = rand.randn(1000)  # standard normal
vals[375:] += 3.5
```

Plot the 1,000-point sample data as a histogram and add a scatterplot below it:

fig, ax = plt.subplots(figsize=(14, 10))
ax.hist(vals, bins=50, density=True, label='Sampled Values')
ax.plot(vals, -0.005 - 0.01 * numpy.random.random(len(vals)), '+k', label='Individual Points')
ax.legend(loc='upper right')

The output is as follows:

Figure 9.29: A histogram of the random sample with a scatterplot underneath

Define a grid of bandwidth values. Then, define and fit a grid search cross-validation algorithm:

bandwidths = 10 ** numpy.linspace(-1, 1, 100)

grid = sklearn.model_selection.GridSearchCV(
    estimator=sklearn.neighbors.KernelDensity(kernel="gaussian"),
    param_grid={"bandwidth": bandwidths},
    cv=10
)
grid.fit(vals[:, None])

Extract the optimal bandwidth value:

best_bandwidth = grid.best_params_["bandwidth"]

print(
    "Best Bandwidth Value: {}"
    .format(best_bandwidth)
)

Replot the histogram from Step 3 and overlay the estimated density:

fig, ax = plt.subplots(figsize=(14, 10))

ax.hist(vals, bins=50, density=True, alpha=0.75, label='Sampled Values')

x_vec = numpy.linspace(-4, 8, 10000)[:, numpy.newaxis]
log_density = numpy.exp(grid.best_estimator_.score_samples(x_vec))
ax.plot(
     x_vec[:, 0], log_density, 
     '-', linewidth=4, label='Kernel = Gaussian'
)

ax.legend(loc='upper right')

The output is as follows:

Figure 9.30: A histogram of the random sample with the optimal estimated density overlaid

Activity 22: Analyzing Crime in London

Solution:

Load the crime data. Use the path where you saved the downloaded directory, create a list of the year-month tags, use the read_csv command to load the individual files iteratively, and then concatenate these files together:

base_path = (
    "~/Documents/packt/unsupervised-learning-python/"
    "lesson-9-hotspot-models/metro-jul18-dec18/"
    "{yr_mon}/{yr_mon}-metropolitan-street.csv"
)

print(base_path)

yearmon_list = [
    "2018-0" + str(i) if i <= 9 else "2018-" + str(i) 
    for i in range(7, 13)
]

print(yearmon_list)

data_yearmon_list = []

for idx, i in enumerate(yearmon_list):
    df = pandas.read_csv(
        base_path.format(yr_mon=i), 
        header=0
    )
    
    data_yearmon_list.append(df)
    
    if idx == 0:
        print("Month: {}".format(i))
        print("Dimensions: {}".format(df.shape))
        print("Head:\n{}\n".format(df.head(2)))

london = pandas.concat(data_yearmon_list)

The output is as follows:

Figure 9.31: An example of one of the individual crime files

This printed information is just for the first of the loaded files, which will be the criminal information from the Metropolitan Police Service for July 2018. This one file has nearly 100,000 entries. You will notice that there is a great deal of interesting information in this dataset, but we will focus on Longitude, Latitude, Month, and Crime type.

Print diagnostics of the complete (six months) and concatenated dataset:

print(
    "Dimensions - Full Data:\n{}\n"
    .format(london.shape)
)
print(
    "Unique Months - Full Data:\n{}\n"
    .format(london["Month"].unique())
)
print(
    "Number of Unique Crime Types - Full Data:\n{}\n"
    .format(london["Crime type"].nunique())
)
print(
    "Unique Crime Types - Full Data:\n{}\n"
    .format(london["Crime type"].unique())
)
print(
    "Count Occurrences Of Each Unique Crime Type - Full Type:\n{}\n"
    .format(london["Crime type"].value_counts())
)

The output is as follows:

Figure 9.32: Descriptors of the full crime dataset

Subset the DataFrame down to four variables (Longitude, Latitude, Month, and Crime type):
```
london_subset = london[["Month", "Longitude", "Latitude", "Crime type"]]
london_subset.head(5)
```
The output is as follows:
Figure 9.33: Crime data in DataFrame form subset down to the Longitude, Latitude, Month, and Crime type columns
Using the jointplot function from seaborn, fit and visualize three kernel density estimation models for bicycle theft in July, September, and December 2018:
```
crime_bicycle_jul = london_subset[
    (london_subset["Crime type"] == "Bicycle theft") & 
    (london_subset["Month"] == "2018-07")
]

seaborn.jointplot("Longitude", "Latitude", crime_bicycle_jul, kind="kde")
```
The output is as follows:
Figure 9.34: The estimated joint and marginal densities for bicycle thefts in July 2018
```
crime_bicycle_sept = london_subset[
    (london_subset["Crime type"] == "Bicycle theft") & 
    (london_subset["Month"] == "2018-09")
]

seaborn.jointplot("Longitude", "Latitude", crime_bicycle_sept, kind="kde")
```
The output is as follows:
Figure 9.35: The estimated joint and marginal densities for bicycle thefts in September 2018
```
crime_bicycle_dec = london_subset[
    (london_subset["Crime type"] == "Bicycle theft") & 
    (london_subset["Month"] == "2018-12")
]

seaborn.jointplot("Longitude", "Latitude", crime_bicycle_dec, kind="kde")
```
The output is as follows:
Figure 9.36: The estimated joint and marginal densities for bicycle thefts in December 2018
From month to month, the density of bicycle thefts stays quite constant. There are slight differences between the densities, which is to be expected given that the data that is the foundation of these estimated densities is three one-month samples. Given these results, police or criminologists should be confident in predicting where future bicycle thefts are most likely to occur.
Repeat Step 4; this time, use shoplifting crimes for the months of August, October, and November 2018:
```
crime_shoplift_aug = london_subset[
    (london_subset["Crime type"] == "Shoplifting") & 
    (london_subset["Month"] == "2018-08")
]

seaborn.jointplot("Longitude", "Latitude", crime_shoplift_aug, kind="kde")
```
The output is as follows:
Figure 9.37: The estimated joint and marginal densities for shoplifting incidents in August 2018
```
crime_shoplift_oct = london_subset[
    (london_subset["Crime type"] == "Shoplifting") & 
    (london_subset["Month"] == "2018-10")
]

seaborn.jointplot("Longitude", "Latitude", crime_shoplift_oct, kind="kde")
```
The output is as follows:
Figure 9.38: The estimated joint and marginal densities for shoplifting incidents in October 2018
```
crime_shoplift_nov = london_subset[
    (london_subset["Crime type"] == "Shoplifting") & 
    (london_subset["Month"] == "2018-11")
]

seaborn.jointplot("Longitude", "Latitude", crime_shoplift_nov, kind="kde")
```
The output is as follows:
Figure 9.39: The estimated joint and marginal densities for shoplifting incidents in November 2018
Like the bicycle theft results, the shoplifting densities are quite stable across the months. The density from August 2018 looks different from the other two months; however, if you look at the longitude and latitude values, you will notice that the density is very similar, but it has just shifted and scaled. The reason for this is that there were probably a number of outliers forcing the creation of a much larger plotting region.
Repeat Step 5; this time use burglary crimes for the months of July, October, and December 2018:
```
crime_burglary_jul = london_subset[
    (london_subset["Crime type"] == "Burglary") & 
    (london_subset["Month"] == "2018-07")
]

seaborn.jointplot("Longitude", "Latitude", crime_burglary_jul, kind="kde")
```
The output is as follows:
Figure 9.40: The estimated joint and marginal densities for burglaries in July 2018
```
crime_burglary_oct = london_subset[
    (london_subset["Crime type"] == "Burglary") & 
    (london_subset["Month"] == "2018-10")
]

seaborn.jointplot("Longitude", "Latitude", crime_burglary_oct, kind="kde")
```
The output is as follows:
Figure 9.41: The estimated joint and marginal densities for burglaries in October 2018
```
crime_burglary_dec = london_subset[
    (london_subset["Crime type"] == "Burglary") & 
    (london_subset["Month"] == "2018-12")
]

seaborn.jointplot("Longitude", "Latitude", crime_burglary_dec, kind="kde")
```
The output is as follows:
Figure 9.42: The estimated joint and marginal densities for burglaries in December 2018
Once again, we can see that the distributions are quite similar across the months. The only difference is that the densities seem to widen or spread from July to December. As always, the noise and inherent lack of information contained in the sample data is causing small shifts in the estimated densities.