Chapter 9: Hotspot Analysis
Activity 21: Estimating Density in One Dimension
Solution:
Open a new notebook and install all the necessary libraries.
get_ipython().run_line_magic('matplotlib', 'inline') import matplotlib.pyplot as plt import numpy import pandas import seaborn import sklearn.datasets import sklearn.model_selection import sklearn.neighbors seaborn.set()
Sample 1,000 data points from the standard normal distribution. Add 3.5 to each of the last 625 values of the sample (that is, the indices between 375 and 1,000). To do this, set a random state of 100 using numpy.random.RandomState to guarantee the same sampled values, and then randomly generate the data points using the randn(1000) call:
rand = numpy.random.RandomState(100) vals = rand.randn(1000) # standard normal vals[375:] += 3.5
Plot the 1,000-point sample data as a histogram and add a scatterplot below it:
fig, ax = plt.subplots(figsize=(14, 10)) ax.hist(vals, bins=50, density=True, label='Sampled Values') ax.plot(vals, -0.005 - 0.01 * numpy.random.random(len(vals)), '+k', label='Individual Points') ax.legend(loc='upper right')
The output is as follows:
Define a grid of bandwidth values. Then, define and fit a grid search cross-validation algorithm:
bandwidths = 10 ** numpy.linspace(-1, 1, 100) grid = sklearn.model_selection.GridSearchCV( estimator=sklearn.neighbors.KernelDensity(kernel="gaussian"), param_grid={"bandwidth": bandwidths}, cv=10 ) grid.fit(vals[:, None])
Extract the optimal bandwidth value:
best_bandwidth = grid.best_params_["bandwidth"] print( "Best Bandwidth Value: {}" .format(best_bandwidth) )
Replot the histogram from Step 3 and overlay the estimated density:
fig, ax = plt.subplots(figsize=(14, 10)) ax.hist(vals, bins=50, density=True, alpha=0.75, label='Sampled Values') x_vec = numpy.linspace(-4, 8, 10000)[:, numpy.newaxis] log_density = numpy.exp(grid.best_estimator_.score_samples(x_vec)) ax.plot( x_vec[:, 0], log_density, '-', linewidth=4, label='Kernel = Gaussian' ) ax.legend(loc='upper right')
The output is as follows:
Activity 22: Analyzing Crime in London
Solution:
Load the crime data. Use the path where you saved the downloaded directory, create a list of the year-month tags, use the read_csv command to load the individual files iteratively, and then concatenate these files together:
base_path = ( "~/Documents/packt/unsupervised-learning-python/" "lesson-9-hotspot-models/metro-jul18-dec18/" "{yr_mon}/{yr_mon}-metropolitan-street.csv" ) print(base_path) yearmon_list = [ "2018-0" + str(i) if i <= 9 else "2018-" + str(i) for i in range(7, 13) ] print(yearmon_list) data_yearmon_list = [] for idx, i in enumerate(yearmon_list): df = pandas.read_csv( base_path.format(yr_mon=i), header=0 ) data_yearmon_list.append(df) if idx == 0: print("Month: {}".format(i)) print("Dimensions: {}".format(df.shape)) print("Head:\n{}\n".format(df.head(2))) london = pandas.concat(data_yearmon_list)
The output is as follows:
This printed information is just for the first of the loaded files, which will be the criminal information from the Metropolitan Police Service for July 2018. This one file has nearly 100,000 entries. You will notice that there is a great deal of interesting information in this dataset, but we will focus on Longitude, Latitude, Month, and Crime type.
Print diagnostics of the complete (six months) and concatenated dataset:
print( "Dimensions - Full Data:\n{}\n" .format(london.shape) ) print( "Unique Months - Full Data:\n{}\n" .format(london["Month"].unique()) ) print( "Number of Unique Crime Types - Full Data:\n{}\n" .format(london["Crime type"].nunique()) ) print( "Unique Crime Types - Full Data:\n{}\n" .format(london["Crime type"].unique()) ) print( "Count Occurrences Of Each Unique Crime Type - Full Type:\n{}\n" .format(london["Crime type"].value_counts()) )
The output is as follows:
Subset the DataFrame down to four variables (Longitude, Latitude, Month, and Crime type):
london_subset = london[["Month", "Longitude", "Latitude", "Crime type"]] london_subset.head(5)
The output is as follows:
Using the jointplot function from seaborn, fit and visualize three kernel density estimation models for bicycle theft in July, September, and December 2018:
crime_bicycle_jul = london_subset[ (london_subset["Crime type"] == "Bicycle theft") & (london_subset["Month"] == "2018-07") ] seaborn.jointplot("Longitude", "Latitude", crime_bicycle_jul, kind="kde")
The output is as follows:
crime_bicycle_sept = london_subset[ (london_subset["Crime type"] == "Bicycle theft") & (london_subset["Month"] == "2018-09") ] seaborn.jointplot("Longitude", "Latitude", crime_bicycle_sept, kind="kde")
The output is as follows:
crime_bicycle_dec = london_subset[ (london_subset["Crime type"] == "Bicycle theft") & (london_subset["Month"] == "2018-12") ] seaborn.jointplot("Longitude", "Latitude", crime_bicycle_dec, kind="kde")
The output is as follows:
From month to month, the density of bicycle thefts stays quite constant. There are slight differences between the densities, which is to be expected given that the data that is the foundation of these estimated densities is three one-month samples. Given these results, police or criminologists should be confident in predicting where future bicycle thefts are most likely to occur.
Repeat Step 4; this time, use shoplifting crimes for the months of August, October, and November 2018:
crime_shoplift_aug = london_subset[ (london_subset["Crime type"] == "Shoplifting") & (london_subset["Month"] == "2018-08") ] seaborn.jointplot("Longitude", "Latitude", crime_shoplift_aug, kind="kde")
The output is as follows:
crime_shoplift_oct = london_subset[ (london_subset["Crime type"] == "Shoplifting") & (london_subset["Month"] == "2018-10") ] seaborn.jointplot("Longitude", "Latitude", crime_shoplift_oct, kind="kde")
The output is as follows:
crime_shoplift_nov = london_subset[ (london_subset["Crime type"] == "Shoplifting") & (london_subset["Month"] == "2018-11") ] seaborn.jointplot("Longitude", "Latitude", crime_shoplift_nov, kind="kde")
The output is as follows:
Like the bicycle theft results, the shoplifting densities are quite stable across the months. The density from August 2018 looks different from the other two months; however, if you look at the longitude and latitude values, you will notice that the density is very similar, but it has just shifted and scaled. The reason for this is that there were probably a number of outliers forcing the creation of a much larger plotting region.
Repeat Step 5; this time use burglary crimes for the months of July, October, and December 2018:
crime_burglary_jul = london_subset[ (london_subset["Crime type"] == "Burglary") & (london_subset["Month"] == "2018-07") ] seaborn.jointplot("Longitude", "Latitude", crime_burglary_jul, kind="kde")
The output is as follows:
crime_burglary_oct = london_subset[ (london_subset["Crime type"] == "Burglary") & (london_subset["Month"] == "2018-10") ] seaborn.jointplot("Longitude", "Latitude", crime_burglary_oct, kind="kde")
The output is as follows:
crime_burglary_dec = london_subset[ (london_subset["Crime type"] == "Burglary") & (london_subset["Month"] == "2018-12") ] seaborn.jointplot("Longitude", "Latitude", crime_burglary_dec, kind="kde")
The output is as follows:
Once again, we can see that the distributions are quite similar across the months. The only difference is that the densities seem to widen or spread from July to December. As always, the noise and inherent lack of information contained in the sample data is causing small shifts in the estimated densities.