Exploring types of anomalies
Before choosing appropriate algorithms, a fundamental understanding of what constitutes an anomaly is essential to enhance explainability. Anomalies manifest in many shapes and sizes, including objects, vectors, events, patterns, and observations. They can exist in static entities or temporal contexts. Here is a comparison of different types of anomalies:
- A point anomaly exists in any dataset where an individual data point is out of the boundary of normal distribution. For example, an out-of-norm expensive credit card purchase is a point anomaly.
- A collective anomaly only occurs when a group of related data records or sequences of observations appear collectively and significantly differ from the remaining dataset. A spike of errors from multiple systems is a collective anomaly that might indicate problems with downstream e-commerce systems.
- A contextual anomaly occurs when viewed against contextual attributes such as day and time. An example of a temporal contextual anomaly is a sudden increase in online orders outside of expected peak shopping hours.
An anomaly has at least one (univariate) or multiple attributes (multivariate) in numerical, binary, continuous, or categorical data types. These attributes describe the characteristics, features, and dimensions of an anomaly. Figure 1.1 shows examples of common anomaly types:
![Figure 1.1 – Types of anomalies](https://static.packt-cdn.com/products/9781804617755/graphics/image/B18948_1_01.jpg)
Figure 1.1 – Types of anomalies
Defining an anomaly is not a straightforward task because boundaries between normal and abnormal behaviors can be domain-specific and subject to risk tolerance levels defined by the business, organization, and industry. For example, an irregular heart rhythm from electrocardiogram (ECG) time series data may signal cardiovascular disease risk, whereas stock price fluctuations might be considered normal based on market demand. Thus, there is no universal definition of an anomaly and no one-size-fits-all solution for anomaly detection.
Let’s look at a point anomaly example using PyOD and a diabetes dataset from Kaggle, https://www.kaggle.com/datasets/mathchi/diabetes-data-set. PyOD, https://github.com/yzhao062/pyod. PyOD is an open source Python library that provides over 40 outlier detection algorithms, covering everything from outlier ensembles to neural network-based methods on multivariate data.
Sample Jupyter notebooks and requirements files for package dependencies discussed in this chapter are available at https://github.com/PacktPublishing/Deep-Learning-and-XAI-Techniques-for-Anomaly-Detection/tree/main/Chapter1.
You can experiment with this example on Amazon SageMaker Studio Lab, https://aws.amazon.com/sagemaker/studio-lab/, a free ML development environment that provides up to 12 hours of CPU or 4 hours of GPU per user session and 15 GiB storage at no cost. Alternatively, you can try this on your preferred Integrated Development Environment (IDE). A sample notebook, chapter1_pyod_point_anomaly.ipynb, can be found in the book's GitHub repo. Let’s get started:
- First, install the required packages using provided requirements file.
import sys
!{sys.executable} -m pip install -r requirements.txt
- Import essential libraries.
%matplotlib inline
import pandas as pd
import numpy as np
import warnings
from pyod.models.knn import KNN
from platform import python_version
warnings.filterwarnings('ignore')
print(f'Python version: {python_version()}')
- Load and preview the dataset, as shown in Figure 1.2:
df = pd.read_csv('diabetes.csv')
df.head()
![Figure 1.2 – Preview dataset](https://static.packt-cdn.com/products/9781804617755/graphics/image/B18948_1_02.jpg)
Figure 1.2 – Preview dataset
- The dataset contains the following columns:
Pregnancies
: Number of times pregnantGlucose
: Plasma glucose concentration in an oral glucose tolerance testBloodPressure
: Diastolic blood pressure (mm Hg)SkinThickness
: Triceps skin fold thickness (mm)Insulin
: 2-hour serum insulin (mu U/ml)BMI
: Body mass index (weight in kg/(height in m)^2)DiabetesPedigreeFunction
: Diabetes pedigree functionAge
: Age (years)Outcome
: Class variable (0 is not diabetic and 1 is diabetic)
- Figure 1.3 shows the descriptive statistics about the dataset:
df.describe()
![Figure 1.3 – Descriptive statistics](https://static.packt-cdn.com/products/9781804617755/graphics/image/B18948_1_03.jpg)
Figure 1.3 – Descriptive statistics
- We will focus on identifying point anomalies using the Glucose and Insulin features. Assign model feature and target column to the variables:
X = df['Glucose']
Y = df['Insulin']
- Figure 1.4 is a scatter plot that shows the original data distribution using the following code:
import matplotlib.pyplot as plt
plt.scatter(X, Y)
plt.xlabel('Glucose')
plt.ylabel('Blood Pressure')
plt.show()
![Figure 1.4 – Original data distribution](https://static.packt-cdn.com/products/9781804617755/graphics/image/B18948_1_04.jpg)
Figure 1.4 – Original data distribution
- Next, load a K-nearest neighbors (KNN) model from PyOD. Before predicting outliers, we must reshape the target column into the desired input format for KNN:
from pyod.models.knn import KNN
Y = Y.values.reshape(-1, 1)
X = X.values.reshape(-1, 1)
clf = KNN()
clf.fit(Y)
outliers = clf.predict(Y)
- List the identified outliers. You should see the output as shown in Figure 1.5:
anomaly = np.where(outliers==1)
anomaly
![Figure 1.5 – Outliers detected by KNN](https://static.packt-cdn.com/products/9781804617755/graphics/image/B18948_1_05.jpg)
Figure 1.5 – Outliers detected by KNN
Figure 1.6 shows a preview of the identified outliers:
![Figure 1.6 – Preview outliers](https://static.packt-cdn.com/products/9781804617755/graphics/image/B18948_1_06.jpg)
Figure 1.6 – Preview outliers
- Visualize the outliers and inliers distribution, as shown in Figure 1.7:
Y_outliers = Y[np.where(outliers==1)]
X_outliers = X[np.where(outliers==1)]
Y_inliers = Y[np.where(outliers==0)]
X_inliers = X[np.where(outliers==0)]
plt.scatter(X_outliers, Y_outliers, edgecolor='black',color='red', label= 'Outliers')
plt.scatter(X_inliers, Y_inliers, edgecolor='black',color='cyan', label= 'Inliers')
plt.legend()
plt.ylabel('Blood Pressure')
plt.xlabel('Glucose')
plt.savefig('outliers_distribution.png', bbox_inches='tight')
plt.show()
![Figure 1.7 – Outliers versus inliers](https://static.packt-cdn.com/products/9781804617755/graphics/image/B18948_1_07.jpg)
Figure 1.7 – Outliers versus inliers
- PyOD computes anomaly scores using
decision_function
for the trained model. The larger the anomaly score, the higher the probability that the instance is an outlier:anomaly_score = clf.decision_function(Y)
- Visualize the calculated anomaly score distribution with a histogram:
n_bins = 5
min_outlier_anomaly_score = np.floor(np.min(anomaly_score[np.where(outliers==1)])*10)/10
plt.figure(figsize=(6, 4))
values, bins, bars = plt.hist(anomaly_score, bins=n_bins, edgecolor='white')
plt.axvline(min_outlier_anomaly_score, c='r')
plt.bar_label(bars, fontsize=12)
plt.margins(x=0.01, y=0.1)
plt.xlabel('Anomaly Score')
plt.ylabel('Number of Instances')
plt.savefig('outliers_min.png', bbox_inches='tight')
plt.show()
In Figure 1.8, the red vertical line indicates the minimum anomaly score to flag an instance as an outlier:
![Figure 1.8 – Anomaly score distribution](https://static.packt-cdn.com/products/9781804617755/graphics/image/B18948_1_08.jpg)
Figure 1.8 – Anomaly score distribution
- We can change the anomaly score threshold. Increasing the threshold should reduce the number of outputs. In this case, we only have one outlier after increasing the anomaly score threshold to over 250, as shown in Figure 1.9:
raw_outliers = np.where(anomaly_score >= 250)
raw_outliers
![Figure 1.9 – Outlier with a higher anomaly score](https://static.packt-cdn.com/products/9781804617755/graphics/image/B18948_1_09.jpg)
Figure 1.9 – Outlier with a higher anomaly score
- Figure 1.10 shows another outlier distribution with a different threshold:
n_bins = 5
min_anomaly_score = 50
values, bins, bars = plt.hist(anomaly_score, bins=n_bins, edgecolor='white', color='green')
plt.axvline(min_anomaly_score, c='r')
plt.bar_label(bars, fontsize=12)
plt.margins(x=0.01, y=0.1)
plt.xlabel('Anomaly Score')
plt.ylabel('Number of Instances')
plt.savefig('outliers_modified.png', bbox_inches='tight')
plt.show()
![Figure 1.10 – Modified anomaly threshold](https://static.packt-cdn.com/products/9781804617755/graphics/image/B18948_1_10.jpg)
Figure 1.10 – Modified anomaly threshold
You completed a walk-through of point anomaly detection using a KNN model. Feel free to explore other outlier detection algorithms provided by PyOD. With a foundational knowledge of anomaly types, you are ready to explore various real-world use cases for anomaly detection in the following section.