You're reading from Deep Learning and XAI Techniques for Anomaly Detection

Product type Book

Published in Jan 2023

Publisher Packt

ISBN-13 9781804617755

Pages 218 pages

Edition 1st Edition

Languages

Concepts

Data Science

Author (1):

Cher Simon

Table of Contents (15) Chapters

Preface

1. Part 1 – Introduction to Explainable Deep Learning Anomaly Detection

2. Chapter 1: Understanding Deep Learning Anomaly Detection

3. Chapter 2: Understanding Explainable AI

4. Part 2 – Building an Explainable Deep Learning Anomaly Detector

5. Chapter 3: Natural Language Processing Anomaly Explainability

6. Chapter 4: Time Series Anomaly Explainability

7. Chapter 5: Computer Vision Anomaly Explainability

8. Part 3 – Evaluating an Explainable Deep Learning Anomaly Detector

9. Chapter 6: Differentiating Intrinsic and Post Hoc Explainability

10. Chapter 7: Backpropagation versus Perturbation Explainability

11. Chapter 8: Model-Agnostic versus Model-Specific Explainability

12. Chapter 9: Explainability Evaluation Schemes

13. Index

Why subscribe?

14. Other Books You May Enjoy

Exploring types of anomalies

Before choosing appropriate algorithms, a fundamental understanding of what constitutes an anomaly is essential to enhance explainability. Anomalies manifest in many shapes and sizes, including objects, vectors, events, patterns, and observations. They can exist in static entities or temporal contexts. Here is a comparison of different types of anomalies:

A point anomaly exists in any dataset where an individual data point is out of the boundary of normal distribution. For example, an out-of-norm expensive credit card purchase is a point anomaly.
A collective anomaly only occurs when a group of related data records or sequences of observations appear collectively and significantly differ from the remaining dataset. A spike of errors from multiple systems is a collective anomaly that might indicate problems with downstream e-commerce systems.
A contextual anomaly occurs when viewed against contextual attributes such as day and time. An example of a temporal contextual anomaly is a sudden increase in online orders outside of expected peak shopping hours.

An anomaly has at least one (univariate) or multiple attributes (multivariate) in numerical, binary, continuous, or categorical data types. These attributes describe the characteristics, features, and dimensions of an anomaly. Figure 1.1 shows examples of common anomaly types:

Figure 1.1 – Types of anomalies

Defining an anomaly is not a straightforward task because boundaries between normal and abnormal behaviors can be domain-specific and subject to risk tolerance levels defined by the business, organization, and industry. For example, an irregular heart rhythm from electrocardiogram (ECG) time series data may signal cardiovascular disease risk, whereas stock price fluctuations might be considered normal based on market demand. Thus, there is no universal definition of an anomaly and no one-size-fits-all solution for anomaly detection.

Let’s look at a point anomaly example using PyOD and a diabetes dataset from Kaggle, https://www.kaggle.com/datasets/mathchi/diabetes-data-set. PyOD, https://github.com/yzhao062/pyod. PyOD is an open source Python library that provides over 40 outlier detection algorithms, covering everything from outlier ensembles to neural network-based methods on multivariate data.

Sample Jupyter notebooks and requirements files for package dependencies discussed in this chapter are available at https://github.com/PacktPublishing/Deep-Learning-and-XAI-Techniques-for-Anomaly-Detection/tree/main/Chapter1.

You can experiment with this example on Amazon SageMaker Studio Lab, https://aws.amazon.com/sagemaker/studio-lab/, a free ML development environment that provides up to 12 hours of CPU or 4 hours of GPU per user session and 15 GiB storage at no cost. Alternatively, you can try this on your preferred Integrated Development Environment (IDE). A sample notebook, chapter1_pyod_point_anomaly.ipynb, can be found in the book's GitHub repo. Let’s get started:

First, install the required packages using provided requirements file.
```
import sys
```
```
!{sys.executable} -m pip install -r requirements.txt
```

Import essential libraries.

%matplotlib inline

import pandas as pd

import numpy as np

import warnings

from pyod.models.knn import KNN

from platform import python_version

warnings.filterwarnings('ignore')

print(f'Python version: {python_version()}')

Load and preview the dataset, as shown in Figure 1.2:
```
df = pd.read_csv('diabetes.csv')
```
```
df.head()
```

Figure 1.2 – Preview dataset

The dataset contains the following columns:
- Pregnancies: Number of times pregnant
- Glucose: Plasma glucose concentration in an oral glucose tolerance test
- BloodPressure: Diastolic blood pressure (mm Hg)
- SkinThickness: Triceps skin fold thickness (mm)
- Insulin: 2-hour serum insulin (mu U/ml)
- BMI: Body mass index (weight in kg/(height in m)^2)
- DiabetesPedigreeFunction: Diabetes pedigree function
- Age: Age (years)
- Outcome: Class variable (0 is not diabetic and 1 is diabetic)
Figure 1.3 shows the descriptive statistics about the dataset:
```
df.describe()
```

Figure 1.3 – Descriptive statistics

We will focus on identifying point anomalies using the Glucose and Insulin features. Assign model feature and target column to the variables:
```
X = df['Glucose']
```
```
Y = df['Insulin']
```

Figure 1.4 is a scatter plot that shows the original data distribution using the following code:

import matplotlib.pyplot as plt

plt.scatter(X, Y)

plt.xlabel('Glucose')

plt.ylabel('Blood Pressure')

plt.show()

Figure 1.4 – Original data distribution

Next, load a K-nearest neighbors (KNN) model from PyOD. Before predicting outliers, we must reshape the target column into the desired input format for KNN:
```
from pyod.models.knn import KNN
```
```
Y = Y.values.reshape(-1, 1)
```
```
X = X.values.reshape(-1, 1)
```
```
clf = KNN()
```
```
clf.fit(Y)
```
```
outliers = clf.predict(Y)
```
List the identified outliers. You should see the output as shown in Figure 1.5:
```
anomaly = np.where(outliers==1)
```
```
anomaly
```

Figure 1.5 – Outliers detected by KNN

Figure 1.6 shows a preview of the identified outliers:

Figure 1.6 – Preview outliers

Visualize the outliers and inliers distribution, as shown in Figure 1.7:

Y_outliers = Y[np.where(outliers==1)]

X_outliers = X[np.where(outliers==1)]

Y_inliers = Y[np.where(outliers==0)]

X_inliers = X[np.where(outliers==0)]

plt.scatter(X_outliers, Y_outliers, edgecolor='black',color='red', label= 'Outliers')

plt.scatter(X_inliers, Y_inliers, edgecolor='black',color='cyan', label= 'Inliers')

plt.legend()

plt.ylabel('Blood Pressure')

plt.xlabel('Glucose')

plt.savefig('outliers_distribution.png', bbox_inches='tight')

plt.show()

Figure 1.7 – Outliers versus inliers

PyOD computes anomaly scores using decision_function for the trained model. The larger the anomaly score, the higher the probability that the instance is an outlier:
```
anomaly_score = clf.decision_function(Y)
```

Visualize the calculated anomaly score distribution with a histogram:

n_bins = 5

min_outlier_anomaly_score = np.floor(np.min(anomaly_score[np.where(outliers==1)])*10)/10

plt.figure(figsize=(6, 4))

values, bins, bars = plt.hist(anomaly_score, bins=n_bins, edgecolor='white')

plt.axvline(min_outlier_anomaly_score, c='r')

plt.bar_label(bars, fontsize=12)

plt.margins(x=0.01, y=0.1)

plt.xlabel('Anomaly Score')

plt.ylabel('Number of Instances')

plt.savefig('outliers_min.png', bbox_inches='tight')

plt.show()

In Figure 1.8, the red vertical line indicates the minimum anomaly score to flag an instance as an outlier:

Figure 1.8 – Anomaly score distribution

We can change the anomaly score threshold. Increasing the threshold should reduce the number of outputs. In this case, we only have one outlier after increasing the anomaly score threshold to over 250, as shown in Figure 1.9:
```
raw_outliers = np.where(anomaly_score >= 250)
```
```
raw_outliers
```

Figure 1.9 – Outlier with a higher anomaly score

Figure 1.10 shows another outlier distribution with a different threshold:

n_bins = 5

min_anomaly_score = 50

values, bins, bars = plt.hist(anomaly_score, bins=n_bins, edgecolor='white', color='green')

plt.axvline(min_anomaly_score, c='r')

plt.bar_label(bars, fontsize=12)

plt.margins(x=0.01, y=0.1)

plt.xlabel('Anomaly Score')

plt.ylabel('Number of Instances')

plt.savefig('outliers_modified.png', bbox_inches='tight')

plt.show()

Figure 1.10 – Modified anomaly threshold

You completed a walk-through of point anomaly detection using a KNN model. Feel free to explore other outlier detection algorithms provided by PyOD. With a foundational knowledge of anomaly types, you are ready to explore various real-world use cases for anomaly detection in the following section.

You're reading from Deep Learning and XAI Techniques for Anomaly Detection

Table of Contents (15) Chapters

Exploring types of anomalies

Authors (1)

Personalised recommendations for you

You're reading from Deep Learning and XAI Techniques for Anomaly Detection

Table of Contents (15) Chapters close

Exploring types of anomalies

Authors (1)

Personalised recommendations for you

Table of Contents (15) Chapters