Search icon CANCEL
Subscription
0
Cart icon
Cart
Close icon
You have no products in your basket yet
Save more on your purchases!
Savings automatically calculated. No voucher code required
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
Deep Learning and XAI Techniques for Anomaly Detection

You're reading from  Deep Learning and XAI Techniques for Anomaly Detection

Product type Book
Published in Jan 2023
Publisher Packt
ISBN-13 9781804617755
Pages 218 pages
Edition 1st Edition
Languages
Author (1):
Cher Simon Cher Simon
Profile icon Cher Simon
Toc

Table of Contents (15) Chapters close

Preface 1. Part 1 – Introduction to Explainable Deep Learning Anomaly Detection
2. Chapter 1: Understanding Deep Learning Anomaly Detection 3. Chapter 2: Understanding Explainable AI 4. Part 2 – Building an Explainable Deep Learning Anomaly Detector
5. Chapter 3: Natural Language Processing Anomaly Explainability 6. Chapter 4: Time Series Anomaly Explainability 7. Chapter 5: Computer Vision Anomaly Explainability 8. Part 3 – Evaluating an Explainable Deep Learning Anomaly Detector
9. Chapter 6: Differentiating Intrinsic and Post Hoc Explainability 10. Chapter 7: Backpropagation versus Perturbation Explainability 11. Chapter 8: Model-Agnostic versus Model-Specific Explainability 12. Chapter 9: Explainability Evaluation Schemes 13. Index 14. Other Books You May Enjoy

Exploring types of anomalies

Before choosing appropriate algorithms, a fundamental understanding of what constitutes an anomaly is essential to enhance explainability. Anomalies manifest in many shapes and sizes, including objects, vectors, events, patterns, and observations. They can exist in static entities or temporal contexts. Here is a comparison of different types of anomalies:

  • A point anomaly exists in any dataset where an individual data point is out of the boundary of normal distribution. For example, an out-of-norm expensive credit card purchase is a point anomaly.
  • A collective anomaly only occurs when a group of related data records or sequences of observations appear collectively and significantly differ from the remaining dataset. A spike of errors from multiple systems is a collective anomaly that might indicate problems with downstream e-commerce systems.
  • A contextual anomaly occurs when viewed against contextual attributes such as day and time. An example of a temporal contextual anomaly is a sudden increase in online orders outside of expected peak shopping hours.

An anomaly has at least one (univariate) or multiple attributes (multivariate) in numerical, binary, continuous, or categorical data types. These attributes describe the characteristics, features, and dimensions of an anomaly. Figure 1.1 shows examples of common anomaly types:

Figure 1.1 – Types of anomalies

Figure 1.1 – Types of anomalies

Defining an anomaly is not a straightforward task because boundaries between normal and abnormal behaviors can be domain-specific and subject to risk tolerance levels defined by the business, organization, and industry. For example, an irregular heart rhythm from electrocardiogram (ECG) time series data may signal cardiovascular disease risk, whereas stock price fluctuations might be considered normal based on market demand. Thus, there is no universal definition of an anomaly and no one-size-fits-all solution for anomaly detection.

Let’s look at a point anomaly example using PyOD and a diabetes dataset from Kaggle, https://www.kaggle.com/datasets/mathchi/diabetes-data-set. PyOD, https://github.com/yzhao062/pyod. PyOD is an open source Python library that provides over 40 outlier detection algorithms, covering everything from outlier ensembles to neural network-based methods on multivariate data.

Sample Jupyter notebooks and requirements files for package dependencies discussed in this chapter are available at https://github.com/PacktPublishing/Deep-Learning-and-XAI-Techniques-for-Anomaly-Detection/tree/main/Chapter1.

You can experiment with this example on Amazon SageMaker Studio Lab, https://aws.amazon.com/sagemaker/studio-lab/, a free ML development environment that provides up to 12 hours of CPU or 4 hours of GPU per user session and 15 GiB storage at no cost. Alternatively, you can try this on your preferred Integrated Development Environment (IDE). A sample notebook, chapter1_pyod_point_anomaly.ipynb, can be found in the book's GitHub repo. Let’s get started:

  1. First, install the required packages using provided requirements file.
    import sys
    !{sys.executable} -m pip install -r requirements.txt
  2. Import essential libraries.
    %matplotlib inline
    import pandas as pd
    import numpy as np
    import warnings
    from pyod.models.knn import KNN
    from platform import python_version
    warnings.filterwarnings('ignore')
    print(f'Python version: {python_version()}')
  3. Load and preview the dataset, as shown in Figure 1.2:
    df = pd.read_csv('diabetes.csv')
    df.head()
Figure 1.2 – Preview dataset

Figure 1.2 – Preview dataset

  1. The dataset contains the following columns:
    • Pregnancies: Number of times pregnant
    • Glucose: Plasma glucose concentration in an oral glucose tolerance test
    • BloodPressure: Diastolic blood pressure (mm Hg)
    • SkinThickness: Triceps skin fold thickness (mm)
    • Insulin: 2-hour serum insulin (mu U/ml)
    • BMI: Body mass index (weight in kg/(height in m)^2)
    • DiabetesPedigreeFunction: Diabetes pedigree function
    • Age: Age (years)
    • Outcome: Class variable (0 is not diabetic and 1 is diabetic)
  2. Figure 1.3 shows the descriptive statistics about the dataset:
    df.describe()
Figure 1.3 – Descriptive statistics

Figure 1.3 – Descriptive statistics

  1. We will focus on identifying point anomalies using the Glucose and Insulin features. Assign model feature and target column to the variables:
    X = df['Glucose']
    Y = df['Insulin']
  2. Figure 1.4 is a scatter plot that shows the original data distribution using the following code:
    import matplotlib.pyplot as plt
    plt.scatter(X, Y)
    plt.xlabel('Glucose')
    plt.ylabel('Blood Pressure')
    plt.show()
Figure 1.4 – Original data distribution

Figure 1.4 – Original data distribution

  1. Next, load a K-nearest neighbors (KNN) model from PyOD. Before predicting outliers, we must reshape the target column into the desired input format for KNN:
    from pyod.models.knn import KNN
    Y = Y.values.reshape(-1, 1)
    X = X.values.reshape(-1, 1)
    clf = KNN()
    clf.fit(Y)
    outliers = clf.predict(Y)
  2. List the identified outliers. You should see the output as shown in Figure 1.5:
    anomaly = np.where(outliers==1)
    anomaly
Figure 1.5 – Outliers detected by KNN

Figure 1.5 – Outliers detected by KNN

Figure 1.6 shows a preview of the identified outliers:

Figure 1.6 – Preview outliers

Figure 1.6 – Preview outliers

  1. Visualize the outliers and inliers distribution, as shown in Figure 1.7:
    Y_outliers = Y[np.where(outliers==1)]
    X_outliers = X[np.where(outliers==1)]
    Y_inliers = Y[np.where(outliers==0)]
    X_inliers = X[np.where(outliers==0)]
    plt.scatter(X_outliers, Y_outliers, edgecolor='black',color='red', label= 'Outliers')
    plt.scatter(X_inliers, Y_inliers, edgecolor='black',color='cyan', label= 'Inliers')
    plt.legend()
    plt.ylabel('Blood Pressure')
    plt.xlabel('Glucose')
    plt.savefig('outliers_distribution.png', bbox_inches='tight')
    plt.show()
Figure 1.7 – Outliers versus inliers

Figure 1.7 – Outliers versus inliers

  1. PyOD computes anomaly scores using decision_function for the trained model. The larger the anomaly score, the higher the probability that the instance is an outlier:
    anomaly_score = clf.decision_function(Y)
  2. Visualize the calculated anomaly score distribution with a histogram:
    n_bins = 5
    min_outlier_anomaly_score = np.floor(np.min(anomaly_score[np.where(outliers==1)])*10)/10
    plt.figure(figsize=(6, 4))
    values, bins, bars = plt.hist(anomaly_score, bins=n_bins, edgecolor='white')
    plt.axvline(min_outlier_anomaly_score, c='r')
    plt.bar_label(bars, fontsize=12)
    plt.margins(x=0.01, y=0.1)
    plt.xlabel('Anomaly Score')
    plt.ylabel('Number of Instances')
    plt.savefig('outliers_min.png', bbox_inches='tight')
    plt.show()

In Figure 1.8, the red vertical line indicates the minimum anomaly score to flag an instance as an outlier:

Figure 1.8 – Anomaly score distribution

Figure 1.8 – Anomaly score distribution

  1. We can change the anomaly score threshold. Increasing the threshold should reduce the number of outputs. In this case, we only have one outlier after increasing the anomaly score threshold to over 250, as shown in Figure 1.9:
    raw_outliers = np.where(anomaly_score >= 250)
    raw_outliers
Figure 1.9 – Outlier with a higher anomaly score

Figure 1.9 – Outlier with a higher anomaly score

  1. Figure 1.10 shows another outlier distribution with a different threshold:
    n_bins = 5
    min_anomaly_score = 50
    values, bins, bars = plt.hist(anomaly_score, bins=n_bins, edgecolor='white', color='green')
    plt.axvline(min_anomaly_score, c='r')
    plt.bar_label(bars, fontsize=12)
    plt.margins(x=0.01, y=0.1)
    plt.xlabel('Anomaly Score')
    plt.ylabel('Number of Instances')
    plt.savefig('outliers_modified.png', bbox_inches='tight')
    plt.show()
Figure 1.10 – Modified anomaly threshold

Figure 1.10 – Modified anomaly threshold

You completed a walk-through of point anomaly detection using a KNN model. Feel free to explore other outlier detection algorithms provided by PyOD. With a foundational knowledge of anomaly types, you are ready to explore various real-world use cases for anomaly detection in the following section.

You have been reading a chapter from
Deep Learning and XAI Techniques for Anomaly Detection
Published in: Jan 2023 Publisher: Packt ISBN-13: 9781804617755
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime