You're reading from Artificial Intelligence for Cybersecurity Develop AI approaches to solve cybersecurity problems in your organization

Product type Paperback

Published in Oct 2024

Publisher Packt

ISBN-13 9781805124962

Length 358 pages

Edition 1st Edition

Concepts

Data Governance

Authors (4):

Bojan Kolosnjaji

Apostolis Zarras

Huang Xiao

Peng Xu

View More author details

Table of Contents (27) Chapters

Preface

1. Part 1: Data-Driven Cybersecurity and AI FREE CHAPTER

2. Chapter 1: Big Data in Cybersecurity

3. Chapter 2: Automation in Cybersecurity

4. Chapter 3: Cybersecurity Data Analytics

5. Part 2: AI and Where It Fits In

6. Chapter 4: AI, Machine Learning, and Statistics - A Taxonomy

7. Chapter 5: AI Problems and Methods

8. Chapter 6: Workflow, Tools, and Libraries in AI Projects

9. Part 3: Applications of AI in Cybersecurity

10. Chapter 7: Malware and Network Intrusion Detection and Analysis

11. Chapter 8: User and Entity Behavior Analysis

12. Chapter 9: Fraud, Spam, and Phishing Detection

13. Chapter 10: User Authentication and Access Control

14. Chapter 11: Threat Intelligence

15. Chapter 12: Anomaly Detection in Industrial Control Systems

16. Chapter 13: Large Language Models and Cybersecurity

17. Part 4: Common Problems When Applying AI in Cybersecurity

18. Chapter 14: Data Quality and its Usage in the AI and LLM Era

19. Chapter 15: Correlation, Causation, Bias, and Variance

20. Chapter 16: Evaluation, Monitoring, and Feedback Loop

21. Chapter 17: Learning in a Changing and Adversarial Environment

22. Chapter 18: Privacy, Accountability, Explainability, and Trust – Responsible AI

23. Part 5: Final Remarks and Takeaways

24. Chapter 19: Summary

25. Index

Why subscribe?

26. Other Books You May Enjoy

Exercise 2 – network intrusion detection

Applying AI to intrusion detection offers analogous benefits to using AI for malware detection. In the following exercise, we will attempt to detect malicious traffic. For this purpose, we will use support vector machines (SVMs) to construct a model for intrusion detection. SVMs possess several advantages in the realm of intrusion detection systems. They excel in high-dimensional spaces, rendering them suitable for environments where feature spaces are intricate and contain many dimensions. Furthermore, SVMs exhibit resistance to overfitting compared to some alternative algorithms, a trait particularly advantageous when dealing with limited labeled data—a common occurrence in intrusion detection scenarios. Additionally, SVMs demonstrate tolerance to irrelevant features. This is a crucial aspect for intrusion detection, where certain features may hold minimal significance in attack detection. Moreover, they adeptly handle imbalanced datasets, a paramount consideration as intrusion detection datasets often feature limited instances representing attacks.

The dataset employed in this implementation is the NSL-KDD dataset, an updated iteration of the original KDD Cup 1999 dataset. While the KDD Cup 1999 dataset was extensively used to evaluate intrusion detection systems, it presented shortcomings. NSL-KDD rectifies these issues, offering a more realistic depiction of network traffic and a more challenging evaluation environment for intrusion detection systems. Other suitable datasets include UNSW-NB15, CICIDS2017, Kyoto 2006+, and CSE-CIC-IDS2018. The NSL-KDD dataset has an approximate size of 20 MB.

Similarly to the previous exercise, we will begin by importing the requisite libraries. These include standard libraries such as numpy and pandas for numerical operations and data handling, pickle for model serialization, and scikit-learn for preprocessing, model training, and evaluation. Additionally, in this scenario, we import support vector classification (SVC), an implementation of SVM tailored for classification tasks in scikit-learn. This is shown here:

import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report
import pickle

The ensuing source code section entails loading the dataset and conducting preprocessing steps to facilitate compatibility with SVMs and enhance performance. Initially, we will define the columns or features of the dataset. Since the KDDTrain+.txt dataset lacks explicit column headers, we will manually specify the appropriate columns based on the dataset description, the content, and our domain knowledge. Of paramount importance is the label column utilized for classification, distinguishing between normal and abnormal instances. Notably, the dataset constitutes a multiclass classification problem, featuring numerous options for the label column such as normal traffic, DoS attack, and brute force. To accommodate a binary classification problem, we transform the label column, accordingly, categorizing instances as normal or abnormal.

After delineating the columns, we will load the dataset and convert it into a DataFrame using pandas. Subsequently, we will discard columns that are either unusable or that could potentially degrade the model’s performance. For instance, the num_outbound_cmds column holds static values devoid of informational utility. We will then employ the LabelEncoder function to convert categorical variables such as protocol_type, service, and flag into numerical form, thereby enabling compatibility with SVMs.

Lastly, we will convert the label column into binary values, denoted as normal and abnormal, and subsequently transform them into numerical values 0 and 1, respectively, to facilitate SVM usage. Additionally, we will print the count of specific values, such as 50,000 instances of normal traffic and 40,000 of abnormal traffic, for further insight into the dataset’s distribution:

columns =(['duration','protocol_type','service','flag', 
    'src_bytes','dst_bytes','land','wrong_fragment','urgent', 
    'hot','num_failed_logins','logged_in','num_compromised', 
    'root_ shell','su_attempted','num_root', 'num_file_creations', 
    'num_ shells','num_access_files', 'num_outbound_cmds', 
    'is_host_login','is_ guest_login', 'count','srv_count', 
    'serror_rate','srv_serror_rate', 'rerror_rate', 'srv_rerror_rate', 
    'same_srv_rate',  'diff_srv_rate','srv_ diff_host_rate',
    'dst_host_count', 'dst_host_srv_count', 
    'dst_host_same_srv_rate', 'dst_host_diff_srv_rate', 
    'dst_host_same_src_port_rate', 'dst_host_srv_diff_host_rate', 
    'dst_host_serror_rate', 'dst_host_srv_serror_rate',
    'dst_host_rerror_rate', 'dst_host_srv_rerror_rate', 
    'label','difficulty_level']) 
train_data = pd.read_csv(
    'dataset/KDDTrain+.txt', header=None, names=columns)
train_data.drop(
    ['difficulty_level','num_outbound_cmds'],
    axis=1,
    inplace=True)
le = LabelEncoder()
train_data['protocol_type'] = le.fit_transform(
    train_data['protocol_type'])
train_data['service'] = le.fit_transform(
    train_data['service'])
train_data['flag'] = le.fit_transform(train_data['flag'])
train_data['label'] = train_data['label'].apply(
    lambda x: 'normal' if x == 'normal' else 'abnormal')
print(train_data['label'].value_counts())
train_data['label'] = train_data['label'].apply(
    lambda x: 0 if x == 'normal' else 1)

Then in the following code, the dataset is divided into features (X) and labels (y). Subsequently, the dataset undergoes partitioning into training and testing subsets utilizing a 70-30 split. We can omit this step and employ a separate test dataset, such as the KDDTest+ dataset, exclusively for testing purposes, while utilizing the KDDTrain+ dataset solely for model training:

X = train_data.drop(['label'], axis=1)
y = train_data['label']
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.30, random_state=40)

Then we will standardize the features using the StandardScaler function for scaling:

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

Next, we will train the model using SVM by initializing an SVM model with a radial basis function (RBF) kernel. Henceforth, the model is trained on the standardized training data. The choice of kernel holds significant importance in SVMs, as different kernels exhibit varying performances and compatibility with diverse techniques and datasets. In our scenario, the rbf kernel demonstrated superior performance to other tested SVM kernels:

svm_model = SVC(kernel='rbf')
svm_model.fit(X_train, y_train)

We will now make predictions on the test set utilizing our trained model. We will print a variety of performance metrics, including accuracy for both the training and testing phases, as well as overall model accuracy. A comprehensive classification report is also generated, encompassing additional metrics such as precision, recall, and F1-score. Finally, we will save our trained model using the pickle library for future use:

y_pred = svm_model.predict(X_test)
print('Training accuracy: ', svm_model.score(X_train, y_train))
print('Testing accuracy: ', svm_model.score(X_test, y_test))
print('SVM Model accuracy: ',
    np.round(accuracy_score(y_test,y_pred),6))
print('Classification Report:\n',
    classification_report(
        y_test,y_pred,target_names=['Normal','Abnormal']
    )
)
pkl_filename = "SVM_model.pkl"
with open(pkl_filename, 'wb') as file:
    pickle.dump(svm_model, file)

The following is a sample of our program execution:

Figure 7.2 – Execution of SVM model implementation

As shown in the preceding figure, our performance metrics showcase a remarkable 99% accuracy. However, this was attained solely through dataset splitting and not validation on a separate test dataset.

Next, we will provide source code for loading a comprehensive dataset for testing our implementation. It’s crucial to acknowledge that since we have already partitioned the training dataset into train and test subsets, we should exclusively utilize the ensuing test dataset once we remove the train-test partitioning code. Failure to do so may adversely impact the model’s performance.

print('Starting Testing Process using the Test Dataset....')
test_data = pd.read_csv('dataset/KDDTest+.txt',
    header=None, names=columns)
test_data.drop(['difficulty_level', 'num_outbound_cmds'],
    axis=1, inplace=True)
le = LabelEncoder()
test_data['protocol_type'] = le.fit_transform(
    test_data['protocol_type'])
test_data['service'] = le.fit_transform(test_data['service'])
test_data['flag'] = le.fit_transform(test_data['flag'])
test_data['label'] = test_data['label'].apply(
    lambda x: 0 if x == 'normal' else 1)
X_test = test_data.drop(['label'], axis=1)
y_test = test_data['label']
X_test = scaler.transform(X_test)
y_pred = svm_model.predict(X_test)
print('Testing Phase accuracy: ', accuracy_score(y_test, y_pred))
print('SVM Model accuracy: ', accuracy_score(y_test, y_pred))
print('Classification Report:\n',
    classification_report(y_test, y_pred,
        target_names=["Normal", "Abnormal"]))

Here is the performance of the implementation using the test dataset:

Figure 7.3 – Execution of SVM model implementation

This highlights the performance of our model in real-world scenarios, showcasing an accuracy of approximately 80%. The observed deviation in performance signals the presence of overfitting, a common occurrence under such circumstances. Notably, feature selection was not conducted on the dataset, allowing certain features to contribute to overfitting. Additionally, the dataset itself is relatively small. Using an RBF kernel, which is susceptible to overfitting on small datasets, further exacerbates this issue.

Moreover, no fine-tuning was performed on the kernel itself, compounding the problem. However, in the final implementation, these issues will be addressed. By integrating feature selection alongside a different algorithm, such as recurrent neural networks (RNNs), and refining the model, we can anticipate achieving a performance exceeding 95% in real-world environments.

Deep learning techniques, notably neural networks, can discern intricate patterns and relationships in data. They excel in learning from large datasets and are adaptable to structured and unstructured data. Neural networks have consistently demonstrated high performance in the domain of intrusion detection. Convolutional neural networks (CNNs) are particularly effective in image-based intrusion detection. In contrast, RNNs excel in handling sequential data.

The rest of the chapter is locked

You're reading from Artificial Intelligence for Cybersecurity Develop AI approaches to solve cybersecurity problems in your organization

Table of Contents (27) Chapters

Exercise 2 – network intrusion detection

Unlock this book and the full library FREE for 7 days

Authors (4)

Personalised recommendations for you