Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds
Arrow up icon
GO TO TOP
Artificial Intelligence for Cybersecurity

You're reading from   Artificial Intelligence for Cybersecurity Develop AI approaches to solve cybersecurity problems in your organization

Arrow left icon
Product type Paperback
Published in Oct 2024
Publisher Packt
ISBN-13 9781805124962
Length 358 pages
Edition 1st Edition
Arrow right icon
Authors (4):
Arrow left icon
Bojan Kolosnjaji Bojan Kolosnjaji
Author Profile Icon Bojan Kolosnjaji
Bojan Kolosnjaji
Apostolis Zarras Apostolis Zarras
Author Profile Icon Apostolis Zarras
Apostolis Zarras
Huang Xiao Huang Xiao
Author Profile Icon Huang Xiao
Huang Xiao
Peng Xu Peng Xu
Author Profile Icon Peng Xu
Peng Xu
Arrow right icon
View More author details
Toc

Table of Contents (27) Chapters Close

Preface 1. Part 1: Data-Driven Cybersecurity and AI FREE CHAPTER
2. Chapter 1: Big Data in Cybersecurity 3. Chapter 2: Automation in Cybersecurity 4. Chapter 3: Cybersecurity Data Analytics 5. Part 2: AI and Where It Fits In
6. Chapter 4: AI, Machine Learning, and Statistics - A Taxonomy 7. Chapter 5: AI Problems and Methods 8. Chapter 6: Workflow, Tools, and Libraries in AI Projects 9. Part 3: Applications of AI in Cybersecurity
10. Chapter 7: Malware and Network Intrusion Detection and Analysis 11. Chapter 8: User and Entity Behavior Analysis 12. Chapter 9: Fraud, Spam, and Phishing Detection 13. Chapter 10: User Authentication and Access Control 14. Chapter 11: Threat Intelligence 15. Chapter 12: Anomaly Detection in Industrial Control Systems 16. Chapter 13: Large Language Models and Cybersecurity 17. Part 4: Common Problems When Applying AI in Cybersecurity
18. Chapter 14: Data Quality and its Usage in the AI and LLM Era 19. Chapter 15: Correlation, Causation, Bias, and Variance 20. Chapter 16: Evaluation, Monitoring, and Feedback Loop 21. Chapter 17: Learning in a Changing and Adversarial Environment 22. Chapter 18: Privacy, Accountability, Explainability, and Trust – Responsible AI 23. Part 5: Final Remarks and Takeaways
24. Chapter 19: Summary 25. Index 26. Other Books You May Enjoy

Exercise 2 – network intrusion detection

Applying AI to intrusion detection offers analogous benefits to using AI for malware detection. In the following exercise, we will attempt to detect malicious traffic. For this purpose, we will use support vector machines (SVMs) to construct a model for intrusion detection. SVMs possess several advantages in the realm of intrusion detection systems. They excel in high-dimensional spaces, rendering them suitable for environments where feature spaces are intricate and contain many dimensions. Furthermore, SVMs exhibit resistance to overfitting compared to some alternative algorithms, a trait particularly advantageous when dealing with limited labeled data—a common occurrence in intrusion detection scenarios. Additionally, SVMs demonstrate tolerance to irrelevant features. This is a crucial aspect for intrusion detection, where certain features may hold minimal significance in attack detection. Moreover, they adeptly handle imbalanced datasets, a paramount consideration as intrusion detection datasets often feature limited instances representing attacks.

The dataset employed in this implementation is the NSL-KDD dataset, an updated iteration of the original KDD Cup 1999 dataset. While the KDD Cup 1999 dataset was extensively used to evaluate intrusion detection systems, it presented shortcomings. NSL-KDD rectifies these issues, offering a more realistic depiction of network traffic and a more challenging evaluation environment for intrusion detection systems. Other suitable datasets include UNSW-NB15, CICIDS2017, Kyoto 2006+, and CSE-CIC-IDS2018. The NSL-KDD dataset has an approximate size of 20 MB.

Similarly to the previous exercise, we will begin by importing the requisite libraries. These include standard libraries such as numpy and pandas for numerical operations and data handling, pickle for model serialization, and scikit-learn for preprocessing, model training, and evaluation. Additionally, in this scenario, we import support vector classification (SVC), an implementation of SVM tailored for classification tasks in scikit-learn. This is shown here:

import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report
import pickle

The ensuing source code section entails loading the dataset and conducting preprocessing steps to facilitate compatibility with SVMs and enhance performance. Initially, we will define the columns or features of the dataset. Since the KDDTrain+.txt dataset lacks explicit column headers, we will manually specify the appropriate columns based on the dataset description, the content, and our domain knowledge. Of paramount importance is the label column utilized for classification, distinguishing between normal and abnormal instances. Notably, the dataset constitutes a multiclass classification problem, featuring numerous options for the label column such as normal traffic, DoS attack, and brute force. To accommodate a binary classification problem, we transform the label column, accordingly, categorizing instances as normal or abnormal.

After delineating the columns, we will load the dataset and convert it into a DataFrame using pandas. Subsequently, we will discard columns that are either unusable or that could potentially degrade the model’s performance. For instance, the num_outbound_cmds column holds static values devoid of informational utility. We will then employ the LabelEncoder function to convert categorical variables such as protocol_type, service, and flag into numerical form, thereby enabling compatibility with SVMs.

Lastly, we will convert the label column into binary values, denoted as normal and abnormal, and subsequently transform them into numerical values 0 and 1, respectively, to facilitate SVM usage. Additionally, we will print the count of specific values, such as 50,000 instances of normal traffic and 40,000 of abnormal traffic, for further insight into the dataset’s distribution:

columns =(['duration','protocol_type','service','flag', 
    'src_bytes','dst_bytes','land','wrong_fragment','urgent', 
    'hot','num_failed_logins','logged_in','num_compromised', 
    'root_ shell','su_attempted','num_root', 'num_file_creations', 
    'num_ shells','num_access_files', 'num_outbound_cmds', 
    'is_host_login','is_ guest_login', 'count','srv_count', 
    'serror_rate','srv_serror_rate', 'rerror_rate', 'srv_rerror_rate', 
    'same_srv_rate',  'diff_srv_rate','srv_ diff_host_rate',
    'dst_host_count', 'dst_host_srv_count', 
    'dst_host_same_srv_rate', 'dst_host_diff_srv_rate', 
    'dst_host_same_src_port_rate', 'dst_host_srv_diff_host_rate', 
    'dst_host_serror_rate', 'dst_host_srv_serror_rate',
    'dst_host_rerror_rate', 'dst_host_srv_rerror_rate', 
    'label','difficulty_level']) 
train_data = pd.read_csv(
    'dataset/KDDTrain+.txt', header=None, names=columns)
train_data.drop(
    ['difficulty_level','num_outbound_cmds'],
    axis=1,
    inplace=True)
le = LabelEncoder()
train_data['protocol_type'] = le.fit_transform(
    train_data['protocol_type'])
train_data['service'] = le.fit_transform(
    train_data['service'])
train_data['flag'] = le.fit_transform(train_data['flag'])
train_data['label'] = train_data['label'].apply(
    lambda x: 'normal' if x == 'normal' else 'abnormal')
print(train_data['label'].value_counts())
train_data['label'] = train_data['label'].apply(
    lambda x: 0 if x == 'normal' else 1)

Then in the following code, the dataset is divided into features (X) and labels (y). Subsequently, the dataset undergoes partitioning into training and testing subsets utilizing a 70-30 split. We can omit this step and employ a separate test dataset, such as the KDDTest+ dataset, exclusively for testing purposes, while utilizing the KDDTrain+ dataset solely for model training:

X = train_data.drop(['label'], axis=1)
y = train_data['label']
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.30, random_state=40)

Then we will standardize the features using the StandardScaler function for scaling:

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

Next, we will train the model using SVM by initializing an SVM model with a radial basis function (RBF) kernel. Henceforth, the model is trained on the standardized training data. The choice of kernel holds significant importance in SVMs, as different kernels exhibit varying performances and compatibility with diverse techniques and datasets. In our scenario, the rbf kernel demonstrated superior performance to other tested SVM kernels:

svm_model = SVC(kernel='rbf')
svm_model.fit(X_train, y_train)

We will now make predictions on the test set utilizing our trained model. We will print a variety of performance metrics, including accuracy for both the training and testing phases, as well as overall model accuracy. A comprehensive classification report is also generated, encompassing additional metrics such as precision, recall, and F1-score. Finally, we will save our trained model using the pickle library for future use:

y_pred = svm_model.predict(X_test)
print('Training accuracy: ', svm_model.score(X_train, y_train))
print('Testing accuracy: ', svm_model.score(X_test, y_test))
print('SVM Model accuracy: ',
    np.round(accuracy_score(y_test,y_pred),6))
print('Classification Report:\n',
    classification_report(
        y_test,y_pred,target_names=['Normal','Abnormal']
    )
)
pkl_filename = "SVM_model.pkl"
with open(pkl_filename, 'wb') as file:
    pickle.dump(svm_model, file)

The following is a sample of our program execution:

Figure 7.2 – Execution of SVM model implementation

Figure 7.2 – Execution of SVM model implementation

As shown in the preceding figure, our performance metrics showcase a remarkable 99% accuracy. However, this was attained solely through dataset splitting and not validation on a separate test dataset.

Next, we will provide source code for loading a comprehensive dataset for testing our implementation. It’s crucial to acknowledge that since we have already partitioned the training dataset into train and test subsets, we should exclusively utilize the ensuing test dataset once we remove the train-test partitioning code. Failure to do so may adversely impact the model’s performance.

print('Starting Testing Process using the Test Dataset....')
test_data = pd.read_csv('dataset/KDDTest+.txt',
    header=None, names=columns)
test_data.drop(['difficulty_level', 'num_outbound_cmds'],
    axis=1, inplace=True)
le = LabelEncoder()
test_data['protocol_type'] = le.fit_transform(
    test_data['protocol_type'])
test_data['service'] = le.fit_transform(test_data['service'])
test_data['flag'] = le.fit_transform(test_data['flag'])
test_data['label'] = test_data['label'].apply(
    lambda x: 0 if x == 'normal' else 1)
X_test = test_data.drop(['label'], axis=1)
y_test = test_data['label']
X_test = scaler.transform(X_test)
y_pred = svm_model.predict(X_test)
print('Testing Phase accuracy: ', accuracy_score(y_test, y_pred))
print('SVM Model accuracy: ', accuracy_score(y_test, y_pred))
print('Classification Report:\n',
    classification_report(y_test, y_pred,
        target_names=["Normal", "Abnormal"]))

Here is the performance of the implementation using the test dataset:

Figure 7.3 – Execution of SVM model implementation

Figure 7.3 – Execution of SVM model implementation

This highlights the performance of our model in real-world scenarios, showcasing an accuracy of approximately 80%. The observed deviation in performance signals the presence of overfitting, a common occurrence under such circumstances. Notably, feature selection was not conducted on the dataset, allowing certain features to contribute to overfitting. Additionally, the dataset itself is relatively small. Using an RBF kernel, which is susceptible to overfitting on small datasets, further exacerbates this issue.

Moreover, no fine-tuning was performed on the kernel itself, compounding the problem. However, in the final implementation, these issues will be addressed. By integrating feature selection alongside a different algorithm, such as recurrent neural networks (RNNs), and refining the model, we can anticipate achieving a performance exceeding 95% in real-world environments.

Deep learning techniques, notably neural networks, can discern intricate patterns and relationships in data. They excel in learning from large datasets and are adaptable to structured and unstructured data. Neural networks have consistently demonstrated high performance in the domain of intrusion detection. Convolutional neural networks (CNNs) are particularly effective in image-based intrusion detection. In contrast, RNNs excel in handling sequential data.

lock icon The rest of the chapter is locked
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime
Banner background image