Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds
Arrow up icon
GO TO TOP
Artificial Intelligence for Cybersecurity

You're reading from   Artificial Intelligence for Cybersecurity Develop AI approaches to solve cybersecurity problems in your organization

Arrow left icon
Product type Paperback
Published in Oct 2024
Publisher Packt
ISBN-13 9781805124962
Length 358 pages
Edition 1st Edition
Arrow right icon
Authors (4):
Arrow left icon
Bojan Kolosnjaji Bojan Kolosnjaji
Author Profile Icon Bojan Kolosnjaji
Bojan Kolosnjaji
Apostolis Zarras Apostolis Zarras
Author Profile Icon Apostolis Zarras
Apostolis Zarras
Huang Xiao Huang Xiao
Author Profile Icon Huang Xiao
Huang Xiao
Peng Xu Peng Xu
Author Profile Icon Peng Xu
Peng Xu
Arrow right icon
View More author details
Toc

Table of Contents (27) Chapters Close

Preface 1. Part 1: Data-Driven Cybersecurity and AI FREE CHAPTER
2. Chapter 1: Big Data in Cybersecurity 3. Chapter 2: Automation in Cybersecurity 4. Chapter 3: Cybersecurity Data Analytics 5. Part 2: AI and Where It Fits In
6. Chapter 4: AI, Machine Learning, and Statistics - A Taxonomy 7. Chapter 5: AI Problems and Methods 8. Chapter 6: Workflow, Tools, and Libraries in AI Projects 9. Part 3: Applications of AI in Cybersecurity
10. Chapter 7: Malware and Network Intrusion Detection and Analysis 11. Chapter 8: User and Entity Behavior Analysis 12. Chapter 9: Fraud, Spam, and Phishing Detection 13. Chapter 10: User Authentication and Access Control 14. Chapter 11: Threat Intelligence 15. Chapter 12: Anomaly Detection in Industrial Control Systems 16. Chapter 13: Large Language Models and Cybersecurity 17. Part 4: Common Problems When Applying AI in Cybersecurity
18. Chapter 14: Data Quality and its Usage in the AI and LLM Era 19. Chapter 15: Correlation, Causation, Bias, and Variance 20. Chapter 16: Evaluation, Monitoring, and Feedback Loop 21. Chapter 17: Learning in a Changing and Adversarial Environment 22. Chapter 18: Privacy, Accountability, Explainability, and Trust – Responsible AI 23. Part 5: Final Remarks and Takeaways
24. Chapter 19: Summary 25. Index 26. Other Books You May Enjoy

Exercise 1 – malware detection

In this exercise, we will consider a dataset of executable files and attempt to find the malicious ones among them. We will achieve this by leveraging the Random Forest algorithm. Before delving into the code section by section, let’s elucidate our rationale for selecting the Random Forest algorithm. Random Forest, an ensemble learning algorithm, constructs numerous decision trees and amalgamates their predictions to enhance accuracy and mitigate overfitting. Renowned for its robustness in ML applications for malware detection, Random Forest handles extensive datasets—a critical attribute in ML implementation—while offering commendable generalization and resistance to overfitting. Its consistent performance across diverse datasets underscores its preferability over alternative algorithms in malware detection tasks. However, optimal algorithm selection for malware detection hinges on dataset characteristics, task-specific requirements, and available computational resources. In practice, antivirus solutions often leverage ensemble models comprising a blend of algorithms to bolster detection efficacy.

The dataset employed in the current implementation originates from Kaggle, with a size of approximately 6.40 MB. Despite its modest size, this dataset boasts meticulously selected features, setting it apart from others. In real-world scenarios, a custom dataset akin to the current one, albeit larger, would be developed. For now, let’s dissect our implementation section-wise.

The initial step entails importing essential libraries. numpy and pandas facilitate data manipulation, while pickle aids in model serialization and saving the model in a designated format. scikit-learn, encompassing functionalities such as train_test_split, RandomForestClassifier, and classification_report, is employed for various ML tasks such as dataset partitioning, model training, and performance assessment. Here is an example code:

import numpy as np
import pandas as pd
import pickle
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

We now load the dataset_train.csv dataset in CSV format, converting it into a pandas DataFrame labeled as data. Dataframe facilitate efficient data handling. Following this, in the subsequent two lines, we print essential information about the dataset, including its columns or features—commonly referred to as such—along with details such as rows, size, and so on (one of the lines is commented out to minimize end user output). Subsequently, we instantiate a new DataFrame named df with identical content to the preceding DataFrame (no alterations made). The subsequent step involves feature selection during the preprocessing phase. Finally, the last two lines segregate the dataset features to be utilized (X) and the corresponding labels (y) from the DataFrame . Specifically, the features selected entail dropping certain columns, namely Name, Machine, TimeDateStamp, and Malware, as they were deemed uninformative during model training due to various reasons (e.g., static values and redundancy). Notably, feature selection was conducted before implementation, considering that the dataset already encompasses specific features contributing to the model’s high performance. Hence, only four features were omitted during this process. The Malware column is designated as the y variable. Malware merely represents the classification of a sample, characterized by values 0 and 1, where 1 denotes malicious and 0 benign samples. Refer to the code block here:

data = pd.read_csv('dataset/dataset_train.csv')
#print(data.info())
print(data.head(10))
df = pd.DataFrame(data)
X = df.drop(['Name', 'Machine', 'TimeDateStamp', 'Malware'],
    axis=1)
y = df['Malware']

In the subsequent section, we partition the dataset into training and testing subsets utilizing the train_test_split function. Our selection was to opt for an 80-20 (0.2) division into training and testing data. However, this ratio can be adjusted, varying from 70-30 to 85-15, or even to 90-10. Through experimentation, we found that an 80-20 split yielded the most favorable results. To ensure reproducibility of the model results, the random_state parameter is set to 42. Additionally, the stratify parameter safeguards the preservation of the target variable’s distribution in both training and testing datasets. In the subsequent line, we simply print the count of features utilized:

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y)
print(f'Number of features used from the dataset is {
    X_train.shape[1]}')

We train the model by initializing a Random Forest classifier called rfc. This classifier comprises 100 trees, specified through the n_estimators parameter, and a designated random seed set via the random_state parameter. Additionally, we incorporate out-of-bag (OOB) scoring and restrict the maximum depth of each tree to 16 using the max_depth parameter. Subsequently, the model undergoes training utilizing the training data:

rfc = RandomForestClassifier(n_estimators=100, random_state=0,
    oob_score = True, max_depth = 16)
rfc.fit(X_train, y_train)

The first line in the following code section involves generating predictions based on the test set. As a reminder, we previously partitioned the data into an 80-20 ratio. In the second line, we create a classification report. This report encompasses various metrics, including accuracy, precision, recall, and F1-score, among others, for both Benign and Malware labels as shown here:

y_pred = rfc.predict(X_test)
print(classification_report(y_test, y_pred,
    target_names=['Benign', 'Malware']))

We now load the test dataset stored in CSV format from the dataset_test.csv file into a DataFrame. It’s important to note that this dataset comprises unlabeled data containing samples without explicit classifications as malicious or benign. Subsequently, following the loading process, we drop any features not utilized. Finally, leveraging the trained Random Forest model, we make predictions to classify the samples as malicious or benign:

test_data = pd.read_csv("dataset/dataset_test.csv")
X_pred = test_data.drop(['Name', 'Machine', 'TimeDateStamp'],
    axis=1)
predictions = rfc.predict(X_pred)

In the preceding code section, we established a new DataFrame labeled results. This DataFrame encompasses the filenames of the samples alongside their corresponding model predictions, which are categorized as either malicious or benign. After that, the results are printed for reference. Finally, the trained Random Forest model is serialized using pickle and stored in a file named RF_model.pkl. This enables the model to be reloaded later for reuse without necessitating retraining. The load method from the pickle library facilitates loading a model saved in a pickle file for subsequent use. This is shown with the following code:

results = pd.DataFrame({'File Name': test_data['Name'],
    'Model Prediction': predictions})
print(results)
pkl_filename = "RF_model.pkl"
with open(pkl_filename, 'wb') as file:
    pickle.dump(rfc, file)

The following is a sample of the program execution:

Figure 7.1 – Execution of the Random Forest implementation

Figure 7.1 – Execution of the Random Forest implementation

As illustrated here, we’ve achieved exceptional accuracy and precision exceeding 99%, along with commendable results across other metrics. While the size of a larger dataset may introduce some variability in the results, we can anticipate minimal loss or deviation compared to the outcomes observed thus far. It’s worth noting that the dataset utilized in this demonstration is a demo dataset of less than 10 MB. In a practical scenario, we would generate another dataset with identical features but of a larger size. To accomplish this, we would initially gather Windows-executable samples from specific platforms, such as https://virusshare.com/, utilizing the platform’s specific API. After amassing a diverse array of samples, we would employ a Python program to extract pertinent features.

lock icon The rest of the chapter is locked
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime
Banner background image