You're reading from Artificial Intelligence for Cybersecurity Develop AI approaches to solve cybersecurity problems in your organization

Product type Paperback

Published in Oct 2024

Publisher Packt

ISBN-13 9781805124962

Length 358 pages

Edition 1st Edition

Concepts

Data Governance

Authors (4):

Bojan Kolosnjaji

Apostolis Zarras

Huang Xiao

Peng Xu

View More author details

Table of Contents (27) Chapters

Preface

1. Part 1: Data-Driven Cybersecurity and AI FREE CHAPTER

2. Chapter 1: Big Data in Cybersecurity

3. Chapter 2: Automation in Cybersecurity

4. Chapter 3: Cybersecurity Data Analytics

5. Part 2: AI and Where It Fits In

6. Chapter 4: AI, Machine Learning, and Statistics - A Taxonomy

7. Chapter 5: AI Problems and Methods

8. Chapter 6: Workflow, Tools, and Libraries in AI Projects

9. Part 3: Applications of AI in Cybersecurity

10. Chapter 7: Malware and Network Intrusion Detection and Analysis

11. Chapter 8: User and Entity Behavior Analysis

12. Chapter 9: Fraud, Spam, and Phishing Detection

13. Chapter 10: User Authentication and Access Control

14. Chapter 11: Threat Intelligence

15. Chapter 12: Anomaly Detection in Industrial Control Systems

16. Chapter 13: Large Language Models and Cybersecurity

17. Part 4: Common Problems When Applying AI in Cybersecurity

18. Chapter 14: Data Quality and its Usage in the AI and LLM Era

19. Chapter 15: Correlation, Causation, Bias, and Variance

20. Chapter 16: Evaluation, Monitoring, and Feedback Loop

21. Chapter 17: Learning in a Changing and Adversarial Environment

22. Chapter 18: Privacy, Accountability, Explainability, and Trust – Responsible AI

23. Part 5: Final Remarks and Takeaways

24. Chapter 19: Summary

25. Index

Why subscribe?

26. Other Books You May Enjoy

Exercise 1 – malware detection

In this exercise, we will consider a dataset of executable files and attempt to find the malicious ones among them. We will achieve this by leveraging the Random Forest algorithm. Before delving into the code section by section, let’s elucidate our rationale for selecting the Random Forest algorithm. Random Forest, an ensemble learning algorithm, constructs numerous decision trees and amalgamates their predictions to enhance accuracy and mitigate overfitting. Renowned for its robustness in ML applications for malware detection, Random Forest handles extensive datasets—a critical attribute in ML implementation—while offering commendable generalization and resistance to overfitting. Its consistent performance across diverse datasets underscores its preferability over alternative algorithms in malware detection tasks. However, optimal algorithm selection for malware detection hinges on dataset characteristics, task-specific requirements, and available computational resources. In practice, antivirus solutions often leverage ensemble models comprising a blend of algorithms to bolster detection efficacy.

The dataset employed in the current implementation originates from Kaggle, with a size of approximately 6.40 MB. Despite its modest size, this dataset boasts meticulously selected features, setting it apart from others. In real-world scenarios, a custom dataset akin to the current one, albeit larger, would be developed. For now, let’s dissect our implementation section-wise.

The initial step entails importing essential libraries. numpy and pandas facilitate data manipulation, while pickle aids in model serialization and saving the model in a designated format. scikit-learn, encompassing functionalities such as train_test_split, RandomForestClassifier, and classification_report, is employed for various ML tasks such as dataset partitioning, model training, and performance assessment. Here is an example code:

import numpy as np
import pandas as pd
import pickle
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

We now load the dataset_train.csv dataset in CSV format, converting it into a pandas DataFrame labeled as data. Dataframe facilitate efficient data handling. Following this, in the subsequent two lines, we print essential information about the dataset, including its columns or features—commonly referred to as such—along with details such as rows, size, and so on (one of the lines is commented out to minimize end user output). Subsequently, we instantiate a new DataFrame named df with identical content to the preceding DataFrame (no alterations made). The subsequent step involves feature selection during the preprocessing phase. Finally, the last two lines segregate the dataset features to be utilized (X) and the corresponding labels (y) from the DataFrame . Specifically, the features selected entail dropping certain columns, namely Name, Machine, TimeDateStamp, and Malware, as they were deemed uninformative during model training due to various reasons (e.g., static values and redundancy). Notably, feature selection was conducted before implementation, considering that the dataset already encompasses specific features contributing to the model’s high performance. Hence, only four features were omitted during this process. The Malware column is designated as the y variable. Malware merely represents the classification of a sample, characterized by values 0 and 1, where 1 denotes malicious and 0 benign samples. Refer to the code block here:

data = pd.read_csv('dataset/dataset_train.csv')
#print(data.info())
print(data.head(10))
df = pd.DataFrame(data)
X = df.drop(['Name', 'Machine', 'TimeDateStamp', 'Malware'],
    axis=1)
y = df['Malware']

In the subsequent section, we partition the dataset into training and testing subsets utilizing the train_test_split function. Our selection was to opt for an 80-20 (0.2) division into training and testing data. However, this ratio can be adjusted, varying from 70-30 to 85-15, or even to 90-10. Through experimentation, we found that an 80-20 split yielded the most favorable results. To ensure reproducibility of the model results, the random_state parameter is set to 42. Additionally, the stratify parameter safeguards the preservation of the target variable’s distribution in both training and testing datasets. In the subsequent line, we simply print the count of features utilized:

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y)
print(f'Number of features used from the dataset is {
    X_train.shape[1]}')

We train the model by initializing a Random Forest classifier called rfc. This classifier comprises 100 trees, specified through the n_estimators parameter, and a designated random seed set via the random_state parameter. Additionally, we incorporate out-of-bag (OOB) scoring and restrict the maximum depth of each tree to 16 using the max_depth parameter. Subsequently, the model undergoes training utilizing the training data:

rfc = RandomForestClassifier(n_estimators=100, random_state=0,
    oob_score = True, max_depth = 16)
rfc.fit(X_train, y_train)

The first line in the following code section involves generating predictions based on the test set. As a reminder, we previously partitioned the data into an 80-20 ratio. In the second line, we create a classification report. This report encompasses various metrics, including accuracy, precision, recall, and F1-score, among others, for both Benign and Malware labels as shown here:

y_pred = rfc.predict(X_test)
print(classification_report(y_test, y_pred,
    target_names=['Benign', 'Malware']))

We now load the test dataset stored in CSV format from the dataset_test.csv file into a DataFrame. It’s important to note that this dataset comprises unlabeled data containing samples without explicit classifications as malicious or benign. Subsequently, following the loading process, we drop any features not utilized. Finally, leveraging the trained Random Forest model, we make predictions to classify the samples as malicious or benign:

test_data = pd.read_csv("dataset/dataset_test.csv")
X_pred = test_data.drop(['Name', 'Machine', 'TimeDateStamp'],
    axis=1)
predictions = rfc.predict(X_pred)

In the preceding code section, we established a new DataFrame labeled results. This DataFrame encompasses the filenames of the samples alongside their corresponding model predictions, which are categorized as either malicious or benign. After that, the results are printed for reference. Finally, the trained Random Forest model is serialized using pickle and stored in a file named RF_model.pkl. This enables the model to be reloaded later for reuse without necessitating retraining. The load method from the pickle library facilitates loading a model saved in a pickle file for subsequent use. This is shown with the following code:

results = pd.DataFrame({'File Name': test_data['Name'],
    'Model Prediction': predictions})
print(results)
pkl_filename = "RF_model.pkl"
with open(pkl_filename, 'wb') as file:
    pickle.dump(rfc, file)

The following is a sample of the program execution:

Figure 7.1 – Execution of the Random Forest implementation

As illustrated here, we’ve achieved exceptional accuracy and precision exceeding 99%, along with commendable results across other metrics. While the size of a larger dataset may introduce some variability in the results, we can anticipate minimal loss or deviation compared to the outcomes observed thus far. It’s worth noting that the dataset utilized in this demonstration is a demo dataset of less than 10 MB. In a practical scenario, we would generate another dataset with identical features but of a larger size. To accomplish this, we would initially gather Windows-executable samples from specific platforms, such as https://virusshare.com/, utilizing the platform’s specific API. After amassing a diverse array of samples, we would employ a Python program to extract pertinent features.

The rest of the chapter is locked

You're reading from Artificial Intelligence for Cybersecurity Develop AI approaches to solve cybersecurity problems in your organization

Table of Contents (27) Chapters

Exercise 1 – malware detection

Unlock this book and the full library FREE for 7 days

Authors (4)

Personalised recommendations for you