Exercise 1 – malware detection
In this exercise, we will consider a dataset of executable files and attempt to find the malicious ones among them. We will achieve this by leveraging the Random Forest algorithm. Before delving into the code section by section, let’s elucidate our rationale for selecting the Random Forest algorithm. Random Forest, an ensemble learning algorithm, constructs numerous decision trees and amalgamates their predictions to enhance accuracy and mitigate overfitting. Renowned for its robustness in ML applications for malware detection, Random Forest handles extensive datasets—a critical attribute in ML implementation—while offering commendable generalization and resistance to overfitting. Its consistent performance across diverse datasets underscores its preferability over alternative algorithms in malware detection tasks. However, optimal algorithm selection for malware detection hinges on dataset characteristics, task-specific requirements, and available computational resources. In practice, antivirus solutions often leverage ensemble models comprising a blend of algorithms to bolster detection efficacy.
The dataset employed in the current implementation originates from Kaggle, with a size of approximately 6.40 MB. Despite its modest size, this dataset boasts meticulously selected features, setting it apart from others. In real-world scenarios, a custom dataset akin to the current one, albeit larger, would be developed. For now, let’s dissect our implementation section-wise.
The initial step entails importing essential libraries. numpy
and pandas
facilitate data manipulation, while pickle
aids in model serialization and saving the model in a designated format. scikit-learn
, encompassing functionalities such as train_test_split
, RandomForestClassifier
, and classification_report
, is employed for various ML tasks such as dataset partitioning, model training, and performance assessment. Here is an example code:
import numpy as np import pandas as pd import pickle from sklearn.model_selection import train_test_split from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import classification_report
We now load the dataset_train.csv
dataset in CSV format, converting it into a pandas
DataFrame labeled as data. Dataframe facilitate efficient data handling. Following this, in the subsequent two lines, we print essential information about the dataset, including its columns or features—commonly referred to as such—along with details such as rows, size, and so on (one of the lines is commented out to minimize end user output). Subsequently, we instantiate a new DataFrame named df
with identical content to the preceding DataFrame (no alterations made). The subsequent step involves feature selection during the preprocessing phase. Finally, the last two lines segregate the dataset features to be utilized (X
) and the corresponding labels (y
) from the DataFrame . Specifically, the features selected entail dropping certain columns, namely Name
, Machine
, TimeDateStamp
, and Malware
, as they were deemed uninformative during model training due to various reasons (e.g., static values and redundancy). Notably, feature selection was conducted before implementation, considering that the dataset already encompasses specific features contributing to the model’s high performance. Hence, only four features were omitted during this process. The Malware
column is designated as the y
variable. Malware
merely represents the classification of a sample, characterized by values 0
and 1
, where 1
denotes malicious and 0
benign samples. Refer to the code block here:
data = pd.read_csv('dataset/dataset_train.csv') #print(data.info()) print(data.head(10)) df = pd.DataFrame(data) X = df.drop(['Name', 'Machine', 'TimeDateStamp', 'Malware'], axis=1) y = df['Malware']
In the subsequent section, we partition the dataset into training and testing subsets utilizing the train_test_split
function. Our selection was to opt for an 80-20 (0.2) division into training and testing data. However, this ratio can be adjusted, varying from 70-30 to 85-15, or even to 90-10. Through experimentation, we found that an 80-20 split yielded the most favorable results. To ensure reproducibility of the model results, the random_state
parameter is set to 42
. Additionally, the stratify
parameter safeguards the preservation of the target variable’s distribution in both training and testing datasets. In the subsequent line, we simply print the count of features utilized:
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42, stratify=y) print(f'Number of features used from the dataset is { X_train.shape[1]}')
We train the model by initializing a Random Forest classifier called rfc
. This classifier comprises 100 trees, specified through the n_estimators
parameter, and a designated random seed set via the random_state
parameter. Additionally, we incorporate out-of-bag (OOB) scoring and restrict the maximum depth of each tree to 16 using the max_depth
parameter. Subsequently, the model undergoes training utilizing the training data:
rfc = RandomForestClassifier(n_estimators=100, random_state=0, oob_score = True, max_depth = 16) rfc.fit(X_train, y_train)
The first line in the following code section involves generating predictions based on the test set. As a reminder, we previously partitioned the data into an 80-20 ratio. In the second line, we create a classification report. This report encompasses various metrics, including accuracy, precision, recall, and F1-score, among others, for both Benign
and Malware
labels as shown here:
y_pred = rfc.predict(X_test) print(classification_report(y_test, y_pred, target_names=['Benign', 'Malware']))
We now load the test dataset stored in CSV format from the dataset_test.csv
file into a DataFrame. It’s important to note that this dataset comprises unlabeled data containing samples without explicit classifications as malicious or benign. Subsequently, following the loading process, we drop any features not utilized. Finally, leveraging the trained Random Forest model, we make predictions to classify the samples as malicious or benign:
test_data = pd.read_csv("dataset/dataset_test.csv") X_pred = test_data.drop(['Name', 'Machine', 'TimeDateStamp'], axis=1) predictions = rfc.predict(X_pred)
In the preceding code section, we established a new DataFrame labeled results
. This DataFrame encompasses the filenames of the samples alongside their corresponding model predictions, which are categorized as either malicious
or benign
. After that, the results are printed for reference. Finally, the trained Random Forest model is serialized using pickle
and stored in a file named RF_model.pkl
. This enables the model to be reloaded later for reuse without necessitating retraining. The load
method from the pickle
library facilitates loading a model saved in a pickle
file for subsequent use. This is shown with the following code:
results = pd.DataFrame({'File Name': test_data['Name'], 'Model Prediction': predictions}) print(results) pkl_filename = "RF_model.pkl" with open(pkl_filename, 'wb') as file: pickle.dump(rfc, file)
The following is a sample of the program execution:
Figure 7.1 – Execution of the Random Forest implementation
As illustrated here, we’ve achieved exceptional accuracy and precision exceeding 99%, along with commendable results across other metrics. While the size of a larger dataset may introduce some variability in the results, we can anticipate minimal loss or deviation compared to the outcomes observed thus far. It’s worth noting that the dataset utilized in this demonstration is a demo dataset of less than 10 MB. In a practical scenario, we would generate another dataset with identical features but of a larger size. To accomplish this, we would initially gather Windows-executable samples from specific platforms, such as https://virusshare.com/, utilizing the platform’s specific API. After amassing a diverse array of samples, we would employ a Python program to extract pertinent features.