You're reading from Hands-On Artificial Intelligence for Cybersecurity Implement smart AI systems for preventing cyber attacks and detecting threats and network anomalies

Product type Paperback

Published in Aug 2019

Publisher Packt

ISBN-13 9781789804027

Length 342 pages

Edition 1st Edition

Languages

Python

Concepts

Artificial Intelligence

Author (1):

Alessandro Parisi

View More author details

Best practices of feature engineering

In the previous chapters, we looked at different artificial intelligence (AI) algorithms, analyzing their application to the different scenarios and their use cases in a cybersecurity context. Now, the time has come to learn how to evaluate these algorithms, starting from the assumption that algorithms are the foundation of data-driven learning models.

We will therefore have to deal with the very nature of the data, which is the basis of the algorithm learning process, which aims to make generalizations in the form of predictions based on the samples received as input in the training phase.

The choice of algorithm will therefore fall on the one that is best for generalizing beyond the training data, thereby obtaining the best predictions when facing new data. In fact, it is relatively simple to identify an algorithm that fits the training data; the problem becomes more complicated when the algorithm must correctly make predictions on data that has never been seen before. In fact, we will see that the tendency to optimize the accuracy of the algorithm's predictions on training data gives rise to the phenomenon known as overfitting, where predictions become worse when dealing with new test data.

It therefore becomes important to understand how to correctly perform the algorithm training, from the selection of the training dataset up to the correct tuning of the learning parameters characterizing the chosen algorithm.

There are several methods available for performing algorithm training, such as using the same training dataset (for example, by dividing the training dataset into two separate subsets, one for training and one for testing) and choosing a suitable percentage of the original training dataset to be assigned to the two distinct subsets.

Another strategy is based on cross validation, which, as we will see, consists of randomly dividing the training dataset into a certain number of subsets on which to train the algorithm and calculate the average of the results obtained in order to verify the accuracy of predictions.

Better algorithms or more data?

While it is true that, in order to make correct predictions (which, in turn, are nothing more than generalizations starting from sample data), the data alone is not enough; you need to combine the data with algorithms (which, in turn, are nothing more than data representations). In practice, however, we are often faced with a dilemma when improving our predictions: should we design a better algorithm, or do we just need to collect more data? The answer to this question has not always been the same over time, since, when research in the field of AI began, the emphasis was on the quality of the algorithms, since the availability of data was dictated by the cost of storage.

With the reduction in costs associated with storage, in recent years, we have witnessed an unprecedented explosion in the availability of data, which has given rise to new analytical techniques based on big data, and the emphasis has consequently shifted to the availability of data. However, as the amount of data available increases, the amount of time required to analyze it increases accordingly, so, in choosing between the quality of the algorithms and the amount of training data, we must face a trade-off.

In general, practical experience shows us that even a dumb algorithm powered by large amounts of data is able to produce better predictions than a clever algorithm fed with less data.

However, the very nature of the data is often the element that makes the difference.

The very nature of raw data

The emphasis given to the relevance of data often resonates in the motto that states let the data speak for itself. In reality, data is almost never able to speak for itself, and when it does, it usually deceives us. The raw data is nothing more than fragments of information that behave like pieces of a puzzle where we do not (yet) know the bigger picture.

To make sense of the raw data, we therefore need models that help us to distinguish the necessary pieces (the signal) from the useless pieces (the noise), in addition to identifying the missing pieces to complete our puzzle.

The models, in the case of AI, take the form of mathematical relations between features, through which we are able to show the different aspects and the different functions that the data represents, based on the purpose we intend to achieve with our analysis. In order for the raw data to be inserted in to our mathematical models, it must first be treated appropriately, thereby becoming the feature of our models. A feature, in fact, is nothing but the numerical representation of the raw data.

For example, raw data often does not occur in numerical form. However, the representation of data in numerical form is a necessary prerequisite, in order to get processed by algorithms. Therefore, we must convert the raw data into numerical form before feeding it to our algorithms.

Feature engineering to the rescue

Therefore, in implementing our predictive models, we must not simply limit ourselves to specifying the choice of the algorithm(s), but we must also define the features required to power them. As such, the correct definition of features is an essential step, both for the achievement of our goals and for the efficiency of the implementation of our predictive model.

As we have said, the features constitute the numerical representations of the raw data. There are obviously different ways to convert raw data into numerical form, and these are distinguished according to the varying nature of the raw data, in addition to the types of algorithms of choice. Different algorithms, in fact, require different features in order to work.

The number of features is also equally important for the predictive performance of our model. Choosing the quality and quantity of features therefore constitutes a preliminary process, known as feature engineering.

Dealing with raw data

A first screening is conducted on the basis of the nature of the numerical values that are to be associated with our models. We should ask ourselves whether the number values we require are only positive or negative, or just Boolean values, whether we can limit ourselves to certain orders of magnitude, whether we can determine in advance the maximum value and the minimum value that the features can assume, and so on.

We can also artificially create complex features, starting from simple features, in order to increase the explanatory, as well as the predictive, capacity of our models.

Here are some of the most common transformations applicable to transforming raw data into model features:

Data binarization
Data binning
Logarithmic data transformation

We will now examine each of these transformations in detail.

Data binarization

One of the most basic forms of transformation, based on raw data counting, is binarization, which consists of assigning the value 1 to all the counts greater than 0, and assigning the value 0 in the remaining cases. To understand the usefulness of binarization, we only need to consider the development of a predictive model whose goal is to predict user preferences based on video visualizations. We could therefore decide to assess the preferences of the individual users simply by counting their respective visualizations of videos; however, the problem is that the order of magnitude of the visualizations varies according to the habits of the individual users.

Therefore, the absolute value of the visualizations—that is, the raw count—does not constitute a reliable measure of the greater or lesser preference accorded to each video. In fact, some users have the habit of repeatedly visualizing the same video, without paying particular attention to it, while other users prefer to focus their attention, thereby reducing the number of visualizations.

Moreover, the different orders of magnitude associated with video visualizations by each user, varying from tens to hundreds, or even thousands of views, based on a user's habits, makes some statistical measurements, such as the arithmetic average, less representative of individual preferences.

Instead of using the raw count of visualizations, we can binarize the counts, associating the value 1 with all the videos that obtained a number of visualizations greater than 0 and the value 0 otherwise. Obtaining the results in this way is more efficient and robust measure of individual preferences.

Data binning

Managing the different orders of magnitude of the counts is a problem that occurs in different situations, and there are many algorithms that behave badly when faced with data that exhibits a wide ranges of values, such as clustering algorithms that measure similarity on the basis of Euclidean distance.

In a similar way to binarization, it is possible to reduce the dimensional scale by grouping the raw data counts into containers called bins, with fixed amplitude (fixed-with binning), sorted in ascending order, thereby scaling their absolute values linearly or exponentially.

Logarithmic data transformation

Similarly, it is possible to reduce the magnitude of raw data counts by replacing their absolute values with logarithms.

A peculiar feature of the logarithmic transformation is precisely that of reducing the relevance of greater values, and, at the same time, of amplifying smaller values, thereby achieving greater uniformity of value distribution.

In addition to logarithms, it is possible to use other power functions, which allow the stabilization of the variance of a data distribution (such as the Box–Cox transformation).

Data normalization

Also known as feature normalization or feature scaling, data normalization improves the performance of algorithms that can be influenced by the scale of input values.

The following are the most common examples of feature normalization.

Min–max scaling

With the min–max scaling transformation, we let the data fall within a limited range of values: 0 and 1.

The transformation of the data involves replacing the original values with the values calculated with the following formula:

Here, represents the minimum value of the entire distribution, and the maximum value.

Variance scaling

Another very common data normalization method involves subtracting the mean of the distribution from each single value, and then dividing the result obtained by the variance of the distribution.

Following normalization (also known as standardization), the distribution of the recalculated data shows a mean equal to 0 and a variance equal to 1.

The formula for variance scaling is as follows:

How to manage categorical variables

Raw data can be represented by categorical variables that take non-numeric values.

A typical example of a categorical variable is nationality. In order to mathematically manage categorical variables, we need to use some form of category transformation in numerical values, also called encoding.

The following are the most common methods of categorical encoding.

Ordinal encoding

An intuitive approach to encoding could be to assign a single progressive value to the individual categories:

The advantage and disadvantage of this encoding method is that the transformed values may be numerically ordered, even when this numerical ordering has no real meaning.

One-hot encoding

With the one-hot encoding method, a set of bits is assigned to each variable, with each bit representing a distinct category.

The set of bits enables us to distinguish the variables that cannot belong to more than one category, resulting in only one bit data set:

Dummy encoding

The one-hot encoding method actually wastes a bit (that, in fact, is not strictly necessary), which can be eliminated using the dummy encoding method:

Feature engineering examples with sklearn

Now let's look at some examples of feature engineering implementation using the NumPy library and the preprocessing package of the scikit-learn library.

Min–max scaler

In the following code, we see an example of feature engineering using the MinMaxScaler class of scikit-learn, aimed at scaling features to lie between a given range of values (minimum and maximum), such as 0 and 1:

from sklearn import preprocessing
 import numpy as np
 raw_data = np.array([
 [ 2., -3., 4.],
 [ 5., 0., 1.],
 [ 4., 0., -2.]])
 min_max_scaler = preprocessing.MinMaxScaler()
 scaled_data = min_max_scaler.fit_transform(raw_data)

Standard scaler

The following example shows the StandardScaler class of scikit-learn in action, used to compute the mean and standard deviation on a training set by leveraging the transform() method:

from sklearn import preprocessing
import numpy as np
raw_data = np.array([
[ 2., -3., 4.],
[ 5., 0., 1.],
[ 4., 0., -2.]])
std_scaler = preprocessing.StandardScaler().fit(raw_data)
std_scaler.transform(raw_data)
test_data = [[-3., 1., 2.]]
std_scaler.transform(test_data)

Power transformation

In the following example, we see the PowerTransformer class of scikit-learn in action, applying the zero-mean, unit-variance normalization to the transformed output using a Box–Cox transformation:

from sklearn import preprocessing
import numpy as np
pt = preprocessing.PowerTransformer(method='box-cox', standardize=False) 
X_lognormal = np.random.RandomState(616).lognormal(size=(3, 3))
pt.fit_transform(X_lognormal)

Ordinal encoding with sklearn

In the following example, we see how to encode categorical features into integers using the OrdinalEncoder class of scikit-learn and its transform() method:

from sklearn import preprocessing
ord_enc = preprocessing.OrdinalEncoder()
cat_data = [['Developer', 'Remote Working', 'Windows'], ['Sysadmin', 'Onsite Working', 'Linux']]
ord_enc.fit(cat_data)
ord_enc.transform([['Developer', 'Onsite Working', 'Linux']])

One-hot encoding with sklearn

The following example shows how to transform categorical features into binary representation, making use of the OneHotEncoder class of scikit-learn:

from sklearn import preprocessing
one_hot_enc = preprocessing.OneHotEncoder()
cat_data = [['Developer', 'Remote Working', 'Windows'], ['Sysadmin', 'Onsite Working', 'Linux']]
one_hot_enc.fit(cat_data)
one_hot_enc.transform([['Developer', 'Onsite Working', 'Linux']])

After having described feature engineering best practices, we can move on to evaluating the performance of our models.

You're reading from Hands-On Artificial Intelligence for Cybersecurity Implement smart AI systems for preventing cyber attacks and detecting threats and network anomalies

Table of Contents (16) Chapters

Best practices of feature engineering

Better algorithms or more data?

The very nature of raw data

Feature engineering to the rescue

Dealing with raw data

Data binarization

Data binning

Logarithmic data transformation

Data normalization

Min–max scaling

Variance scaling

How to manage categorical variables

Ordinal encoding

One-hot encoding

Dummy encoding

Feature engineering examples with sklearn

Min–max scaler

Standard scaler

Power transformation

Ordinal encoding with sklearn

One-hot encoding with sklearn

Authors (1)

Other recommended products

Personalised recommendations for you

You're reading from Hands-On Artificial Intelligence for Cybersecurity Implement smart AI systems for preventing cyber attacks and detecting threats and network anomalies

Table of Contents (16) Chapters

Unlock this book and the full library FREE for 7 days

Authors (1)

Other recommended products

Personalised recommendations for you