You're reading from Machine Learning for Cybersecurity Cookbook Over 80 recipes on how to implement machine learning algorithms for building security systems using Python

Product type Paperback

Published in Nov 2019

Publisher Packt

ISBN-13 9781789614671

Length 346 pages

Edition 1st Edition

Languages

Python

Tools

Metasploit

Concepts

Cybersecurity

Author (1):

Emmanuel Tsukerman

View More author details

Natural language processing using a hashing vectorizer and tf-idf with scikit-learn

We often find in data science that the objects we wish to analyze are textual. For example, they might be tweets, articles, or network logs. Since our algorithms require numerical inputs, we must find a way to convert such text into numerical features. To this end, we utilize a sequence of techniques.

A token is a unit of text. For example, we may specify that our tokens are words, sentences, or characters. A count vectorizer takes textual input and then outputs a vector consisting of the counts of the textual tokens. A hashing vectorizer is a variation on the count vectorizer that sets out to be faster and more scalable, at the cost of interpretability and hashing collisions. Though it can be useful, just having the counts of the words appearing in a document corpus can be misleading. The reason is that, often, unimportant words, such as the and a (known as stop words) have a high frequency of occurrence, and hence little informative content. For reasons such as this, we often give words different weights to offset this. The main technique for doing so is tf-idf, which stands for Term-Frequency, Inverse-Document-Frequency. The main idea is that we account for the number of times a term occurs, but discount it by the number of documents it occurs in.

In cybersecurity, text data is omnipresent; event logs, conversational transcripts, and lists of function names are just a few examples. Consequently, it is essential to be able to work with such data, something you'll learn in this recipe.

Getting ready

The preparation for this recipe consists of installing the scikit-learn package in pip. The command for this is as follows:

pip install sklearn

In addition, a log file, anonops_short.log, consisting of an excerpt of conversations taking place on the IRC channel, #Anonops, is included in the repository for this chapter.

How to do it…

In the next steps, we will convert a corpus of text data into numerical form, amenable to machine learning algorithms:

First, import a textual dataset:

with open("anonops_short.txt", encoding="utf8") as f:
    anonops_chat_logs = f.readlines()

Next, count the words in the text using the hash vectorizer and then perform weighting using tf-idf:

from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

my_vector = HashingVectorizer(input="content", ngram_range=(1, 2))
X_train_counts = my_vector.fit_transform(anonops_chat_logs,)
tf_transformer = TfidfTransformer(use_idf=True,).fit(X_train_counts)
X_train_tf = tf_transformer.transform(X_train_counts)

The end result is a sparse matrix with each row being a vector representing one of the texts:

X_train_tf

<180830 x 1048576 sparse matrix of type <class 'numpy.float64'>' with 3158166 stored elements in Compressed Sparse Row format>

print(X_train_tf)

The following is the output:

How it works...

We started by loading in the #Anonops text dataset (step 1). The Anonops IRC channel has been affiliated with the Anonymous hacktivist group. In particular, chat participants have in the past planned and announced their future targets on Anonops. Consequently, a well-engineered ML system would be able to predict cyber attacks by training on such data. In step 2, we instantiated a hashing vectorizer. The hashing vectorizer gave us counts of the 1- and 2-grams in the text, in other words, singleton and consecutive pairs of words (tokens) in the articles. We then applied a tf-idf transformer to give appropriate weights to the counts that the hashing vectorizer gave us. Our final result is a large, sparse matrix representing the occurrences of 1- and 2-grams in the texts, weighted by importance. Finally, we examined the frontend of a sparse matrix representation of our featured data in Scipy.