In this section, we demonstrate a technique for extracting the most frequent N-grams quickly and memory-efficiently. This allows us to make the challenges that come with the immense number of N-grams easier. The technique is called Hash-Grams, and relies on hashing the N-grams as they are extracted. A property of N-grams is that they follow a power law that ensures that hash collisions have an insignificant impact on the quality of the features thus obtained.
Extracting N-grams quickly using the hash-gram algorithm
Getting ready
Preparation for this recipe involves installing nltk in pip. The command is as follows:
pip install nltk
In addition, benign and malicious files have been provided for you in the PE Samples Dataset...