The number of different N-grams grows exponentially in N. Even for a fixed tiny N, such as N=3, there are 256x256x256=16,777,216 possible N-grams. This means that the number of N-grams features is impracticably large. Consequently, we must select a smaller subset of N-grams that will be of most value to our classifiers. In this section, we show three different methods for selecting the topmost informative N-grams.
Selecting the best N-grams
Getting ready
Preparation for this recipe consists of installing the scikit-learn and nltk packages in pip. The instructions are as follows:
pip install sklearn nltk
In addition, benign and malicious files have been provided for you in the PE Samples Dataset folder in the...