In this section, we will see how to featurize PDF files in order to use them for machine learning. The tool we will be utilizing is the PDFiD Python script designed by Didier Stevens (https://blog.didierstevens.com/). Stevens selected a list of 20 features that are commonly found in malicious files, including whether the PDF file contains JavaScript or launches an automatic action. It is suspicious to find these features in a file, hence, the appearance of these can be indicative of malicious behavior.
Essentially, the tool scans through a PDF file, and counts the number of occurrences of each of the ~20 features. A run of the tool appears as follows:
PDFiD 0.2.5 PythonBrochure.pdf
PDF Header: %PDF-1.6
obj 1096
endobj 1095
stream 1061
endstream 1061
xref 0
trailer 0
startxref...