Character n-grams
We saw how function words can be used as features to predict the author of a document. Another feature type is character n-grams. An n-gram is a sequence of n tokens, where n is a value (for text, generally between 2 and 6). Word n-grams have been used in many studies, usually relating to the topic of the documents - as per the previous chapter. However, character n-grams have proven to be of high quality for authorship attribution.
Character n-grams are found in text documents by representing the document as a sequence of characters. These n-grams are then extracted from this sequence and a model is trained. There are a number of different models for this, but a standard one is very similar to the bag-of-words model we have used earlier.
For each distinct n-gram in the training corpus, we create a feature for it. An example of an n-gram is <e t>
, which is the letter e, space, and then the letter t (the angle brackets are used to denote the start and end of the n-gram...