Much has been written about NLTK being used to identify n-grams within text. An n-gram is a set of words, n words in length, that are common within a document/corpus (occurring 2 or more times). A 2-gram is any two words commonly repeated, a 3-gram is a three word phrase, and so on. We will not look into determining the n-grams in a document. We will focus on reconstructing known n-grams from our token streams, as we will consider those n-grams to be more important to a search result than the 2 or 3 independent words found in any order.
In the domain of parsing job listings, important 2-grams can be things such as Computer Science, SQL Server, Data Science, and Big Data. Additionally, we could consider C# a 2-gram of 'C' and '#', and hence why we might not want to use the regex parser or '#' as punctuation when processing...