Analyzing ngrams
An n-gram is a continuous sequence of n items of a given text. These items can be words, letters, or syllables. N-grams help us extract useful information about the distribution of words, syllables, or letters within a given text. The n stands for positive numerical values, starting from 1 to n. The most common n-grams are unigram, bigram, and trigram, where n is 1, 2, and 3 respectively.
Analyzing n-grams involves checking the frequency or distribution of an n-gram within a text. We typically split the text into the respective n-gram and count the frequency of each one in the text data. This will help us identify the most common words, syllables, or phrases in our data.
For example, in the sentence “The boy threw the ball,” the n-grams would be as follows:
- 1-gram (or unigram):
["The", "boy", "threw", "
the", "ball"]
- 2-gram (or bigram):
["The boy", "boy threw", "...