Performing context Ngram in Hive
Ngrams are sequences that are collected from specific sets of words and are based on their occurrence in a given text. N-grams are generally used to find the occurrence of certain words in a sequence, which helps in the calculation of sentiment analysis. Hive provides built-in support for Ngram calculations by providing a function. In this recipe, we will take a look at how to use this function in order to analyze text data.
Getting ready
To perform this recipe, you should have a running Hadoop cluster as well as the latest version of Hive installed on it. Here, I am using Hive 1.2.1.
How to do it...
N-gram can be used to find the most frequently used word after a sequence of words in a give text dataset. To do this, let's first create a Hive table and load data into it.
Take a situation where we have data from Twitter where people are writing about their sentiments about chocolate. Let's assume that we have text data, as follows:
Chocolate is good Chocolate is...