Simple encoding methods
As the saying goes, “All great endeavors commence from ground zero,” and NLP’s ground zero is encoding. There are many encoding techniques to effectively represent words with the right contexts or NLU meaning. Let’s start with the three simplest encoding methods – one-hot encoding, BoW, and Bag of N-grams.
One-hot encoding
We can do one-hot encoding for texts. It is also called count vectorizing. The idea is very simple – we create a vector whose length is the number of unique words in the entire text. At the time of writing this chapter, on a quiet evening, I am listening to the song Never Enough from The Greatest Showman (https://www.imdb.com/title/tt1485796/). Let me use its lyric as an example:
Here, there are two sentences. Each sentence will be converted to a vector. The length...