Representing Text – Capturing Semantics
Representing the meaning of words, phrases, and sentences in a form that’s understandable to computers is one of the pillars of NLP processing. Machine learning, for example, represents each data point as a list of numbers (a fixed-size vector), and we are faced with the question of how to turn words and sentences into these vectors. Most NLP tasks start by representing the text in some numeric form, and in this chapter, we show several ways to do that.
First, we will create a simple classifier to demonstrate the effectiveness of each method of encoding, and then we will use it to test the different encoding methods. We will also learn how to turn phrases such as fried chicken into vectors – that is, how to train a word2vec
model for phrases. Finally, we will see how to use vector-based search.
For a theoretical background on some of the concepts discussed in this section, refer to Building Machine Learning Systems...