Many machine learning problems use text, which usually represents natural language. Text must be transformed to a vector representation that encodes some aspect of its meaning. In the following sections, we will review variations of two of the most common representation of text that are used in machine learning: the bag-of-words model and word embeddings.
Extracting features from text
The bag-of-words model
The most common representation of text is the bag-of-words model. This representation uses a multiset, or bag, that encodes the words that appear in a text; bag-of-words does not encode any of the text's syntax, ignores the order of words, and disregards all grammar. Bag-of-words can be thought of as an extension to...