Representing language numerically with vectors
A common mathematical technique for representing language in preparation for machine learning is through the use of vectors. Both documents and words can be represented with vectors. We’ll start by discussing document vectors.
Understanding vectors for document representation
We have seen that texts can be represented as sequences of symbols such as words, which is the way that we read them. However, it is usually more convenient for computational NLP purposes to represent text numerically, especially if we are dealing with large quantities of text. Another advantage of numerical representation is that we can also process text represented numerically with a much wider range of mathematical techniques.
A common way to represent both documents and words is by using vectors, which are basically one-dimensional arrays. Along with words, we can also use vectors to represent other linguistic units, such as lemmas or stemmed words...