If machine learning models only operate on numerical data, how can we transform our text into a numerical representation? That is exactly the focus of Natural Language Processing (NLP). Let's take a brief look at how this is done.
We'll begin with a small corpus of three sentences:
- The new kitten played with the other kittens
- She ate lunch
- She loved her kitten
We'll first convert our corpus into a bag-of-words (BOW) representation. We'll skip preprocessing for now. Converting our corpus into a BOW representation involves taking each word and its count to create what's called a term-document matrix. In a term-document matrix, each unique word is assigned to a column, and each document is assigned to a row. At the intersection of the two is the count:
Sr. no. |
the |
new |
kitten |
played |
with |
other |
kittens... |