Bag of Words (BoW) is one of the most basic, simplest, and popular feature engineering techniques for converting text into a numeric vector. It works in two steps: collecting vocabulary words and counting their presence or frequency in the text. It does not consider the document structure and contextual information. Let's take the following three documents and understand BoW:
Document 1: I like pizza.
Document 2: I do not like burgers.
Document 3: Pizza and burgers both are junk food.
Now, we will create the Document Term Matrix (DTM). This matrix consists of the document at rows, words at the column, and the frequency at cell values.
|
I |
like |
pizza |
do |
not |
burgers |
and |
both |
are |
junk |
food |
Doc-1 |
1 |
1 |
1 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
Doc-2 |
1 |
1 |
0 |
1 |
1 |
1 |
0 |
0 |
0 |
0 |
0 |
Doc-3 |
0 |
0 |
1 |
0 |
0 |
1 |
1 |
1 |
1 |
1 |
1 |
In the preceding example, we generated the DTM using a single keyword known as a unigram. We...