When we tested that the data structure made sense, we printed the FullText field. We wish to cluster based on the content of the tweet. What matters to us is that content. This can be found in the FullText field of the struct. Later on in the chapter, we will see how we may use the metadata of the tweets, such as location, to help cluster the tweets better.
As mentioned in the previous sections, each individual tweet needs to be represented as a coordinate in some higher-dimensional space. Thus, our goal is to take all the tweets in a timeline and preprocess them in such a way that we can get this output table:
| Tweet ID | twitter | test | right | wrong |
|:--------:|:------:|:----:|:----:|:---:|
| 1 | 0 | 1 | 0 | 0 |
| 2 | 1 | 0 | 0 | 0 |
| 3 | 0 | 0 | 1 | 1 |
Each row in the table represents a tweet, indexed by the tweet ID. The columns that follow are words that...