Using TruncatedSVD for LSI with real data
In this section, we will build a model using TruncatedSVD
and real data. Let me outline the tasks first. This task list is a general procedure when you build an LSI:
- Loading the data
- Creating TF-IDF
- Using TruncatedSVD to build a model
- Interpreting the outcome
For an effective learning outcome, we will just use five documents in the data so we can print out the words. Once you know how the process works, you can replicate it for the entire data.
Loading the data
In the Preface of the book, we said that we will use the sampled AG corpus of news articles throughout the book. Using one dataset will help you to focus on the techniques rather than orient yourself to different data, although there is still some value in exposing it to different datasets. The original AG corpus of news articles is a large collection of more than 1 million news articles from more than 2,000 news sources. A smaller collection that sampled...