Key Input Parameters for LSA Topic Modeling
We will be using the gensim library to perform LSA topic modeling. The key input parameters for gensim are corpus
, the number of topics, and id2word
. Here, the corpus
is specified in the form of a list of documents in which each document is a list of tokens. The id2word
parameter refers to a dictionary that is used to convert the corpus from a textual representation to a numeric representation such that each word corresponds to a unique number. Let's do an exercise to understand this concept better.
spaCy is a popular natural language processing Library for Python. In our exercises, we will be using spaCy to tokenize the text, lemmatize the tokens, and check which part-of-speech that token is. We will be using spaCy v2.1.3. After installing spaCy v2.1.3 we will need to download the English language model using the following code, so that we can load this model (since there are models for many different languages).
python -m spacy...