The theme modeling refers to the procedure of recognizing hidden patterns in manuscript information. The objective is to expose some hidden thematic configuration in a collection of documents.
Identifying patterns in text using topic modeling
How to do it...
- Import the following packages:
from nltk.tokenize import RegexpTokenizer
from nltk.stem.snowball import SnowballStemmer
from gensim import models, corpora
from nltk.corpus import stopwords
- Load the input data:
def load_words(in_file):
element = []
with open(in_file, 'r') as f:
for line in f.readlines():
element.append(line[:-1])
return element
- Class to pre-process text:
classPreprocedure(object):
def __init__(self):
# Create a regular expression...