Identifying patterns in text using topic modeling
The theme modeling refers to the procedure of recognizing hidden patterns in manuscript information. The objective is to expose some hidden thematic configuration in a collection of documents.
How to do it...
- Import the following packages:
from nltk.tokenize import RegexpTokenizer from nltk.stem.snowball import SnowballStemmer from gensim import models, corpora from nltk.corpus import stopwords
- Load the input data:
def load_words(in_file): element = [] with open(in_file, 'r') as f: for line in f.readlines(): element.append(line[:-1]) return element
- Class to pre-process text:
classPreprocedure(object):
def __init__(self):
# Create a regular expression tokenizer
self.tokenizer = RegexpTokenizer(r'w+')
- Obtain a list of stop words to terminate the program execution:
self.english_stop_words= stopwords.words('english')
- Create a Snowball stemmer:
self.snowball_stemmer = SnowballStemmer('english')
- Define a function to perform...