Topic models at scale
For the final Spark example, we will do a simple topic modelling using MLLib (the Spark machine learning library) on our corpus.
We will use nouns as the features for our documents. First we will import the required classes:
from pyspark.mllib.clustering import LDA, LDAModel from pyspark.mllib.linalg import Vectors
We will build the vocabulary from the noun word count RDD:
vocabulary = noun_word_count.map(lambda w: w[0]).collect()
Next, we need to transform the chunks corpus into a list of nouns per document:
doc_nouns = chunks \ .map(lambda chunks: filter( lambda chunk: chunk.part_of_speech == 'NP', chunks )) \ .filter(lambda chunks: len(chunks) > 0) \ .map(lambda chunks: list(chain.from_iterable(map( lambda chunk: chunk.words, chunks )))) \ .map(lambda words: filter( lambda word: match_noun_like_pos(word.part_of_speech), words )) \ .filter...