Chapter 7: Topic Modeling
Activity 15: Loading and Cleaning Twitter Data
Solution:
Import the necessary libraries:
import langdetect import matplotlib.pyplot import nltk import numpy import pandas import pyLDAvis import pyLDAvis.sklearn import regex import sklearn
Load the LA Times health Twitter data (latimeshealth.txt) from https://github.com/TrainingByPackt/Applied-Unsupervised-Learning-with-Python/tree/master/Lesson07/Activity15-Activity17:
Note
Pay close attention to the delimiter (it is neither a comma nor a tab) and double-check the header status.
path = '<Path>/latimeshealth.txt' df = pandas.read_csv(path, sep="|", header=None) df.columns = ["id", "datetime", "tweettext"]
Run a quick exploratory analysis to ascertain the data size and structure:
def dataframe_quick_look(df, nrows): print("SHAPE:\n{shape}\n".format(shape=df.shape)) print("COLUMN NAMES:\n{names}\n".format(names=df.columns)) print("HEAD:\n{head}\n".format(head=df.head(nrows))) dataframe_quick_look(df, nrows=2)
The output is as follows:
Extract the tweet text and convert it to a list object:
raw = df['tweettext'].tolist() print("HEADLINES:\n{lines}\n".format(lines=raw[:5])) print("LENGTH:\n{length}\n".format(length=len(raw)))
The output is as follows:
Write a function to perform language detection, tokenization on whitespaces, and replace screen names and URLs with SCREENNAME and URL, respectively. The function should also remove punctuation, numbers, and the SCREENNAME and URL replacements. Convert everything to lowercase, except SCREENNAME and URL. It should remove all stop words, perform lemmatization, and keep words with five or more letters:
Note
Screen names start with the @ symbol.
def do_language_identifying(txt): try: the_language = langdetect.detect(txt) except: the_language = 'none' return the_language def do_lemmatizing(wrd): out = nltk.corpus.wordnet.morphy(wrd) return (wrd if out is None else out) def do_tweet_cleaning(txt): # identify language of tweet # return null if language not english lg = do_language_identifying(txt) if lg != 'en': return None # split the string on whitespace out = txt.split(' ') # identify screen names # replace with SCREENNAME out = ['SCREENNAME' if i.startswith('@') else i for i in out] # identify urls # replace with URL out = ['URL' if bool(regex.search('http[s]?://', i)) else i for i in out] # remove all punctuation out = [regex.sub('[^\\w\\s]|\n', '', i) for i in out] # make all non-keywords lowercase keys = ['SCREENNAME', 'URL'] out = [i.lower() if i not in keys else i for i in out] # remove keywords out = [i for i in out if i not in keys] # remove stopwords list_stop_words = nltk.corpus.stopwords.words('english') list_stop_words = [regex.sub('[^\\w\\s]', '', i) for i in list_stop_words] out = [i for i in out if i not in list_stop_words] # lemmatizing out = [do_lemmatizing(i) for i in out] # keep words 4 or more characters long out = [i for i in out if len(i) >= 5] return out
Apply the function defined in step 5 to every tweet:
clean = list(map(do_tweet_cleaning, raw))
Remove elements of output list equal to None:
clean = list(filter(None.__ne__, clean)) print("HEADLINES:\n{lines}\n".format(lines=clean[:5])) print("LENGTH:\n{length}\n".format(length=len(clean)))
The output is as follows:
Turn the elements of each tweet back into a string. Concatenate using white space:
clean_sentences = [" ".join(i) for i in clean] print(clean_sentences[0:10])
The first 10 elements of the output list should resemble the following:
Keep the notebook open for future modeling.
Activity 16: Latent Dirichlet Allocation and Health Tweets
Solution:
Specify the number_words, number_docs, and number_features variables:
number_words = 10 number_docs = 10 number_features = 1000
Create a bag-of-words model and assign the feature names to another variable for use later on:
vectorizer1 = sklearn.feature_extraction.text.CountVectorizer( analyzer=»word», max_df=0.95, min_df=10, max_features=number_features ) clean_vec1 = vectorizer1.fit_transform(clean_sentences) print(clean_vec1[0]) feature_names_vec1 = vectorizer1.get_feature_names()
The output is as follows:
(0, 320) 1
Identify the optimal number of topics:
def perplexity_by_ntopic(data, ntopics): output_dict = { «Number Of Topics": [], «Perplexity Score»: [] } for t in ntopics: lda = sklearn.decomposition.LatentDirichletAllocation( n_components=t, learning_method="online", random_state=0 ) lda.fit(data) output_dict["Number Of Topics"].append(t) output_dict["Perplexity Score"].append(lda.perplexity(data)) output_df = pandas.DataFrame(output_dict) index_min_perplexity = output_df["Perplexity Score"].idxmin() output_num_topics = output_df.loc[ index_min_perplexity, # index «Number Of Topics" # column ] return (output_df, output_num_topics) df_perplexity, optimal_num_topics = perplexity_by_ntopic( clean_vec1, ntopics=[i for i in range(1, 21) if i % 2 == 0] ) print(df_perplexity)
The output is as follows:
Fit the LDA model using the optimal number of topics:
lda = sklearn.decomposition.LatentDirichletAllocation( n_components=optimal_num_topics, learning_method="online", random_state=0 ) lda.fit(clean_vec1)
The output is as follows:
Create and print the word-topic table:
def get_topics(mod, vec, names, docs, ndocs, nwords): # word to topic matrix W = mod.components_ W_norm = W / W.sum(axis=1)[:, numpy.newaxis] # topic to document matrix H = mod.transform(vec) W_dict = {} H_dict = {} for tpc_idx, tpc_val in enumerate(W_norm): topic = «Topic{}".format(tpc_idx) # formatting w W_indices = tpc_val.argsort()[::-1][:nwords] W_names_values = [ (round(tpc_val[j], 4), names[j]) for j in W_indices ] W_dict[topic] = W_names_values # formatting h H_indices = H[:, tpc_idx].argsort()[::-1][:ndocs] H_names_values = [ (round(H[:, tpc_idx][j], 4), docs[j]) for j in H_indices ] H_dict[topic] = H_names_values W_df = pandas.DataFrame( W_dict, index=["Word" + str(i) for i in range(nwords)] ) H_df = pandas.DataFrame( H_dict, index=["Doc" + str(i) for i in range(ndocs)] ) return (W_df, H_df) W_df, H_df = get_topics( mod=lda, vec=clean_vec1, names=feature_names_vec1, docs=raw, ndocs=number_docs, nwords=number_words ) print(W_df)
The output is as follows:
Print the document-topic table:
print(H_df)
The output is as follows:
Create a biplot visualization:
lda_plot = pyLDAvis.sklearn.prepare(lda, clean_vec1, vectorizer1, R=10) pyLDAvis.display(lda_plot)
Keep the notebook open for future modeling.
Activity 17: Non-Negative Matrix Factorization
Solution:
Create the appropriate bag-of-words model and output the feature names as another variable:
vectorizer2 = sklearn.feature_extraction.text.TfidfVectorizer( analyzer="word", max_df=0.5, min_df=20, max_features=number_features, smooth_idf=False ) clean_vec2 = vectorizer2.fit_transform(clean_sentences) print(clean_vec2[0]) feature_names_vec2 = vectorizer2.get_feature_names()
Define and fit the NMF algorithm using the number of topics (n_components) value from activity two:
nmf = sklearn.decomposition.NMF( n_components=optimal_num_topics, init="nndsvda", solver="mu", beta_loss="frobenius", random_state=0, alpha=0.1, l1_ratio=0.5 ) nmf.fit(clean_vec2)
The output is as follows:
Get the topic-document and word-topic result tables. Take a few minutes to explore the word groupings and try to define the abstract topics:
W_df, H_df = get_topics( mod=nmf, vec=clean_vec2, names=feature_names_vec2, docs=raw, ndocs=number_docs, nwords=number_words ) print(W_df)
Adjust the model parameters and rerun step 3 and step 4.