Packt+ | Advance your knowledge in tech

You're reading from Applied Unsupervised Learning with Python Discover hidden patterns and relationships in unstructured data with Python

Product type Paperback

Published in May 2019

Publisher

ISBN-13 9781789952292

Length 482 pages

Edition 1st Edition

Languages

Python

Tools

Scikit-learn

Concepts

Machine Learning

Authors (3):

Benjamin Johnston

Christopher Kruger

Aaron Jones

View More author details

Table of Contents (12) Chapters

Applied Unsupervised Learning with Python

Preface

1. Introduction to Clustering

2. Hierarchical Clustering FREE CHAPTER

3. Neighborhood Approaches and DBSCAN

4. Dimension Reduction and PCA

5. Autoencoders

6. t-Distributed Stochastic Neighbor Embedding (t-SNE)

7. Topic Modeling

8. Market Basket Analysis

9. Hotspot Analysis

Appendix

Chapter 7: Topic Modeling

Activity 15: Loading and Cleaning Twitter Data

Solution:

Import the necessary libraries:

import langdetect
import matplotlib.pyplot
import nltk
import numpy
import pandas
import pyLDAvis
import pyLDAvis.sklearn
import regex
import sklearn

Load the LA Times health Twitter data (latimeshealth.txt) from https://github.com/TrainingByPackt/Applied-Unsupervised-Learning-with-Python/tree/master/Lesson07/Activity15-Activity17:
Note
Pay close attention to the delimiter (it is neither a comma nor a tab) and double-check the header status.
```
path = '<Path>/latimeshealth.txt'
df = pandas.read_csv(path, sep="|", header=None)
df.columns = ["id", "datetime", "tweettext"]
```

Run a quick exploratory analysis to ascertain the data size and structure:

def dataframe_quick_look(df, nrows):
print("SHAPE:\n{shape}\n".format(shape=df.shape))
print("COLUMN NAMES:\n{names}\n".format(names=df.columns))
print("HEAD:\n{head}\n".format(head=df.head(nrows)))

dataframe_quick_look(df, nrows=2)

The output is as follows:

Figure 7.54: Shape, column names, and head of data

Extract the tweet text and convert it to a list object:

raw = df['tweettext'].tolist()
print("HEADLINES:\n{lines}\n".format(lines=raw[:5]))
print("LENGTH:\n{length}\n".format(length=len(raw)))

The output is as follows:

Figure 7.55: Headlines and their length

Write a function to perform language detection, tokenization on whitespaces, and replace screen names and URLs with SCREENNAME and URL, respectively. The function should also remove punctuation, numbers, and the SCREENNAME and URL replacements. Convert everything to lowercase, except SCREENNAME and URL. It should remove all stop words, perform lemmatization, and keep words with five or more letters:

Note

Screen names start with the @ symbol.

def do_language_identifying(txt):
    	try:
           the_language = langdetect.detect(txt)
    	except:
        	the_language = 'none'
    	return the_language
def do_lemmatizing(wrd):
    	out = nltk.corpus.wordnet.morphy(wrd)
    	return (wrd if out is None else out)
def do_tweet_cleaning(txt):
# identify language of tweet
# return null if language not english
    	lg = do_language_identifying(txt)
    	if lg != 'en':
        	return None
# split the string on whitespace
    	out = txt.split(' ')
# identify screen names
# replace with SCREENNAME
    	out = ['SCREENNAME' if i.startswith('@') else i for i in out]
# identify urls
# replace with URL
    	out = ['URL' if bool(regex.search('http[s]?://', i)) else i for i in out]
      # remove all punctuation
    	out = [regex.sub('[^\\w\\s]|\n', '', i) for i in out]
      # make all non-keywords lowercase
    	keys = ['SCREENNAME', 'URL']
    	out = [i.lower() if i not in keys else i for i in out]
      # remove keywords
    	out = [i for i in out if i not in keys]
      # remove stopwords
    	list_stop_words = nltk.corpus.stopwords.words('english')
    	list_stop_words = [regex.sub('[^\\w\\s]', '', i) for i in list_stop_words]
    	out = [i for i in out if i not in list_stop_words]
      # lemmatizing
    	out = [do_lemmatizing(i) for i in out]
      # keep words 4 or more characters long
    	out = [i for i in out if len(i) >= 5]
    	return out

Apply the function defined in step 5 to every tweet:
```
clean = list(map(do_tweet_cleaning, raw))
```

Remove elements of output list equal to None:

clean = list(filter(None.__ne__, clean))
print("HEADLINES:\n{lines}\n".format(lines=clean[:5]))
print("LENGTH:\n{length}\n".format(length=len(clean)))

The output is as follows:

Figure 7.56: Headline and length after removing None

Turn the elements of each tweet back into a string. Concatenate using white space:
```
clean_sentences = [" ".join(i) for i in clean]
print(clean_sentences[0:10])
```
The first 10 elements of the output list should resemble the following:
Figure 7.57: Tweets cleaned for modeling
Keep the notebook open for future modeling.

Activity 16: Latent Dirichlet Allocation and Health Tweets

Solution:

Specify the number_words, number_docs, and number_features variables:
```
number_words = 10
number_docs = 10
number_features = 1000
```

Create a bag-of-words model and assign the feature names to another variable for use later on:

vectorizer1 = sklearn.feature_extraction.text.CountVectorizer(
    analyzer=»word»,
    max_df=0.95, 
    min_df=10, 
    max_features=number_features
)
clean_vec1 = vectorizer1.fit_transform(clean_sentences)
print(clean_vec1[0])

feature_names_vec1 = vectorizer1.get_feature_names()

The output is as follows:

(0, 320)    1

Identify the optimal number of topics:

def perplexity_by_ntopic(data, ntopics):
    output_dict = {
        «Number Of Topics": [], 
        «Perplexity Score»: []
    }
    for t in ntopics:
        lda = sklearn.decomposition.LatentDirichletAllocation(
            n_components=t,
            learning_method="online",
            random_state=0
        )
        lda.fit(data)
        output_dict["Number Of Topics"].append(t)
        output_dict["Perplexity Score"].append(lda.perplexity(data))
    output_df = pandas.DataFrame(output_dict)
    index_min_perplexity = output_df["Perplexity Score"].idxmin()
    output_num_topics = output_df.loc[
        index_min_perplexity,  # index
        «Number Of Topics"  # column
    ]
    return (output_df, output_num_topics)
df_perplexity, optimal_num_topics = perplexity_by_ntopic(
    clean_vec1, 
    ntopics=[i for i in range(1, 21) if i % 2 == 0]
)
print(df_perplexity)

The output is as follows:

Figure 7.58: Number of topics versus perplexity score data frame

Fit the LDA model using the optimal number of topics:

lda = sklearn.decomposition.LatentDirichletAllocation(
    n_components=optimal_num_topics,
    learning_method="online",
    random_state=0
)
lda.fit(clean_vec1)

The output is as follows:

Figure 7.59: LDA model

Create and print the word-topic table:

def get_topics(mod, vec, names, docs, ndocs, nwords):
    # word to topic matrix
    W = mod.components_
    W_norm = W / W.sum(axis=1)[:, numpy.newaxis]
    # topic to document matrix
    H = mod.transform(vec)
    W_dict = {}
    H_dict = {}
    for tpc_idx, tpc_val in enumerate(W_norm):
        topic = «Topic{}".format(tpc_idx)
        # formatting w
        W_indices = tpc_val.argsort()[::-1][:nwords]
        W_names_values = [
            (round(tpc_val[j], 4), names[j]) 
            for j in W_indices
        ]
        W_dict[topic] = W_names_values
        # formatting h
        H_indices = H[:, tpc_idx].argsort()[::-1][:ndocs]
        H_names_values = [
        (round(H[:, tpc_idx][j], 4), docs[j]) 
            for j in H_indices
        ]
        H_dict[topic] = H_names_values
    W_df = pandas.DataFrame(
        W_dict, 
        index=["Word" + str(i) for i in range(nwords)]
    )
    H_df = pandas.DataFrame(
        H_dict,
        index=["Doc" + str(i) for i in range(ndocs)]
    )
    return (W_df, H_df)

W_df, H_df = get_topics(
    mod=lda,
    vec=clean_vec1,
    names=feature_names_vec1,
    docs=raw,
    ndocs=number_docs, 
    nwords=number_words
)
print(W_df)

The output is as follows:

Figure 7.60: Word-topic table for the health tweet data

Print the document-topic table:
```
print(H_df)
```
The output is as follows:
Figure 7.61: Document topic table
Create a biplot visualization:
```
lda_plot = pyLDAvis.sklearn.prepare(lda, clean_vec1, vectorizer1, R=10)
pyLDAvis.display(lda_plot)
```
Figure 7.62: A histogram and biplot for the LDA model trained on health tweets
Keep the notebook open for future modeling.

Activity 17: Non-Negative Matrix Factorization

Solution:

Create the appropriate bag-of-words model and output the feature names as another variable:

vectorizer2 = sklearn.feature_extraction.text.TfidfVectorizer(
    analyzer="word",
    max_df=0.5, 
    min_df=20, 
    max_features=number_features,
    smooth_idf=False
)
clean_vec2 = vectorizer2.fit_transform(clean_sentences)
print(clean_vec2[0])

feature_names_vec2 = vectorizer2.get_feature_names()

Define and fit the NMF algorithm using the number of topics (n_components) value from activity two:

nmf = sklearn.decomposition.NMF(
    n_components=optimal_num_topics,
    init="nndsvda",
    solver="mu",
    beta_loss="frobenius",
    random_state=0, 
    alpha=0.1, 
    l1_ratio=0.5
)
nmf.fit(clean_vec2)

The output is as follows:

Figure 7.63: Defining the NMF model

Get the topic-document and word-topic result tables. Take a few minutes to explore the word groupings and try to define the abstract topics:
```
W_df, H_df = get_topics(
    mod=nmf,
    vec=clean_vec2,
    names=feature_names_vec2,
    docs=raw,
    ndocs=number_docs, 
    nwords=number_words
)

print(W_df)
```
Figure 7.64: The word-topic table with probabilities
Adjust the model parameters and rerun step 3 and step 4.