Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
Applied Unsupervised Learning with Python

You're reading from   Applied Unsupervised Learning with Python Discover hidden patterns and relationships in unstructured data with Python

Arrow left icon
Product type Paperback
Published in May 2019
Publisher
ISBN-13 9781789952292
Length 482 pages
Edition 1st Edition
Languages
Arrow right icon
Authors (3):
Arrow left icon
Benjamin Johnston Benjamin Johnston
Author Profile Icon Benjamin Johnston
Benjamin Johnston
Christopher Kruger Christopher Kruger
Author Profile Icon Christopher Kruger
Christopher Kruger
Aaron Jones Aaron Jones
Author Profile Icon Aaron Jones
Aaron Jones
Arrow right icon
View More author details
Toc

Table of Contents (12) Chapters Close

Applied Unsupervised Learning with Python
Preface
1. Introduction to Clustering FREE CHAPTER 2. Hierarchical Clustering 3. Neighborhood Approaches and DBSCAN 4. Dimension Reduction and PCA 5. Autoencoders 6. t-Distributed Stochastic Neighbor Embedding (t-SNE) 7. Topic Modeling 8. Market Basket Analysis 9. Hotspot Analysis Appendix

Chapter 7: Topic Modeling


Activity 15: Loading and Cleaning Twitter Data

Solution:

  1. Import the necessary libraries:

    import langdetect
    import matplotlib.pyplot
    import nltk
    import numpy
    import pandas
    import pyLDAvis
    import pyLDAvis.sklearn
    import regex
    import sklearn
  2. Load the LA Times health Twitter data (latimeshealth.txt) from https://github.com/TrainingByPackt/Applied-Unsupervised-Learning-with-Python/tree/master/Lesson07/Activity15-Activity17:

    Note

    Pay close attention to the delimiter (it is neither a comma nor a tab) and double-check the header status.

    path = '<Path>/latimeshealth.txt'
    df = pandas.read_csv(path, sep="|", header=None)
    df.columns = ["id", "datetime", "tweettext"]
  3. Run a quick exploratory analysis to ascertain the data size and structure:

    def dataframe_quick_look(df, nrows):
    print("SHAPE:\n{shape}\n".format(shape=df.shape))
    print("COLUMN NAMES:\n{names}\n".format(names=df.columns))
    print("HEAD:\n{head}\n".format(head=df.head(nrows)))
    
    dataframe_quick_look(df, nrows=2)

    The output is as follows:

    Figure 7.54: Shape, column names, and head of data

  4. Extract the tweet text and convert it to a list object:

    raw = df['tweettext'].tolist()
    print("HEADLINES:\n{lines}\n".format(lines=raw[:5]))
    print("LENGTH:\n{length}\n".format(length=len(raw)))

    The output is as follows:

    Figure 7.55: Headlines and their length

  5. Write a function to perform language detection, tokenization on whitespaces, and replace screen names and URLs with SCREENNAME and URL, respectively. The function should also remove punctuation, numbers, and the SCREENNAME and URL replacements. Convert everything to lowercase, except SCREENNAME and URL. It should remove all stop words, perform lemmatization, and keep words with five or more letters:

    Note

    Screen names start with the @ symbol.

    def do_language_identifying(txt):
        	try:
               the_language = langdetect.detect(txt)
        	except:
            	the_language = 'none'
        	return the_language
    def do_lemmatizing(wrd):
        	out = nltk.corpus.wordnet.morphy(wrd)
        	return (wrd if out is None else out)
    def do_tweet_cleaning(txt):
    # identify language of tweet
    # return null if language not english
        	lg = do_language_identifying(txt)
        	if lg != 'en':
            	return None
    # split the string on whitespace
        	out = txt.split(' ')
    # identify screen names
    # replace with SCREENNAME
        	out = ['SCREENNAME' if i.startswith('@') else i for i in out]
    # identify urls
    # replace with URL
        	out = ['URL' if bool(regex.search('http[s]?://', i)) else i for i in out]
          # remove all punctuation
        	out = [regex.sub('[^\\w\\s]|\n', '', i) for i in out]
          # make all non-keywords lowercase
        	keys = ['SCREENNAME', 'URL']
        	out = [i.lower() if i not in keys else i for i in out]
          # remove keywords
        	out = [i for i in out if i not in keys]
          # remove stopwords
        	list_stop_words = nltk.corpus.stopwords.words('english')
        	list_stop_words = [regex.sub('[^\\w\\s]', '', i) for i in list_stop_words]
        	out = [i for i in out if i not in list_stop_words]
          # lemmatizing
        	out = [do_lemmatizing(i) for i in out]
          # keep words 4 or more characters long
        	out = [i for i in out if len(i) >= 5]
        	return out
  6. Apply the function defined in step 5 to every tweet:

    clean = list(map(do_tweet_cleaning, raw))
  7. Remove elements of output list equal to None:

    clean = list(filter(None.__ne__, clean))
    print("HEADLINES:\n{lines}\n".format(lines=clean[:5]))
    print("LENGTH:\n{length}\n".format(length=len(clean)))

    The output is as follows:

    Figure 7.56: Headline and length after removing None

  8. Turn the elements of each tweet back into a string. Concatenate using white space:

    clean_sentences = [" ".join(i) for i in clean]
    print(clean_sentences[0:10])

    The first 10 elements of the output list should resemble the following:

    Figure 7.57: Tweets cleaned for modeling

  9. Keep the notebook open for future modeling.

Activity 16: Latent Dirichlet Allocation and Health Tweets

Solution:

  1. Specify the number_words, number_docs, and number_features variables:

    number_words = 10
    number_docs = 10
    number_features = 1000
  2. Create a bag-of-words model and assign the feature names to another variable for use later on:

    vectorizer1 = sklearn.feature_extraction.text.CountVectorizer(
        analyzer=»word»,
        max_df=0.95, 
        min_df=10, 
        max_features=number_features
    )
    clean_vec1 = vectorizer1.fit_transform(clean_sentences)
    print(clean_vec1[0])
    
    feature_names_vec1 = vectorizer1.get_feature_names()

    The output is as follows:

    (0, 320)    1
  3. Identify the optimal number of topics:

    def perplexity_by_ntopic(data, ntopics):
        output_dict = {
            «Number Of Topics": [], 
            «Perplexity Score»: []
        }
        for t in ntopics:
            lda = sklearn.decomposition.LatentDirichletAllocation(
                n_components=t,
                learning_method="online",
                random_state=0
            )
            lda.fit(data)
            output_dict["Number Of Topics"].append(t)
            output_dict["Perplexity Score"].append(lda.perplexity(data))
        output_df = pandas.DataFrame(output_dict)
        index_min_perplexity = output_df["Perplexity Score"].idxmin()
        output_num_topics = output_df.loc[
            index_min_perplexity,  # index
            «Number Of Topics"  # column
        ]
        return (output_df, output_num_topics)
    df_perplexity, optimal_num_topics = perplexity_by_ntopic(
        clean_vec1, 
        ntopics=[i for i in range(1, 21) if i % 2 == 0]
    )
    print(df_perplexity)

    The output is as follows:

    Figure 7.58: Number of topics versus perplexity score data frame

  4. Fit the LDA model using the optimal number of topics:

    lda = sklearn.decomposition.LatentDirichletAllocation(
        n_components=optimal_num_topics,
        learning_method="online",
        random_state=0
    )
    lda.fit(clean_vec1)

    The output is as follows:

    Figure 7.59: LDA model

  5. Create and print the word-topic table:

    def get_topics(mod, vec, names, docs, ndocs, nwords):
        # word to topic matrix
        W = mod.components_
        W_norm = W / W.sum(axis=1)[:, numpy.newaxis]
        # topic to document matrix
        H = mod.transform(vec)
        W_dict = {}
        H_dict = {}
        for tpc_idx, tpc_val in enumerate(W_norm):
            topic = «Topic{}".format(tpc_idx)
            # formatting w
            W_indices = tpc_val.argsort()[::-1][:nwords]
            W_names_values = [
                (round(tpc_val[j], 4), names[j]) 
                for j in W_indices
            ]
            W_dict[topic] = W_names_values
            # formatting h
            H_indices = H[:, tpc_idx].argsort()[::-1][:ndocs]
            H_names_values = [
            (round(H[:, tpc_idx][j], 4), docs[j]) 
                for j in H_indices
            ]
            H_dict[topic] = H_names_values
        W_df = pandas.DataFrame(
            W_dict, 
            index=["Word" + str(i) for i in range(nwords)]
        )
        H_df = pandas.DataFrame(
            H_dict,
            index=["Doc" + str(i) for i in range(ndocs)]
        )
        return (W_df, H_df)
    
    W_df, H_df = get_topics(
        mod=lda,
        vec=clean_vec1,
        names=feature_names_vec1,
        docs=raw,
        ndocs=number_docs, 
        nwords=number_words
    )
    print(W_df)

    The output is as follows:

    Figure 7.60: Word-topic table for the health tweet data

  6. Print the document-topic table:

    print(H_df)

    The output is as follows:

    Figure 7.61: Document topic table

  7. Create a biplot visualization:

    lda_plot = pyLDAvis.sklearn.prepare(lda, clean_vec1, vectorizer1, R=10)
    pyLDAvis.display(lda_plot)

    Figure 7.62: A histogram and biplot for the LDA model trained on health tweets

  8. Keep the notebook open for future modeling.

Activity 17: Non-Negative Matrix Factorization

Solution:

  1. Create the appropriate bag-of-words model and output the feature names as another variable:

    vectorizer2 = sklearn.feature_extraction.text.TfidfVectorizer(
        analyzer="word",
        max_df=0.5, 
        min_df=20, 
        max_features=number_features,
        smooth_idf=False
    )
    clean_vec2 = vectorizer2.fit_transform(clean_sentences)
    print(clean_vec2[0])
    
    feature_names_vec2 = vectorizer2.get_feature_names()
  2. Define and fit the NMF algorithm using the number of topics (n_components) value from activity two:

    nmf = sklearn.decomposition.NMF(
        n_components=optimal_num_topics,
        init="nndsvda",
        solver="mu",
        beta_loss="frobenius",
        random_state=0, 
        alpha=0.1, 
        l1_ratio=0.5
    )
    nmf.fit(clean_vec2)

    The output is as follows:

    Figure 7.63: Defining the NMF model

  3. Get the topic-document and word-topic result tables. Take a few minutes to explore the word groupings and try to define the abstract topics:

    W_df, H_df = get_topics(
        mod=nmf,
        vec=clean_vec2,
        names=feature_names_vec2,
        docs=raw,
        ndocs=number_docs, 
        nwords=number_words
    )
    
    print(W_df)

    Figure 7.64: The word-topic table with probabilities

  4. Adjust the model parameters and rerun step 3 and step 4.

lock icon The rest of the chapter is locked
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime
Banner background image