Search icon CANCEL
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Conferences
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
Data Science  with Python

You're reading from   Data Science with Python Combine Python with machine learning principles to discover hidden patterns in raw data

Arrow left icon
Product type Paperback
Published in Jul 2019
Publisher Packt
ISBN-13 9781838552862
Length 426 pages
Edition 1st Edition
Languages
Tools
Arrow right icon
Authors (3):
Arrow left icon
Rohan Chopra Rohan Chopra
Author Profile Icon Rohan Chopra
Rohan Chopra
Mohamed Noordeen Alaudeen Mohamed Noordeen Alaudeen
Author Profile Icon Mohamed Noordeen Alaudeen
Mohamed Noordeen Alaudeen
Aaron England Aaron England
Author Profile Icon Aaron England
Aaron England
Arrow right icon
View More author details
Toc

Table of Contents (10) Chapters Close

About the Book 1. Introduction to Data Science and Data Pre-Processing FREE CHAPTER 2. Data Visualization 3. Introduction to Machine Learning via Scikit-Learn 4. Dimensionality Reduction and Unsupervised Learning 5. Mastering Structured Data 6. Decoding Images 7. Processing Human Language 8. Tips and Tricks of the Trade 1. Appendix

Chapter 7: Processing Human Language

Activity 19: Predicting Sentiments of Movie Reviews

Solution:

  1. Read the IMDB movie review dataset using pandas in Python:

    import pandas as pd

    data = pd.read_csv('../../chapter 7/data/movie_reviews.csv', encoding='latin-1')

  2. Convert the tweets to lowercase to reduce the number of unique words:

    data.text = data.text.str.lower()

    Note

    Keep in mind that "Hello" and "hellow" are not the same to a computer.

  3. Clean the reviews using RegEx with the clean_str function:

    import re

    def clean_str(string):

        

        string = re.sub(r"https?\://\S+", '', string)

        string = re.sub(r'\<a href', ' ', string)

        string = re.sub(r'&amp;', '', string)

        string = re.sub(r'<br />', ' ', string)

        string = re.sub(r'[_"\-;%()|+&=*%.,!?:#$@\[\]/]', ' ', string)

        string = re.sub('\d','', string)

        string = re.sub(r"can\'t", "cannot", string)

        string = re.sub(r"it\'s", "it is", string)

        return string

    data.SentimentText = data.SentimentText.apply(lambda x: clean_str(str(x)))

  4. Next, remove stop words and other frequently occurring unnecessary words from the reviews:

    Note

    To see how we found these, words refer to Exercise 51.

  5. This step converts strings into tokens (which will be helpful in the next step):

    from nltk.corpus import stopwords

    from nltk.tokenize import word_tokenize,sent_tokenize

    stop_words = stopwords.words('english') + ['movie', 'film', 'time']

    stop_words = set(stop_words)

    remove_stop_words = lambda r: [[word for word in word_tokenize(sente) if word not in stop_words] for sente in sent_tokenize(r)]

    data['SentimentText'] = data['SentimentText'].apply(remove_stop_words)

  6. Create the word embedding of the reviews with the tokens created in the previous step. Here, we will use genism Word2Vec to create these embedding vectors:

    from gensim.models import Word2Vec

    model = Word2Vec(

            data['SentimentText'].apply(lambda x: x[0]),

            iter=10,

            size=16,

            window=5,

            min_count=5,

            workers=10)

    model.wv.save_word2vec_format('movie_embedding.txt', binary=False)

  7. Combine the tokens to get a string and then drop any review that does not have anything in it after stop word removal:

    def combine_text(text):

        try:

            return ' '.join(text[0])

        except:

            return np.nan

     

    data.SentimentText = data.SentimentText.apply(lambda x: combine_text(x))

    data = data.dropna(how='any')

  8. Tokenize the reviews using the Keras Tokenizer and convert them into numbers:

    from keras.preprocessing.text import Tokenizer

    tokenizer = Tokenizer(num_words=5000)

    tokenizer.fit_on_texts(list(data['SentimentText']))

    sequences = tokenizer.texts_to_sequences(data['SentimentText'])

    word_index = tokenizer.word_index

  9. Finally, pad the tweets to have a maximum of 100 words. This will remove any words after the 100-word limit and add 0s if the number of words is less than 100:

    from keras.preprocessing.sequence import pad_sequences

    reviews = pad_sequences(sequences, maxlen=100)

  10. Load the created embedding to get the embedding matrix using the load_embedding function discussed in the Text Processing section:

    import numpy as np

     

    def load_embedding(filename, word_index , num_words, embedding_dim):

        embeddings_index = {}

        file = open(filename, encoding="utf-8")

        for line in file:

            values = line.split()

            word = values[0]

            coef = np.asarray(values[1:])

            embeddings_index[word] = coef

        file.close()

        

        embedding_matrix = np.zeros((num_words, embedding_dim))

        for word, pos in word_index.items():

            if pos >= num_words:

                continue

            embedding_vector = embeddings_index.get(word)

            if embedding_vector is not None:

                embedding_matrix[pos] = embedding_vector

        return embedding_matrix

     

    embedding_matrix = load_embedding('movie_embedding.txt', word_index, len(word_index), 16)

  11. Convert the label into one-hot vector using pandas' get_dummies function and split the dataset into testing and training sets with an 80:20 split:

    from sklearn.model_selection import train_test_split

    labels = pd.get_dummies(data.Sentiment)

    X_train, X_test, y_train, y_test = train_test_split(reviews,labels, test_size=0.2, random_state=9)

  12. Create the neural network model starting with the input and embedding layers. This layer converts the input words into their embedding vectors:

    from keras.layers import Input, Dense, Dropout, BatchNormalization, Embedding, Flatten

    from keras.models import Model

    inp = Input((100,))

    embedding_layer = Embedding(len(word_index),

                        16,

                        weights=[embedding_matrix],

                        input_length=100,

                        trainable=False)(inp)

  13. Create the rest of the fully connected neural network using Keras:

    model = Flatten()(embedding_layer)

    model = BatchNormalization()(model)

    model = Dropout(0.10)(model)

    model = Dense(units=1024, activation='relu')(model)

    model = Dense(units=256, activation='relu')(model)

    model = Dropout(0.5)(model)

    predictions = Dense(units=2, activation='softmax')(model)

    model = Model(inputs = inp, outputs = predictions)

  14. Compile and train the model for 10 epochs. You can modify the model and the hyperparameters to try and get a better accuracy:

    model.compile(loss='binary_crossentropy', optimizer='sgd', metrics = ['acc'])

    model.fit(X_train, y_train, validation_data = (X_test, y_test), epochs=10, batch_size=256)

  15. Calculate the accuracy of the model on the test set to see how well our model performs on previously unseen data by using the following:

    from sklearn.metrics import accuracy_score

    preds = model.predict(X_test)

    accuracy_score(np.argmax(preds, 1), np.argmax(y_test.values, 1))

    The accuracy of the model is:

    Figure 7.39: Model accuracy
    Figure 7.39: Model accuracy
  16. Plot the confusion matrix of the model to get a proper sense of the model's prediction:

    y_actual = pd.Series(np.argmax(y_test.values, axis=1), name='Actual')

    y_pred = pd.Series(np.argmax(preds, axis=1), name='Predicted')

    pd.crosstab(y_actual, y_pred, margins=True)

    Check the following

    Figure 7.40: Confusion matrix of the model (0 = negative sentiment, 1 = positive sentiment)
    Figure 7.40: Confusion matrix of the model (0 = negative sentiment, 1 = positive sentiment)
  17. Check the performance of the model by seeing the sentiment predictions on random reviews using the following code:

    review_num = 111

    print("Review: \n"+tokenizer.sequences_to_texts([X_test[review_num]])[0])

    sentiment = "Positive" if np.argmax(preds[review_num]) else "Negative"

    print("\nPredicted sentiment = "+ sentiment)

    sentiment = "Positive" if np.argmax(y_test.values[review_num]) else "Negative"

    print("\nActual sentiment = "+ sentiment)

    Check that you receive the following output:

Figure 7.41: A review from the IMDB dataset
Figure 7.41: A review from the IMDB dataset

Activity 20: Predicting Sentiments from Tweets

Solution:

  1. Read the tweet dataset using pandas and rename the columns with those given in the following code:

    import pandas as pd

    data = pd.read_csv('tweet-data.csv', encoding='latin-1', header=None)

    data.columns = ['sentiment', 'id', 'date', 'q', 'user', 'text']

  2. Drop the following columns as we won't be using them. You can analyze and use them if you want when trying to improve the accuracy:

    data = data.drop(['id', 'date', 'q', 'user'], axis=1)

  3. We perform this activity only on a subset (400,000 tweets) of the data to save time. If you want, you can work on the whole dataset:

    data = data.sample(400000).reset_index(drop=True)

  4. Convert the tweets to lowercase to reduce the number of unique words. Keep in mind that "Hello" and "hellow" are not the same to a computer:

    data.text = data.text.str.lower()

  5. Clean the tweets using the clean_str function:

    import re

    def clean_str(string):

        string = re.sub(r"https?\://\S+", '', string)

        string = re.sub(r"@\w*\s", '', string)

        string = re.sub(r'\<a href', ' ', string)

        string = re.sub(r'&amp;', '', string)

        string = re.sub(r'<br />', ' ', string)

        string = re.sub(r'[_"\-;%()|+&=*%.,!?:#$@\[\]/]', ' ', string)

        string = re.sub('\d','', string)

        return string

     

    data.text = data.text.apply(lambda x: clean_str(str(x)))

  6. Remove all the stop words from the tweets, as was done in the Text Preprocessing section:

    from nltk.corpus import stopwords

    from nltk.tokenize import word_tokenize,sent_tokenize

    stop_words = stopwords.words('english')

    stop_words = set(stop_words)

    remove_stop_words = lambda r: [[word for word in word_tokenize(sente) if word not in stop_words] for sente in sent_tokenize(r)]

    data['text'] = data['text'].apply(remove_stop_words)

     

    def combine_text(text):

        try:

            return ' '.join(text[0])

        except:

            return np.nan

     

    data.text = data.text.apply(lambda x: combine_text(x))

     

    data = data.dropna(how='any')

  7. Tokenize the tweets and convert them to numbers using the Keras Tokenizer:

    from keras.preprocessing.text import Tokenizer

    tokenizer = Tokenizer(num_words=5000)

    tokenizer.fit_on_texts(list(data['text']))

    sequences = tokenizer.texts_to_sequences(data['text'])

    word_index = tokenizer.word_index

  8. Finally, pad the tweets to have a maximum of 50 words. This will remove any words after the 50-word limit and add 0s if the number of words is less than 50:

    from keras.preprocessing.sequence import pad_sequences

    tweets = pad_sequences(sequences, maxlen=50)

  9. Create the embedding matrix from the GloVe embedding file that we downloaded using the load_embedding function:

    import numpy as np

    def load_embedding(filename, word_index , num_words, embedding_dim):

        embeddings_index = {}

        file = open(filename, encoding="utf-8")

        for line in file:

            values = line.split()

            word = values[0]

            coef = np.asarray(values[1:])

            embeddings_index[word] = coef

        file.close()

        

        embedding_matrix = np.zeros((num_words, embedding_dim))

        for word, pos in word_index.items():

            if pos >= num_words:

                continue

            embedding_vector = embeddings_index.get(word)

            if embedding_vector is not None:

                embedding_matrix[pos] = embedding_vector

        return embedding_matrix

     

    embedding_matrix = load_embedding('../../embedding/glove.twitter.27B.50d.txt', word_index, len(word_index), 50)

  10. Split the dataset into training and testing sets with an 80:20 spilt. You can experiment with different splits:

    from sklearn.model_selection import train_test_split

    X_train, X_test, y_train, y_test = train_test_split(tweets, pd.get_dummies(data.sentiment), test_size=0.2, random_state=9)

  11. Create the LSTM model that will predict the sentiment. You can modify this to create your own neural network:

    from keras.models import Sequential

    from keras.layers import Dense, Dropout, BatchNormalization, Embedding, Flatten, LSTM

    embedding_layer = Embedding(len(word_index),

                               50,

                               weights=[embedding_matrix],

                               input_length=50,

                                trainable=False)

    model = Sequential()

    model.add(embedding_layer)

    model.add(Dropout(0.5))

    model.add(LSTM(100, dropout=0.2))

    model.add(Dense(2, activation='softmax'))

     

    model.compile(loss='binary_crossentropy', optimizer='sgd', metrics = ['acc'])

  12. Train the model. Here, we train it only for 10 epochs. You can increase the number of epochs to try and get a better accuracy:

    model.fit(X_train, y_train, validation_data = (X_test, y_test), epochs=10, batch_size=256)

  13. Check how well the model is performing by predicting the sentiment of a few tweets in the test set:

    preds = model.predict(X_test)

    review_num = 1

    print("Tweet: \n"+tokenizer.sequences_to_texts([X_test[review_num]])[0])

    sentiment = "Positive" if np.argmax(preds[review_num]) else "Negative"

    print("\nPredicted sentiment = "+ sentiment)

    sentiment = "Positive" if np.argmax(y_test.values[review_num]) else "Negative"

    print("\nActual sentiment = "+ sentiment)

    The output is as follows:

Figure 7.42: Positive (left) and negative (right) tweets and their predictions
Figure 7.42: Positive (left) and negative (right) tweets and their predictions
lock icon The rest of the chapter is locked
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime