Using Vectorized Operations to Analyze Data Fast
The core building blocks in all programmers' toolboxes are looping and conditionals – usually materialized as a for
loop or an if
statement respectively. Almost any programming problem in its most fundamental form can be broken down into a series of conditional operations (only do something if a specific condition is met) and a series of iterative operations (carry on doing the same thing until a condition is met).
In machine learning, vectors, matrices, and tensors become the basic building blocks, taking over from arrays and linked lists. When we are manipulating and analyzing matrices, we often want to apply a single operation or function to the entire matrix.
Programmers coming from a traditional computer science background will often use a for
loop or a while
loop to do this kind of analysis or manipulation, but they are inefficient.
Instead, it is important to become comfortable with vectorized operations. Nearly all modern processors support efficiently modifying matrices and vectors in parallel by executing the same operation to each element simultaneously.
Similarly, many software packages are optimized for exactly this use case: applying the same operator to many rows of a matrix.
But if you are used to writing for
loops, it can be difficult to get out of the habit. So, we will compare the for
loop with the vectorized operation to help you understand the reason to avoid using a for
loop. In the next exercise, we'll use our headlines dataset again and do some basic analysis. We'll do each piece of analysis twice: first using a for
loop, and then again using a vectorized operation. You'll see the speed differences even on this relatively small dataset, but these differences will be even more important on the larger datasets that we previously discussed.
While some languages have great support for vectorized operations out of the box, Python relies mainly on third-party libraries to take advantage of these. We'll be using pandas
in the upcoming exercise.
Exercise 1.02: Applying Vectorized Operations to Entire Matrices
In this exercise, we'll use the pandas
library to load the same clickbait dataset and we'll carry out some descriptive analysis. We'll do each piece of analysis twice to see the efficiency gains of using vectorized operations compared to for
loops.
Perform the following steps to complete the exercise:
- Create a new directory,
Exercise01.02
, in theChapter01
directory to store the files for this exercise. - Open your Terminal (macOS or Linux) or Command Prompt (Windows), navigate to the
Chapter01
directory, and typejupyter notebook
. - In the Jupyter notebook, click the
Exercise01.02
directory and create a new notebook file with a Python3 kernel. - Import the
pandas
library and use it to read the dataset file into a DataFrame, as shown in the following code:import pandas as pd df = pd.read_csv("../Datasets/clickbait-headlines.tsv", \ sep="\t", names=["Headline", "Label"]) df
You should get the following output:
We import the
pandas
library and then use theread_csv()
function to read the file into a DataFrame calleddf
. We pass thesep
argument to indicate that the file uses tab (\t
) characters as separators and then pass in the column names as thenames
argument. The output is summarized to show only the first few entries and the last few, followed by a description of how many rows and columns there are in the entire DataFrame. - Calculate the length of each headline and print out the first 10 lengths using a
for
loop, along with the total performance timing, as shown in the following code:%%time lengths = [] for i, row in df.iterrows(): lengths.append(len(row[0])) print(lengths[:10])
You should get the following output:
[42, 60, 72, 49, 66, 51, 51, 58, 57, 76] CPU times: user 1.82 s, sys: 50.8 ms, total: 1.87 s Wall time: 1.95 s
We declare an empty array to store the lengths, then loop through each row in our DataFrame using the
iterrows()
method. We append the length of the first item of each row (the headline) to our array, and finally, print out the first 10 results. - Now re-calculate the length of each row, but this time using vectorized operations, as shown in the following code:
%%time lengths = df['Headline'].apply(len) print(lengths[:10])
You should get the following output:
0 42 1 60 2 72 3 49 4 66 5 51 6 51 7 58 8 57 9 76 Name: Headline, dtype: int64 CPU times: user 6.31 ms, sys: 1.7 ms, total: 8.01 ms Wall time: 7.76 ms
We use the
apply()
function to applylen
to every row in our DataFrame, without afor
loop. Then we print the results to verify they are the same as when we used thefor
loop. From the output, we can see the results are the same, but this time it took only16.3
milliseconds instead of over 1 second to carry out all of these calculations. Now, let's try a different calculation. - This time, find the average length of all clickbait headlines and compare this average to the length of normal headlines, as shown in the following code:
%%time from statistics import mean normal_lengths = [] clickbait_lengths = [] for i, row in df.iterrows(): if row[1] == 1: # clickbait clickbait_lengths.append(len(row[0])) else: normal_lengths.append(len(row[0])) print("Mean normal length is {}"\ .format(mean(normal_lengths))) print("Mean clickbait length is {}"\ .format(mean(clickbait_lengths)))
Note
The
#
symbol in the code snippet above denotes a code comment. Comments are added into code to help explain specific bits of logic.You should get the following output:
Mean normal length is 52.0322 Mean clickbait length is 55.6876 CPU times: user 1.91 s, sys: 40.7 ms, total: 1.95 s Wall time: 2.03 s
We import the
mean
function from thestatistics
library. This time, we set up two empty arrays, one for the lengths of normal headlines and one for the lengths of clickbait headlines. We use theiterrows()
function again to check every row and calculate the length, but this time store the result in one of our two arrays, based on whether the headline is clickbait or not. We then take the average of each array and print it out. - Now recalculate this output using vectorized operations, as shown in the following code:
%%time print(df[df["Label"] == 0]['Headline'].apply(len).mean()) print(df[df["Label"] == 1]['Headline'].apply(len).mean())
You should get the following output:
52.0322 55.6876 CPU times: user 10.5 ms, sys: 3.14 ms, total: 13.7 ms Wall time: 14 ms
In each line, we look at only a subset of the DataFrame: first when the label is
0
, and second when it is1
. We again apply thelen
function to each row that matches the condition and then take the average of the entire result. We confirm that the output is the same as before, but the overall time is in milliseconds in this case. - As a final test, calculate how often the word
"you"
appears in each kind of headline, as shown in the following code:%%time from statistics import mean normal_yous = 0 clickbait_yous = 0 for i, row in df.iterrows(): num_yous = row[0].lower().count("you") if row[1] == 1: # clickbait clickbait_yous += num_yous else: normal_yous += num_yous print("Total 'you's in normal headlines {}".format(normal_yous)) print("Total 'you's in clickbait headlines {}".format(clickbait_yous))
You should get the following output:
Total 'you's in normal headlines 43 Total 'you's in clickbait headlines 2527 CPU times: user 1.48 s, sys: 8.84 ms, total: 1.49 s Wall time: 1.53 s
We define two variables,
normal_yous
andclickbait_yous
, to count the total occurrences of the wordyou
in each class of headline. We loop through the entire dataset again using afor
loop and theiterrows()
function. For each row, we use thecount()
function to count how often the wordyou
appear and then add this total to the relevant total. Finally, we print out both results, seeing thatyou
appear very often in clickbait headlines, but hardly in non-clickbait headlines. - Rerun the same analysis without using a
for
loop and compare the time, as shown in the following code:%%time print(df[df["Label"] == 0]['Headline']\ .apply(lambda x: x.lower().count("you")).sum()) print(df[df["Label"] == 1]['Headline']\ .apply(lambda x: x.lower().count("you")).sum())
You should get the following output:
43 2527 CPU times: user 20.8 ms, sys: 1.32 ms, total: 22.1 ms Wall time: 27.9 ms
We break the dataset into two subsets and apply the same operation to each. This time, our function is a bit more complicated than the
len
function we used before, so we define an anonymous function inline usinglambda
. We lowercase each headline and count how often"you"
appears and then sum the results. We notice that the performance time, in this case, is again in milliseconds.Note
To access the source code for this specific section, please refer to https://packt.live/2OmyEE2.
In this exercise, the main takeaway we can see is that vectorized operations can be many times faster than using for
loops. We also learned some interesting things about clickbait characteristics though. For example, the word "you"
appears very often in clickbait headlines (2,527
times), but hardly ever in normal headlines (43
times). Clickbait headlines are also, on average, slightly longer than non-clickbait headlines.
Let's implement the concepts learned so far in the next activity.
Activity 1.01: Creating a Text Classifier for Movie Reviews
In this activity, we will create another text classifier. Instead of training a machine learning model to discriminate between clickbait headlines and normal headlines, we will train a similar classifier to discriminate between positive and negative movie reviews.
The objectives of our activity are as follows:
- Vectorize the text of IMDb movie reviews and label these as positive or negative.
- Train an SVM classifier to predict whether a movie review is positive or negative.
- Check how accurate our classifier is on a held-out test set.
- Evaluate our classifier on out-of-context data.
Note
We will be using some randomizers in this activity. It is helpful to set the global random seeds to ensure that the results you see are the same as in the examples.
Sklearn
uses theNumPy
random seed, and we will also use theshuffle
function from the built-in random library. You can ensure you see the same results by adding the following code:
import random import numpy as np random.seed(1337) np.random.seed(1337)
We'll use the aclImdb
dataset of 100,000 movie reviews from Internet Movie Database (IMDb) – 50,000 each for training and testing. Each dataset has 25,000 positive reviews and 25,000 negative ones, so this is a larger dataset than our headlines one. The dataset can be found in our GitHub repository at the following location: https://packt.live/2C72sBN
You need to download the aclImdb
folder from the GitHub repository.
Dataset Citation: Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. (2011). Learning Word Vectors for Sentiment Analysis. The 49th Annual Meeting of the Association for Computational Linguistics (ACL 2011).
In Exercise 1.01, Training a Machine Learning Model to Identify Clickbait Headlines, we had one file, with each line representing a different data item. Now we have a file for each data item, so keep in mind that we'll need to restructure some of our training code accordingly.
Note
The code and the resulting output for this exercise have been loaded in a Jupyter notebook that can be found at https://packt.live/3iWYZGH.
Perform the following steps to complete the activity:
- Import the
os
library and therandom
library, and define where our training and test data is stored using four variables: one fortraining_positive
, one fortraining_negative
, one fortest_positive
, and one fortest_negative
, each pointing at the respective dataset subdirectory. - Define a
read_dataset
function that takes a path to a dataset and a label (eitherpos
orneg
) that reads the contents of each file in the given directory and adds these contents into a data structure that is a list of tuples. Each tuple contains both the text of the file and the label,pos
orneg
. An example is shown as follows. The actual data should be read from disk instead of being defined in code:contents_labels = [('this is the text from one of the files', 'pos'), ('this is another text', 'pos')]
- Use the
read_dataset
function to read each dataset into its variable. You should have four variables in total:train_pos
,train_neg
,test_pos
, andtest_neg
, each one of which is a list of tuples, containing the relative text and labels. - Combine the
train_pos
andtrain_neg
datasets. Do the same for thetest_pos
andtest_neg
datasets. - Use the
random.shuffle
function to shuffle the train and test datasets separately. This gives us datasets where the training data is mixed up, instead of feeding all the positive and then all the negative examples to the classifier in order. - Split each of the train and test datasets back into
data
andlabels
respectively. You should have four variables again calledtrain_data
,y_train
,test_data
, andy_test
where they
prefix indicates that the respective array contains labels. - Import
TfidfVectorizer
fromsklearn
, initialize an instance of it, fit the vectorizer on the training data, and vectorize both the training and testing data into theX_train
andX_test
variables respectively. Time how long this takes and print out the shape of the training vectors at the end. - Again, find the execution time, import
LinearSVC
fromsklearn
and initialize an instance of it. Fit the SVM on the training data and training labels, and then generate predictions on the test data (X_test)
. - Import
accuracy_score
andclassification_report
fromsklearn
and calculate the results of your predictions. You should get the following output: - See how your classifier performs on data on different topics. Create two restaurant reviews as follows:
good_review = "The restaurant was really great! "\ "I ate wonderful food and had a very good time" bad_review = "The restaurant was awful. "\ "The staff were rude and "\ "the food was horrible. "\ "I hated it"
- Now vectorize each using the same vectorizer and generate predictions for whether each one is negative or positive. Did your classifier guess correctly?
Now that we've built two machine learning models and gained some hands-on experience with vectorized operations, it's time to recap.
Note
The solution for this activity can be found via this link.