Search icon CANCEL
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Conferences
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
Learning Data Mining with Python

You're reading from   Learning Data Mining with Python Use Python to manipulate data and build predictive models

Arrow left icon
Product type Paperback
Published in Apr 2017
Publisher Packt
ISBN-13 9781787126787
Length 358 pages
Edition 2nd Edition
Languages
Concepts
Arrow right icon
Toc

Table of Contents (14) Chapters Close

Preface 1. Getting Started with Data Mining FREE CHAPTER 2. Classifying with scikit-learn Estimators 3. Predicting Sports Winners with Decision Trees 4. Recommending Movies Using Affinity Analysis 5. Features and scikit-learn Transformers 6. Social Media Insight using Naive Bayes 7. Follow Recommendations Using Graph Mining 8. Beating CAPTCHAs with Neural Networks 9. Authorship Attribution 10. Clustering News Articles 11. Object Detection in Images using Deep Neural Networks 12. Working with Big Data 13. Next Steps...

What is classification?

Classification is one of the largest uses of data mining, both in practical use and in research. As before, we have a set of samples that represents objects or things we are interested in classifying. We also have a new array, the class values. These class values give us a categorization of the samples. Some examples are as follows:

  • Determining the species of a plant by looking at its measurements. The class value here would be: Which species is this?
  • Determining if an image contains a dog. The class would be: Is there a dog in this image?
  • Determining if a patient has cancer, based on the results of a specific test. The class would be: Does this patient have cancer?

While many of the examples previous are binary (yes/no) questions, they do not have to be, as in the case of plant species classification in this section.

The goal of classification applications is to train a model on a set of samples with known classes and then apply that model to new unseen samples with unknown classes. For example, we want to train a spam classifier on my past e-mails, which I have labeled as spam or not spam. I then want to use that classifier to determine whether my next email is spam, without me needing to classify it myself.

Loading and preparing the dataset

The dataset we are going to use for this example is the famous Iris database of plant classification. In this dataset, we have 150 plant samples and four measurements of each: sepal length, sepal width, petal length, and petal width (all in centimeters). This classic dataset (first used in 1936!) is one of the classic datasets for data mining. There are three classes: Iris Setosa, Iris Versicolour, and Iris Virginica. The aim is to determine which type of plant a sample is, by examining its measurements.

The scikit-learn library contains this dataset built-in, making the loading of the dataset straightforward:

from sklearn.datasets import load_iris 
dataset = load_iris()
X = dataset.data
y = dataset.target

You can also print(dataset.DESCR) to see an outline of the dataset, including some details about the features.

The features in this dataset are continuous values, meaning they can take any range of values. Measurements are a good example of this type of feature, where a measurement can take the value of 1, 1.2, or 1.25 and so on. Another aspect of continuous features is that feature values that are close to each other indicate similarity. A plant with a sepal length of 1.2 cm is like a plant with a Sepal width of 1.25 cm.

In contrast are categorical features. These features, while often represented as numbers, cannot be compared in the same way. In the Iris dataset, the class values are an example of a categorical feature. The class 0 represents Iris Setosa; class 1 represents Iris Versicolour, and class 2 represents Iris Virginica. The numbering doesn't mean that Iris Setosa is more similar to Iris Versicolour than it is to Iris Virginica-despite the class value being more similar. The numbers here represent categories. All we can say is whether categories are the same or different.

There are other types of features too, which we will cover in later chapters. These include pixel intensity, word frequency and n-gram analysis.

While the features in this dataset are continuous, the algorithm we will use in this example requires categorical features. Turning a continuous feature into a categorical feature is a process called discretization.

A simple discretization algorithm is to choose some threshold, and any values below this threshold are given a value 0. Meanwhile, any above this are given the value 1. For our threshold, we will compute the mean (average) value for that feature. To start with, we compute the mean for each feature:

attribute_means = X.mean(axis=0)

The result from this code will be an array of length 4, which is the number of features we have. The first value is the mean of the values for the first feature and so on. Next, we use this to transform our dataset from one with continuous features to one with discrete categorical features:

assert attribute_means.shape == (n_features,)
X_d = np.array(X >= attribute_means, dtype='int')

We will use this new X_d dataset (for X discretized) for our training and testing, rather than the original dataset (X).

Implementing the OneR algorithm

OneR is a simple algorithm that simply predicts the class of a sample by finding the most frequent class for the feature values. OneR is shorthand for One Rule, indicating we only use a single rule for this classification by choosing the feature with the best performance. While some of the later algorithms are significantly more complex, this simple algorithm has been shown to have good performance in some real-world datasets.

The algorithm starts by iterating over every value of every feature. For that value, count the number of samples from each class that has that feature value. Record the most frequent class of the feature value, and the error of that prediction.

For example, if a feature has two values, 0 and 1, we first check all samples that have the value 0. For that value, we may have 20 in Class A, 60 in Class B, and a further 20 in Class C. The most frequent class for this value is B, and there are 40 instances that have different classes. The prediction for this feature value is B with an error of 40, as there are 40 samples that have a different class from the prediction. We then do the same procedure for the value 1 for this feature, and then for all other feature value combinations.

Once these combinations are computed, we compute the error for each feature by summing up the errors for all values for that feature. The feature with the lowest total error is chosen as the One Rule and then used to classify other instances.

In code, we will first create a function that computes the class prediction and error for a specific feature value. We have two necessary imports, defaultdict and itemgetter, that we used in earlier code:

from collections import defaultdict 
from operator import itemgetter

Next, we create the function definition, which needs the dataset, classes, the index of the feature we are interested in, and the value we are computing. It loops over each sample, and counts the number of time each feature value corresponds to a specific class. We then choose the most frequent class for the current feature/value pair:

def train_feature_value(X, y_true, feature, value):
# Create a simple dictionary to count how frequency they give certain
predictions
class_counts = defaultdict(int)
# Iterate through each sample and count the frequency of each
class/value pair
for sample, y in zip(X, y_true):
if sample[feature] == value:
class_counts[y] += 1
# Now get the best one by sorting (highest first) and choosing the
first item
sorted_class_counts = sorted(class_counts.items(), key=itemgetter(1),
reverse=True)
most_frequent_class = sorted_class_counts[0][0]
# The error is the number of samples that do not classify as the most
frequent class
# *and* have the feature value.
n_samples = X.shape[1]
error = sum([class_count for class_value, class_count in
class_counts.items()
if class_value != most_frequent_class])
return most_frequent_class, error

As a final step, we also compute the error of this rule. In the OneR algorithm, any sample with this feature value would be predicted as being the most frequent class. Therefore, we compute the error by summing up the counts for the other classes (not the most frequent). These represent training samples that result in error or an incorrect classification.

With this function, we can now compute the error for an entire feature by looping over all the values for that feature, summing the errors, and recording the predicted classes for each value.

The function needs the dataset, classes, and feature index we are interested in. It then iterates through the different values and finds the most accurate feature value to use for this specific feature, as the rule in OneR:

def train(X, y_true, feature): 
# Check that variable is a valid number
n_samples, n_features = X.shape
assert 0 <= feature < n_features
# Get all of the unique values that this variable has
values = set(X[:,feature])
# Stores the predictors array that is returned
predictors = dict()
errors = []
for current_value in values:
most_frequent_class, error = train_feature_value
(X, y_true, feature, current_value)
predictors[current_value] = most_frequent_class
errors.append(error)
# Compute the total error of using this feature to classify on
total_error = sum(errors)
return predictors, total_error

Let's have a look at this function in a little more detail.

After some initial tests, we then find all the unique values that the given feature takes. The indexing in the next line looks at the whole column for the given feature and returns it as an array. We then use the set function to find only the unique values:

    values = set(X[:,feature_index])

Next, we create our dictionary that will store the predictors. This dictionary will have feature values as the keys and classification as the value. An entry with key 1.5 and value 2 would mean that, when the feature has a value set to 1.5, classify it as belonging to class 2. We also create a list storing the errors for each feature value:

predictors = {} 
errors = []

As the main section of this function, we iterate over all the unique values for this feature and use our previously defined train_feature_value function to find the most frequent class and the error for a given feature value. We store the results as outlined earlier:

Finally, we compute the total errors of this rule and return the predictors along with this value:

total_error = sum(errors)
return predictors, total_error

Testing the algorithm

When we evaluated the affinity analysis algorithm of the earlier section, our aim was to explore the current dataset. With this classification, our problem is different. We want to build a model that will allow us to classify previously unseen samples by comparing them to what we know about the problem.

For this reason, we split our machine-learning workflow into two stages: training and testing. In training, we take a portion of the dataset and create our model. In testing, we apply that model and evaluate how effectively it worked on the dataset. As our goal is to create a model that can classify previously unseen samples, we cannot use our testing data for training the model. If we do, we run the risk of overfitting.

Overfitting is the problem of creating a model that classifies our training dataset very well but performs poorly on new samples. The solution is quite simple: never use training data to test your algorithm. This simple rule has some complex variants, which we will cover in later chapters; but, for now, we can evaluate our OneR implementation by simply splitting our dataset into two small datasets: a training one and a testing one. This workflow is given in this section.

The scikit-learn library contains a function to split data into training and testing components:

from sklearn.cross_validation import train_test_split

This function will split the dataset into two sub-datasets, per a given ratio (which by default uses 25 percent of the dataset for testing). It does this randomly, which improves the confidence that the algorithm will perform as expected in real world environments (where we expect data to come in from a random distribution):

Xd_train, Xd_test, y_train, y_test = train_test_split(X_d, y, 
random_state=14)

We now have two smaller datasets: Xd_train contains our data for training and Xd_test contains our data for testing. y_train and y_test give the corresponding class values for these datasets.

We also specify a random_state. Setting the random state will give the same split every time the same value is entered. It will look random, but the algorithm used is deterministic, and the output will be consistent. For this book, I recommend setting the random state to the same value that I do, as it will give you the same results that I get, allowing you to verify your results. To get truly random results that change every time you run it, set random_state to None.

Next, we compute the predictors for all the features for our dataset. Remember to only use the training data for this process. We iterate over all the features in the dataset and use our previously defined functions to train the predictors and compute the errors:

all_predictors = {} 
errors = {}
for feature_index in range(Xd_train.shape[1]):
predictors, total_error = train(Xd_train,
y_train,
feature_index)
all_predictors[feature_index] = predictors
errors[feature_index] = total_error

Next, we find the best feature to use as our One Rule, by finding the feature with the lowest error:

best_feature, best_error = sorted(errors.items(), key=itemgetter(1))[0]

We then create our model by storing the predictors for the best feature:

model = {'feature': best_feature,
'predictor': all_predictors[best_feature]}

Our model is a dictionary that tells us which feature to use for our One Rule and the predictions that are made based on the values it has. Given this model, we can predict the class of a previously unseen sample by finding the value of the specific feature and using the appropriate predictor. The following code does this for a given sample:

variable = model['feature'] 
predictor = model['predictor']
prediction = predictor[int(sample[variable])]

Often we want to predict several new samples at one time, which we can do using the following function. It simply uses the above code, but iterate over all the samples in a dataset, obtaining the prediction for each sample:

def predict(X_test, model):
variable = model['feature']
predictor = model['predictor']
y_predicted = np.array([predictor
[int(sample[variable])] for sample
in X_test])
return y_predicted

For our testing dataset, we get the predictions by calling the following function:

y_predicted = predict(Xd_test, model)

We can then compute the accuracy of this by comparing it to the known classes:

accuracy = np.mean(y_predicted == y_test) * 100 
print("The test accuracy is {:.1f}%".format(accuracy))

This algorithm gives an accuracy of 65.8 percent, which is not bad for a single rule!

You have been reading a chapter from
Learning Data Mining with Python - Second Edition
Published in: Apr 2017
Publisher: Packt
ISBN-13: 9781787126787
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at €18.99/month. Cancel anytime