Spam filtering
Our first problem is a modern version of the canonical binary classification problem: spam filtering. In our version, however, we will classify spam and ham SMS messages rather than e-mail. We will extract tf-idf features from the messages using the techniques we learned in previous chapters, and classify the messages using logistic regression. We will use the SMS Spam Collection Data Set
from the UCI Machine Learning Repository
. The dataset can be downloaded from http://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection. First, let's explore the dataset and calculate some basic summary statistics using pandas:
# In[1]: import pandas as pd df = pd.read_csv('./SMSSpamCollection', delimiter='t', header=None) print(df.head()) # Out[1]: 0 1 0 ham Go until jurong point, crazy.. Available only ... 1 ham Ok lar... Joking wif u oni... 2 spam Free entry in 2 a wkly comp to win FA Cup fina... 3 ham...