The name Naïve Bayes comes from the basic assumption in the model that the probability of a particular feature Xi is independent of any other feature Xj given the class label CK. This implies the following:
Using this assumption and the Bayes rule, one can show that the probability of class CK, given features {X1,X2,X3,...,Xn}, is given by:
Here, P(X1,X2,X3,...,Xn) is the normalization term obtained by summing the numerator on all the values of k. It is also called Bayesian evidence or partition function Z. The classifier selects a class label as the target class that maximizes the posterior class probability P(CK |{X1,X2,X3,...,Xn}):
The Naïve Bayes classifier is a baseline classifier for document classification. One reason for this is that the underlying assumption that each feature (words or m-grams) is independent of others, given the class label typically holds good for text. Another reason is that the Naïve Bayes classifier scales well when there is a large number of documents.
There are two implementations of Naïve Bayes. In Bernoulli Naïve Bayes, features are binary variables that encode whether a feature (m-gram) is present or absent in a document. In multinomial Naïve Bayes, the features are frequencies of m-grams in a document. To avoid issues when the frequency is zero, a Laplace smoothing is done on the feature vectors by adding a 1 to each count. Let's look at multinomial Naïve Bayes in some detail.
Let ni be the number of times the feature Xi occurred in the class CK in the training data. Then, the likelihood function of observing a feature vector X={X1,X2,X3,..,Xn}, given a class label CK, is given by:
Here, is the probability of observing the feature Xi in the class CK.
Using Bayesian rule, the posterior probability of observing the class CK, given a feature vector X, is given by:
Taking logarithm on both the sides and ignoring the constant term Z, we get the following:
So, by taking logarithm of posterior distribution, we have converted the problem into a linear regression model with as the coefficients to be determined from data. This can be easily solved. Generally, instead of term frequencies, one uses TF-IDF (term frequency multiplied by inverse frequency) with the document length normalized to improve the performance of the model.
The R package e1071 (Miscellaneous Functions of the Department of Statistics) by T.U. Wien contains an R implementation of Naïve Bayes. For this article, we will use the SMS spam dataset from the UCI Machine Learning repository (reference 1 in the References section of this article). The dataset consists of 425 SMS spam messages collected from the UK forum Grumbletext, where consumers can submit spam SMS messages. The dataset also contains 3375 normal (ham) SMS messages from the NUS SMS corpus maintained by the National University of Singapore.
The dataset can be downloaded from the UCI Machine Learning repository (https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection). Let's say that we have saved this as file SMSSpamCollection.txt in the working directory of R (actually, you need to open it in Excel and save it is as tab-delimited file for it to read in R properly).
Then, the command to read the file into the tm (text mining) package would be the following:
>spamdata ←read.table("SMSSpamCollection.txt",sep="\t",stringsAsFactors = default.stringsAsFactors())
We will first separate the dependent variable y and independent variables x and split the dataset into training and testing sets in the ratio 80:20, using the following R commands:
>samp←sample.int(nrow(spamdata),as.integer(nrow(spamdata)*0.2),replace=F)
>spamTest ←spamdata[samp,]
>spamTrain ←spamdata[-samp,]
>ytrain←as.factor(spamTrain[,1])
>ytest←as.factor(spamTest[,1])
>xtrain←as.vector(spamTrain[,2])
>xtest←as.vector(spamTest[,2])
Since we are dealing with text documents, we need to do some standard preprocessing before we can use the data for any machine learning models. We can use the tm package in R for this purpose. In the next section, we will describe this in some detail.
The tm package has methods for data import, corpus handling, preprocessing, metadata management, and creation of term-document matrices. Data can be imported into the tm package either from a directory, a vector with each component a document, or a data frame. The fundamental data structure in tm is an abstract collection of text documents called Corpus. It has two implementations; one is where data is stored in memory and is called VCorpus (volatile corpus) and the second is where data is stored in the hard disk and is called PCorpus (permanent corpus).
We can create a corpus of our SMS spam dataset by using the following R commands; prior to this, you need to install the tm package and SnowballC package by using the install.packages("packagename") command in R:
>library(tm)
>library(SnowballC)
>xtrain ← VCorpus(VectorSource(xtrain))
First, we need to do some basic text processing, such as removing extra white space, changing all words to lowercase, removing stop words, and stemming the words. This can be achieved by using the following functions in the tm package:
>#remove extra white space
>xtrain ← tm_map(xtrain,stripWhitespace)
>#remove punctuation
>xtrain ← tm_map(xtrain,removePunctuation)
>#remove numbers
>xtrain ← tm_map(xtrain,removeNumbers)
>#changing to lower case
>xtrain ← tm_map(xtrain,content_transformer(tolower))
>#removing stop words
>xtrain ← tm_map(xtrain,removeWords,stopwords("english"))
>#stemming the document
>xtrain ← tm_map(xtrain,stemDocument)
Finally, the data is transformed into a form that can be consumed by machine learning models. This is the so called document-term matrix form where each document (SMS in this case) is a row, the terms appearing in all documents are the columns, and the entry in each cell denotes how many times each word occurs in one document:
>#creating Document-Term Matrix
>xtrain ← as.data.frame.matrix(DocumentTermMatrix(xtrain))
The same set of processes is done on the xtest dataset as well. The reason we converted y to factors and xtrain to a data frame is to match the input format for the Naïve Bayes classifier in the e1071 package.
You need to first install the e1071 package from CRAN. The naiveBayes() function can be used to train the Naïve Bayes model. The function can be called using two methods. The following is the first method:
>naiveBayes(formula,data,laplace=0, ,subset,na.action=na.pass)
Here formula stands for the linear combination of independent variables to predict the following class:
>class ~ x1+x2+…
Also, data stands for either a data frame or contingency table consisting of categorical and numerical variables.
If we have the class labels as a vector y and dependent variables as a data frame x, then we can use the second method of calling the function, as follows:
>naiveBayes(x,y,laplace=0,…)
We will use the second method of calling in our example. Once we have a trained model, which is an R object of class naiveBayes, we can predict the classes of new instances as follows:
>predict(object,newdata,type=c(class,raw),threshold=0.001,eps=0,…)
So, we can train the Naïve Bayes model on our training dataset and score on the test dataset by using the following commands:
>#Training the Naive Bayes Model
>nbmodel ← naiveBayes(xtrain,ytrain,laplace=3)
>#Prediction using trained model
>ypred.nb ← predict(nbmodel,xtest,type = "class",threshold = 0.075)
>#Converting classes to 0 and 1 for plotting ROC
>fconvert ← function(x){
if(x == "spam"){ y ← 1}
else {y ← 0}
y
}
>ytest1 ← sapply(ytest,fconvert,simplify = "array")
>ypred1 ← sapply(ypred.nb,fconvert,simplify = "array")
>roc(ytest1,ypred1,plot = T)
Here, the ROC curve for this model and dataset is shown. This is generated using the pROC package in CRAN:
>#Confusion matrix
>confmat ← table(ytest,ypred.nb)
>confmat
pred.nb
ytest ham spam
ham 143 139
spam 9 35
From the ROC curve and confusion matrix, one can choose the best threshold for the classifier, and the precision and recall metrics. Note that the example shown here is for illustration purposes only. The model needs be to tuned further to improve accuracy.
We can also print some of the most frequent words (model features) occurring in the two classes and their posterior probabilities generated by the model. This will give a more intuitive feeling for the model exercise. The following R code does this job:
>tab ← nbmodel$tables
>fham ← function(x){
y ← x[1,1]
y
}
>hamvec ← sapply(tab,fham,simplify = "array")
>hamvec ← sort(hamvec,decreasing = T)
>fspam ← function(x){
y ← x[2,1]
y
}
>spamvec ← sapply(tab,fspam,simplify = "array")
>spamvec ← sort(spamvec,decreasing = T)
>prb ← cbind(spamvec,hamvec)
>print.table(prb)
The output table is as follows:
word |
Prob(word|spam) |
Prob(word|ham) |
call |
0.6994 |
0.4084 |
free |
0.4294 |
0.3996 |
now |
0.3865 |
0.3120 |
repli |
0.2761 |
0.3094 |
text |
0.2638 |
0.2840 |
spam |
0.2270 |
0.2726 |
txt |
0.2270 |
0.2594 |
get |
0.2209 |
0.2182 |
stop |
0.2086 |
0.2025 |
The table shows, for example, that given a document is spam, the probability of the word call appearing in it is 0.6994, whereas the probability of the same word appearing in a normal document is only 0.4084.
In this article, we learned a basic and popular method for classification, Naïve Bayes, implemented using the Bayesian approach.
For further information on Bayesian models, you can refer to:
Further resources on this subject: