Text-classifying techniques
Classification is concerned with taking a specific document and determining whether it fits into one of several other document groups. There are two basic techniques for classifying text:
- Rule-based classification
- Supervised machine learning
Rule-based classification uses a combination of words and other attributes that are organized around expert crafted rules. These can be very effective, but creating them is a time-consuming process.
Supervised machine learning (SML) takes a collection of annotated training documents to create a model. The model is normally called the classifier. There are many different machine learning techniques, including Naive Bayes, support vector machine (SVM), and k-nearest neighbor.
We are not concerned with how these approaches work, but the interested reader will find innumerable sources that expand upon these and other techniques.