Building a text classifier
The goal of text classification is to categorize text documents into different classes. This is an extremely important analysis technique in NLP. We will use a technique, which is based on a statistic called tf-idf, which stands for term frequency—inverse document frequency. This is an analysis tool that helps us understand how important a word is to a document in a set of documents. This serves as a feature vector that's used to categorize documents. You can learn more about it at http://www.tfidf.com.
How to do it…
Create a new Python file, and import the following package:
from sklearn.datasets import fetch_20newsgroups
Let's select a list of categories and name them using a dictionary mapping. These categories are available as part of the news groups dataset that we just imported:
category_map = {'misc.forsale': 'Sales', 'rec.motorcycles': 'Motorcycles', 'rec.sport.baseball': 'Baseball', 'sci.crypt': 'Cryptography', 'sci.space': 'Space'}
Load the...