Text processing and tf-idf weighting are examples of feature extraction techniques designed to both reduce the dimensionality of, and extract some structure from, raw text data. We can see the impact of applying these processing techniques by comparing the performance of a model trained on raw text data with one trained on processed and tf-idf weighted text data.
Evaluating the impact of text processing
Comparing raw features with processed tf-idf features on the 20 Newsgroups dataset
In this example, we will simply apply the hashing term frequency transformation to the raw text tokens obtained using a simple whitespace splitting of the document text. We will train a model on this data and evaluate the performance on the test set as we did for the model trained with...