Document classification
This chapter will be looking at text classification using Keras. The dataset we will use is included in the Keras library. As we have done in previous chapters, we will use traditional machine learning techniques to create a benchmark before applying a deep learning algorithm. The reason for this is to show how deep learning models perform against other techniques.
The Reuters dataset
We will use the Reuters dataset, which can be accessed through a function in the Keras library. This dataset has 11,228 records with 46 categories. To see more information about this dataset, run the following code:
library(keras) ?dataset_reuters
Although the Reuters dataset can be accessed from Keras, it is not in a format that can be used by other machine learning algorithms. Instead of the actual words, the text data is a list of word indices. We will write a short script (Chapter7/create_reuters_data.R
) that downloads the data and the lookup index file and creates a data frame of the...