Working with bigger data – online algorithms and out-of-core learning
If you executed the code examples in the previous section, you may have noticed that it could be computationally quite expensive to construct the feature vectors for the 50,000 movie review dataset during grid search. In many real-world applications, it is not uncommon to work with even larger datasets that can exceed our computer's memory. Since not everyone has access to supercomputer facilities, we will now apply a technique called out-of-core learning, which allows us to work with such large datasets by fitting the classifier incrementally on smaller batches of the dataset.
Back in Chapter 2, Training Simple Machine Learning Algorithms for Classification, we introduced the concept of stochastic gradient descent, which is an optimization algorithm that updates the model's weights using one sample at a time. In this section, we will make use of the partial_fit
function of the SGDClassifier
in scikit-learn to stream the...