Data is the most critical part of every machine learning pipeline; the model learns from it, and its quantity and quality are game-changers of every machine learning application.
Feeding data to a Keras model has so far seemed natural: we can fetch the dataset as a NumPy array, create the batches, and feed the batches to the model to train it using mini-batch gradient descent.
However, the way of feeding the input shown so far is, in fact, hugely inefficient and error-prone, for the following reasons:
- The complete dataset can weight several thousands of GBs: no single standard computer or even a deep learning workstation has the memory required to load huge datasets in memory.
- Manually creating the input batches means taking care of the slicing indexes manually; errors can happen.
- Doing data augmentation, applying random perturbations to each input...