In data science and machine learning, operations such as filtering data, shuffling samples, and stacking samples into batches are extremely common. The tf.data API offers simple solutions to most of those (refer to the documentation at https://www.tensorflow.org/api_docs/python/tf/data/Dataset). For example, some of the most frequently used datasets' methods are as follows:
-
.batch(batch_size, ...), which returns a new dataset, with the data samples batched accordingly (tf.data.experimental.unbatch() does the opposite). Note that if .map() is called after .batch(), the mapping function will therefore receive batched data as input.
-
.repeat(count=None), which repeats the data count times (infinitely if count = None).
-
.shuffle(buffer_size, seed, ...), which shuffles elements after filling a buffer accordingly (for instance, if buffer_size = 10, the dataset will virtually divide the dataset into subsets of 10 elements, and randomly permute the elements in each...