So far, we have applied our denoising autoencoder on the MNIST dataset, which is a pretty simple dataset. Let's take a look now at a more complicated dataset, which better represents the challenges of denoising documents in real life.
The dataset that we will be using is provided for free by the University of California Irvine (UCI). For more information on the dataset, you can visit UCI's website at https://archive.ics.uci.edu/ml/datasets/NoisyOffice.
The dataset can be found in the accompanying GitHub repository for this book. For more information on downloading the code and dataset for this chapter from the GitHub repository, please refer to the Technical requirements section earlier in the chapter.
The dataset consists of 216 different noisy images. The noisy images are scanned office documents that are tainted by coffee stains...