You should start by downloading the training data from the following links:
- http://yaroslavvb.com/upload/notMNIST/notMNIST_small.tar.gz
- http://yaroslavvb.com/upload/notMNIST/notMNIST_large.tar.gz
We will download this programmatically, but we should start with a manual download just to peek at the data and structure of the archive. This will be important when we write the pipeline, as we'll need to understand the structure so we can manipulate the data.
The small set is ideal for peeking. You can do this via the following command line, or just use a browser to download the file with an unarchiver to extract the files (I suggest getting familiarized with the command line as all of this needs to be automated):
cd ~/workdir wget http://yaroslavvb.com/upload/notMNIST/notMNIST_small.tar.gz tar xvf notMNIST_small.tar.gz
The preceding command line will...