Getting the complete set of oven-ready fastai datasets
In Chapter 1, Getting Started with fastai, you encountered the MNIST dataset and saw how easy it was to make this dataset available to train a fastai deep learning model. You were able to train the model without needing to worry about the location of the dataset or its structure (apart from the names of the folders containing the training and validation datasets). You were able to examine elements of the dataset conveniently.
In this section, we'll take a closer look at the complete set of datasets that fastai curates and explain how you can get additional information about these datasets.
Getting ready
Ensure you have followed the steps in Chapter 1, Getting Started with fastai, so that you have a fastai environment set up. Confirm that you can open the fastai_dataset_walkthrough.ipynb
notebook in the ch2
directory of your cloned repository.
How to do it…
In this section, you will be running through the fastai_dataset_walkthrough.ipynb
notebook, as well as the fastai dataset documentation, so that you understand the datasets that fastai curates. Once you have the notebook open in your fastai environment, complete the following steps:
- Run the first three cells of the notebook to load the required libraries, set up the notebook for fastai, and define the MNIST dataset:
- Consider the argument to
untar_data
:URLs.MINST
. What is this? Let's try the??
  shortcut to examine the source code for aURLs
object: - By looking at the
image classification datasets
section of the source code forURLs
, we can find the definition ofURLs.MNIST
:MNISTÂ Â Â Â Â Â Â Â Â Â Â = f'{S3_IMAGE}mnist_png.tgz'
- Working backward through the source code for the
URLs
class, we can get the whole URL for MNIST:S3_IMAGEÂ Â Â Â Â = f'{S3}imageclas/' S3Â Â = 'https://s3.amazonaws.com/fast-ai-'
- Putting it all together, we get the URL for
URLs.MNIST
:https://s3.amazonaws.com/fast-ai-imageclas/mnist_png.tgz
- You can download this file for yourself and untar it. You will see that the directory structure of the untarred package looks like this:
mnist_png ├── testing │   ├── 0 │   ├── 1 │   ├── 2 │   ├── 3 │   ├── 4 │   ├── 5 │   ├── 6 │   ├── 7 │   ├── 8 │   └── 9 └── training      ├── 0      ├── 1      ├── 2      ├── 3      ├── 4      ├── 5      ├── 6      ├── 7      ├── 8      └── 9
- In the untarred directory structure, each of the testing and training directories contain subdirectories for each digit. These digit directories contain image files for that digit. This means that the label of the dataset – the value that we want the model to predict – is encoded in the directory that the image file resides in.
- Is there a way to get the directory structure of one of the curated datasets without having to determine its URL from the definition of
URLs
, download the dataset, and unpack it? There is – usingpath.ls()
: - This tells us that there are two subdirectories in the dataset:
training
andtesting
. You can callls()
to get the structure of thetraining
subdirectory: - Now that we have learned how to get the directory structure of the MNIST dataset using the
ls()
function, what else can we learn from the output of??URLs
? - First, let's look at the other datasets listed in the output of
??URLs
by group. First, let's look at the datasets listed undermain datasets
. This list includes tabular datasets (ADULT_SAMPLE
), text datasets (IMDB_SAMPLE
), recommender system datasets (ML_SAMPLE
), and a variety of image datasets (CIFAR, IMAGENETTE, COCO_SAMPLE
):     ADULT_SAMPLE           = f'{URL}adult_sample.tgz'      BIWI_SAMPLE            = f'{URL}biwi_sample.tgz'      CIFAR                     = f'{URL}cifar10.tgz'      COCO_SAMPLE            = f'{S3_COCO}coco_sample.tgz'      COCO_TINY               = f'{S3_COCO}coco_tiny.tgz'      HUMAN_NUMBERS         = f'{URL}human_numbers.tgz'      IMDB                       = f'{S3_NLP}imdb.tgz'      IMDB_SAMPLE            = f'{URL}imdb_sample.tgz'      ML_SAMPLE               = f'{URL}movie_lens_sample.tgz'      ML_100k                  = 'http://files.grouplens.org/datasets/movielens/ml-100k.zip'      MNIST_SAMPLE           = f'{URL}mnist_sample.tgz'      MNIST_TINY              = f'{URL}mnist_tiny.tgz'      MNIST_VAR_SIZE_TINY = f'{S3_IMAGE}mnist_var_size_tiny.tgz'      PLANET_SAMPLE         = f'{URL}planet_sample.tgz'      PLANET_TINY            = f'{URL}planet_tiny.tgz'      IMAGENETTE              = f'{S3_IMAGE}imagenette2.tgz'      IMAGENETTE_160        = f'{S3_IMAGE}imagenette2-160.tgz'      IMAGENETTE_320        = f'{S3_IMAGE}imagenette2-320.tgz'      IMAGEWOOF               = f'{S3_IMAGE}imagewoof2.tgz'      IMAGEWOOF_160         = f'{S3_IMAGE}imagewoof2-160.tgz'      IMAGEWOOF_320         = f'{S3_IMAGE}imagewoof2-320.tgz'      IMAGEWANG               = f'{S3_IMAGE}imagewang.tgz'      IMAGEWANG_160         = f'{S3_IMAGE}imagewang-160.tgz'      IMAGEWANG_320         = f'{S3_IMAGE}imagewang-320.tgz'
- Next, let's look at the datasets in the other categories: image classification datasets, NLP datasets, image localization datasets, audio classification datasets, and medical image classification datasets. Note that the list of curated datasets includes datasets that aren't directly associated with any of the four main application areas supported by fastai. The audio datasets, for example, apply to a use case outside the four main application areas:
     # image classification datasets      CALTECH_101  = f'{S3_IMAGE}caltech_101.tgz'      CARS            = f'{S3_IMAGE}stanford-cars.tgz'      CIFAR_100     = f'{S3_IMAGE}cifar100.tgz'      CUB_200_2011 = f'{S3_IMAGE}CUB_200_2011.tgz'      FLOWERS        = f'{S3_IMAGE}oxford-102-flowers.tgz'      FOOD            = f'{S3_IMAGE}food-101.tgz'      MNIST           = f'{S3_IMAGE}mnist_png.tgz'      PETS            = f'{S3_IMAGE}oxford-iiit-pet.tgz'      # NLP datasets      AG_NEWS                        = f'{S3_NLP}ag_news_csv.tgz'      AMAZON_REVIEWS              = f'{S3_NLP}amazon_review_full_csv.tgz'      AMAZON_REVIEWS_POLARITY = f'{S3_NLP}amazon_review_polarity_csv.tgz'      DBPEDIA                        = f'{S3_NLP}dbpedia_csv.tgz'      MT_ENG_FRA                    = f'{S3_NLP}giga-fren.tgz'      SOGOU_NEWS                    = f'{S3_NLP}sogou_news_csv.tgz'      WIKITEXT                       = f'{S3_NLP}wikitext-103.tgz'      WIKITEXT_TINY               = f'{S3_NLP}wikitext-2.tgz'      YAHOO_ANSWERS               = f'{S3_NLP}yahoo_answers_csv.tgz'      YELP_REVIEWS                 = f'{S3_NLP}yelp_review_full_csv.tgz'      YELP_REVIEWS_POLARITY   = f'{S3_NLP}yelp_review_polarity_csv.tgz'      # Image localization datasets      BIWI_HEAD_POSE      = f"{S3_IMAGELOC}biwi_head_pose.tgz"      CAMVID                  = f'{S3_IMAGELOC}camvid.tgz'      CAMVID_TINY           = f'{URL}camvid_tiny.tgz'      LSUN_BEDROOMS        = f'{S3_IMAGE}bedroom.tgz'      PASCAL_2007           = f'{S3_IMAGELOC}pascal_2007.tgz'      PASCAL_2012           = f'{S3_IMAGELOC}pascal_2012.tgz'      # Audio classification datasets      MACAQUES               = 'https://storage.googleapis.com/ml-animal-sounds-datasets/macaques.zip'      ZEBRA_FINCH           = 'https://storage.googleapis.com/ml-animal-sounds-datasets/zebra_finch.zip'      # Medical Imaging datasets      SIIM_SMALL            = f'{S3_IMAGELOC}siim_small.tgz'
- Now that we have listed all the datasets defined in
URLs
, how can we find out more information about them?a) The fastai documentation (https://course.fast.ai/datasets) documents some of the datasets listed in
URLs
. Note that this documentation is not consistent with what's listed in the source ofURLs
. For example, the naming of the datasets is not consistent and the documentation page does not cover all the datasets. When in doubt, treat the source ofURLs
as your single source of truth about fastai curated datasets.b) Use the
path.ls()
function to examine the directory structure, as shown in the following example, which lists the directories under thetraining
subdirectory of the MNIST dataset:c) Check out the file structure that gets installed when you run
untar_data
. For example, in Gradient, the datasets get installed instorage/data
, so you can go into that directory in Gradient to inspect the directories for the curated dataset you're interested in.d) For example, let's say
untar_data
is run withURLs.PETS
as the argument:path = untar_data(URLs.PETS)
e) Here, you can find the dataset in
storage/data/oxford-iiit-pet
, and you can see the directory's structure:oxford-iiit-pet ├── annotations │   ├── trimaps │   └── xmls └── images
- If you want to see the definition of a function in a notebook, you can run a cell with
??
, followed by the name of the function. For example, to see the definition of thels()
function, you can use??Path.ls
: - To see the documentation for any function, you can use the
doc()
function. For example, the output ofdoc(Path.ls)
shows the signature of the function, along with links to the source code (https://github.com/fastai/fastcore/blob/master/fastcore/xtras.py#L111) and the documentation (https://fastcore.fast.ai/xtras#Path.ls) for this function:
You have now explored the list of oven-ready datasets curated by fastai. You have also learned how to get the directory structure of these datasets, as well as how to examine the source and documentation of a function from within a notebook.
How it works…
As you saw in this section, fastai defines URLs for each of the curated datasets in the URLs
class. When you call untar_data
with one of the curated datasets as the argument, if the files for the dataset have not already been copied, these files get downloaded to your filesystem (storage/data
in a Gradient instance). The object you get back from untar_data
allows you to examine the directory structure of the dataset, and then pass it along to the next stage in the process of creating a fastai deep learning model. By wrapping a large sampling of interesting datasets in such a convenient way, fastai makes it easy for you to create deep learning models with these datasets, and also lets you focus your efforts on creating and improving the deep learning model rather than fiddling with the details of ingesting the datasets.
There's more…
You might be asking yourself why we went to the trouble of examining the source code for the URLs
class to get details about the curated datasets. After all, these datasets are documented in https://course.fast.ai/datasets. The problem is that this documentation page doesn't give a complete list of all the curated datasets, and it doesn't clearly explain what you need to know to make the correct untar_data
calls for a particular curated dataset. The incomplete documentation for the curated datasets demonstrates one of the weaknesses of fastai – inconsistent documentation. Sometimes, the documentation is complete, but sometimes, it is lacking details, so you will need to look at the source code directly to figure out what's going on, like we had to do in this section for the curated datasets. This problem is compounded by Google search returning hits for documentation for earlier versions of fastai. If you are searching for some details about fastai, avoid hits for fastai version 1 (https://fastai1.fast.ai/) and keep to the documentation for the current version of fastai: https://docs.fast.ai/.