Search icon CANCEL
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Conferences
Free Learning
Arrow right icon
Deep Learning with fastai Cookbook
Deep Learning with fastai Cookbook

Deep Learning with fastai Cookbook: Leverage the easy-to-use fastai framework to unlock the power of deep learning

eBook
€29.99
Paperback
€36.99
Subscription
Free Trial
Renews at €18.99p/m

What do you get with Print?

Product feature icon Instant access to your digital eBook copy whilst your Print order is Shipped
Product feature icon Paperback book shipped to your preferred address
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
Product feature icon AI Assistant (beta) to help accelerate your learning
Table of content icon View table of contents Preview book icon Preview Book

Deep Learning with fastai Cookbook

Chapter 2: Exploring and Cleaning Up Data with fastai

In the previous chapter, we got started with the fastai framework by setting up its coding environment, working through a concrete application example (MNIST), and investigating two frameworks with different relationships to fastai: PyTorch and Keras. In this chapter, we are going to dive deeper into an important aspect of fastai: ingesting, exploring, and cleaning up data. In particular, we are going to explore a selection of the datasets that are curated by fastai.

By the end of this chapter, you will be able to describe the complete set of curated datasets that fastai supports, use the facilities of fastai to examine these datasets, and clean up a dataset to eliminate missing and non-numeric values.

Here are the recipes that will be covered in this chapter:

  • Getting the complete set of oven-ready fastai datasets
  • Examining tabular datasets with fastai
  • Examining text datasets with fastai
  • Examining image datasets with fastai
  • Cleaning up raw datasets with fastai

Technical requirements

Ensure that you have completed the setup sections in Chapter 1, Getting Started with fastai, and that you have a working Gradient instance or Colab setup. Ensure that you have cloned the repository for this book (https://github.com/PacktPublishing/Deep-Learning-with-fastai-Cookbook) and have access to the ch2 folder. This folder contains the code samples that will be described in this chapter.

Getting the complete set of oven-ready fastai datasets

In Chapter 1, Getting Started with fastai, you encountered the MNIST dataset and saw how easy it was to make this dataset available to train a fastai deep learning model. You were able to train the model without needing to worry about the location of the dataset or its structure (apart from the names of the folders containing the training and validation datasets). You were able to examine elements of the dataset conveniently.

In this section, we'll take a closer look at the complete set of datasets that fastai curates and explain how you can get additional information about these datasets.

Getting ready

Ensure you have followed the steps in Chapter 1, Getting Started with fastai, so that you have a fastai environment set up. Confirm that you can open the fastai_dataset_walkthrough.ipynb notebook in the ch2 directory of your cloned repository.

How to do it…

In this section, you will be running through the fastai_dataset_walkthrough.ipynb notebook, as well as the fastai dataset documentation, so that you understand the datasets that fastai curates. Once you have the notebook open in your fastai environment, complete the following steps:

  1. Run the first three cells of the notebook to load the required libraries, set up the notebook for fastai, and define the MNIST dataset:
    Figure 2.1 – Cells to load the libraries, set up the notebook, and define the MNIST dataset

    Figure 2.1 – Cells to load the libraries, set up the notebook, and define the MNIST dataset

  2. Consider the argument to untar_data: URLs.MINST. What is this? Let's try the ??  shortcut to examine the source code for a URLs object:
    Figure 2.2 – Source for URLs

    Figure 2.2 – Source for URLs

  3. By looking at the image classification datasets section of the source code for URLs, we can find the definition of URLs.MNIST:
    MNIST           = f'{S3_IMAGE}mnist_png.tgz'
  4. Working backward through the source code for the URLs class, we can get the whole URL for MNIST:
    S3_IMAGE     = f'{S3}imageclas/'
    S3  = 'https://s3.amazonaws.com/fast-ai-'
  5. Putting it all together, we get the URL for URLs.MNIST:
    https://s3.amazonaws.com/fast-ai-imageclas/mnist_png.tgz
  6. You can download this file for yourself and untar it. You will see that the directory structure of the untarred package looks like this:
    mnist_png
    ├── testing
    │   ├── 0
    │   ├── 1
    │   ├── 2
    │   ├── 3
    │   ├── 4
    │   ├── 5
    │   ├── 6
    │   ├── 7
    │   ├── 8
    │   └── 9
    └── training
         ├── 0
         ├── 1
         ├── 2
         ├── 3
         ├── 4
         ├── 5
         ├── 6
         ├── 7
         ├── 8
         └── 9
  7. In the untarred directory structure, each of the testing and training directories contain subdirectories for each digit. These digit directories contain image files for that digit. This means that the label of the dataset – the value that we want the model to predict – is encoded in the directory that the image file resides in.
  8. Is there a way to get the directory structure of one of the curated datasets without having to determine its URL from the definition of URLs, download the dataset, and unpack it? There is – using path.ls():
    Figure 2.3 – Using path.ls() to get the dataset's directory structure

    Figure 2.3 – Using path.ls() to get the dataset's directory structure

  9. This tells us that there are two subdirectories in the dataset: training and testing. You can call ls() to get the structure of the training subdirectory:
    Figure 2.4 – The structure of the training subdirectory

    Figure 2.4 – The structure of the training subdirectory

  10. Now that we have learned how to get the directory structure of the MNIST dataset using the ls() function, what else can we learn from the output of ??URLs?
  11. First, let's look at the other datasets listed in the output of ??URLs by group. First, let's look at the datasets listed under main datasets. This list includes tabular datasets (ADULT_SAMPLE), text datasets (IMDB_SAMPLE), recommender system datasets (ML_SAMPLE), and a variety of image datasets (CIFAR, IMAGENETTE, COCO_SAMPLE):
         ADULT_SAMPLE           = f'{URL}adult_sample.tgz'
         BIWI_SAMPLE            = f'{URL}biwi_sample.tgz'
         CIFAR                     = f'{URL}cifar10.tgz'
         COCO_SAMPLE            = f'{S3_COCO}coco_sample.tgz'
         COCO_TINY               = f'{S3_COCO}coco_tiny.tgz'
         HUMAN_NUMBERS         = f'{URL}human_numbers.tgz'
         IMDB                       = f'{S3_NLP}imdb.tgz'
         IMDB_SAMPLE            = f'{URL}imdb_sample.tgz'
         ML_SAMPLE               = f'{URL}movie_lens_sample.tgz'
         ML_100k                  = 'http://files.grouplens.org/datasets/movielens/ml-100k.zip'
         MNIST_SAMPLE           = f'{URL}mnist_sample.tgz'
         MNIST_TINY              = f'{URL}mnist_tiny.tgz'
         MNIST_VAR_SIZE_TINY = f'{S3_IMAGE}mnist_var_size_tiny.tgz'
         PLANET_SAMPLE         = f'{URL}planet_sample.tgz'
         PLANET_TINY            = f'{URL}planet_tiny.tgz'
         IMAGENETTE              = f'{S3_IMAGE}imagenette2.tgz'
         IMAGENETTE_160        = f'{S3_IMAGE}imagenette2-160.tgz'
         IMAGENETTE_320        = f'{S3_IMAGE}imagenette2-320.tgz'
         IMAGEWOOF               = f'{S3_IMAGE}imagewoof2.tgz'
         IMAGEWOOF_160         = f'{S3_IMAGE}imagewoof2-160.tgz'
         IMAGEWOOF_320         = f'{S3_IMAGE}imagewoof2-320.tgz'
         IMAGEWANG               = f'{S3_IMAGE}imagewang.tgz'
         IMAGEWANG_160         = f'{S3_IMAGE}imagewang-160.tgz'
         IMAGEWANG_320         = f'{S3_IMAGE}imagewang-320.tgz'
  12. Next, let's look at the datasets in the other categories: image classification datasets, NLP datasets, image localization datasets, audio classification datasets, and medical image classification datasets. Note that the list of curated datasets includes datasets that aren't directly associated with any of the four main application areas supported by fastai. The audio datasets, for example, apply to a use case outside the four main application areas:
         # image classification datasets
         CALTECH_101  = f'{S3_IMAGE}caltech_101.tgz'
         CARS            = f'{S3_IMAGE}stanford-cars.tgz'
         CIFAR_100     = f'{S3_IMAGE}cifar100.tgz'
         CUB_200_2011 = f'{S3_IMAGE}CUB_200_2011.tgz'
         FLOWERS        = f'{S3_IMAGE}oxford-102-flowers.tgz'
         FOOD            = f'{S3_IMAGE}food-101.tgz'
         MNIST           = f'{S3_IMAGE}mnist_png.tgz'
         PETS            = f'{S3_IMAGE}oxford-iiit-pet.tgz'
         # NLP datasets
         AG_NEWS                        = f'{S3_NLP}ag_news_csv.tgz'
         AMAZON_REVIEWS              = f'{S3_NLP}amazon_review_full_csv.tgz'
         AMAZON_REVIEWS_POLARITY = f'{S3_NLP}amazon_review_polarity_csv.tgz'
         DBPEDIA                        = f'{S3_NLP}dbpedia_csv.tgz'
         MT_ENG_FRA                    = f'{S3_NLP}giga-fren.tgz'
         SOGOU_NEWS                    = f'{S3_NLP}sogou_news_csv.tgz'
         WIKITEXT                       = f'{S3_NLP}wikitext-103.tgz'
         WIKITEXT_TINY               = f'{S3_NLP}wikitext-2.tgz'
         YAHOO_ANSWERS               = f'{S3_NLP}yahoo_answers_csv.tgz'
         YELP_REVIEWS                 = f'{S3_NLP}yelp_review_full_csv.tgz'
         YELP_REVIEWS_POLARITY   = f'{S3_NLP}yelp_review_polarity_csv.tgz'
         # Image localization datasets
         BIWI_HEAD_POSE      = f"{S3_IMAGELOC}biwi_head_pose.tgz"
         CAMVID                  = f'{S3_IMAGELOC}camvid.tgz'
         CAMVID_TINY           = f'{URL}camvid_tiny.tgz'
         LSUN_BEDROOMS        = f'{S3_IMAGE}bedroom.tgz'
         PASCAL_2007           = f'{S3_IMAGELOC}pascal_2007.tgz'
         PASCAL_2012           = f'{S3_IMAGELOC}pascal_2012.tgz'
         # Audio classification datasets
         MACAQUES               = 'https://storage.googleapis.com/ml-animal-sounds-datasets/macaques.zip'
         ZEBRA_FINCH           = 'https://storage.googleapis.com/ml-animal-sounds-datasets/zebra_finch.zip'
         # Medical Imaging datasets
         SIIM_SMALL            = f'{S3_IMAGELOC}siim_small.tgz'
  13. Now that we have listed all the datasets defined in URLs, how can we find out more information about them?

    a) The fastai documentation (https://course.fast.ai/datasets) documents some of the datasets listed in URLs. Note that this documentation is not consistent with what's listed in the source of URLs. For example, the naming of the datasets is not consistent and the documentation page does not cover all the datasets. When in doubt, treat the source of URLs as your single source of truth about fastai curated datasets.

    b) Use the path.ls() function to examine the directory structure, as shown in the following example, which lists the directories under the training subdirectory of the MNIST dataset:

    Figure 2.5 – Structure of the training subdirectory

    Figure 2.5 – Structure of the training subdirectory

    c) Check out the file structure that gets installed when you run untar_data. For example, in Gradient, the datasets get installed in storage/data, so you can go into that directory in Gradient to inspect the directories for the curated dataset you're interested in.

    d) For example, let's say untar_data is run with URLs.PETS as the argument:

    path = untar_data(URLs.PETS)

    e) Here, you can find the dataset in storage/data/oxford-iiit-pet, and you can see the directory's structure:

    oxford-iiit-pet
    ├── annotations
    │   ├── trimaps
    │   └── xmls
    └── images
  14. If you want to see the definition of a function in a notebook, you can run a cell with ??, followed by the name of the function. For example, to see the definition of the ls() function, you can use ??Path.ls:
    Figure 2.6 – Source for Path.ls()

    Figure 2.6 – Source for Path.ls()

  15. To see the documentation for any function, you can use the doc() function. For example, the output of doc(Path.ls) shows the signature of the function, along with links to the source code (https://github.com/fastai/fastcore/blob/master/fastcore/xtras.py#L111) and the documentation (https://fastcore.fast.ai/xtras#Path.ls) for this function:
Figure 2.7 – Output of doc(Path.ls)

Figure 2.7 – Output of doc(Path.ls)

You have now explored the list of oven-ready datasets curated by fastai. You have also learned how to get the directory structure of these datasets, as well as how to examine the source and documentation of a function from within a notebook.

How it works…

As you saw in this section, fastai defines URLs for each of the curated datasets in the URLs class. When you call untar_data with one of the curated datasets as the argument, if the files for the dataset have not already been copied, these files get downloaded to your filesystem (storage/data in a Gradient instance). The object you get back from untar_data allows you to examine the directory structure of the dataset, and then pass it along to the next stage in the process of creating a fastai deep learning model. By wrapping a large sampling of interesting datasets in such a convenient way, fastai makes it easy for you to create deep learning models with these datasets, and also lets you focus your efforts on creating and improving the deep learning model rather than fiddling with the details of ingesting the datasets.

There's more…

You might be asking yourself why we went to the trouble of examining the source code for the URLs class to get details about the curated datasets. After all, these datasets are documented in https://course.fast.ai/datasets. The problem is that this documentation page doesn't give a complete list of all the curated datasets, and it doesn't clearly explain what you need to know to make the correct untar_data calls for a particular curated dataset. The incomplete documentation for the curated datasets demonstrates one of the weaknesses of fastai – inconsistent documentation. Sometimes, the documentation is complete, but sometimes, it is lacking details, so you will need to look at the source code directly to figure out what's going on, like we had to do in this section for the curated datasets. This problem is compounded by Google search returning hits for documentation for earlier versions of fastai. If you are searching for some details about fastai, avoid hits for fastai version 1 (https://fastai1.fast.ai/) and keep to the documentation for the current version of fastai: https://docs.fast.ai/.

Examining tabular datasets with fastai

In the previous section, we looked at the whole set of datasets curated by fastai. In this section, we are going to dig into a tabular dataset from the curated list. We will ingest the dataset, look at some example records, and then explore characteristics of the dataset, including the number of records and the number of unique values in each column.

Getting ready

Ensure you have followed the steps in Chapter 1, Getting Started with fastai, to get a fastai environment set up. Confirm that you can open the examining_tabular_datasets.ipynb notebook in the ch2 directory of your repository.

I am grateful for the opportunity to include the ADULT_SAMPLE dataset featured in this section.

Dataset citation

Ron Kohavi. (1996) Scaling Up the Accuracy of Naive-Bayes Classifiers: a Decision-Tree Hybrid (http://robotics.stanford.edu/~ronnyk/nbtree.pdf).

How to do it…

In this section, you will be running through the examining_tabular_datasets.ipynb notebook to examine the ADULT_SAMPLE dataset.

Once you have the notebook open in your fastai environment, complete the following steps:

  1. Run the first two cells to import the necessary libraries and set up the notebook for fastai.
  2. Run the following cell to copy the dataset into your filesystem (if it's not already there) and to define the path for the dataset:
    path = untar_data(URLs.ADULT_SAMPLE)
  3. Run the following cell to get the output of path.ls() so that you can examine the directory structure of the dataset:
    Figure 2.8 – Output of path.ls()

    Figure 2.8 – Output of path.ls()

  4. The dataset is in the adult.csv file. Run the following cell to ingest this CSV file into a pandas DataFrame:
    df = pd.read_csv(path/'adult.csv')
  5. Run the head() command to get a sample of records from the beginning of the dataset:
    Figure 2.9 – Sample of records from the beginning of the dataset

    Figure 2.9 – Sample of records from the beginning of the dataset

  6. Run the following command to get the number of records (rows) and fields (columns) in the dataset:
    df.shape
  7. Run the following command to get the number of unique values in each column of the dataset. Can you tell from the output which columns are categorical?
    df.nunique()
  8. Run the following command to get the count of missing values in each column of the dataset. Which columns have missing values?
    df.isnull().sum()
  9. Run the following command to display some sample records from the subset of the dataset for people whose age is less than or equal to 40:
    df_young = df[df.age <= 40]
    df_young.head()

Congratulations! You have ingested a tabular dataset curated by fastai and done a basic examination of the dataset.

How it works…

The dataset that you explored in this section, ADULT_SAMPLE, is one of the datasets you would have seen in the source for URLs in the previous section. Note that while the source for URLs identifies which datasets are related to image or NLP (text) applications, it does not explicitly identify the tabular or recommender system datasets. ADULT_SAMPLE is one of the datasets listed under main datasets:

Figure 2.10 – Main datasets from the source for URLs

Figure 2.10 – Main datasets from the source for URLs

How did I determine that ADULT_SAMPLE was a tabular dataset? First, the paper by Howard and Gugger (https://arxiv.org/pdf/2002.04688.pdf) identifies ADULT_SAMPLE as a tabular dataset. Second, I just had to ingest it and try it out to confirm it could be ingested into a pandas DataFrame.

There's more…

What about the other curated datasets that aren't explicitly categorized in the source for URLs? Here's a summary of the datasets listed in the source for URLs under main datasets:

  • Tabular:

    a) ADULT_SAMPLE

  • NLP (text):

    a) HUMAN_NUMBERS  

    b) IMDB

    c) IMDB_SAMPLE      

  • Collaborative filtering:

    a) ML_SAMPLE               

    b) ML_100k                              

  • Image data:  

    a) All of the other datasets listed in URLs under main datasets.

Examining text datasets with fastai

In the previous section, we looked at how a curated tabular dataset could be ingested. In this section, we are going to dig into a text dataset from the curated list.

Getting ready

Ensure you have followed the steps in Chapter 1, Getting Started with fastai, to get a fastai environment set up. Confirm that you can open the examining_text_datasets.ipynb notebook in the ch2 directory of your repository.

I am grateful for the opportunity to use the WIKITEXT_TINY dataset (https://blog.einstein.ai/the-wikitext-long-term-dependency-language-modeling-dataset/) featured in this section.

Dataset citation

Stephen Merity, Caiming Xiong, James Bradbury, Richard Socher. (2016). Pointer Sentinel Mixture Models (https://arxiv.org/pdf/1609.07843.pdf).

How to do it…

In this section, you will be running through the examining_text_datasets.ipynb notebook to examine the WIKITEXT_TINY dataset. As its name suggests, this is a small set of text that's been gleaned from good and featured Wikipedia articles.

Once you have the notebook open in your fastai environment, complete the following steps:

  1. Run the first two cells to import the necessary libraries and set up the notebook for fastai.
  2. Run the following cell to copy the dataset into your filesystem (if it's not already there) and to define the path for the dataset:
    path = untar_data(URLs.WIKITEXT_TINY)
  3. Run the following cell to get the output of path.ls() so that you can examine the directory structure of the dataset:
    Figure 2.11 – Output of path.ls()

    Figure 2.11 – Output of path.ls()

  4. There are two CSV files that make up this dataset. Let's ingest each of them into a pandas DataFrame, starting with train.csv:
    df_train = pd.read_csv(path/'train.csv')
  5. When you use head() to check the DataFrame, you'll notice that something's wrong – the CSV file has no header with column names, but by default, read_csv assumes the first row is the header, so the first row gets misinterpreted as a header. As shown in the following screenshot, the first row of output is in bold, which indicates that the first row is being interpreted as a header, even though it contains a regular data row:
    Figure 2.12 – First record in df_train

    Figure 2.12 – First record in df_train

  6. To fix this problem, rerun the read_csv function, but this time with the header=None parameter, to specify that the CSV file doesn't have a header:
    df_train = pd.read_csv(path/'train.csv',header=None)
  7. Check head() again to confirm that the problem has been resolved:
    Figure 2.13 – Revising the first record in df_train

    Figure 2.13 – Revising the first record in df_train

  8. Ingest test.csv into a DataFrame using the header=None parameter:
    df_test = pd.read_csv(path/'test.csv',header=None)
  9. We want to tokenize the dataset and transform it into a list of words. Since we want a common set of tokens for the entire dataset, we will begin by combining the test and train DataFrames:
    df_combined = pd.concat([df_train,df_test])
  10. Confirm the shape of the train, test, and combined dataframes – the number of rows in the combined DataFrame should be the sum of the number of rows in the train and test DataFrames:
    print("df_train: ",df_train.shape)
    print("df_test: ",df_test.shape)
    print("df_combined: ",df_combined.shape)
  11. Now, we're ready to tokenize the DataFrame. The tokenize_df() function takes the list of columns containing the text we want to tokenize as a parameter. Since the columns of the DataFrame are not labeled, we need to refer to the column we want to tokenize using its position rather than its name:
    df_tok, count = tokenize_df(df_combined,[df_combined.columns[0]])
  12. Check the contents of the first few records of df_tok, which is the new DataFrame containing the tokenized contents of the combined DataFrame:
    Figure 2.14 – The first few records of df_tok

    Figure 2.14 – The first few records of df_tok

  13. Check the count for a few sample words to ensure they are roughly what you expected. Pick a very common word, a moderately common word, and a rare word:
    print("very common word (count['the']):", count['the'])
    print("moderately common word (count['prepared']):", count['prepared'])
    print("rare word (count['gaga']):", count['gaga'])

Congratulations! You have successfully ingested, explored, and tokenized a curated text dataset.

How it works…

The dataset that you explored in this section, WIKITEXT_TINY, is one of the datasets you would have seen in the source for URLs in the Getting the complete set of oven-ready fastai datasets section. Here, you can see that WIKITEXT_TINY is in the NLP datasets section of the source for URLs:

Figure 2.15 – WIKITEXT_TINY in the NLP datasets list in the source for URLs

Figure 2.15 – WIKITEXT_TINY in the NLP datasets list in the source for URLs

Examining image datasets with fastai

In the past two sections, we examined tabular and text datasets and got a taste of the facilities that fastai provides for accessing and exploring these datasets. In this section, we are going to look at image data. We are going to look at two datasets: the FLOWERS image classification dataset and the BIWI_HEAD_POSE image localization dataset.

Getting ready

Ensure you have followed the steps in Chapter 1, Getting Started with fastai, to get a fastai environment set up. Confirm that you can open the examining_image_datasets.ipynb notebook in the ch2 directory of your repository.

I am grateful for the opportunity to use the FLOWERS dataset featured in this section.

Dataset citation

Maria-Elena Nilsback, Andrew Zisserman. (2008). Automated flower classification over a large number of classes (https://www.robots.ox.ac.uk/~vgg/publications/papers/nilsback08.pdf).

I am grateful for the opportunity to use the BIWI_HEAD_POSE dataset featured in this section.

Dataset citation

Gabriele Fanelli, Thibaut Weise, Juergen Gall, Luc Van Gool. (2011). Real Time Head Pose Estimation from Consumer Depth Cameras (https://link.springer.com/chapter/10.1007/978-3-642-23123-0_11). Lecture Notes in Computer Science, vol 6835. Springer, Berlin, Heidelberg https://doi.org/10.1007/978-3-642-23123-0_11.

How to do it…

In this section, you will be running through the examining_image_datasets.ipynb notebook to examine the FLOWERS and BIWI_HEAD_POSE datasets.

Once you have the notebook open in your fastai environment, complete the following steps:

  1. Run the first two cells to import the necessary libraries and set up the notebook for fastai.
  2. Run the following cell to copy the FLOWERS dataset into your filesystem (if it's not already there) and to define the path for the dataset:
    path = untar_data(URLs.FLOWERS)
  3. Run the following cell to get the output of path.ls() so that you can examine the directory structure of the dataset:
    Figure 2.16 – Output of path.ls()

    Figure 2.16 – Output of path.ls()

  4. Look at the contents of the valid.txt file. This indicates that train.txt, valid.txt, and test.txt contain lists of the image files that belong to each of these datasets:
    Figure 2.17 – The first few records of valid.txt

    Figure 2.17 – The first few records of valid.txt

  5. Examine the jgp subdirectory:
    (path/'jpg').ls()
  6. Take a look at one of the image files. Note that the get_image_files() function doesn't need to be pointed to a particular subdirectory – it recursively collects all the image files in a directory and its subdirectories:
    img_files = get_image_files(path)
    img = PILImage.create(img_files[100])
    img
  7. You should have noticed that the image displayed in the previous step was the native size of the image, which makes it rather big for the notebook. To get the image at a more appropriate size, apply the to_thumb function with the image dimension specified as an argument. Note that you might see a different image when you run this cell:
    Figure 2.18 – Applying to_thumb to an image

    Figure 2.18 – Applying to_thumb to an image

  8. Now, ingest the BIWI_HEAD_POSE dataset:
    path = untar_data(URLs.BIWI_HEAD_POSE)
  9. Examine the path for this dataset:
    path.ls()
  10. Examine the 05 subdirectory:
    (path/"05").ls()
  11. Examine one of the images. Note that you may see a different image:
    Figure 2.19 – One of the images in the BIWI_HEAD_POSE dataset

    Figure 2.19 – One of the images in the BIWI_HEAD_POSE dataset

  12. In addition to the image files, this dataset also includes text files that encode the pose depicted in the image. Ingest one of these text files into a pandas DataFrame and display it:
Figure 2.20 – The first few records of one of the position text files

Figure 2.20 – The first few records of one of the position text files

In this section, you learned how to ingest two different kinds of image datasets, explore their directory structure, and examine images from the datasets.

How it works…

You used the same untar_data() function to ingest the curated tabular, text, and image datasets, and the same ls() function to examine the directory structures for all the different kinds of datasets. On top of these common facilities, fastai provides additional convenience functions for examining image data: get_image_files() to collect all the image files in a directory tree starting at a given directory, and to_thumb() to render the image at a size that is suitable for a notebook.

There's more…

In addition to image classification datasets (where the goal of the trained model is to predict the category of what's displayed in the image) and image localization datasets (where the goal is to predict the location in the image of a given feature), the fastai curated datasets also include image segmentation datasets where the goal is to identify the subsets of an image that contain a particular object, including the CAMVID and CAMVID_TINY datasets.

Cleaning up raw datasets with fastai

Now that we have explored a variety of datasets that are curated by fastai, there is one more topic left to cover in this chapter: how to clean up datasets with fastai. Cleaning up datasets includes dealing with missing values and converting categorical values into numeric identifiers. We need to apply these cleanup steps to datasets because deep learning models can only be trained with numeric data. If we try to train the model with datasets that contain non-numeric data, including missing values and alphanumeric identifiers in categorical columns, the training process will fail. In this section, we are going to review the facilities provided by fastai to make it easy to clean up datasets, and thus make the datasets ready to train deep learning models.

Getting ready

Ensure you have followed the steps in Chapter 1, Getting Started with fastai, to get a fastai environment set up. Confirm that you can open the cleaning_up_datasets.ipynb notebook in the ch2 directory of your repository.

How to do it…

In this section, you will be running through the cleaning_up_datasets.ipynb notebook to address missing values in the ADULT_SAMPLE dataset and replace categorical values with numeric identifiers.

Once you have the notebook open in your fastai environment, complete the following steps:

  1. Run the first two cells to import the necessary libraries and set up the notebook for fastai.
  2. Recall the Examining tabular datasets with fastai section of this chapter. When you checked to see which columns in the ADULT_SAMPLE dataset had missing values, you found that some columns did indeed have missing values. We are going to identify the columns in ADULT_SAMPLE that have missing values, and use the facilities of fastai to apply transformations to the dataset that deal with the missing values in those columns, and then replace those categorical values with numeric identifiers.
  3. First, let's ingest the ADULT_SAMPLE curated dataset again:
    path = untar_data(URLs.ADULT_SAMPLE)
  4. Now, create a pandas DataFrame for the dataset and check for the number of missing values in each column. Note which columns have missing values:
    df = pd.read_csv(path/'adult.csv')
    df.isnull().sum()
  5. To deal with these missing values (and prepare categorical columns), we will use the fastai TabularPandas class (https://docs.fast.ai/tabular.core.html#TabularPandas). To use this class, we need to prepare the following parameters:

    a) procs is the list of transformations that will be applied to TabularPandas. Here, we will specify that we want missing values to be filled (FillMissing) and that we will replace values in categorical columns with numeric identifiers (Categorify).

    b) dep_var specifies which column is the dependent variable; that is, the target that we want to ultimately predict with the model. In the case of ADULT_SAMPLE, the dependent variable is salary.

    c) cont and cat are lists of the columns in the dataset. They are continuous and categorical, respectively. Continuous columns contain numeric values, such as integers or floating-point values. Categorical values contain category identifiers, such as names of US states, days of the week, or colors. We use the cont_cat_split() (https://docs.fast.ai/tabular.core.html#cont_cat_split) function to automatically identify the continuous and categorical columns:

    procs = [FillMissing,Categorify]
    dep_var = 'salary'
    cont,cat = cont_cat_split(df, 1, dep_var=dep_var)
  6. Now, create a TabularPandas object called df_no_missing using these parameters. This object will contain the dataset with missing values replaced and the values in the categorical columns replaced with numeric identifiers:
    df_no_missing = TabularPandas(df, procs, cat, cont, y_names = dep_var)
  7. Apply the show API to df_no_missing to display samples of its contents. Note that the values in the categorical columns are maintained when the object is displayed using show(). What about replacing the categorical values with numeric identifiers? Don't worry – we'll see that result in the next step:
    Figure 2.21 – The first few records of df_no_missing

    Figure 2.21 – The first few records of df_no_missing

  8. Now, display some sample contents of df_no_missing using the items.head() API. This time, the categorical columns contain the numeric identifiers rather than the original values. This is an example of a benefit provided by fastai: the switch between the original categorical values and the numeric identifiers is handled elegantly. If you need to see the original values, you can use the show() API, which transforms the numeric values in categorical columns back into their original values, while the items.head() API shows the actual numeric identifiers in the categorical columns:
    Figure 2.22 – The first few records of df_no_missing with numeric identifiers in categorical columns

    Figure 2.22 – The first few records of df_no_missing with numeric identifiers in categorical columns

  9. Finally, let's confirm that the missing values were handled correctly. As you can see, the two columns that originally had missing values no longer have missing values in df_no_missing:
Figure 2.23 – Missing values in df_no_missing

Figure 2.23 – Missing values in df_no_missing

By following these steps, you have seen how fastai makes it easy to prepare a dataset to train a deep learning model. It does this by replacing missing values and converting the values in the categorical columns into numeric identifiers.

How it works…

In this section, you saw several ways that fastai makes it easy to perform common data preparation steps. The TabularPandas class provides a lot of value by making it easy to execute common steps to prepare a tabular dataset (including replacing missing values and dealing with categorical columns). The cont_cat_split() function automatically identifies continuous and categorical columns in your dataset. In conclusion, fastai makes the cleanup process easy and less error prone than it would be if you had to hand code all the functions required to accomplish these dataset cleanup steps.

Left arrow icon Right arrow icon
Download code icon Download Code

Key benefits

  • Discover how to apply state-of-the-art deep learning techniques to real-world problems
  • Build and train neural networks using the power and flexibility of the fastai framework
  • Use deep learning to tackle problems such as image classification and text classification

Description

fastai is an easy-to-use deep learning framework built on top of PyTorch that lets you rapidly create complete deep learning solutions with as few as 10 lines of code. Both predominant low-level deep learning frameworks, TensorFlow and PyTorch, require a lot of code, even for straightforward applications. In contrast, fastai handles the messy details for you and lets you focus on applying deep learning to actually solve problems. The book begins by summarizing the value of fastai and showing you how to create a simple 'hello world' deep learning application with fastai. You'll then learn how to use fastai for all four application areas that the framework explicitly supports: tabular data, text data (NLP), recommender systems, and vision data. As you advance, you'll work through a series of practical examples that illustrate how to create real-world applications of each type. Next, you'll learn how to deploy fastai models, including creating a simple web application that predicts what object is depicted in an image. The book wraps up with an overview of the advanced features of fastai. By the end of this fastai book, you'll be able to create your own deep learning applications using fastai. You'll also have learned how to use fastai to prepare raw datasets, explore datasets, train deep learning models, and deploy trained models.

Who is this book for?

This book is for data scientists, machine learning developers, and deep learning enthusiasts looking to explore the fastai framework using a recipe-based approach. Working knowledge of the Python programming language and machine learning basics is strongly recommended to get the most out of this deep learning book.

What you will learn

  • Prepare real-world raw datasets to train fastai deep learning models
  • Train fastai deep learning models using text and tabular data
  • Create recommender systems with fastai
  • Find out how to assess whether fastai is a good fit for a given problem
  • Deploy fastai deep learning models in web applications
  • Train fastai deep learning models for image classification
Estimated delivery fee Deliver to Hungary

Premium delivery 7 - 10 business days

€25.95
(Includes tracking information)

Product Details

Country selected
Publication date, Length, Edition, Language, ISBN-13
Publication date : Sep 24, 2021
Length: 340 pages
Edition : 1st
Language : English
ISBN-13 : 9781800208100
Category :
Languages :
Concepts :
Tools :

What do you get with Print?

Product feature icon Instant access to your digital eBook copy whilst your Print order is Shipped
Product feature icon Paperback book shipped to your preferred address
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
Product feature icon AI Assistant (beta) to help accelerate your learning
Estimated delivery fee Deliver to Hungary

Premium delivery 7 - 10 business days

€25.95
(Includes tracking information)

Product Details

Publication date : Sep 24, 2021
Length: 340 pages
Edition : 1st
Language : English
ISBN-13 : 9781800208100
Category :
Languages :
Concepts :
Tools :

Packt Subscriptions

See our plans and pricing
Modal Close icon
€18.99 billed monthly
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Simple pricing, no contract
€189.99 billed annually
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just €5 each
Feature tick icon Exclusive print discounts
€264.99 billed in 18 months
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just €5 each
Feature tick icon Exclusive print discounts

Frequently bought together


Stars icon
Total 123.97
Mastering PyTorch
€44.99
Machine Learning for Time-Series with Python
€41.99
Deep Learning with fastai Cookbook
€36.99
Total 123.97 Stars icon

Table of Contents

9 Chapters
Chapter 1: Getting Started with fastai Chevron down icon Chevron up icon
Chapter 2: Exploring and Cleaning Up Data with fastai Chevron down icon Chevron up icon
Chapter 3: Training Models with Tabular Data Chevron down icon Chevron up icon
Chapter 4: Training Models with Text Data Chevron down icon Chevron up icon
Chapter 5: Training Recommender Systems Chevron down icon Chevron up icon
Chapter 6: Training Models with Visual Data Chevron down icon Chevron up icon
Chapter 7: Deployment and Model Maintenance Chevron down icon Chevron up icon
Chapter 8: Extended fastai and Deployment Features Chevron down icon Chevron up icon
Other Books You May Enjoy Chevron down icon Chevron up icon

Customer reviews

Top Reviews
Rating distribution
Full star icon Full star icon Full star icon Full star icon Half star icon 4.5
(15 Ratings)
5 star 60%
4 star 33.3%
3 star 6.7%
2 star 0%
1 star 0%
Filter icon Filter
Top Reviews

Filter reviews by




Richa Sethi Oct 29, 2021
Full star icon Full star icon Full star icon Full star icon Full star icon 5
As the name of the book describes, this book is actually a cookbook where it starts from the very basic ingredients (setting up the environment) needed for fastai, and eventually builds up the pace by getting deep into data cleaning and exploration. It then delves into training tabular data, text data and recommender systems, and finally gets into model deployment and maintenance. Great for beginners as well as individuals like me who have used fastai for certain applications before. This book should allow anyone apply fastai to create end-to-end deep learning models to make predictions on a wide variety of datasets.
Amazon Verified review Amazon
Daniel Armstrong Oct 30, 2021
Full star icon Full star icon Full star icon Full star icon Full star icon 5
This book is filled with great examples of how you can apply fast.ai to a wide range of deep learning projects. I may be a little biased because I am a huge fan of fast.ai, but I really enjoyed this book. I am glad to have it as part of my library. It not only covers the major deep learning topics like CV, NLP, and Tabular data, but is also covers other topics like using callback, memory management, and model deployment. The only part that I was a little disappointed about was the section on object detection, the author did a great job covering the subject, but the fast.ai library doesn't cover an easy way to do object detection out of the box. The good new is their is a library called Ice Vision that is built and maintained by a group of former Fast.ai student/alumni, which is built on top of fast.ai that does a great job making object detection possible and as easy as most things you can do in fast.ai.
Amazon Verified review Amazon
Guangping zhang Sep 24, 2021
Full star icon Full star icon Full star icon Full star icon Full star icon 5
Finally I read a book for FastAI, "Deep Learning with fastai Cookbook".Fist this book discusses how to setting up a fastai environment in Google Colab, then introduce theapplications: tables, text, recommender systems, and images.After discussing data cleaning, which is a necessary step using fastai, this book discuss the four typesof application of Fastai chapter by chapter, all the application share same training and predict api, fit_one_cycleand predict, for transfer learning, you can use fine_tune() replace fit_one_cycle.Several steps for model web deployment and advanced deployment, including export(), load_learner() and web_flask_deploy and/or web_flask_deploy_image_mode()were discussed using two chapter in this book.This book uses clean logic and easy understanding language to introduce fastai to readers, I suggest you read this book if you plan to use fastai.
Amazon Verified review Amazon
Erfan Chowdhury Nov 30, 2021
Full star icon Full star icon Full star icon Full star icon Full star icon 5
This book is perfect for people who wants to utilize the highly optimized Fastai library. Jeremy Howard , the creator of Fastai, has a free of charge online series where he demonstrates and teaches how to use the library. He even published a book "Deep Learning for Coders with Fastai and PyTorch: AI Applications Without a PhD" to follow along with the course. However , many topics were just brushed over in these videos and the book and the task was left for the user to go and tinker around.This is time consuming unfortunately. But here is the good part! This is the book that basically spoon feeds the Fastai library to you. This book contains in-depth explanations to topics that were not deeply covered in the series or the book. All the chapters in the book comes with their own datasets to test out what you have learnt. Starting from cleaning the data to solving all different kinds of structured and unstructured problems that you might face in the field of deep learning, this book teaches you how to effectively use the Fastai toolkit. This book also demonstrates how you can deploy the models that you trained. The "Test Your Knowledge" section at the end of every chapter is like a brain teaser that strengthens your conceptualization of that particular chapter.The book is well written and well organized. I thoroughly enjoyed the book. Highly Recommended!
Amazon Verified review Amazon
Carlo Nov 29, 2021
Full star icon Full star icon Full star icon Full star icon Full star icon 5
This book provides an excellent starting point to anyone interested in building deep learning applications.It effectively shows how to use fastai, a powerful and intuitive deep learning framework.The book provides detailed instructions and tutorials to quickly dive into deep learning examples. A very nice aspect is the step-by-step guidance in setting up the right working environment to run the examples.It is a perfect book for practitioners and machine learning beginners that want to get started as quickly as possible. However, it doesn't provide much theoretical background on machine learning and how models work, so should definitely be accompanied by other resources for practitioners with no background on machine learning who want to understand the inner workings of ML models.Overall it's a useful book especially recommended to anyone who wants to use the fast.ai framework to build powerful deep learning models quickly.
Amazon Verified review Amazon
Get free access to Packt library with over 7500+ books and video courses for 7 days!
Start Free Trial

FAQs

What is the delivery time and cost of print book? Chevron down icon Chevron up icon

Shipping Details

USA:

'

Economy: Delivery to most addresses in the US within 10-15 business days

Premium: Trackable Delivery to most addresses in the US within 3-8 business days

UK:

Economy: Delivery to most addresses in the U.K. within 7-9 business days.
Shipments are not trackable

Premium: Trackable delivery to most addresses in the U.K. within 3-4 business days!
Add one extra business day for deliveries to Northern Ireland and Scottish Highlands and islands

EU:

Premium: Trackable delivery to most EU destinations within 4-9 business days.

Australia:

Economy: Can deliver to P. O. Boxes and private residences.
Trackable service with delivery to addresses in Australia only.
Delivery time ranges from 7-9 business days for VIC and 8-10 business days for Interstate metro
Delivery time is up to 15 business days for remote areas of WA, NT & QLD.

Premium: Delivery to addresses in Australia only
Trackable delivery to most P. O. Boxes and private residences in Australia within 4-5 days based on the distance to a destination following dispatch.

India:

Premium: Delivery to most Indian addresses within 5-6 business days

Rest of the World:

Premium: Countries in the American continent: Trackable delivery to most countries within 4-7 business days

Asia:

Premium: Delivery to most Asian addresses within 5-9 business days

Disclaimer:
All orders received before 5 PM U.K time would start printing from the next business day. So the estimated delivery times start from the next day as well. Orders received after 5 PM U.K time (in our internal systems) on a business day or anytime on the weekend will begin printing the second to next business day. For example, an order placed at 11 AM today will begin printing tomorrow, whereas an order placed at 9 PM tonight will begin printing the day after tomorrow.


Unfortunately, due to several restrictions, we are unable to ship to the following countries:

  1. Afghanistan
  2. American Samoa
  3. Belarus
  4. Brunei Darussalam
  5. Central African Republic
  6. The Democratic Republic of Congo
  7. Eritrea
  8. Guinea-bissau
  9. Iran
  10. Lebanon
  11. Libiya Arab Jamahriya
  12. Somalia
  13. Sudan
  14. Russian Federation
  15. Syrian Arab Republic
  16. Ukraine
  17. Venezuela
What is custom duty/charge? Chevron down icon Chevron up icon

Customs duty are charges levied on goods when they cross international borders. It is a tax that is imposed on imported goods. These duties are charged by special authorities and bodies created by local governments and are meant to protect local industries, economies, and businesses.

Do I have to pay customs charges for the print book order? Chevron down icon Chevron up icon

The orders shipped to the countries that are listed under EU27 will not bear custom charges. They are paid by Packt as part of the order.

List of EU27 countries: www.gov.uk/eu-eea:

A custom duty or localized taxes may be applicable on the shipment and would be charged by the recipient country outside of the EU27 which should be paid by the customer and these duties are not included in the shipping charges been charged on the order.

How do I know my custom duty charges? Chevron down icon Chevron up icon

The amount of duty payable varies greatly depending on the imported goods, the country of origin and several other factors like the total invoice amount or dimensions like weight, and other such criteria applicable in your country.

For example:

  • If you live in Mexico, and the declared value of your ordered items is over $ 50, for you to receive a package, you will have to pay additional import tax of 19% which will be $ 9.50 to the courier service.
  • Whereas if you live in Turkey, and the declared value of your ordered items is over € 22, for you to receive a package, you will have to pay additional import tax of 18% which will be € 3.96 to the courier service.
How can I cancel my order? Chevron down icon Chevron up icon

Cancellation Policy for Published Printed Books:

You can cancel any order within 1 hour of placing the order. Simply contact customercare@packt.com with your order details or payment transaction id. If your order has already started the shipment process, we will do our best to stop it. However, if it is already on the way to you then when you receive it, you can contact us at customercare@packt.com using the returns and refund process.

Please understand that Packt Publishing cannot provide refunds or cancel any order except for the cases described in our Return Policy (i.e. Packt Publishing agrees to replace your printed book because it arrives damaged or material defect in book), Packt Publishing will not accept returns.

What is your returns and refunds policy? Chevron down icon Chevron up icon

Return Policy:

We want you to be happy with your purchase from Packtpub.com. We will not hassle you with returning print books to us. If the print book you receive from us is incorrect, damaged, doesn't work or is unacceptably late, please contact Customer Relations Team on customercare@packt.com with the order number and issue details as explained below:

  1. If you ordered (eBook, Video or Print Book) incorrectly or accidentally, please contact Customer Relations Team on customercare@packt.com within one hour of placing the order and we will replace/refund you the item cost.
  2. Sadly, if your eBook or Video file is faulty or a fault occurs during the eBook or Video being made available to you, i.e. during download then you should contact Customer Relations Team within 14 days of purchase on customercare@packt.com who will be able to resolve this issue for you.
  3. You will have a choice of replacement or refund of the problem items.(damaged, defective or incorrect)
  4. Once Customer Care Team confirms that you will be refunded, you should receive the refund within 10 to 12 working days.
  5. If you are only requesting a refund of one book from a multiple order, then we will refund you the appropriate single item.
  6. Where the items were shipped under a free shipping offer, there will be no shipping costs to refund.

On the off chance your printed book arrives damaged, with book material defect, contact our Customer Relation Team on customercare@packt.com within 14 days of receipt of the book with appropriate evidence of damage and we will work with you to secure a replacement copy, if necessary. Please note that each printed book you order from us is individually made by Packt's professional book-printing partner which is on a print-on-demand basis.

What tax is charged? Chevron down icon Chevron up icon

Currently, no tax is charged on the purchase of any print book (subject to change based on the laws and regulations). A localized VAT fee is charged only to our European and UK customers on eBooks, Video and subscriptions that they buy. GST is charged to Indian customers for eBooks and video purchases.

What payment methods can I use? Chevron down icon Chevron up icon

You can pay with the following card types:

  1. Visa Debit
  2. Visa Credit
  3. MasterCard
  4. PayPal
What is the delivery time and cost of print books? Chevron down icon Chevron up icon

Shipping Details

USA:

'

Economy: Delivery to most addresses in the US within 10-15 business days

Premium: Trackable Delivery to most addresses in the US within 3-8 business days

UK:

Economy: Delivery to most addresses in the U.K. within 7-9 business days.
Shipments are not trackable

Premium: Trackable delivery to most addresses in the U.K. within 3-4 business days!
Add one extra business day for deliveries to Northern Ireland and Scottish Highlands and islands

EU:

Premium: Trackable delivery to most EU destinations within 4-9 business days.

Australia:

Economy: Can deliver to P. O. Boxes and private residences.
Trackable service with delivery to addresses in Australia only.
Delivery time ranges from 7-9 business days for VIC and 8-10 business days for Interstate metro
Delivery time is up to 15 business days for remote areas of WA, NT & QLD.

Premium: Delivery to addresses in Australia only
Trackable delivery to most P. O. Boxes and private residences in Australia within 4-5 days based on the distance to a destination following dispatch.

India:

Premium: Delivery to most Indian addresses within 5-6 business days

Rest of the World:

Premium: Countries in the American continent: Trackable delivery to most countries within 4-7 business days

Asia:

Premium: Delivery to most Asian addresses within 5-9 business days

Disclaimer:
All orders received before 5 PM U.K time would start printing from the next business day. So the estimated delivery times start from the next day as well. Orders received after 5 PM U.K time (in our internal systems) on a business day or anytime on the weekend will begin printing the second to next business day. For example, an order placed at 11 AM today will begin printing tomorrow, whereas an order placed at 9 PM tonight will begin printing the day after tomorrow.


Unfortunately, due to several restrictions, we are unable to ship to the following countries:

  1. Afghanistan
  2. American Samoa
  3. Belarus
  4. Brunei Darussalam
  5. Central African Republic
  6. The Democratic Republic of Congo
  7. Eritrea
  8. Guinea-bissau
  9. Iran
  10. Lebanon
  11. Libiya Arab Jamahriya
  12. Somalia
  13. Sudan
  14. Russian Federation
  15. Syrian Arab Republic
  16. Ukraine
  17. Venezuela