Data Augmentation Made Easy
Data augmentation is essential for developing a successful deep learning (DL) project. However, data scientists and developers often overlook this crucial step. It is no secret that you will spend the majority of your project time gathering, cleaning, and augmenting the dataset in a real-world DL project. Thus, learning how to expand the dataset without purchasing new data is essential. This book covers standard and advanced techniques for extending image, text, audio, and tabular datasets. Furthermore, you will learn about data biases and learn how to code on Jupyter Python Notebooks.
Chapter 1 will introduce various data augmentation concepts, set up the coding environment, and create the foundation class. Later chapters will explain various techniques in detail, including Python coding. The effective use of data augmentation has proven to be the deciding factor between success and failure in machine learning (ML). Many real-world ML projects stay in the conceptual phase because of insufficient data for training the ML model. Data augmentation is a cost-effective technique that can increase the size of the dataset, lower the training error rate, and produce a more accurate prediction and forecast.
Fun fact
The car gasoline analogy is helpful for students who first learn about data augmentation and artificial intelligence (AI). You can think of data for the AI engine as the gasoline and data augmentation as the additive, such as the Chevron Techron fuel cleaner, that makes your car engine run faster, smoother, and further without extra petrol.
In this chapter, we’ll define the data augmentation role and the limitations of extending data without changing its integrity. We’ll briefly discuss the different types of input data, such as image, text, audio, and tabular data, and the challenges in supplementing it. Finally, we’ll set up the system requirements and the programming style in the accompanying Python notebook.
I designed this book to be a hands-on journey. It will be most effective to read a chapter, run the code, re-read the part of the chapter that confused you, and jump back to hacking the code until you firmly understand the concept or technique that was presented.
You are encouraged to change or add new code to the Python notebook. The primary purpose of this book is interactive learning. So, if something goes wrong, download a fresh copy from the book's GitHub. The surest method to learn is to make mistakes and create something new.
Data augmentation is an iterative process. There is no fixed recipe. In other words, depending on the dataset, you select augmented functions and jiggle the parameters. A subject domain expert may provide insight into how much distortion is acceptable. By the end of this chapter, you will know the general rules for data augmentation, what type of input data can be augmented, the programming style, and how to set up a Python Notebook online or offline.
In particular, this chapter covers the following primary topics:
- Data augmentation role
- Data input types
- Python Notebook
- Programming styles
Let’s start with the data augmentation role.