Overview of deep learning
ML is the ingredient that makes our tiny devices capable of making intelligent decisions. These software algorithms heavily rely on the correct data to learn patterns or actions based on experience. As we commonly say, data is everything for ML because it is what makes or breaks an application.
This book will refer to deep learning (DL) as a specific class of ML that can perform complex prediction tasks directly on raw images, text, or sound. These algorithms have state-of-the-art accuracy and can be better and faster than humans in solving some data analysis problems.
A complete discussion of DL architectures and algorithms is beyond the scope of this book. However, this section will summarize some essential points relevant to understanding the following chapters.
Deep neural networks
A deep neural network consists of several stacked layers aimed at learning patterns.
Each layer contains several neurons, the fundamental computing elements for artificial neural networks (ANNs) inspired by the human brain.
A neuron produces a single output through a linear transformation, defined as the weighted sum of the inputs plus a constant value called bias, as shown in the following diagram:
Figure 1.3: A neuron representation
The coefficients of this weighted sum are called weights.
Weights and bias are obtained after an iterative training process to make the neuron capable of learning complex patterns. However, neurons can only solve simple linear problems with linear transformations. Therefore, non-linear functions, called activations, generally follow the neuron’s output to help the network learn complex patterns:
An example of a widely adopted activation function is the rectified linear unit (ReLU), which returns the maximum value between the input value and 0:
float relu(float input) {
return max(input, 0);
}
Its computational simplicity makes it preferable to other non-linear functions, such as a hyperbolic tangent or logistic sigmoid, requiring more computational resources.
In the following subsection, we will see how the neurons are connected to solve complex visual recognition tasks.
Convolutional neural networks
Convolutional neural networks (CNNs) are specialized deep neural networks predominantly applied to visual recognition tasks.
We can consider CNNs as the evolution of a regularized version of the classic fully connected neural networks with dense layers, also known as fully connected layers.
As we can see in the following diagram, a characteristic of fully connected networks is connecting every neuron to all the output neurons of the previous layer:
Figure 1.5: A fully connected network
Unfortunately, this method of connecting neurons does not work well for training a model for image classification.
For instance, if we considered an RGB image of size 320x240 (width x height), we would need 230,400 (320*240*3) weights for just one neuron. Since our models will undoubtedly need several layers of neurons to discern complex problems, the model will likely overfit, given the unmanageable number of trainable parameters. Overfitting implies that the model learns to predict the training data well but struggles to generalize data not used during the training process (unseen data).
In the past, data scientists adopted manual feature engineering techniques to extract a reduced set of good features from images. However, the approach suffered from being difficult, time-consuming, and domain-specific.
With the rise of CNNs, visual recognition tasks saw improvement thanks to convolution layers, which make feature extraction part of the learning problem.
Based on the assumption that we are dealing with images and inspired by biological processes in the animal visual cortex, the convolution layer borrows the widely adopted convolution operator from image processing to create a set of learnable features.
The convolution operator is performed similarly to other image processing routines: sliding a window application (filter or kernel) onto the entire input image and applying the dot product between its weights and the underlying pixels, as shown in Figure 1.6:
Figure 1.6: Convolution operator
This approach brings two significant benefits:
- It extracts the relevant features automatically without human intervention.
- It reduces the number of input signals per neuron considerably.
For instance, applying a 3x3 filter on the preceding RGB image would only require 27 weights (3*3*3).
Like fully connected layers, convolution layers need several kernels to learn as many features as possible. Therefore, the convolution layer’s output generally produces a set of images (feature maps), commonly kept in a multidimensional memory object called a tensor, as shown in the following illustration:
Figure 1.7: Representation of a 3D tensor
Traditional CNNs for visual recognition tasks usually include the fully connected layers at the network’s end to carry out the prediction stage. Since the output of the convolution layers is a set of images, we generally adopt subsampling strategies to reduce the information propagated through the network and the risk of overfitting when feeding the fully connected layers.
Typically, there are two ways to perform subsampling:
- Skipping the convolution operator for some input pixels. As a result, the output of the convolution layer will have fewer spatial dimensions than the input ones.
- Adopting subsampling functions such as pooling layers.
The following figure shows a generic CNN architecture, where the pooling layer reduces the spatial dimensionality, and the fully connected layer performs the classification stage:
Figure 1.8: Traditional CNN with a pooling layer to reduce the spatial dimensionality
When developing DL networks for tinyML, one of the most crucial factors is the model’s size, defined as the number of trainable weights. Due to the limited physical memory of our platforms, the model needs to be compact to fit the target device. However, memory constraints are not the only challenge we may face. For instance, while trained models often use floating-point precision arithmetic operations, the CPUs on our platforms may lack hardware acceleration.
Thus, to overcome these limitations, quantization becomes an indispensable technique.
Model quantization
Quantization is the process of performing neural network computations in lower bit precision. The widely adopted technique for microcontrollers applies the quantization post-training and converts the 32-bit floating-point weights to 8-bit integer values. This technique brings a 4x model size reduction and a significant latency improvement with little or no accuracy drop.
Other techniques like pruning (setting weights to zero) or clustering (grouping weights into clusters) can help reduce the model size. However, in this book, we will limit the scope to the quantization technique because it is sufficient to showcase the model deployment on microcontrollers.
If you are interested in learning more about pruning and clustering, you can refer to the following practical blog post, which shows the benefit of these two techniques on the model size: https://community.arm.com/arm-community-blogs/b/ai-and-ml-blog/posts/pruning-clustering-arm-ethos-u-npu.
As we know, ML is the component that allows smartness into our application. Nevertheless, to ensure the longevity of battery-powered applications, it is essential to use low-power devices. So far, we have mentioned power and energy in general terms, but let’s see what they mean practically in the following section.