Delving into deep neural networks
So far, we’ve scratched the surface of several ANNs, because there are so many variants that it would be very difficult to collate them all in this document, and it would be beyond the scope of this book. The main problem with ANNs was that toward the end of the last century, they hit a brick wall, and advancements were moving at a relatively slow pace.
This changed soon after the turn of the century due to three main reasons:
- The amount of information available online in digital form provided a fertile ground for data-hungry algorithms such as ANNs. Just consider that up to 90% of the world’s data was created in the past 2 years, and is constantly increasing at an impressive rate of 2.5 quintillion bytes of data per day.
- The advancement in video game development. Back then, game developers were looking at ways in which they could improve the graphics. Because of this, they needed a more powerful and dedicated processor, which is how the Graphics Processing Unit (GPU) was born. The GPU is designed like a parallel computer capable of performing mathematical functions not just on one data item but on several at once. Coincidentally, this was precisely the kind of processing that was needed to process large and complex ANNs. Because of this, AI professionals started using GPUs, which proved to be a match made in heaven since this sped up the development of more complex models. Later, companies such as Google released their own specialized versions of the GPU specifically designed for AI applications, such as the Tensor Processing Unit (TPU). During this period, we’ve also seen democratization in processing power thanks to the widespread availability of cloud processing. Thus, the combination of these factors led to the successful creation of more complex ANNs.
- The development and use of new ANNs architectures, normally referred to as Deep Learning (DL) approaches. Many of the underlying ideas are not entirely new, and some of them have been around for the past 70 years; however, their success was limited due to the issues mentioned earlier. DL is a class of algorithms that can be used to tackle a myriad of problems in very diverse domains.
In the coming section, we will be exploring various architectures used in DL. The word deep is normally attributed to the deep hidden layers found in these types of networks through which such networks derive their effectiveness in handling complex tasks. However, before delving into these architectures, it is important to note that these networks learn using several different approaches. The four main categories are as follows:
- Supervised learning occurs when the algorithm is provided with examples (normally input-output pairs), and the ANN learns to model that function.
- Semi-supervised learning uses a small subset of examples for learning. It then categorizes unseen examples and adds to the training set those examples that were labeled with high confidence. The process is repeated until all the items are eventually labeled.
- Unsupervised learning doesn’t use any examples but tries to identify underlying patterns in the data to learn possible classifications.
- Self-supervised learning uses a semi-automatic process to learn to categorize from the data and then use that information to tackle the whole dataset.
Now that we have had a look at how DL algorithms learn, let’s have a look at the most popular ones used in various applications.
Convolutional neural networks
A convolutional neural network (CNN) is a kind of network inspired by the biological evolution of the visual cortex. If we look at how the eye works, we find that various receptors specialize in recognizing different features such as colors, shapes, and so on. CNN works on similar principles, and because of this, it is very much adapted to tackle computer vision applications. Similarly, the network comprises various layers that perform visual extraction followed by classification, as seen in Figure 3.13. The input image is first divided into sections that feed into the convolutional layer.
![Figure 3.13: A CNN](https://static.packt-cdn.com/products/9781804617625/graphics/image/B19292_03_13.jpg)
Figure 3.13: A CNN
These are used to extract features from that image. Of course, not all the features are important, so the Rectified Linear Unit (ReLU) is used as the activation function, which also helps to keep the gradient relatively constant while allowing the network to be trained without harmful consequences. The following step is pooling, which allows the CNN to focus on the most relevant patterns. It’s a sort of lens that reduces the dimensionality of the features by informing us that a feature was present but not exactly where. By doing so, a small variation will not affect the final target while also managing memory and improving speed (especially in large images). There are various iterations of the architecture just described until, finally, all the output is passed through a fully connected (FC) multilayer perceptron at the end. This is where the classification happens. Such networks are normally trained using backpropagation and have been widely successful in applications related to image processing, video recognition, and several Natural Language Processing (NLP) tasks.
Long short-term memory networks
A long short-term memory (LSTM) network is a special kind of RNN that doesn’t suffer from the vanishing gradient problem mentioned earlier. The basic architecture departs from the traditional neuron-based networks we’ve seen and uses a memory cell concept. The cell uses the memory component (as seen in Figure 3.14) to remember what’s important for a short or longer period. It also contains three gates – input, forget, and output – which are used to control the flow of information within the network:
- The input gate regulates the flow of new information into memory
- The forget gate allows the network to dispose of existing information, thus making space for new data
- The output gate controls the usage of the data stored within each cell
LSTMs are extensively used in NLP applications since they are extremely good at processing sequences like the ones we use in natural languages. There are several variances of LSTMs, one of which is the Gated Recurrent Unit (GRU), which simplifies the existing LSTM model by removing the output gate. This makes the network use less memory and improves its performance; however, it tends to be less accurate than an LSTM when dealing with longer sequences.
![Figure 3.14: An LSTM architecture showing the memory mechanism, which takes information from prior inputs](https://static.packt-cdn.com/products/9781804617625/graphics/image/B19292_03_14.jpg)
Figure 3.14: An LSTM architecture showing the memory mechanism, which takes information from prior inputs
Autoencoders
An autoencoder (Figure 3.15) is a network of three layers; input, hidden, and output. However, unlike other networks, it gathers its input, encodes it in the hidden layer, and recreates it in the output layer.
![Figure 3.15: Autoencoder architecture](https://static.packt-cdn.com/products/9781804617625/graphics/image/B19292_03_15.jpg)
Figure 3.15: Autoencoder architecture
The hidden layer typically consists of fewer nodes than the input layer, and an encoding/decoding function is used to process the data. Because of this, the error is not calculated on the output only like most other networks but is a function that finds the difference between the input and output layer. The weights are then adjusted to reduce the error further. Also, since the autoencoders constantly encode and decode the same input, there is no need to compare the outputs with any additional data like in traditional techniques, thus making the network self-supervised. Such networks have been successfully used for image compression, denoising, and feature extraction.
Deep belief networks
A deep belief network (DBM) is a typical multilayer network with an input, several hidden layers, and an output. The main difference between this and typical multilayered architectures is in the training of such a network. Rather than performing a forward pass followed by backpropagation, each pair of layers (RBM1, RBM2, and RBM3) is considered an individual RBM. Because of this, we can describe the DBM as a stack of RBMs, as can be seen in Figure 3.16.
![Figure 3.16: DBM architecture](https://static.packt-cdn.com/products/9781804617625/graphics/image/B19292_03_16.jpg)
Figure 3.16: DBM architecture
The input layer gets a feed from the raw sensory input, and as the information passes through the hidden layers, it generates a different level of abstraction. The output layer then simply takes care of the final classification. The training is also divided into two phases: the first is the unsupervised pretraining, and the second is the supervised fine-tuning.
In the first phase, the second layer in every RBM is trained to reconstruct the first layer. So, the first hidden layer has to reconstruct the input layer. The second hidden layer must reconstruct the first hidden layer, and the output layer is trained to reconstruct the second hidden layer. Once all the layers have been pretrained, the second phase begins, whereby the output nodes are linked to labels to give them meaning. Once this is completed and all the weights are set, backpropagation (or any other training function) is used to finalize the training phase. Such networks have been used in sentiment analysis applications and personalization.
Generative networks
As the name implies, generative networks use DL methods such as CNNs (discussed earlier) to generate new information. The task comprises the automatic discovery and learning of patterns within the input data, which is then used to generate new plausible examples based on the original dataset.
The most popular ones are called generative adversarial networks (GANs), and they use a supervised learning approach. Rather than using one model, they have two: a generator and a discriminator. The role of the generator is to create new example data points based on the training dataset. These are then fed to the discriminator, which decides whether they exhibit the characteristics of the data found in the training set. If they don’t, they are simply discarded, and a new example is created. This process continues until the discriminator model identifies new data with similar characteristics at least 50% of the time, thus indicating that the generator learned to replicate examples similar to the training data. Such an approach has been successfully used to generate new texts, paraphrase, and create artistic masterpieces.
Another important model gaining traction in recent years is the diffusion model. This model starts with a training set and gradually destroys it by inserting Gaussian noise, which is a statistical noise with a probability density equal to what one would expect in a normal distribution. The learner then tries to recover the data by removing the noise and replacing it with the correct value. After the training, the diffusion model can generate an image out of noisy data. This model has been extensively used in the creation of photo-realistic computer-generated images, as shown in Figure 3.17. It also has the benefit of not using any adversarial training (as in GANs) while being scalable and parallelizable.
![Figure 3.17: Images generated using the Imagen diffusion model (source: imagen.research.google)](https://static.packt-cdn.com/products/9781804617625/graphics/image/B19292_03_17.jpg)
Figure 3.17: Images generated using the Imagen diffusion model (source: imagen.research.google)
Transformers
A transformer is a machine learning model designed to process sequential data. However, it differs from traditional networks as it does not analyze the data sequentially but utilizes an attention mechanism that gives a higher weight to objects (such as words or images) based on their context.
Self-attention is a vital component of these transformer architectures since it identifies the most essential sections of a sequence. To do so, it does the following:
- Encodes every object into a set of numbers (normally represented as a vector).
- For every object and all the other words in the sentence, it calculates the product of the two vectors.
- The results are normalized, multiplied by the existing weights, added together, and used as the new weights.
The effect of this process is that objects that have a semantic relationship are given more importance than others. As can be seen in Figure 3.18, the architecture uses a sequence-to-sequence model with a separate encoder and decoder.
![Figure 3.18: Typical transformer architecture](https://static.packt-cdn.com/products/9781804617625/graphics/image/B19292_03_18.jpg)
Figure 3.18: Typical transformer architecture
The encoder consists of blocks comprising the self-attention mechanism and the feed-forward network. The decoder has three blocks: the self-attention mechanism, an encoder-decoder attention component, and a feed-forward network. The main difference between the transformers and RNNs is that a transformer is not time-dependent; thus, objects are given importance based on the attention mechanism. Since their inception, transformers obtained massive popularity since their results almost reached a human level in various tasks, such as the following:
- Text classification: Assigning a label or category to a text, such as sentiment analysis, spam detection, and topic identification
- Information extraction: Extracting structured information from unstructured text, such as named entity recognition, relation extraction, and event extraction
- Question answering (QA): Generating a natural language answer to a natural language question, such as reading comprehension, factoid QA, and open-domain QA
- Summarization: Producing a concise summary of a longer text or multiple texts, such as news articles, scientific papers, and reviews
- Translation: Converting a text from one language to another language, such as English to French, Chinese to English, and Hindi to Urdu
- Text generation: Creating new text based on some input or context, such as dialogue generation, story generation, and caption generation
Today, we live in the age of large language models with transformers such as GPT, BERT, T5, and many others integrated into several AI applications.
Comparing the networks
Let’s now look at the pros, cons, and applications of the networks that we’ve discussed so far:
- Artificial Neural Network: Networks inspired by the structure and function of biological neurons:
- Pros: Can learn complex patterns and relationships, and perform tasks such as classification, regression, and prediction
- Cons: Can suffer from overfitting; may require a large amount of data and computing resources to train effectively
- Applications: Image and speech recognition, natural language processing, and recommender systems
- Recurrent Neural Networks: Networks that can process sequences of inputs by passing information from one time step to the next:
- Pros: Can model temporal dependencies and long-term dependencies; can handle variable-length inputs
- Cons: Can suffer from vanishing and exploding gradients; can be computationally expensive to train
- Applications: Natural language processing, speech recognition, and time series prediction
- Competitive Networks: Networks where neurons compete with each other for activation:
- Pros: Can learn to classify inputs without explicit supervision; can identify prototypes or representatives in the data
- Cons: Limited to unsupervised learning; may require hand-crafted features
- Applications: Clustering and feature extraction
- Hopfield Networks: Networks that can store and retrieve memories as stable patterns of activation:
- Pros: Can store a large number of memories; can tolerate noise and incomplete patterns
- Cons: Limited to associative memory and pattern completion tasks; may converge to incorrect states or spurious memories
- Applications: Pattern completion and optimization problems
- Convolutional Neural Networks: Networks that can learn hierarchical representations of inputs, particularly images:
- Pros: Can learn spatial and translation-invariant features; can perform well with limited data
- Cons: Can be computationally expensive to train; may require a large number of parameters
- Applications: Image and video recognition, object detection, and segmentation
- Long Short-Term Memory Networks: Recurrent neural networks that can better handle long-term dependencies by selectively remembering or forgetting information:
- Pros: Can model long-term dependencies and variable-length sequences; can handle noisy and incomplete inputs
- Cons: Can be computationally expensive to train; may suffer from vanishing gradients
- Applications: Natural language processing, speech recognition, and time series prediction
- Autoencoders: Networks that can learn compressed representations of inputs by encoding and decoding them:
- Pros: Can learn meaningful representations of data; can perform data compression and denoising
- Cons: Limited to unsupervised learning; may suffer from overfitting
- Applications: Data compression, denoising, and feature extraction
- Deep Belief Networks: Networks composed of multiple layers of restricted Boltzmann machines, can learn hierarchical representations of data:
- Pros: Can learn complex and abstract features; can perform unsupervised pretraining followed by supervised fine-tuning
- Cons: Can be computationally expensive to train; limited to classification and generation tasks
- Applications: Image and speech recognition and generative modeling
- Generative Networks: Networks that can generate new samples of data similar to the training data:
- Pros: Can generate realistic and diverse samples; can be conditioned on specific inputs
- Cons: Can be computationally expensive to train; may suffer from mode collapse
- Applications: Image and speech synthesis, data augmentation, and anomaly detection
- AI Transformers: Networks that use self-attention mechanisms to process sequential data:
- Pros: Can process sequential data of varying lengths; can capture long-range dependencies and achieve state-of-the-art performance in language modeling
- Cons: Can be computationally expensive to train; may require large amounts of data for optimal performance
- Applications: NLP, text generation, machine translation, speech recognition, and image captioning
In this section, we explored some of the most popular ANN architectures beyond the basic single-layer and multilayer perceptrons. Each architecture has its strengths and weaknesses, making it suitable for specific tasks. However, as we’ll see in the next chapter, the rise of data has allowed researchers to train increasingly complex models that can handle more significant and complex datasets. The rise of big data has been a game-changer for machine learning, enabling researchers to develop new algorithms and techniques to tackle challenging problems. In the following chapter, we will delve deeper into the rise of data and its impact on the development of deep learning.