Deep learning refers to training large neural networks. Let's first discuss some basic use cases of neural networks and why deep learning is creating such a furore even though these neural networks have been here for decades.
Following are the examples of supervised learning in neural networks:
Inputs(x) | Output(y) | Application domain | Suggested neural network approach |
House features | Price of the house | Real estate | Standard neural network with rectified linear unit in the output layer |
Ad and user info Click on ad ? | Yes(1) or No(0) | Online advertising | Standard neural network with binary classification |
Image object | Classifying from 100 different objects, that is (1,2,.....,100) | Photo tagging | Convolutional neural network (since image, that is, spatial data) |
Audio | Text transcript | Speech recognition | Recurrent neural network (since both input-output are sequential data) |
English | Chinese | Machine translation | Recurrent neural network (since the input is a sequential data) |
Image, radar information | Position of other cars | Autonomous driving | Customized hybrid/complex neural network |
We will go into the details of the previously-mentioned neural networks in the coming sections of this chapter, but first we must understand that different types of neural networks are used based on the objective of the problem statement.
Supervised learning is an approach in machine learning where an agent is trained using pairs of input features and their corresponding output/target values (also called labels).
Traditional machine learning algorithms worked very well for the structured data, where most of the input features were very well defined. This is not the case with the unstructured data, such as audio, image, and text, where the data is a signal, pixels, and letters, respectively. It's harder for the computers to make sense of the unstructured data than the structured data. The neural network's ability to make predictions based on this unstructured data is the key reason behind their popularity and generate economic value.
First, it's the scale at the present moment, that is the scale of data, computational power and new algorithms, which is driving the progress in deep learning. It's been over four decades of internet, resulting in an enormous amount of digital footprints accumulating and growing. During that period, research and technological development helped to expand the storage and processing ability of computational systems. Currently, owing to these heavy computational systems and massive amounts of data, we are able to verify discoveries in the field of artificial intelligence done over the past three decades.
Now, what do we need to implement deep learning?
First, we need a large amount of data.
Second, we need to train a reasonably large neural network.
So, why not train a large neural network on small amounts of data?
Think back to your data structure lessons, where the utility of the structure is to sufficiently handle a particular type of value. For example, you will not store a scalar value in a variable that has the tensor data type. Similarly, these large neural networks create distinct representations and develop comprehending patterns given the high volume of data, as shown in the following graph:
Please refer to the preceding graphical representation of data versus performance of different machine learning algorithms for the following inferences:
-
We see that the performance of traditional machine learning algorithms converges after a certain time as they are not able to absorb distinct representations with data volume beyond a threshold.
-
Check the bottom left part of the graph, near the origin. This is the region where the relative ordering of the algorithms is not well defined. Due to the small data size, the inner representations are not that distinct. As a result, the performance metrics of all the algorithms coincide. At this level, performance is directly proportional to better feature engineering. But these hand engineered features fail with the increase in data size. That's where deep neural networks come in as they are able to capture better representations from large amounts of data.
Therefore, we can conclude that one shouldn't fit a deep learning architecture in to any encountered data. The volume and variety of the data obtained indicate which algorithm to apply. Sometimes small data works better with traditional machine learning algorithms rather than deep neural networks.
Deep learning problem statements and algorithms can be further segregated into four different segments based on their area of research and application:
-
General deep learning: Densely-connected layers or fully-connected networks
-
Sequence models: Recurrent neural networks, Long Short Term Memory Networks, Gated Recurrent Units, and so on
-
Spatial data models (images, for example): Convolutional neural networks, Generative Adversarial Networks
-
Others: Unsupervised learning, reinforcement learning, sparse encoding, and so on
Presently, the industry is mostly driven by the first three segments, but the future of Artificial Intelligence rests on the advancements in the fourth segment. Walking down the journey of advancements in machine learning, we can see that until now, these learning models were giving real numbers as output, for example, movie reviews (sentiment score) and image classification (class object). But now, as well as, other type of outputs are being generated, for example, image captioning (input: image, output: text), machine translation (input: text, output: text), and speech recognition (input: audio, output: text).
Human-level performance is necessary and being commonly applied in deep learning. Human-level accuracy becomes constant after some time converging to the highest possible point. This point is called the Optimal Error Rate (also known as the Bayes Error Rate, that is, the lowest possible error rate for any classifier of a random outcome).
The reason behind this is that a lot of problems have a theoretical limit in performance owing to the noise in the data. Therefore, human-level accuracy is a good approach to improving your models by doing error analysis. This is done by incorporating human-level error, training set error, and validation set error to estimate bias variance effects, that is, the underfitting and overfitting conditions.
The scale of data, type of algorithm, and performance metrics are a set of approaches that help us to benchmark the level of improvements with respect to different machine learning algorithms. Thereby, governing the crucial decision of whether to invest in deep learning or go with the traditional machine learning approaches.
A basic perceptron with some input features (three, here in the following diagram) looks as follows:
The preceding diagram sets the basic approach of what a neural network looks like if we have input in the first layer and output in the next. Let's try to interpret it a bit. Here:
-
X1, X2, and X3 are input feature variables, that is, the dimension of input here is 3 (considering there's no bias variable).
-
W1, W2, and W3 are the corresponding weights associated with feature variables. When we talk about the training of neural networks, we mean to say the training of weights. Thus, these form the parameters of our small neural network.
-
The function in the output layer is an activation function applied over the aggregation of the information received from the previous layer. This function creates a representation state that corresponds to the actual output. The series of processes from the input layer to the output layer resulting into a predicted output is called forward propagation.
-
The error value between the output from the activation function and actual output is minimized through multiple iterations.
-
Minimization of the error only happens if we change the value of the weights (going from the output layer toward the input layer) in the direction that can minimize our error function. This process is termed backpropagation, as we are moving in the opposite direction.
Now, keeping these basics in mind, let's go into demystifying the neural networks further using logistic regression as a neural network and try to create a neural network with one hidden layer.