Machine learning, the spark for the third AI boom, is very useful and powerful as a data mining method; however, even with this approach of machine learning, it appeared that the way towards achieving AI was closed. Finding features is a human's role, and here there is a big wall preventing machine learning from reaching AI. It looked like the third AI boom would come to an end as well. However, surprisingly enough, the boom never ended, and on the contrary a new wave has risen. What triggered this wave is deep learning.
With the advent of deep learning, at least in the fields of image recognition and voice recognition, a machine became able to obtain "what should it decide to be a feature value" from the inputted data by itself rather than from a human. A machine that could only handle a symbol as a symbol notation has become able to obtain concepts.
The first time deep learning appeared was actually quite a while ago, back in 2006. Professor Hinton at Toronto University in Canada, and others, published a paper (https://www.cs.toronto.edu/~hinton/absps/fastnc.pdf). In this paper, a method called deep belief nets (DBN) was presented, which is an expansion of neural networks, a method of machine learning. DBN was tested using the MNIST database, the standard database for comparing the precision and accuracy of each image recognition method. This database includes 70,000 28 x 28 pixel hand-written character image data of numbers from 0 to 9 (60,000 are for training and 10,000 are for testing).
Then, they constructed a prediction model based on the training data and measured its accuracy based on whether a machine could correctly answer which number from 0 to 9 was written in the test case. Although this paper presented a result with considerably higher precision than a conventional method, it didn't attract much attention at the time, maybe because it was compared with another general method of machine learning.
Then, a while later in 2012, the whole AI research world was shocked by one method. At the world competition for image recognition, Imagenet Large Scale Visual Recognition Challenge (ILSVRC), a method using deep learning called SuperVision (strictly, that's the name of the team), which was developed by Professor Hinton and others from Toronto University, won the competition. It far surpassed the other competitors, with formidable precision. At this competition, the task was assigned for a machine to automatically distinguish whether an image was a cat, a dog, a bird, a car, a boat, and so on. 10 million images were provided as learning data and 150,000 images were used for the test. In this test, each method competes to return the lowest error rate (that is, the highest accuracy rate).
Let's look at the following table that shows the result of the competition:
You can see that the difference in the error rate between SuperVision and the second position, ISI, is more than 10%. After the second position, it's just a competition within 0.1%. Now you know how greatly SuperVision outshone the others with precision rates. Moreover, surprisingly, it was the first time SuperVision joined this ILSVRC, in other words, image recognition is not their usual field. Until SuperVision (deep learning) appeared, the normal approach for the field of image recognition was machine learning. And, as mentioned earlier, a feature value necessary to use machine learning had to be set or designed by humans. They reiterated design features based on human intuition and experiences and fine-tuning parameters over and over, which, in the end, contributed to improving precision by just 0.1%. The main issue of the research and the competition before deep learning evolved was who was able to invent good feature engineering. Therefore, researchers must have been surprised when deep learning suddenly showed up out of the blue.
There is one other major event that spread deep learning across the world. That event happened in 2012, the same year the world was shocked by SuperVision at ILSVRC, when Google announced that a machine could automatically detect a cat using YouTube videos as learning data from the deep learning algorithm that Google proposed. The details of this algorithm are explained at http://googleblog.blogspot.com/2012/06/using-large-scale-brain-simulations-for.html. This algorithm extracted 10 million images from YouTube videos and used them as input data. Now, remember, in machine learning, a human has to detect feature values from images and process data. On the other hand, in deep learning, original images can be used for inputs as they are. This shows that a machine itself comes to find features automatically from training data. In this research, a machine learned the concept of a cat. (Only this cat story is famous, but the research was also done with human images and it went well. A machine learned what a human is!) The following image introduced in the research illustrates the characteristics of what deep learning thinks a cat is, after being trained using still frames from unlabeled YouTube videos:
These two big events impressed us with deep learning and triggered the boom that is still accelerating now.
Following the development of the method that can recognize a cat, Google conducted another experiment for a machine to draw a picture by utilizing deep learning. This method is called Inceptionism (http://googleresearch.blogspot.ch/2015/06/inceptionism-going-deeper-into-neural.html). As written in the article, in this method, the network is asked:
"Whatever you see there, I want more of it!". This creates a feedback loop: if a cloud looks a little bit like a bird, the network will make it look more like a bird. This in turn will make the network recognize the bird even more strongly on the next pass and so forth, until a highly detailed bird appears, seemingly out of nowhere.
While the use of neural networks in machine learning is a method usually used to detect patterns to be able to specify an image, what Inceptionism does is the opposite. As you can see from the following examples of Inceptionism, these paintings look odd and like the world of a nightmare:
Or rather, they could look artistic. The tool that enables anyone to try Inceptionism is open to the public on GitHub and is named Deep Dream (https://github.com/google/deepdream). Example implementations are available on that page. You can try them if you can write Python codes.
Well, nothing stops deep learning gaining momentum, but there are still questions, such as what exactly is innovative about deep learning? What special function dramatically increased this precision? Surprisingly, actually, there isn't a lot of difference for deep learning in algorithms. As mentioned briefly, deep learning is an application of neural networks, which is an algorithm of machine learning that imitates the structure of a human brain; nevertheless, a device adopted it and changed everything. The representatives are pretraining and dropout (with an activation function). These are also keywords for implementation, so please remember them.
To begin with, what does the deep in deep learning indicate? As you probably know, the human brain is a circuit structure, and that structure is really complicated. It is made up of an intricate circuit piled up in many layers. On the other hand, when the neural network algorithm first appeared its structure was quite simple. It was a simplified structure of the human brain and the network only had a few layers. Hence, the patterns it could recognize were extremely limited. So, everyone wondered "Can we just accumulate networks like the human brain and make its implementation complex?" Of course, though this approach had already been tried. Unfortunately, as a result, the precision was actually lower than if we had just piled up the networks. Indeed, we faced various issues that didn't occur with a simple network. Why was this? Well, in a human brain, a signal runs into a different part of the circuit depending on what you see. Based on the patterns that differ based on which part of the circuit is stimulated, you can distinguish various things.
To reproduce this mechanism, the neural network algorithm substitutes the linkage of the network by weighting with numbers. This is a great way to do it, but soon a problem occurs. If a network is simple, weights are properly allocated from the learning data and the network can recognize and classify patterns well. However, once a network gets complicated, the linkage becomes too dense and it is difficult to make a difference in the weights. In short, it cannot divide into patterns properly. Also, in a neural network, the network can make a proper model by adopting a mechanism that feeds back errors that occurred during training to the whole network. Again, if the network is simple the feedback can be reflected properly, but if the network has many layers a problem occurs in which the error disappears before it's reflected to the whole network—just imagine if that error was stretched out and diluted.
The intention that things would go well if the network was built with a complicated structure ended in disappointing failure. The concept of the algorithm itself was splendid but it couldn't be called a good algorithm by any standards; that was the world's understanding. While deep learning succeeded in making a network multi-layered, that is, making a network "deep," the key to success was to make each layer learn in stages. The previous algorithm treated the whole multi-layered network as one gigantic neural network and made it learn as one, which caused the problems mentioned earlier.
Hence, deep learning took the approach of making each layer learn in advance. This is literally known as pretraining. In pretraining, learning starts from the lower-dimension layer in order. Then, the data that is learned in the lower layer is treated as input data for the next layer. This way, machines become able to take a step by learning a feature of a low layer at the low-grade layer and gradually learning a feature of a higher grade. For example, when learning what a cat is, the first layer is an outline, the next layer is the shape of its eyes and nose, the next layer is a picture of a face, the next layers is the detail of a face, and so on. Similarly, it can be said that humans take the same learning steps as they catch the whole picture first and see the detailed features later. As each layer learns in stages, the feedback for an error of learning can also be done properly in each layer. This leads to an improvement in precision. There is also a device for each respective approach to each layer's learning, but this will be introduced later on.
We have also addressed the fact that the network became too dense. The method that prevents this density problem is called the dropout. Networks with the dropout learn by cutting some linkages randomly within the units of networks. The dropout physically makes the network sparse. Which linkage is cut is random, so a different network is formed at each learning step. Just by looking, you might doubt that this will work, but it greatly contributes to improving the precision and as a result it increases the robustness of the network. The circuit of the human brain also has different places in which to react or not depending on the subject it sees. The dropout seems to be able to successfully imitate this mechanism. By embedding the dropout in the algorithm, the adjustment of the network weight was done well.
Deep learning has seen great success in various fields; however, of course deep learning has a demerit too. As is shown in the name "deep learning," the learning in this method is very deep. This means the steps to complete the learning take a long time. The amount of calculation in this process tends to be enormous. In fact, the previously mentioned learning of the recognition of a cat by Google took three days to be processed with 1,000 computers. Conversely, although the idea of deep learning itself could be conceived using past techniques, it couldn't be implemented. The method wouldn't appear if you couldn't easily use a machine that has a large-scale processing capacity with massive data.
As we keep saying, deep learning is just the first step for a machine to obtain human-like knowledge. Nobody knows what kind of innovation will happen in the future. Yet we can predict to what extent a computer's performance will be improved in the future. To predict this, Moore's law is used. The performance of an integrated circuit that supports the progress of a computer is indicated by the loaded number of transistors. Moore's law shows the number, and the number of transistors is said to double every one and a half years. In fact, the number of transistors in the CPU of a computer has been increasing following Moore's law. Compared to the world's first micro-processor, the Intel® 4004 processor, which had 1x103 (one thousand) transistors, the recent 2015 version, the 5th Generation Intel® Core™ Processor, has 1x109 (one billion)! If this technique keeps improving at this pace, the number of transistors will exceed ten billion, which is more than the number of cells in the human cerebrum.
Based on Moore's law, further in the future in 2045, it is said that we will reach a critical point called Technical Singularity where humans will be able to do technology forecasting. By that time, a machine is expected to be able to produce self-recursive intelligence. In other words, in about 30 years, AI will be ready. What will the world be like then…
The number of transistors loaded in the processor invented by Intel has been increasing smoothly following Moore's law.
The world famous professor Stephen Hawking answered in an interview by the BBC (http://www.bbc.com/news/technology-30290540):
"The development of full artificial intelligence could spell the end of the human race."
Will deep learning become a black magic? Indeed, the progress of technology has sometimes caused tragedy. Achieving AI is still far in the future, yet we should be careful when working on deep learning.