Since the breakthrough in neural network in 2012, when a deep CNN model called AlexNet won the annual ImageNet visual recognition challenge by dramatically reducing the error rate, many researchers in computer vision and natural language processing have started to take advantage of the power of deep learning models. Modern deep-learning-based object detections are all based on CNN and built on top of pre-trained models such as AlexNet, Google Inception, or another popular net VGG. These CNNs typically have trained millions of parameters and can convert an input image to a set of features that can be further used for tasks such as image classification, which we covered in the previous chapter, and object detection, among other computer-vision-related tasks.
In 2014, a state-of-the-art object detector that retrained AlexNet with a labeled...