Hands-On Vision and Behavior for Self-Driving Cars

Chapter 1: OpenCV Basics and Camera Calibration

This chapter is an introduction to OpenCV and how to use it in the initial phases of a self-driving car pipeline, to ingest a video stream, and prepare it for the next phases. We will discuss the characteristics of a camera from the point of view of a self-driving car and how to improve the quality of what we get out of it. We will also study how to manipulate the videos and we will try one of the most famous features of OpenCV, object detection, which we will use to detect pedestrians.

With this chapter, you will build a solid foundation on how to use OpenCV and NumPy, which will be very useful later.

In this chapter, we will cover the following topics:

OpenCV and NumPy basics
Reading, manipulating, and saving images
Reading, manipulating, and saving videos
Manipulating images
How to detect pedestrians with HOG
Characteristics of a camera
How to perform the camera calibration

Introduction to OpenCV and NumPy

OpenCV is a computer vision and machine learning library that has been developed for more than 20 years and provides an impressive number of functionalities. Despite some inconsistencies in the API, its simplicity and the remarkable number of algorithms implemented make it an extremely popular library and an excellent choice for many situations.

OpenCV is written in C++, but there are bindings for Python, Java, and Android.

In this book, we will focus on OpenCV for Python, with all the code tested using OpenCV 4.2.

OpenCV in Python is provided by opencv-python, which can be installed using the following command:

pip install opencv-python

OpenCV can take advantage of hardware acceleration, but to get the best performance, you might need to build it from the source code, with different flags than the default, to optimize it for your target hardware.

OpenCV and NumPy

The Python bindings use NumPy, which increases the flexibility and makes it compatible with many other libraries. As an OpenCV image is a NumPy array, you can use normal NumPy operations to get information about the image. A good understanding of NumPy can improve the performance and reduce the length of your code.

Let's dive right in with some quick examples of what you can do with NumPy in OpenCV.

Image size

The size of the image can be retrieved using the shape attribute:

print("Image size: ", image.shape)

For a grayscale image of 50x50, image.shape() would return the tuple (50, 50), while for an RGB image, the result would be (50, 50, 3).

False friends

In NumPy, the attribute size is the size in bytes of the array; for a 50x50 gray image, it would be 2,500, while for the same image in RGB, it would be 7,500. It's the shape attribute that contains the size of the image – (50, 50) and (50, 50, 3), respectively.

Grayscale images

Grayscale images are represented by a two-dimensional NumPy array. The first index affects the rows (y coordinate) and the second index the columns (x coordinate). The y coordinates have their origin in the top corner of the image and x coordinates have their origin in the left corner of the image.

It is possible to create a black image using np.zeros(), which initializes all the pixels to 0:

black = np.zeros([100,100],dtype=np.uint8)  # Creates a black image

The previous code creates a grayscale image with size (100, 100), composed of 10,000 unsigned bytes (dtype=np.uint8).

To create an image with pixels with a different value than 0, you can use the full() method:

white = np.full([50, 50], 255, dtype=np.uint8)

To change the color of all the pixels at once, it's possible to use the [:] notation:

img[:] = 64        # Change the pixels color to dark gray

To affect only some rows, it is enough to provide a range of rows in the first index:

img[10:20] = 192   # Paints 10 rows with light gray

The previous code changes the color of rows 10-20, including row 10, but excluding row 20.

The same mechanism works for columns; you just need to specify the range in the second index. To instruct NumPy to include a full index, we use the [:] notation that we already encountered:

img[:, 10:20] = 64 # Paints 10 columns with dark gray

You can also combine operations on rows and columns, selecting a rectangular area:

img[90:100, 90:100] = 0  # Paints a 10x10 area with black

It is, of course, possible to operate on a single pixel, as you would do on a normal array:

img[50, 50] = 0  # Paints one pixel with black

It is possible to use NumPy to select a part of an image, also called the Region Of Interest (ROI). For example, the following code copies a 10x10 ROI from the position (90, 90) to the position (80, 80):

roi = img[90:100, 90:100]
img[80:90, 80:90] = roi

The following is the result of the previous operations:

Figure 1.1 – Some manipulation of images using NumPy slicing

To make a copy of an image, you can simply use the copy() method:

image2 = image.copy()

RGB images

RGB images differ from grayscale because they are three-dimensional, with the third index representing the three channels. Please note that OpenCV stores the images in BGR format, not RGB, so channel 0 is blue, channel 1 is green, and channel 2 is red.

Important note

OpenCV stores the images as BGR, not RGB. In the rest of the book, when talking about RGB images, it will only mean that it is a 24-bit color image, but the internal representation will usually be BGR.

To create an RGB image, we need to provide three sizes:

rgb = np.zeros([100, 100, 3],dtype=np.uint8)

If you were going to run the same code previously used on the grayscale image with the new RGB image (skipping the third index), you would get the same result. This is because NumPy would apply the same color to all the three channels, which results in a shade of gray.

To select a color, it is enough to provide the third index:

rgb[:, :, 2] = 255       # Makes the image red

In NumPy, it is also possible to select rows, columns, or channels that are not contiguous. You can do this by simply providing a tuple with the required indexes. To make the image magenta, you need to set the blue and red channels to 255, which can be achieved with the following code:

rgb[:, :, (0, 2)] = 255  # Makes the image magenta

You can convert an RGB image into grayscale using cvtColor():

gray = cv2.cvtColor(original, cv2.COLOR_BGR2GRAY)

Working with image files

OpenCV provides a very simple way to load images, using imread():

import cv2
image = cv2.imread('test.jpg')

To show the image, you can use imshow(), which accepts two parameters:

The name to write on the caption of the window that will show the image
The image to be shown

Unfortunately, its behavior is counterintuitive, as it will not show an image unless it is followed by a call to waitKey():

cv2.imshow("Image", image)cv2.waitKey(0)

The call to waitKey() after imshow() will have two effects:

It will actually allow OpenCV to show the image provided to imshow().
It will wait for the specified amount of milliseconds, or until a key is pressed if the amount of milliseconds passed is <=0. It will wait indefinitely.

An image can be saved on disk using the imwrite() method, which accepts three parameters:

The name of the file
The image
An optional format-dependent parameter:

cv2.imwrite("out.jpg", image)

Sometimes, it can be very useful to combine multiple pictures by putting them next to each other. Some examples in this book will use this feature extensively to compare images.

OpenCV provides two methods for this purpose: hconcat() to concatenate the pictures horizontally and vconcat() to concatenate them vertically, both accepting as a parameter a list of images. Take the following example:

black = np.zeros([50, 50], dtype=np.uint8)white = np.full([50, 50], 255, dtype=np.uint8)cv2.imwrite("horizontal.jpg", cv2.hconcat([white, black]))cv2.imwrite("vertical.jpg", cv2.vconcat([white, black]))

Here's the result:

Figure 1.2 – Horizontal concatenation with hconcat() and vertical concatenation with vconcat()

We could use these two methods to create a chequerboard pattern:

row1 = cv2.hconcat([white, black])row2 = cv2.hconcat([black, white])cv2.imwrite("chess.jpg", cv2.vconcat([row1, row2]))

You will see the following chequerboard:

Figure 1.3 – A chequerboard pattern created using hconcat() in combination with vconcat()

After having worked with images, it's time we work with videos.

Manipulating images

As part of a computer vision pipeline for a self-driving car, with or without deep learning, you might need to process the video stream to make other algorithms work better as part of a preprocessing step.

This section will provide you with a solid foundation to preprocess any video stream.

Flipping an image

OpenCV provides the flip() method to flip an image, and it accepts two parameters:

The image
A number that can be 1 (horizontal flip), 0 (vertical flip), or -1 (both horizontal and vertical flip)

Let's see a sample code:

flipH = cv2.flip(img, 1)flipV = cv2.flip(img, 0)flip = cv2.flip(img, -1)

This will produce the following result:

Figure 1.4 – Original image, horizontally flipped, vertically flipped, and both

As you can see, the first image is our original image, which was flipped horizontally and vertically, and then both, horizontally and vertically together.

Blurring an image

Sometimes, an image can be too noisy, possibly because of some processing steps that you have done. OpenCV provides several methods to blur an image, which can help in these situations. Most likely, you will have to take into consideration not only the quality of the blur but also the speed of execution.

The simplest method is blur(), which applies a low-pass filter to the image and requires at least two parameters:

The image
The kernel size (a bigger kernel means more blur):

blurred = cv2.blur(image, (15, 15))

Another option is to use GaussianBlur(), which offers more control and requires at least three parameters:

The image
The kernel size
sigmaX, which is the standard deviation on X

It is recommended to specify both sigmaX and sigmaY (standard deviation on Y, the forth parameter):

gaussian = cv2.GaussianBlur(image, (15, 15), sigmaX=15, sigmaY=15)

An interesting blurring method is medianBlur(), which computes the median and therefore has the characteristic of emitting only pixels with colors present in the image (which does not necessarily happen with the previous method). It is effective at reducing "salt and pepper" noise and has two mandatory parameters:

The image
The kernel size (an odd integer greater than 1):

median = cv2.medianBlur(image, 15)

There is also a more complex filter, bilateralFilter(), which is effective at removing noise while keeping the edge sharp. It is the slowest of the filters, and it requires at least four parameters:

The image
The diameter of each pixel neighborhood
sigmaColor: Filters sigma in the color space, affecting how much the different colors are mixed together, inside the pixel neighborhood
sigmaSpace: Filters sigma in the coordinate space, affecting how distant pixels affect each other, if their colors are closer than sigmaColor:

bilateral = cv2.bilateralFilter(image, 15, 50, 50)

Choosing the best filter will probably require some experiments. You might also need to consider the speed. To give you some ballpark estimations based on my tests, and considering that the performance is dependent on the parameters supplied, note the following:

blur() is the fastest.
GaussianBlur() is similar, but it can be 2x slower than blur().
medianBlur() can easily be 20x slower than blur().
BilateralFilter() is the slowest and can be 45x slower than blur().

Here are the resultant images:

Figure 1.5 – Original, blur(), GaussianBlur(), medianBlur(), and BilateralFilter(), with the parameters used in the code samples

Changing contrast, brightness, and gamma

A very useful function is convertScaleAbs(), which executes several operations on all the values of the array:

It multiplies them by the scaling parameter, alpha.
It adds to them the delta parameter, beta.
If the result is above 255, it is set to 255.
The result is converted into an unsigned 8-bit int.

The function accepts four parameters:

The source image
The destination (optional)
The alpha parameter used for the scaling
The beta delta parameter

convertScaleAbs() can be used to affect the contrast, as an alpha scaling factor above 1 increases the contrast (amplifying the color difference between pixels), while a scaling factor below one reduces it (decreasing the color difference between pixels):

cv2.convertScaleAbs(image, more_contrast, 2, 0)cv2.convertScaleAbs(image, less_contrast, 0.5, 0)

It can also be used to affect the brightness, as the beta delta factor can be used to increase the value of all the pixels (increasing the brightness) or to reduce them (decreasing the brightness):

cv2.convertScaleAbs(image, more_brightness, 1, 64)
cv2.convertScaleAbs(image, less_brightness, 1, -64)

Let's see the resulting images:

Figure 1.6 – Original, more contrast (2x), less contrast (0.5x), more brightness (+64), and less brightness (-64)

A more sophisticated method to change the brightness is to apply gamma correction. This can be done with a simple calculation using NumPy. A gamma value above 1 will increase the brightness, and a gamma value below 1 will reduce it:

Gamma = 1.5
g_1_5 = np.array(255 * (image / 255) ** (1 / Gamma), dtype='uint8')
Gamma = 0.7
g_0_7 = np.array(255 * (image / 255) ** (1 / Gamma), dtype='uint8')

The following images will be produced:

Figure 1.7 – Original, higher gamma (1.5), and lower gamma (0.7)

You can see the effect of different gamma values in the middle and right images.

Drawing rectangles and text

When working on object detection tasks, it is a common need to highlight an area to see what has been detected. OpenCV provides the rectangle() function, accepting at least the following parameters:

The image
The upper-left corner of the rectangle
The lower-right corner of the rectangle
The color to use
(Optional) The thickness:

cv2.rectangle(image, (x, y), (x + w, y + h), (255, 255, 255), 2)

To write some text in the image, you can use the putText() method, accepting at least six parameters:

The image
The text to print
The coordinates of the bottom-left corner
The font face
The scale factor, to change the size
The color:

cv2.putText(image, 'Text', (x, y), cv2.FONT_HERSHEY_PLAIN, 2, clr)

Pedestrian detection using HOG

The Histogram of Oriented Gradients (HOG) is an object detection technique implemented by OpenCV. In simple cases, it can be used to see whether there is a certain object present in the image, where it is, and how big it is.

OpenCV includes a detector trained for pedestrians, and you are going to use it. It might not be enough for a real-life situation, but it is useful to learn how to use it. You could also train another one with more images to see whether it performs better. Later in the book, you will see how to use deep learning to detect not only pedestrians but also cars and traffic lights.

Sliding window

The HOG pedestrian detector in OpenCV is trained with a model that is 48x96 pixels, and therefore it is not able to detect objects smaller than that (or, better, it could, but the box will be 48x96).

At the core of the HOG detector, there is a mechanism able to tell whether a given 48x96 image is a pedestrian. As this is not terribly useful, OpenCV implements a sliding window mechanism, where the detector is applied many times, on slightly different positions; the "image window" under consideration slides a bit. Once it has analyzed the whole image, the image window is increased in size (scaled) and the detector is applied again, to be able to detect bigger objects. Therefore, the detector is applied hundreds or even thousands of times for each image, which can be slow.

Using HOG with OpenCV

First, you need to initialize the detector and specify that you want to use the detector for pedestrians:

hog = cv2.HOGDescriptor()det = cv2.HOGDescriptor_getDefaultPeopleDetector()
hog.setSVMDetector(det)

Then, it is just a matter of calling detectMultiScale():

(boxes, weights) = hog.detectMultiScale(image, winStride=(1, 1), padding=(0, 0), scale=1.05)

The parameters that we used require some explanation, and they are as follows:

The image
winStride, the window stride, which specifies how much the sliding window moves every time
Padding, which can add some padding pixels at the border of the image (useful to detect pedestrians close to the border)
Scale, which specifies how much to increase the window image every time

You should consider that decreasing winSize can improve the accuracy (as more positions are considered), but it has a big impact on performance. For example, a stride of (4, 4) can be up to 16 times faster than a stride of (1, 1), though in practice, the performance difference is a bit less, maybe 10 times.

In general, decreasing the scale also improves the precision and decreases the performance, though the impact is not dramatic.

Improving the precision means detecting more pedestrians, but this can also increase the false positives. detectMultiScale() has a couple of advanced parameters that can be used for this:

hitThreshold, which changes the distance required from the Support Vector Machine (SVM) plane. A higher threshold means than the detector is more confident with the result.
finalThreshold, which is related to the number of detections in the same area.

Tuning these parameters requires some experiments, but in general, a higher hitThreshold value (typically in the range 0–1.0) should reduce the false positives.

A higher finalThreshold value (such as 10) will also reduce the false positives.

We will use detectMultiScale() on an image with pedestrians generated by Carla:

Figure 1.8 – HOG detection, winStride=(1, 2), scale=1.05, padding=(0, 0) Left: hitThreshold = 0, finalThreshold = 1; Center: hitThreshold = 0, inalThreshold = 3; Right: hitThreshold = 0.2, finalThreshold = 1

Figure 1.8 – HOG detection, winStride=(1, 2), scale=1.05, padding=(0, 0)Left: hitThreshold = 0, finalThreshold = 1; Center: hitThreshold = 0, finalThreshold = 3;Right: hitThreshold = 0.2, finalThreshold = 1

As you can see, we have pedestrians being detected in the image. Using a low hit threshold and a low final threshold can result in false positives, as in the left image. Your goal is to find the right balance, detecting the pedestrians but without having too many false positives.

Introduction to the camera

The camera is probably one of the most ubiquitous sensors in our modern world. They are used in everyday life in our mobile phones, laptops, surveillance systems, and of course, photography. They provide rich, high-resolution imagery containing extensive information about the environment, including spatial, color, and temporal information.

It is no surprise that they are heavily used in self-driving technologies. One reason why the camera is so popular is that it mirrors the functionality of the human eye. For this reason, we are very comfortable using them as we connect on a deep level with their functionality, limitations, and strengths.

In this section, you will learn about the following:

Camera terminology
The components of a camera
Strengths and weaknesses
Choosing the right camera for self-driving

Let's discuss each in detail.

Camera terminology

Before you learn about the components of a camera and its strengths and weaknesses, you need to know some basic terminology. These terms will be important when evaluating and ultimately choosing your camera for your self-driving application.

Field of View (FoV)

This is the vertical and horizontal angular portion of the environment (scene) that is visible to the sensor. In self-driving cars, you typically want to balance the FoV with the resolution of the sensor to ensure we see as much of the environment as possible with the least number of cameras. There is a trade space related to FoV. Larger FoV usually means more lens distortion, which you will need to compensate for in your camera calibration (see the section on camera calibration):

Figure 1.9 – Field of View, credit: https://www.researchgate.net/figure/Illustration-of-camera-lenss-field-of-view-FOV_fig4_335011596

Resolution

This is the total number of pixels in the horizontal and vertical directions on the sensor. This parameter is often discussed using the term megapixels (MP). For example, a 5 MP camera, such as the FLIR Blackfly, has a sensor with 2448 × 2048 pixels, which equates to 5,013,504 pixels.

Higher resolutions allow you to use a lens with a wider FoV but still provide the detail needed for running your computer vision algorithms. This means you can use fewer cameras to cover the environment and thereby lower the cost.

The Blackfly, in all its different flavors, is a common camera used in self-driving vehicles thanks to its cost, small form, reliability, robustness, and ease of integration:

Figure 1.10 – Pixel resolution

Focal length

This is the length from the lens optical center to the sensor. The focal length is best thought of as the zoom of the camera. A longer focal length means you will be zoomed in closer to objects in the environment. In your self-driving car, you may choose different focal lengths based on what you need to see in the environment. For example, you might choose a relatively long focal length of 100 mm to ensure enough resolution for your classifier algorithm to detect a traffic signal at a distance far enough to allow the car to react with smooth and safe stopping:

Figure 1.11 – Focal length, credit: https://photographylife.com/what-is-focal-length-in-photography

Aperture and f-stop

This is the opening through which light passes to shine on the sensor. The unit that is commonly used to describe the size of the opening is the f-stop, which refers to the ratio of the focal length over the aperture size. For example, a lens with a 50 mm focal length and an aperture diameter of 35 mm will equate to an f-stop of f/1.4. The following figure illustrates different aperture diameters and their f-stop values on a 50 mm focal length lens. Aperture size is very important in your self-driving car as it is directly correlated with the Depth of Field (DoF). Large apertures also allow the camera to be tolerant of obscurants (for example, bugs) that may be on the lens. Larger apertures allow light to pass around the bug and still make it to the sensor:

Figure 1.12 – Aperture, credit: https://en.wikipedia.org/wiki/Aperture#/media/File:Lenses_with_different_apertures.jpg

Depth of field (DoF)

This is the distance range in the environment that will be in focus. This is directly correlated to the size of the aperture. Generally, in self-driving cars, you will want a deep DoF so that everything in the FoV is in focus for your computer vision algorithms. The problem is that deep DoF is achieved with a small aperture, which means less light impacting the sensor. So, you will need to balance DoF with dynamic range and ISO to ensure you see everything you need to in your environment.

The following figure depicts the relationship between DoF and aperture:

Figure 1.13 – DoF versus aperture, credit: https://thumbs.dreamstime.com/z/aperture-infographic-explaining-depth-field-corresponding-values-their-effect-blur-light-75823732.jpg

Dynamic range

This is a property of the sensor that indicates its contrast ratio or the ratio of the brightest over the darkest subjects that it can resolve. This may be referred to using the unit dB (for example, 78 dB) or contrast ratio (for example, 2,000,000/1).

Self-driving cars need to operate both during the day and at night. This means that the sensor needs to be sensitive enough to provide useful detail in dark conditions while not oversaturating when driving in bright sunlight. Another reason for High Dynamic Range (HDR) is the example of driving when the sun is low on the horizon. I am sure you have experienced this while driving yourself to work in the morning and the sun is right in your face and you can barely see the environment in front of you because it is saturating your eyes. HDR means that the sensor will be able to see the environment even in the face of direct sunlight. The following figure illustrates these conditions:

Figure 1.14 – Example HDR, credit: https://petapixel.com/2011/05/02/use-iso-numbers-that-are-multiples-of-160-when-shooting-dslr-video/

Your dream dynamic range

If you could make a wish and have whatever dynamic range you wanted in your sensor, what would it be?

International Organization for Standardization (ISO) sensitivity

This is the sensitivity of the pixels to incoming photons.

Wait a minute, you say, do you have your acronym mixed up? It looks like it, but the International Organization for Standardization decided to standardize even their acronym since it would be different in every language otherwise. Thanks, ISO!

The standardized ISO values can range from 100 to upward of 10,000. Lower ISO values correspond to a lower sensitivity of the sensor. Now you may ask, "why wouldn't I want the highest sensitivity?" Well, sensitivity comes at a cost...NOISE. The higher the ISO, the more noise you will see in your images. This added noise may cause trouble for your computer vision algorithms when trying to classify objects. In the following figure, you can see the effect of higher ISO values on noise in an image. These images are all taken with the lens cap on (fully dark). As you increase the ISO value, random noise starts to creep in:

Figure 1.15 – Example ISO values and noise in a dark room

Frame rate (FPS)

This is the rate at which the sensor can obtain consecutive images, usually expressed in Hz or Frames Per Second (FPS). Generally speaking, you want to have the fastest frame rate so that fast-moving objects are not blurry in your scene. The main trade-off here is latency: the time from a real event happening until your computer vision algorithm detects it. The higher the frame rate that must be processed, the higher the latency. In the following figure, you can see the effect of frame rate on motion blur.

Blur is not the only reason for choosing a higher frame rate. Depending on the speed of your vehicle, you will need a frame rate that will allow the vehicle to react if an object suddenly appears in its FoV. If your frame rate is too slow, by the time the vehicle sees something, it may be too late to react:

Figure 1.16 – 120 Hz versus 60 Hz frame rate, credit: https://gadgetstouse.com/blog/2020/03/18/difference-between-60hz-90hz-120hz-displays/

Lens flare

These are the artifacts of light from an object that impact pixels on the sensor that do not correlate with the position of the object in the environment. You have likely experienced this driving at night when you see oncoming headlights. That starry effect is due to light scattered in the lens of your eye (or camera), due to imperfections, leading some of the photons to impact "pixels" that do not correlate with where the photons came from – that is, the headlights. The following figure shows what that effect looks like. You can see that the starburst makes it very difficult to see the actual object, the car!

Figure 1.17 – Lens flare from oncoming headlights, credit: https://s.blogcdn.com/cars.aol.co.uk/media/2011/02/headlights-450-a-g.jpg

Lens distortion

This is the difference between the rectilinear or real scene to what your camera image sees. If you have ever seen action camera footage, you probably recognized the "fish-eye" lens effect. The following figure shows an extreme example of the distortion from a wide-angle lens. You will learn to correct this distortion with OpenCV:

Figure 1.18 – Lens distortion, credit: https://www.slacker.xyz/post/what-lens-should-i-get

The components of a camera

Like the eye, a camera is made up of a light-sensitive array, an aperture, and a lens.

Light sensitive array – CMOS sensor (the camera's retina)

The light-sensitive array, in most consumer cameras, is called a CMOS active-pixel sensor (or just a sensor). Its basic function is to convert incident photons into an electrical current that can be digitized based on the color wavelength of the photon.

The aperture (the camera's iris)

The aperture or iris of a camera is the opening through which light can pass on its way to the sensor. This can be variable or fixed depending on the type of camera you are using. The aperture is used to control parameters such as depth of field and the amount of light hitting the sensor.

The lens (the camera's lens)

The lens or optics are the components of the camera that focus the light from the environment onto the sensor. The lens primarily determines the FoV of the camera through its focal length. In self-driving applications, the FoV is very important since it determines how much of the environment the car can see with a single camera. The optics of a camera are often some of the most expensive parts and have a large impact on image quality and lens flare.

Considerations for choosing a camera

Now that you have learned all the basics of what a camera is and the relevant terminology, it is time to learn how to choose a camera for your self-driving application. The following is a list of the primary factors that you will need to balance when choosing a camera:

Resolution
FoV
Dynamic range
Cost
Size
Ingress protection (IP rating)
The perfect camera
If you could design the ideal camera, what would it be?

My perfect self-driving camera would be able to see in all directions (spherical FoV, 360º HFoV x 360º VFoV). It would have infinite resolution and dynamic range, so you could digitally resolve objects at any distance in any lighting condition. It would be the size of a grain of rice, completely water- and dustproof, and would cost $5! Obviously, this is not possible. So, we must make some careful trade-offs for what we need.

The best place to start is with your budget for cameras. This will give you an idea of what models and specifications to look for.

Next, consider what you need to see for your application:

Do you need to be able to see a child from 200 m away while traveling at 100 km/h?
What coverage around the vehicle do you need, and can you tolerate any blind spots on the side of the vehicle?
Do you need to see at night and during the day?

Lastly, consider how much room you have to integrate these cameras. You probably don't want your vehicle to look like this:

Figure 1.19 – Camera art, credit: https://www.flickr.com/photos/laughingsquid/1645856255/

This may be very overwhelming, but it is important when thinking about how to design your computer vision system. A good camera to start with that is very popular is the FLIR Blackfly S series. They strike an excellent balance of resolution, FPS, and cost. Next, pair it with a lens that meets your FoV needs. There are some helpful FoV calculators available on the internet, such as the one from http://www.bobatkins.com/photography/technical/field_of_view.html.

Strengths and weaknesses of cameras

Now, no sensor is perfect, and even your beloved camera will have its pros and cons. Let's go over some of them now.

Let's look at the strengths first:

High-resolution: Relative to other sensor types, such as radar, lidar, and sonar, cameras have an excellent resolution for picking out objects in your scene. You can easily find cameras with 5 MP resolution quite cheaply.
Texture, color, and contrast information: Cameras provide very rich information about the environment that other sensor types just can't. This is because of a variety of wavelengths that cameras sense.
Cost: Cameras are one of the cheapest sensors you can find, especially for the quality of data they provide.
Size: CMOS technology and modern ASICs have made cameras incredibly small, many less than 30 mm cubed.
Range: This is really thanks to the high resolution and passive nature of the sensor.

Next, here are the weaknesses:

A large amount of data to process for object detection: With high resolution comes a lot of data. Such is the price we pay for such accurate and detailed imagery.
Passive: A camera requires an external illumination source, such as the sun, headlights, and so on.
Obscurants (such as bugs, raindrops, heavy fog, dust, or snow): A camera is not particularly good at seeing through heavy rain, fog, dust, or snow. Radars are typically better suited for this.
Lack native depth/velocity information: A camera image alone doesn't give you any information on an object's speed or distance.
Photogrammetry is helping to bolster this weakness but costs valuable processing resources (GPU, CPU, latency, and so on.) It is also less accurate than a radar or lidar sensor, which produce this information natively.

Now that you have a good understanding of how a camera works, as well as its basic parts and terminology, it's time to get your hands dirty and start calibrating a camera with OpenCV.

Camera calibration with OpenCV

In this section, you will learn how to take objects with a known pattern and use them to correct lens distortion using OpenCV.

Remember the lens distortion we talked about in the previous section? You need to correct this to ensure you accurately locate where objects are relative to your vehicle. It does you no good to see an object if you don't know whether it is in front of you or next to you. Even good lenses can distort the image, and this is particularly true for wide-angle lenses. Luckily, OpenCV provides a mechanism to detect this distortion and correct it!

The idea is to take pictures of a chessboard, so OpenCV can use this high-contrast pattern to detect the position of the points and compute the distortion based on the difference between the expected image and the recorded one.

You need to provide several pictures at different orientations. It might take some experiments to find a good set of pictures, but 10 to 20 images should be enough. If you use a printed chessboard, take care to have the paper as flat as possible so as to not compromise the measurements:

Figure 1.20 – Some examples of pictures that can be used for calibration

As you can see, the central image clearly shows some barrel distortion.

Distortion detection

OpenCV tries to map a series of three-dimensional points to the two-dimensional coordinates of the camera. OpenCV will then use this information to correct the distortion.

The first thing to do is to initialize some structures:

image_points = []   # 2D points object_points = []  # 3D points coords = np.zeros((1, nX * nY, 3), np.float32)coords[0,:,:2] = np.mgrid[0:nY, 0:nX].T.reshape(-1, 2)

Please note nX and nY, which are the number of points to find in the chessboard on the x and y axes,, respectively. In practice, this is the number of squares minus 1.

Then, we need to call findChessboardCorners():

found, corners = cv2.findChessboardCorners(image, (nY, nX), None)

found is true if OpenCV found the points, and corners will contain the points found.

In our code, we will assume that the image has been converted into grayscale, but you can calibrate using an RGB picture as well.

OpenCV provides a nice image depicting the corners found, ensuring that the algorithm is working properly:

out = cv2.drawChessboardCorners(image, (nY, nX), corners, True)object_points.append(coords)   # Save 3d points image_points.append(corners)   # Save corresponding 2d points

Let's see the resulting image:

Figure 1.21 – Corners of the calibration image found by OpenCV

Calibration

After finding the corners in several images, we are finally ready to generate the calibration data using calibrateCamera():

ret, mtx, dist, rvecs, tvecs = cv2.calibrateCamera(object_points, image_points, shape[::-1], None, None)

Now, we are ready to correct our images, using undistort():

dst = cv2.undistort(image, mtx, dist, None, mtx)

Let's see the result:

Figure 1.22 – Original image and calibrated image

We can see that the second image has less barrel distortion, but it is not great. We probably need more and better calibration samples.

But we can also try to get more precision from the same calibration images by looking for sub-pixel precision when looking for the corners. This can be done by calling cornerSubPix() after findChessboardCorners():

corners = cv2.cornerSubPix(image, corners, (11, 11), (-1, -1), (cv2.TERM_CRITERIA_EPS + cv2.TERM_CRITERIA_MAX_ITER, 30, 0.001))

The following is the resulting image:

Figure 1.23 – Image calibrated with sub-pixel precision

As the complete code is a bit long, I recommend checking out the full source code on GitHub.

Summary

Well, you have had a great start to your computer vision journey toward making a real self-driving car.

You learned about a very useful toolset called OpenCV with bindings for Python and NumPy. With these tools, you are now able to create and import images using methods such as imread(), imshow(), hconcat(), and vconcat(). You learned how to import and create video files, as well as capturing video from a webcam with methods such as VideoCapture() and VideoWriter(). Watch out Spielberg, there is a new movie-maker in town!

It was wonderful to be able to import images, but how do you start manipulating them to help your computer vision algorithms learn what features matter? You learned how to do this through methods such as flip(), blur(), GaussianBlur(), medianBlur(), bilateralFilter(), and convertScaleAbs(). Then, you learned how to annotate images for human consumption with methods such as rectangle() and putText().

Then came the real magic, where you learned how to take the images and do your first piece of real computer vision using HOG to detect pedestrians. You learned how to apply a sliding window to scan the detector over an image in various sized windows using the detectMultiScale() method, with parameters such as winStride, padding, scale, hitThreshold, and finalThreshold.

You had a lot of fun with all the new tools you learned for working with images. But there was something missing. How do I get these images on my self-driving car? To answer this, you learned about the camera and its basic terminology, such as resolution, FoV, focal length, aperture, DoF, dynamic range, ISO, frame rate, lens flare, and finally, lens distortion. Then, you learned the basic components that comprise a camera, namely the lens, aperture, and light-sensitive arrays. With these basics, you moved on to some considerations for choosing a camera for your application by learning about the strengths and weaknesses of a camera.

Armed with this knowledge, you boldly began to remove one of these weaknesses, lens distortion, with the tools you learned in OpenCV for distortion correction. You used methods such as findChessboardCorners(), calibrateCamera(), undistort(), and cornerSubPix() for this.

Wow, you are really on your way to being able to perceive the world in your self-driving application. You should take a moment and be proud of what you have learned so far. Maybe you can celebrate with a selfie and apply some of what you learned!

In the next chapter, you are going to learn about some of the basic signal types and protocols you are likely to encounter when trying to integrate sensors in your self-driving application.

Shane Lee Jan 09, 2021

What I like:1. This book covers very well and detail use of open source libraries and tools, on OpenCV and NumPy, especially on the coding and development environment setup.2. This books uses pictures to illustrate a topic whenever possible, this is very helpful in the domain of perception and camera based image processing.What I don't like:1. It is short of mathematical fundamentals, in my view, every domain, and chapter covers the goal of solving self-driving technology, should have a corresponding mathematical equations, or models to begin with, before jumping to the use of open source library, or coding. This will really help readers embrace the understanding of the technologies.2. On the mapping and SLAM part, it lacks some of the real scenarios, and problems that needs to be solved in order to benefit the self-driving.What I would like to see:1. It would be great to extend the full details of using open source tools, to cover a balance of 3 pillars - 1) Real-time object detection and perception, 2) real-time absolute positioning with GNSS, and 3)HD mapping.Even though today no one is really covering well on these 3 pillars strongly and tightly enough to render a robust self-driving engine yet, but it is essential to have before we reach the all-weather, all-time, all scenarios self-driving world.

Amazon Verified review

Amazon Customer Jan 11, 2021

What I like: This book is a introductory manual into self driving cars. It basically explains what is what like the module in vision which has the basic pipe line for predicting lanes. If you are working on a similar problem it’s better to have this book handy so that you can figure out what to do next. All the other modules are helpful as well like the one which explains signals can how to read them. This particular section I found is very useful as a controls engineer myself I find myself working with different type of communication interfaces like CAN, SPI etcThis book can also be used as a reference for self driving car course by udacity.What Ibwould like to see: more in depth mathematical proofs and applications for different concepts

John Smith Aug 25, 2022

Consider this line of code from chapter 1 about camera calibration:corners = cv2.cornerSubPix(img_src, corners, (11, 11), (-1, -1), (cv2.TERM_CRITERIA_EPS + cv2.TERM_CRITERIA_MAX_ITER, 30, 0.001))what is (11,11) or what is (-1,-1). There is absolutely no explanation and you have to consult OpenCV documentation all the time.When I buy a book like this, I expect the author to provide an explanation for all those arguments so that I can save time.It seems OpenCV documentation is more complete and easier to follow.

Hands-On Vision and Behavior for Self-Driving Cars: Explore visual perception, lane detection, and object classification with Python 3 and OpenCV 4

What do you get with eBook?

Contact Details

Billing Address

Key benefits

Description

Who is this book for?

What you will learn

Product Details

What do you get with eBook?

Contact Details

Billing Address

Product Details

Packt Subscriptions

Frequently bought together

Table of Contents

Recommendations for you

Customer reviews

People who bought this also bought

About the authors

FAQs