Hardware overview
The Kinect device is a horizontal bar composed of multiple sensors connected to a base with a motorized pivot.
The following image provides a schematic representation of all the main Kinect hardware components. Looking at the Kinect sensor from the front, from the outside it is possible to identify the Infrared (IR) Projector (1), the RGB camera (3), and the depth camera (2). An array of four microphones (6), the three-axis accelerometer (5), and the tilt motor (4) are arranged inside the plastic case.
The device is connected to a PC through a USB 2.0 cable. It needs an external power supply in order to work because USB ports don't provide enough power.
Now let's jump in to the main features of its components.
The IR projector
The IR projector is the device that Kinect uses for projecting the IR rays that are used for computing the depth data. The IR projector, which from the outside looks like a common camera, is a laser emitter that constantly projects a pattern of structured IR dots at a wavelength around of 830 nm (patent US20100118123, Prime Sense Ltd.). This light beam is invisible to human eyes (that typically respond to wavelengths from about 390 nm to 750 nm) except for a red bright dot in the center of emitter.
The pattern is composed by 3 x 3 subpatterns of 211 x 165 dots (for a total of 633 x 495 dots). In each subpattern, one spot is much brighter than all the others.
As the dotted light (spot) hits an object, the pattern becomes distorted, and this distortion is analyzed by the depth camera in order to estimate the distance between the sensor and the object itself.
Note
In the previous image, we tested the IR projector against the room's wall. In this case we have to notice that a view of the clear infrared pattern can be obtained only by using an external IR camera (the left-hand side of the previous image). Taking the same picture from the internal RGB camera, the pattern will look distorted even though in this case the beam is not hitting any object (the right-hand side of the previous picture).
Depth camera
The depth camera is a (traditional) monochrome CMOS (complementary metal-oxide-semiconductor) camera that is fitted with an IR-pass filter (which is blocking the visible light). The depth camera is the device that Kinect uses for capturing the depth data.
The depth camera is the sensor returning the 3D coordinates (x, y, z) of the scene as a stream. The sensor captures the structured light emitted by the IR projector and the light reflected from the objects inside the scene. All this data is converted in to a stream of frames. Every single frame is processed by the PrimeSense chip that produces an output stream of frames. The output resolution is upto 640 x 480 pixels. Each pixel, based on 11 bits, can represent 2048 levels of depth.
The following table lists the distance ranges:
Mode |
Physical limits |
Practical limits |
---|---|---|
Near |
0.4 to 3 m (1.3 to 9.8 ft) |
0.8 to 2.5 m (2.6 to 8.2 ft) |
Normal |
0.8 to 4 m (2.6 to 13.1 ft) |
1.2 to 3.5 m (4 to 11.5 ft) |
Note
The sensor doesn't work correctly within an environment affected by sunlight, a reflective surface, or an interference with light with a similar wavelength (830 nm circa).
The following figure is composed of two frames extracted from the depth image stream: the one on the left represents a scene without any interference. The one on the right is stressing how interference can reduce the quality of the scene. In this frame, we introduced an infrared source that is overlapping the Kinect's infrared pattern.
The RGB camera
The RGB camera is similar to a common color webcam, but unlike a common webcam, the RGB camera hasn't got an IR-cut filter. Therefore in the RGB camera, the IR is reaching the CMOS. The camera allows a resolution upto 1280 x 960 pixels with 12 images per second speed. We can reach a frame rate of 30 images per second at a resolution of 640 x 480 with 8 bits per channel producing a Bayer filter output with a RGGBD pattern. This camera is also able to perform color flicker avoidance, color saturation operations, and automatic white balancing. This data is utilized to obtain the details of people and objects inside the scene.
The following monochromatic figure shows the infrared frame captured by the RGB camera:
Note
To obtain high quality IR images we need to use dim lighting and to obtain high quality color image we need to use external light sources. So it is important that we balance both of these factors to optimize the use of the Kinect sensors.
Tilt motor and three-axis accelerometer
The Kinect cameras have a horizontal field of view of 57.5 degrees and a vertical field of view of 43.5 degrees. It is possible to increase the interaction space by adjusting the vertical tilt of the sensor by +27 and -27 degrees. The tilt motor can shift the Kinect head's angle upwards or downwards.
The Kinect also contains a three-axis accelerometer configured for a 2g range (g is the acceleration value due to gravity) with a 1 to 3 degree accuracy. It is possible to know the orientation of the device with respect to gravity reading the accelerometer data.
The following figure shows how the field of view angle can be changed when the motor is tilted:
Microphone array
The microphone array consists of four microphones that are located in a linear pattern in the bottom part of the device with a 24-bit Analog to Digital Converter (ADC). The captured audio is encoded using Pulse Code Modulation (PCM) with a sampling rate of 16 KHz and a 16-bit depth. The main advantages of this multi-microphones configuration is an enhanced Noise Suppression, an Acoustic Echo Cancellation (AEC), and the capability to determine the location and the direction of an audio source through a beam-forming technique.
Software architecture
In this paragraph we review the software architecture defining the SDK. The SDK is a composite set of software libraries and tools that can help us to use the Kinect-based natural input. The Kinect senses and reacts to real-world events such as audio and visual tracking. The Kinect and its software libraries interact with our application via the NUI libraries, as detailed in the following figure:
Here, we define the software architecture diagram where we encompass the structural elements and the interfaces by which the Kinect for Windows SDK 1.6 is composed, as well as the behavior as specified in collaboration with those elements:
The following list provides the details for the information shown in the preceding figure:
Kinect sensor: The hardware components as detailed in the previous paragraph, and the USB hub through which the Kinect sensor is connected to the computer.
Kinect drivers: The Windows drivers for the Kinect, which are installed as part of the SDK setup process. The Kinect drivers are accessible in the
%Windows%\System32\DriverStore\FileRepository
directory and they include the following files:kinectaudio.inf_arch_uniqueGUID;
kinectaudioarray.inf_arch_uniqueGUID;
kinectcamera.inf_arch_uniqueGUID;
kinectdevice.inf_arch_uniqueGUID;
kinectsecurity.inf_arch_uniqueGUID
These files expose the information of every single Kinect's capabilities. The Kinect drivers support the following files:
The Kinect microphone array as a kernel-mode audio device that you can access through the standard audio APIs in Windows
Audio and video streaming controls for streaming audio and video (color, depth, and skeleton)
Device enumeration functions that enable an application to use more than one Kinect
Audio and video components defined by NUI APIs for skeleton tracking, audio, and color and depth imaging. You can review the NUI APIs header files in the
%ProgramFiles%\Microsoft SDKs\Kinect\v1.6
folder as follows:NuiApi.h
: This aggregates all the NUI API headersNuiImageCamera.h
: This defines the APIs for the NUI image and camera servicesNuiSensor.h
: This contains the definitions for the interfaces as theaudiobeam
, theaudioarray
, and the acceleratorNuiSkeleton.h
: This defines the APIs for the NUI skeleton
DirectX Media Object (DMO) for microphone array beam-forming and audio source localization. The format of the data used in input and output by a stream in a DirectX DMO is defined by the
Microsoft.Kinect.DMO_MEDIA_TYPE
and theMicrosoft.Kinect.DMO_OUTPUT_DATA_BUFFER
structs. The default facadeMicrosoft.Kinect.DmoAudioWrapper
creates a DMO object using a registered COM server, and calls native DirectX DMO layer directly.Windows 7 standard APIs: The audio, speech, and media APIs in Windows 7, as described in the Windows 7 SDK and the Microsoft Speech SDK (
Microsoft.Speech
,System.Media
, and so on). These APIs are also available to desktop applications in Windows 8.
Video stream
The stream of color image data is handled by the Microsoft.Kinect.ColorImageFrame
. A single frame is then composed of color image data. This data is available in different resolutions and formats. You may use only one resolution and one format at a time.
The following table lists all the available resolutions and formats managed by the Microsoft.Kinect.ColorImageFormat
struct:
Color image format |
Resolution |
FPS |
Data |
---|---|---|---|
|
640 x 480 |
30 |
Pixel format is gray16 |
|
1280 x 960 |
12 |
Bayer data |
|
640 x 480 |
30 |
Bayer data |
|
640 x 480 |
15 |
Raw YUV |
|
1280 x 960 |
12 |
RGB (X8R8G8B8) |
|
640 x 480 |
15 |
Raw YUV |
|
N/A |
N/A |
N/A |
Note
When we use the InfraredResoluzion640x480Fps30
format in the byte array returned for each frame, two bytes make up one single pixel value. The bytes are in little-endian order, so for the first pixel, the first byte is the least significant byte (with the least significant 6 bits of this byte always set to zero), and the second byte is the most significant byte.
The X8R8G8B8
format is a 32-bit RGB pixel format, in which 8 bits are reserved for each color.
Raw YUV is a 16-bit pixel format. While using this format, we can notice the video data has a constant bit rate, because each frame is exactly the same size in bytes.
In case we need to increase the quality of the default conversion done by the SDK from Bayer to RGB, we can utilize the Bayer data provided by the Kinect and apply a customized conversion optimized for our central processing units (CPUs) or graphics processing units (GPUs).
Note
Due to the limited transfer rate of USB 2.0, in order to handle 30 FPS, the images captured by the sensor are compressed and converted in to RGB format. The conversion takes place before the image is processed by the Kinect runtime. This affects the quality of the images themselves.
In the SDK 1.6 we can customize the camera settings for optimizing and adapting the color camera for our environment (when we need to work in a low light or a brightly lit scenario, adapt contrast, and so on). To manage the code the Microsoft.Kinect.ColorCameraSettings
class exposes all the settings we want to adjust and customize.
Note
In native code we have to use the Microsoft.Kinect.Interop.INuiColorCameraSettings
interface instead.
In order to improve the external camera calibration we can use the IR stream to test the pattern observed from both the RGB and IR camera. This enables us to have a more accurate mapping of coordinates from one camera space to another.
Depth stream
The data provided by the depth stream is useful in motion control computing for tracking a person's motion as well as identifying background objects to ignore.
The depth stream is a stream of data where in each single frame the single pixel contains the distance (in millimeters) from the camera itself to the nearest object.
The depth data stream Microsoft.Kinect.DepthImageStream
by the Microsoft.Kinect.DepthImageFrame
exposes two distinct types of data:
Depth data calculated in millimeters (exposed by the
Microsoft.Kinect.DepthImagePixel
struct).Player segmentation data. This data is exposed by the
Microsoft.Kinect.DepthImagePixel.PlayerIndex
property, identifying the unique player detected in the scene.
The following table defines the characteristics of the depth image frame:
Depth image format |
Resolution |
Frame rate |
---|---|---|
|
640 x 480 |
30 FPS |
|
320 x 240 |
30 FPS |
|
80 x 60 |
30 FPS |
|
N/A |
N/A |
The Kinect runtime processes depth data to identify up to six human figures in a segmentation map. The segmentation map is a bitmap of Microsoft.Kinect.DepthImagePixel
, where the
PlayerIndex
property identifies the closest person to the camera in the field-of-view. In order to obtain player segmentation data, we need to enable the skeletal stream tracking.
Microsoft.Kinect.DepthImagePixel
has been introduced in the SDK 1.6 and defines what is called the "Extended Depth Data", or
full depth information: each single pixel is represented by a 16-bit depth and a 16-bit player index.
Note
Note that the sensor is not capable of capturing infrared streams and color streams simultaneously. However, you can capture infrared and depth streams simultaneously.
Audio stream
Thanks to the microphone array, the Kinect provides an audio stream that we can control and manage in our application for audio tracking, voice recognition, high-quality audio capturing, and other interesting scenarios.
By default, Kinect tracks the loudest audio input. Having said that, we can certainly direct programmatically the microphone arrays (towards a given location, or following a tracked skeleton, and so on).
DirectX Media Object (DMO) is the building block used by Kinect for processing audio streams.
Note
In native scenario in addition to the DirectX Media Object (DMO), we can use the Windows Audio Session API (WASAPI) too.
In managed applications, the Microsoft.Kinect.KinectAudioSource
class (exposed in the KinectSensor.AudioSource
property) is the key software architecture component concerning the audio stream. Using the Microsoft.Kinect.INativeAudioWrapper
class wraps the DirectX Media Object (DMO), which is a common Windows component for a single-channel microphone.
The KinectAudioSource
class is not limited to wrap the DMO, but it introduces additional abilities such as:
The
_MIC_ARRAY_MODE
as an additional microphone mode to support the Kinect microphone array.Beam-forming and source localization.
The
_AEC_SYSTEM_MODE
Acoustic Echo Cancellation (AEC). The SDK supports mono sound cancellation only.
Note
In order to increase the quality of the sound, audio inputs coming from the sensor get upto a 20 dB suppression. The array microphone allows an optional additional 6 dB of ambient noise removal for audio coming from behind the sensor.
The audio input has a range of +/– 50 degrees (as visualized in preceding figure) in front of the sensor. We can point the audio direction programmatically using a 10 degree increment range in order to focus our attention on a given user or to elude noise sources.
Skeleton
In addition to the data provided by the depth stream, we can use those provided by the skeleton tracking to enhance the motion control computing capabilities of our applications in regards to recognizing people and following their actions.
We define the skeleton as a set of positioned key points. A detailed skeleton contains 20 points in normal mode and 10 points in seated mode, as shown in the following figure. Every single point of the skeleton highlights a joint of the human body.
Thanks to the depth (IR) camera, Kinect can recognize up to six people in the field of view. Of these, up to two can be tracked in detail.
The stream of skeleton data is maintained by the Microsoft.Kinect.SkeletonStream
class and the Microsoft.Kinect.SkeletonFrame
class. The skeleton data is exposed for each single point in the 3D space by the Microsoft.Kinect.SkeletonPoint
struct. In any single frame handled by the skeleton stream we can manage up to six skeletons using an array of the Microsoft.Kinect.Skeleton
class.