1. Introduction
In recent years, RGB-D sensors have attracted great attention in 3D modeling due to their low cost. Two major concepts, time of flight (ToF) and structured light (SL), are used in RGB-D sensors. Many devices released in the market are based on these concepts; for example, Kinect v1, Structure Sensor [
1] and ASUS Xtion Pro Live [
2] are based on the SL concept, while Kinect v2 [
3] is based on the ToF concept [
4] The RGB-D sensor consists of three different sensors: an RGB camera, an infrared (IR) camera and an IR projector. In Kinect and ASUS, all three sensors are manufactured in a fixed frame, whereas the Structure Sensor combines two IR sensors only and is designed to be attached to a portable device with an RGB camera.
Although RGB-D sensors were originally designed to be used for gaming purposes such as remote controlling, they have recently made an important contribution to surveying and navigation applications, such as building information modeling, indoor navigation, and indoor 3D modeling [
5,
6] Although the accuracy required for gaming applications is not high, to extend the use of RGB-D sensors to surveying-type applications, accurate calibration of the device’s geometric parameters (i.e., camera focal lengths and baselines between cameras) and modeling the sensor errors (i.e., lens distortions and systematic depth errors) are necessary to produce high quality-spatial data and 3D models (e.g., cm-level precision).
The calibration of consumer-grade depth sensors has been widely investigated since the release of the first-generation Kinect in 2010. Various calibration methods, particularly for the depth sensor, have been studied by different research groups [
6,
7,
8,
9,
10,
11]. The calibration methods can be divided into four main categories.
The first category assumes that the three sensors (RGB camera, IR camera, and IR projector) behave similarly and pinhole camera concept is applicable to them all. This method is typified by [
8] which uses the disparity data delivered from the IR camera and projector with the RGB image and a photogrammetric bundle adjustment to calibrate the internal and external parameters of the three sensors. Conventional distortion models [
12,
13] for each sensor separately compensate for their distortion effects. As it is difficult to obtain the raw measurements from the IR projector, the author computed such data from the disparity and the approximated baseline between the IR camera and projector. The main limitation of this category is the dependency between the observations from the IR camera and projector and their calibration parameters.
The second category combines disparity image produced by the depth sensor, the image produced by the RGB camera, and an empirical mathematical model to eliminate the distortions of the IR camera [
9]. The distortion model is based on the error behavior of the whole RGB-D sensor. Unlike the first category, which is valid for any RGB-D sensor, this one is restricted to Kinect v1. Other limitations include the lack of automation and rigor. The user has to select the points in the depth image manually, and the homography matrix is computed using only four corresponding points.
The third category is dealing with the refinement of the in-factory calibration procedure, where the manufacture parameters including the baseline between IR camera and projector as well as the standard depth were precisely calibrated along with the RGB camera. Zhang’s [
14] method is such a type of calibration; the author used the maximum likelihood estimation to calibrate color internal parameters and the manufacture parameters for a depth sensor without any distortion effect. The main limitation of this method is the distortion parameters for both IR cameras and projectors, which are not estimated or compensated.
Finally, the fourth category is mainly dealing with the depth calibration of the RGB-D sensor [
15,
16], the method adopts a mathematical error model derived from observation equation of depth measurements. The method used the output depth information from depth sensor and an adequate experiment setup to obtain the true depth of each pixel to model each pixel’s depth error. The method is valid for both SL and ToF RGB-D sensor. Although the error model was precisely applying to the middle part of the depth image, the whole image’s depth error model can be achieved and applied to the whole depth image. The error model has demonstrated a significant improvement while scanning small objects.
Based on these calibration algorithms, different calibration methods have been implemented and tested. Methods include the use of 1D [
17] 2D [
11], and 3D [
18] calibration objects that work with the depth images directly; calibration of the manufacture parameters of the IR camera and projector [
9,
19]; or photogrammetric bundle adjustments used to model the systematic errors of the IR sensors [
8,
20]. To enhance the depth precision, additional depth error models are added to the calibration procedure [
7,
8,
21,
22,
23]. All of these error models are used to compensate only for the distortion effect of the IR projector and camera. Other research works have been conducted to obtain the relative calibration between an RGB camera and an IR camera by accessing the IR camera [
24,
25,
26]. This can achieve relatively high accuracy calibration parameters for a baseline between IR and RGB cameras, while the remaining limitation is that the distortion parameters for the IR camera cannot represent the full distortion effect for the depth sensor.
Two main issues faced by current RGB-D calibration procedures relate to the depth sensor. The first is the implementation of IR projector distortion along with the IR camera in the calibration procedure; the second is the correction of systematic errors resulting from the inaccuracy of in-factory calibration. This study addressed these issues using a two-step calibration procedure to calibrate all of the geometric parameters of RGB-D sensors. The first step was related to the joint calibration between the RGB and IR cameras, which was achieved by adopting the procedure discussed in [
27] to compute the external baseline between the cameras and the distortion parameters of the RGB camera. The second step focused on the depth sensor calibration. First, the in-factory calibration parameters were updated to eliminate the systematic error resulting from the baseline between the IR camera and projector. Second, a combined distortion model was used to compensate for the distortion and systematic errors resulting from both the IR camera and projector. The experimental design and results are discussed in comparison with the conventional calibration method, and concluding remarks are made.
2. A Distortion Calibration Model for Depth Sensor
Numerous RGB-D sensors were released on the market based on a structured light concept consisting of an IR camera and an IR projector. In addition to IR sensors, an optional RGB camera acquires the color information of the observed scene. The IR projector emits a predefined pattern and the IR camera receives it [
28]. The depth of the image is obtained by triangulation based on the distance between their camera and projector.
Figure 1 shows the main elements of the sensors, which use the SL concept. Two sensors are shown: Kinect v1 and the Structure Sensor. The main difference between both sensors is the baseline between the IR camera and projector. The longer the sensor’s baseline, the longer working distance can be achieved. The working range of Kinect v1 is 0.80 m to 4.0 m, while it is 0.35 m to 3.5 m for Structure Sensor.
The principle of depth computation for RGB-D sensors is shown in
Figure 2, where both IR sensors are integrated together to produce a pixel-shifted distance called disparity. The infrared projector pattern on a predefined plane (Z
0) used in in-factory calibration [
14] was stored in the sensor firmware. While capturing the real feature (
Qi), both the IR standard projector pattern (
) and the real IR projector captured by IR camera (
) are identified. The difference between both locations is called disparity
. Using the disparity value and the predefined configuration information, including the focal length (
f) of the IR sensors, the baseline between the IR projector and camera (w), and the standard depth of the projected pattern (Z0) [
5,
29], we can compute the depth of the feature point (
Qi ).
Using the triangle similarity, the relationship between the standard pattern location (
), the real projector pattern location (
) in the IR camera space and the IR projector location (
) can be written as:
Applying the disparity
, the fundamental equation for computing the depth value for a feature point (
Qi ) can be written as follows:
As noted previously, the disparity value is measured by the firmware of the RGB-D sensor. However, the output value by the sensor is the normalized value (
), which ranges from 0 to 2047. The sensor uses (
) and (
) as two linear factors to normalize the measured disparity (
). The normalized disparity is stated as
. By substituting (
) in Equation (3) and combining all of the constants to assigned factors
a and
b, Equation (3) becomes:
where a and
b are constants:
The final coordinates
Xi,
Yi, and
Zi for the acquired feature (
Qi) are computed as:
The formula presented in Equation (4) with the factors a and b is called the manufacturer’s equation or the mapping function, which produces the depth information from the normalized disparity. The a and b coefficients are a function of the design parameters of the RGB-D sensor, which are the standard plane distance, the baseline, the focal length, and the linear parameters that convert the measured disparity to the normal disparity.
The baseline between IR camera and IR projector has a great influence in depth precision, using the covariance error propagation concept to estimate the variance of the depth resulting from SL RGB-D sensor. Using Equation (3) to estimate the depth variance, it can be figured out that , where and are the precision of depth and disparity, respectively. For a certain SL RGB-D sensor, if the baseline (w) between IR camera and IR projector was increased to double, the precision of depth will be improved by 50% assuming all other variables were constants.
The disparity value can be computed from three constants,
f,
w, and Z
0, and two measured quantities,
and
. The measured quantities are affected by the distortion of the IR camera and projector, respectively. The systematic error can be assumed to be a function of the distortion parameters. The general expression that combines the effect of the systematic error and the distortion effect can be written as follows:
where
is the true disparity,
is the measured disparity, and
represents the error resulting from the effect of lens distortion and systematic error for the IR sensors.
The disparity error can be expressed as a function of the distortion effect for both the IR camera and projector for both tangential and radial distortions:
where:
and
are the tangential distortion effect for the IR camera and projector, respectively, and
and
are the radial distortion effect for the IR camera and projector, respectively.
The radial distortion quantifies the lens quality, which is caused by bending the ray linked object, image, and focal points. The Brown model [
30] with two factors (
K1 and
K2) is applied to compensate for the radial distortion. The tangential other model describes the distortion resulting from the inaccurate location of the lens with respect to the focal point, and the effect is compensated for using another two factors (
P1 and
P2) [
12,
31]:
where
P1 and
P2 are the factors represent the tangential distortion model,
K1 and
K2 are the factors represent the radial distortion model and
and
are the free distortion coordinates of the image point
As the disparity value is computed from the horizontal shift from the projected pattern to the standard one, the relation between the measured point coordinate and the true point coordinate can be expressed as follows:
where
is the measured x coordinate for the sensor and
is the true x coordinate for the sensor.
Inserting Equation (9) into Equation (8), the full disparity error model considering both IR camera and projector distortions can be expressed in Equation (11) with eight parameters:
where
p refers to the IR projector and c refers to the IR camera.
To further simplify Equation (11), we used two parameters (
W1 and
W2) to describe the tangential distortion by correcting the relative orientation between the IR camera and projector lenses and applied another two parameters (
W3 and
W4) to describe the combined the radial distortion for both the IR camera and the projector, which can be considered as one lens combining the overlaying IR camera and projector lenses. As the relative orientation between the camera and projector is well fixed (as a rigid body) and pre-calibrated (mapping function calibration), the y axis for IR camera and IR projector can be assumed to be identical. The third and fourth terms of Equation (11) represent the radial distortion in the x direction resulting from the IR camera and projector. Due to the unknown projector data, we used the gross combined radial distortion, known as Seidal aberrations [
12] and the IR camera’s pixel location to assign
x distortion effect. This gives the two constraints shown in Equation (12):
Equation (11) can then be simplified as:
Equation (13) is the fundamental equation that describes the depth sensor distortion with only four parameters. As
and
, we have:
Finally, the full distortion model for the RGB-D sensor can be given as:
In Equation (15), the distortion model uses four parameters that consider both the radial and tangential distortions for both IR camera and projector lenses.
5. Calibration Results
The phase 1 data for each sensor were processed to compute the calibrated baseline between the RGB and IR cameras, while the phase 2 data were processed to calibrate the depth sensor.
Table 2 and
Table 3 are the results of phase1 for sensors 1 and 2. The output data are the internal parameters for the RGB and IR cameras. The internal parameters are represented as camera focal lengths (Fx, Fy) in pixels and principal point (Cx, Cy) in pixels and five distortion vectors (
K1,
K2,
P1,
P2, and
K3), where K’s are introduced to consume the radial distortion effect and P’s are presented to eliminate the tangential distortion effect. The IR-RGB camera baseline is expressed in six parameters, including three translations (dx, dy, and dz) in mm and three rotation Euler angles (Rx, Ry, and Rz) in radians.
Table 4 shows the default parameters used by the firmware of the structure sensor, stated as the internal parameters for the depth sensor and color camera. The focal length and principal point for both sensors are the same, while the distortion parameters for both RGB and IR cameras are set are to zero.
After calibrating the baseline between the RGB and IR cameras and the internal parameters for the RGB camera, the phase 2 data were processed to calibrate the depth sensor. The two steps, including the mapping function calibration and distortion model, were conducted to deliver a high-precision depth measurement from the sensor.
Table 5 shows the calibration result for mapping the function calibration.
a and
b are the mapping parameters mentioned in Equation (4).
After computing the calibrated mapping function, we modified the measured depth and disparity information to correct the systematic error resulting from the mapping function error, then continued to compute the distortion model parameters.
Figure 5 shows the distortion model parameters set for both Structure Sensors.
Figure 5 shows the main conclusion for the calibration procedure. Although the distortion parameters (
W1,
W2,
W3, and
W4) change with the measured distance, after 2.50 m, each distortion model parameter tends to be the same value. This means that for the full calibration parameters, it could be sufficient to collect the depth data with corresponding disparity images up to 2.50 m.
6. Accuracy Assessment of the Calibration Models
To examine our calibration methodology as well as the distortion model performance on the depth accuracy, we captured a new dataset for each sensor and applied the calibration results. To examine the IR-RGB baseline calibration, two images (depth and RGB) were collected and aligned using the calibrated parameters.
Figure 6 shows the effect of the calibration parameters.
To examine the performance of the depth calibration parameters, including the calibration of the mapping function and the distortion model, we collected several depth images for a planar surface and applied our calibration parameters. The examination criterion was based on the same procedure illustrated in [
8,
9]. Compared with the planes resulted from the total station, the RMSE of the fitted plane surface was used to describe the measured depth precision.
Figure 7 and
Figure 8 show the depth precision performance evaluation for sensors 1 and 2, respectively. The left side introduces the full range performance, and the right side zooms in to display the near range.
In addition to the planar surfaces assessment, two perpendicular planar surfaces (part of wall and ceiling) were captured using one of the calibrated sensors, the average distance between the sensor and the observed planes is 2.00 m (with minimum of 1.20 m and maximum of 3.00 m). The data were processed using the default parameters provided by the manufacturer and processed again after applying our distortion model.
Figure 9 shows the point cloud resulted from the RGB-D sensor before and after applying our distortion model. It is clearly seen that the warp in the wall was removed after calibration also the distortion on the depth image corners was significantly compensated. Comparison between the computed angle before and after applying our distortion model was shown in
Table 6. Using different threshold for RANSAC to extract planes, the recovered average angle using our method is 89.897 ± 0.37 compared to 90.812 ± 7.17 using the default depth.
Comparing our results with those given in [
9] our calibration method achieved nearly similar accuracy for the near range and a significant improvement in accuracy for the far range. However, our calibration method is simpler than the method given in [
9] and the error model is based on a mathematical concept of the lens distortion effect.