**1. Introduction**

The idea and use of 3D imaging dates back to the 19th century, and laser scanning to the 1960s [1], but only recently has it been capable of revolutionising the interaction between robots, the environment and humans. Many advances in computational power, sensor precision and affordability have made this possible [2–4].

The recent development of RGB-D cameras has provided visual sensor devices capable of generating pixel-wise depth information, together with a colour image. The technology behind these cameras has been constantly improving, with developers working to reduce noise and increase precision, e.g., Microsoft Kinect Azure and Intel RealSense L515 and D455 [3].

**Citation:** de Medeiros Esper, I.; Smolkin, O.; Manko, M.; Popov, A.; From, P.J.; Mason, A. Evaluation of RGB-D Multi-Camera Pose Estimation for 3D Reconstruction. *Appl. Sci.* **2022**, *12*, 4134. https:// doi.org/10.3390/app12094134

Academic Editors: Luis Gracia and Carlos Perez-Vidal

Received: 15 March 2022 Accepted: 17 April 2022 Published: 20 April 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

The depth information from those devices can be used to generate a three-dimensional projection of the captured object. Understanding the well-known pinhole camera model [5] is important to understand how the reprojection works, and how it is affected by noise in the depth data.

The model describes the transformation from the 3D world to the 2D image plane, as shown in Figure 1 and in the Equation (1) [6]. It can also be used to calculate the inverse path for reprojecting from 2D to 3D.

Equation (1) has two matrices: one for the intrinsic parameters and another for the extrinsic parameters. The first contains the camera's internal parameters, which are constant for each camera. The second describes where the camera is in the world, i.e., the pose of the camera in relation to an origin coordinate system.

**Figure 1.** Pinhole camera's projective geometry.

The direction vector of the ray from the camera projection centre can be found using these parameters, but the length of the vector cannot. This information is lost in the conversion from 3D to 2D. However, when using an RGB-D camera this information is saved, as a depth that determines where the point lies in the world. The set of points reprojected from these data is called a point cloud.

Other parameters that are important for reprojection are the distortion coefficients which are used to correct the radial and tangential distortions of the lens [7]. In this work, the FRAMOS Industrial Depth Camera D415e, which was built with Intel® RealSense™ technology, was used. The Intel® RealSense module claims to have no lens distortions [8,9].

$$
\begin{bmatrix} s \\ v \\ 1 \end{bmatrix} = \underbrace{\begin{bmatrix} f\_x & 0 & c\_x \\ 0 & f\_y & c\_y \\ 0 & 0 & 1 \end{bmatrix}}\_{\text{intrinsic}} \underbrace{\begin{bmatrix} r\_{11} & r\_{12} & r\_{13} & t\_1 \\ r\_{21} & r\_{22} & r\_{23} & t\_2 \\ r\_{31} & r\_{32} & r\_{33} & t\_3 \end{bmatrix}}\_{\text{extrinsic}} \begin{bmatrix} X \\ Y \\ Z \\ 1 \end{bmatrix} \tag{1}
$$

The intrinsic parameters of a camera are normally represented as a 3 × 3 matrix, as shown in Equation (2).

$$K = \begin{bmatrix} f\_x & 0 & c\_x \\ 0 & f\_y & c\_y \\ 0 & 0 & 1 \end{bmatrix} \tag{2}$$

where *fx* and *fy* are the focal lengths in the *x* and *y* directions, respectively. Furthermore, *cx* and *cy* are the optical centres of the image plane, shown as a solid red line in Figure 1.

As illustrated in Figure 1, the focal length is the distance from the camera lens to the image plane; since the pixel is not necessarily square, the focal length is divided by the pixel size in *x* and *y*, resulting in the variables *fx* and *fy*, respectively, expressing the values in pixels.

The extrinsic parameters are typically represented by a homogeneous transformation matrix, shown in Equation (3), and this was well explained by Briot and Khalil (2015) [10]. This contains the rotation matrix *R*3×<sup>3</sup> and the translation vector *T*3×1, representing the

camera's transformation in relation to the origin of the reference coordinate system in the desired scene.

$$D = \begin{bmatrix} R\_{3 \times 3} & T\_{3 \times 1} \\ 0\_{1 \times 3} & 1 \end{bmatrix} \tag{3}$$

The distortion coefficients are the parameters used to describe the radial and the tangential distortion. They are represented as *kn* and *pn*, respectively. The most notable distortion model is the Brown–Conrady model [11].

The term calibration normally refers to methods of estimating the intrinsic parameters, distortion coefficients and extrinsic parameters.

Quaternions are another way to express rotations in the three-dimensional vector space. This method has a compact representation and has some mathematical advantages [12,13]. It is commonly used by the robotics industry because it is more mathematically stable and avoids the gimbal lock phenomenon, where two axes align and prevent the rotation in one dimension.

The robotic arm used in this work, the ABB IRB 4600, uses quaternions for the orientation of its TCP. Equation (4) [14] shows the conversion method from quaternions to Euler angles used in the homogeneous transformation matrix which was used in this work.

$$\begin{bmatrix} \phi \\ \theta \\ \psi \end{bmatrix} = \begin{bmatrix} \operatorname{atan2}(2(q\_0q\_1 + q\_2q\_3), 1 - 2(q\_1^2 + q\_2^2)) \\ \operatorname{asin}(2(q\_0q\_2 - q\_3q\_1)) \\ \operatorname{atan2}(2(q\_0q\_3 + q\_1q\_2), 1 - 2(q\_2^2 + q\_3^2)) \end{bmatrix} \tag{4}$$

#### *1.1. Point Cloud*

Reprojecting the points in 3D using the intrinsic parameters in a pinhole model, as shown in Equation (1), with no distortion is carried out according to Equations (5)–(7). For an RGB-D device, *Z* is the depth information extracted from the depth frame, *x* is the column index and *y* is the row index. In these equations, a point is a 3D structure with (*x*, *y*, *z*) data representing a point in the camera's coordinate frame.

$$\text{point.} z = \text{Z};\tag{5}$$

$$point.x = ((X - cx) / fx) \* Z;\tag{6}$$

$$point.y = ((X - cy) / fy) \* Z;\tag{7}$$

The origin of the coordinate system for each point cloud is one sensor of the camera, in this case, the left infrared sensor. To reconstruct the scene from multiple cameras or perspectives, the camera's positionin the scene, i.e., the global coordinate system, must be known.

The global coordinate system's origin is chosen according to the task. It can be the optical centre of one of the camera's sensors, for example. In this work, the chosen origin was the base coordinate system of an ABB IRB 4600 robot [15].
