*1.2. Related Work*

The problem of scene reconstruction has been well studied in different scenarios. To solve the challenge of registering two or more point clouds together, the six degrees of freedom (DoF) transformation between them has to be found. This can be calculated using the point clouds by either selecting the matching points manually or using algorithms to find possible matching points. Another method is to calculate a known position for the sensors.

Chen and Medioni [16] and Besl and McKay [17] proposed one of the most widely used algorithms for the registration of 3D shapes, the iterative closest point (ICP). It tries to find the best match between the point clouds by finding the closest points from one point cloud to the other, then it determines the transformation matrix that minimises the distance between the points, and finally, it iterates until it converges. Due to fact that it relies on the data association of the points between point clouds, a good overlap is necessary for convergence.

Wang and Solomon [18] proposed a replacement of ICP called the deep closest point (DCP) method, which uses a three-step approach, first embedding the point clouds into a common space, then capturing co-contextual information in an attention-based module and finally using singular value decomposition to find the transformation matrix.

Aoki et al. [19] used the Pointnet [20] method with the Lukas and Kanade (LK) [21] algorithm to solve the registration problem.

Other methods include that of Stoyanov et al. [22], who used a three-dimensional normal distribution transform representation of the distance between the models, followed by a Newton optimisation, and that of Zhan et al. [23], who proposed an algorithm based on normal vector and particle swarm optimisation. These methods all rely on having a sufficient overlap between the point clouds to solve the problem. Moreover, these methods tend to be slow.

Performing an estimation of the position of the sensorprovides a more reliable transformation for the point clouds. Initially, work on camera calibration was focused on finding the intrinsic parameters of a single camera. Zhang [24] was the first to propose the solution to this challenge using a chessboard pattern, i.e., a planar target, through least-square approximation.

With the intrinsic parameters, in theory, it would be possible to calculate the pose of the camera with four coplanar points that are not collinear, but Schweighofer and Axel [25], discussing pose ambiguities, proved that two local minima exist, and proposed an algorithm to solve this problem.

Fiducial markers became popular for camera pose estimations, and several markers have been proposed, including point-based [26], circle-based [27], and square-based [28,29] markers, which can determine the pose using the four corners and have the ID in the middle of the marker.

Garrido-Jurado et al. [28] proposed a system with configurable marker dictionaries, specially designed for camera localization. The authors developed a marker generator, as well as an automatic detection algorithm. These ArUco markers form the basis of the charuco board used in this work.

Other publications have proposed ways to solve the problem of pose estimation using an arbitrary scene with texture [30–32]. These are not relevant to this project, since it was designed in a contained environment.

In this work, we propose and carry out a novel evaluation method for multiple camera perspectives. We used a charuco to identify and label the points on which the metrics were based, and calculated the errors of three different methods in order to carry out the transformation of the cameras to the global coordinate system.

#### **2. Materials and Methods**

This investigation was part of two projects called Meatable [33] and RoButcher (https://robutcher.eu, accessed on 15 February 2022). The projects involved researching the design and robotisation of a cell to process pig carcasses, called the Meat Factory Cell, and proposing a significant deviation from existing meat processing practices. The process was briefly described by Alvseike et al. (2018) [34,35]. The projects defined constraints on the data captured due to the specific data configuration of the scene and the application requirements.

Due to the inherent characteristics of the projects, precision was a key aspect of the system; otherwise, the cutting and manipulation of carcasses would be unsatisfactory or ineffective (e.g., the robot could cut too deep into the carcass, or not cut it at all).

For these projects, a bespoke machine called a carcass handling unit (CHU) was used, as shown in Figure 2. Its function was to grip a pig's carcass and present it to the robot, which would perform automated cutting to segment or dissect the parts.

**Figure 2.** Carcass handling unit in the RoButcher laboratory.

The robot needed to have a complete understanding of the environment in the 3D space. However, the camera had a limited angle of view. To solve this constraint, multiple camera perspectives had to be transformed to obtain a complete view of the scene.

The robotic arm had a camera attached to the tool centre point (TCP), as illustrated in Figure 3, to capture the data. It cycled the camera to six positions around the CHU, with two on the right, two on top, and two on the left side. With this configuration, almost 360◦of the scene could be captured.

**Figure 3.** FRAMOS D415e Camera and its bespoke support.

The cameras are referred to as left/right/up and front/back. The left/right is defined in relation to the CHU, not to the pig, which can be rotated once grabbed by the CHU. TThe left, right, front, and back sides of the CHU were defined as shown in Figure 4.

**Figure 4.** Carcass handling unit (CHU) with sides labelled.

The FRAMOS D415e has 3 image sensors (2 infrared (IR) and 1 RGB) and one IR laser projector. It calculates depth based on the disparity between the two IR sensors. Furthermore, it is stated by the documentation that the infrared cameras have no distortion; hence, all the distortion coefficients are 0.0 [9]. Furthermore, the cameras come from the

factory calibrated with intrinsic parameters recorded in the memory and can be easily read using the appropriate Intel RealSense SDK library.

The Intel RealSense D415 was evaluated in regard to its the accuracy, performance, and precision by Lourenco and Araujo [36], with an RMSE accuracy distance for an analysis of 100 images of 0.07927, an average failed points ratio of 0.5414%, and average outliers (±10 cm) of 0.0667%, measured at 1 m. The repetitive pattern of the board also prejudices the depth calculation, and consequentially the point cloud, as illustrated in Figure 5.

**Figure 5.** Charuco board with noise and artefacts due to the repetitive pattern.

The data are recorded in a bag file (http://wiki.ros.org/Bags, accessed on 1 February 2022) using the SDK. The bagfile is a container file that contains the image and the depth frame, the intrinsic parameters for both sensors, and metadata information. These data are used to extract the parameters needed to execute the calibration.

The file-naming conventions used in the work used camera positions numbered from 0 to 5, and the sequential take number, with the following relation to the CHU:


#### *2.1. Evaluation Metrics*

To quantify the quality of the reconstruction in each method, the root-mean-square error (RMSE), shown in Equation (8), was used. The error used in the RMSE calculation was the Euclidean distance in three-dimensional space R3, and it was calculated after reprojecting the points in the point cloud, as shown in Equation (9).

$$rmsse = \sqrt{\frac{\sum\_{i=0}^{n} error^2}{n}} \tag{8}$$

$$error = \sqrt{(x\_o - x\_r)^2 + (y\_o - y\_r)^2 + (z\_o - z\_r)^2} \tag{9}$$

The depth information received from the camera was able to have an error of 2%, thus increasing the RMSE. However, as this affected all methods similarly, it did not interfere with the final comparison.

To measure the distance between the correspondents points, these points must be known, i.e, which point in one point cloud should match which point in the other point cloud. Since the point cloud from the RGB-D device did not contain information enabling us to identify which point was which, a method was proposed to label some points.

In this work we used OpenCV [37], as it is a stable and robust library for computer vision. ArUco markers were used as fiducial markers and a charuco board was used as the pattern board for the pose estimation, as proposed by Garrido et al. (2014) [28].

The ArUco module in OpenCV has all the necessary functions for pose estimation using the implemented charuco board [38].

## *2.2. Labelling Points*

The method used to identify and label the points in order to enable error calculation was based on the unique ID of each ArUco marker. Each ID and the corners of the marker could be found on the board during the pose estimation process, as shown in Figure 6.

**Figure 6.** Charuco Board showing the identified ArUco markers.

Equation (10) shows the calculation of the point label values at points on the charuco board. Each corner receives a label to identify the specific point in the point cloud.

$$label = (i \ast 4) + j \tag{10}$$

where *i* is the ArUco ID and it is multiplied by 4 because every marker gives four corner points. *j* is the index of the ArUco in the vector where it is stored.

#### *2.3. Methodology*

The methodology used to measure the reconstruction error was based on a charuco cuboid with 3 faces, which was used to identify the points and calculate the RMSE. The chosen charuco boards were 300 mm by 200 mm, with a checker size of 20 mm and a marker size of 15.56 mm with 14 columns and 9 rows. This was manufactured using precision technology in aluminium by Calib.io (https://calib.io, accessed on 13 January 2022).

A charuco rectangular cuboid was built, as shown in Figure 7, using three boards.

#### *2.4. Pre-Processing RGB-D Data*

The first step was to improve the data by minimising the noise and the artefacts in the point cloud. To perform this task, some post-processing was applied to the captured frames.

In this work, the chosen post-processing methods were: alignment of colour and depth frame, sharpening of the colour frame and temporal and spatial filters applied to the frameset. Although the alignment and sharpening were always applied, the results were tested with both temporal and spatial filters enabled and disabled.

**Figure 7.** Charuco rectangular cuboid.

#### 2.4.1. Alignment of Frames

The FRAMOS D415e camera had its origin reference in the left IR imager sensor and the RGB imager on the right side, as shown in Figure 8. As the sensors are different, an alignment between the colour image and the information was performed.

**Figure 8.** FRAMOS D415e origin reference system.

The alignment was performed in relation to the colour imager, i.e., the depth information was aligned to the colour frame. Since the alignment changes the relation of the pixels in the depth frame to match the colour sensor, the intrinsic matrix used to reproject the point cloud was also derived from the colour imager.

#### 2.4.2. Sharpening Filter

Besides the alignment, a sharpening filter was applied to the colour image to improve the ArUco marker identification. The filter kernel used is shown in Equation (11).

$$
\begin{vmatrix}
\end{vmatrix}
\tag{11}
$$

Figure 9a shows the charuco board before the sharpening filter was applied and the identified ArUco markers. Figure 9b shows the image after filtering, showing a significant improvement of the corners and edges. The sharpening filter improved the edges between the white and the black pixels of the board, increasing the number of identified ArUco markers.

**Figure 9.** Use of a sharpening filter to improve ArUco recognition. (**a**) No filter applied. (**b**) Filter applied.

#### 2.4.3. Temporal and Spatial Filters

To diminish the noise artefacts of the data, two post-processing techniques were applied to the captured frames, a temporal filter and a spatial filter. Both filters were implemented in RealSense SDK and these have been well explained by Grunnet-Jepsen and Tong [39].

The temporal filter performs a time average of the frames to improve the depth calculation. The filter implements three parameters, those being the alpha, delta, and persistence filter. The alpha parameters represent the temporal history of the frames. The delta is a threshold to preserve edges, and the persistence filter tries to minimise holes by keeping the last known value for a pixel.

The spatial filter implemented in the SDK is based on the work of Gastal and Oliveira [40]. It preserves edges while smoothing the data. It takes four parameters, alpha and delta, which have the same function as in the temporal filter. This method also involves the filter magnitude, which defines the number of iterations, and hole filling parameters, which improve artefacts.

#### 2.4.4. The Ground-Truth

After post-processing, the next step for the evaluation of the reconstruction is to generate a cuboid of reference points, in relation to the robot's base, to be used as groundtruth data. This was performed based on the first camera data. The charuco board was identified and the markers' corner was extracted from the image, as shown in Figure 6. Based on the cuboid geometry, the top reference was generated by rotating −90◦ around the *x*-axis, followed by a rotation of 180◦ around the *y*-axis. Then, a translation of 280 mm, 187 mm and −190 mm in the *x*, *y* and *z* directions, respectively, was carried out, as shown in Equation (12). The backside was generated by rotating 180◦ around the *y*-axis and translating 280 mm and −200 mm in the *x* and *z* directions, respectively, as shown in Equation (13).

$$t\_{top}T^b = T^0 \ast\_{cub} T^{cam0} \ast \operatorname{trans}\_a \ast \operatorname{rot}\_{x^{-\beta 0}} \ast \operatorname{rot}\_{y^{180}} \ast \left( {}\_{cub} T^{cam0} \right)^{-1} \tag{12}$$

where *topT<sup>b</sup>* is the transformation from the top of the charuco origin to the robot's base, *T*<sup>0</sup> is the transformation from TCP to the base, *cubTcam*<sup>0</sup> is the transformation from the first camera to the charuco, *transa* is the translation, *rotx*−<sup>90</sup> is the rotation in the *x*-axis, *roty*<sup>180</sup> is the rotation in the *y*-axis, and (*cubTcam*0)−<sup>1</sup> is the inverse transformation from the first camera to the charuco.

$$\ast\_r T^b = T^0 \ast\_{cub} T^{cam0} \ast \operatorname{trans}\_b \ast \operatorname{rot}\_{y^{180}} \ast (\! \_{cub} T^{cam0} \! )^{-1} \tag{13}$$

where *topT<sup>b</sup>* is the transformation from the top of the charuco origin to the robot's base, and the other terms have been explained in the previous paragraph.

The final reference cuboid can be seen in Figure 10 with the coordinate system of every side shown as a visual aid.

**Figure 10.** Cuboid with reference points and the coordinate frame of each board.

The camera positions make it impossible to see only one charuco board at a time. To solve this challenge, the board can be rotated when it can be seen in at least two camera positions, or two (or more) boards can be used with known geometries.

In this work, two reconstructions were performed using the charuco board and one using a robotic arm's position.

### *2.5. Charuco Cuboid*

When using the charuco cuboid, each side is seen in two camera positions. Consequently, the pose estimation is based on the visible board. The defined origin for the reconstruction was set to be the first camera position. To obtain an optimal reconstruction, one must perform a rotation and translation on the data from camera positions, where the camera is facing a different board, based on the cuboid geometry, as explained in Section 2.4.4.

Notably, to calibrate all camera views (i.e., in all six possible positions), the charuco cuboid can remain stationary within the environment (i.e., it does not require repositioning for each camera).

### *2.6. Charuco Double-Sided Angled Board*

Using a double-sided board follows the same principle as the cuboid but is more simple since the camera can see the same side in 4 positions, and the back of the board in the other two, as shown in Figure 11a–c.

To perform the transformation, the left and the upper cameras were transformed to the inverse matrix of charuco to camera transformation (*chaTcam*), to make the charuco board origin the same for these cameras. Then, the right cameras had to perform a 180◦ rotation around the *y*-axis and a translation of 280 mm and 1.2 mm in the *x* and *z* directions, respectively, according to the board geometry.

**Figure 11.** Charuco board views from the camera positions. (**a**) Left view from camera 1. (**b**) Top view from camera 3. (**c**) Right view from camera 5.
