*2.1. Depth Map Alignment and Segmentation*

The color images and depth maps are jointly segmented by a procedure similar to that described in Mutto et al. [58] that has two main stages. In Stage 1, each sample is transformed into a six-dimensional vector. In Stage 2, the point set is clustered using the mean shift algorithm [59].

Every sample in the Kinetic depth map corresponds to a 3D point, *pi*, *i* = 1, ... , *N*, with *N* the number of points. The joint calibration of the depth and color cameras, as described in [57], allows a reprojection of the depth samples over the corresponding pixels in the color image so that each point is associated with the 3D spatial coordinates (x, y, and z) of *pi* and its RGB color components. Since these two representations lie in entirely different spaces, they cannot be compared directly, and all components must be comparable to extract multidimensional vectors that are appropriate for the mean shift clustering algorithm. Thus, a conversion is performed so that the color values lie in the CIELAB uniform color space, which represents color in three dimensions expressed by values representing lightness (L) from black (0) to white (100), a value (a) from green (−) to red (+), and a value (b) from blue (−) to yellow (+). This introduces a perceptual significance to the Euclidean distance between the color vectors that can be used in the mean shift algorithm.

Formally, the color information of each scene point in the CIELAB color space, *c*, can be described with the 3D vector:

$$p\_i^c = \begin{bmatrix} \mathbf{L}(p\_i) \\ \mathbf{a}(p\_i) \\ \mathbf{b}(p\_i) \end{bmatrix} \quad \text{ } i = 1, \dots, N. \tag{1}$$

The geometry, *g*, can be represented simply by the 3D coordinates of each point, thus:

$$\boldsymbol{\nu}\_{i}^{\mathcal{J}} = \begin{bmatrix} \mathbf{x}(p\_{i}) \\ \mathbf{y}(p\_{i}) \\ \mathbf{z}(p\_{i}) \end{bmatrix}, \quad i = 1, \ldots, N. \tag{2}$$

The scene segmentation algorithm needs to be insensitive to the relative scaling of the point-cloud geometry. Moreover, the geometry and color distances must be brought into a consistent framework. For this reason, all the components of *p g <sup>i</sup>* are normalized with respect to the average of the standard deviations of the point coordinates in the three dimensions σ*<sup>g</sup>* = σ*<sup>x</sup>* + σ*<sup>y</sup>* + σ*<sup>z</sup>* /3. Normalization produces the vector:

$$
\begin{bmatrix}
\overline{\mathbf{x}}(p\_i) \\
\overline{\mathbf{y}}(p\_i) \\
\overline{\mathbf{z}}(p\_i)
\end{bmatrix} = \frac{3}{\sigma\_x + \sigma\_y + \sigma\_z} \begin{bmatrix}
\mathbf{x}(p\_i) \\
\mathbf{y}(p\_i) \\
\mathbf{z}(p\_i)
\end{bmatrix} = \frac{1}{\sigma\_\mathcal{g}} \begin{bmatrix}
\mathbf{x}(p\_i) \\
\mathbf{y}(p\_i) \\
\mathbf{z}(p\_i)
\end{bmatrix}.\tag{3}
$$

To balance the relevance of color and geometry in the merging process, the color information vectors are normalized as well. The average of the standard deviations of the L, a, and b color components are computed producing the final color representation:

$$
\begin{bmatrix}
\mathcal{L}(p\_i) \\
\hline
\overline{\mathbf{a}}(p\_i) \\
\hline
\overline{\mathbf{b}}(p\_i)
\end{bmatrix} = \frac{\mathcal{S}}{\sigma\_L + \sigma\_a + \sigma\_b} \begin{bmatrix}
\mathcal{L}(p\_i) \\
\mathbf{a}(p\_i) \\
\mathbf{b}(p\_i)
\end{bmatrix} = \frac{1}{\sigma\_c} \begin{bmatrix}
\mathcal{L}(p\_i) \\
\mathbf{a}(p\_i) \\
\mathbf{b}(p\_i)
\end{bmatrix}.\tag{4}
$$

Once the geometry and color information vectors are normalized, they can be combined for a final representation *f*:

$$\begin{aligned} \left[p\_i^f = \begin{bmatrix} \overline{\mathbf{L}}(p\_i) \\ \overline{\mathbf{a}}(p\_i) \\ \overline{\mathbf{b}}(p\_i) \\ \lambda\_{\overline{\mathbf{x}}} \\ \lambda\_{\overline{\mathbf{y}}} \\ \lambda\_{\overline{\mathbf{z}}} \end{bmatrix} \right] \end{aligned} \tag{5}$$

with the parameter λ adjusting the contribution to the final segmentation of color (low values of λ indicating high color relevance) and geometry (low values indicating high geometry relevance). By adjusting λ, the algorithm can be reduced to a color-based segmentation (λ = 0) or to a geometry (depth)-only segmentation (λ → ∞) (see [58] for a discussion of the effects that this parameter produces and for automatically tuning λ to an optimal value).

Once the final vectors *p<sup>f</sup> <sup>i</sup>* are calculated, they can be clustered by the mean shift algorithm [59] to segment the acquired scene. This algorithm offers an excellent trade-off between segmentation accuracy and computational complexity. For final refinement, regions are removed that are smaller than a predefined threshold, since they are typically due to noise. In Figure 2, examples of a segmented image are shown.

**Figure 2.** Color image (**left**), depth map (**middle**), and segmentation map (**right**).
