5.1.1. Preprocessing

Initially, the pipeline requires as input overlapping images of the object surface and camera calibration parameters, such as the interior orientation of the camera and lens distortion coefficients. Afterwards, all images are preprocessed to remove lens distortion and thus allow for the use of the simplified pinhole camera model.

### 5.1.2. Structure from Motion

An illustration of the single steps of the SfM procedure is provided in Figure 9.

**Figure 9.** Visualization of the structure from motion procedure.

Interest operators, as, for instance, the widespread SIFT [25] and SURF [26], identify distinctive points in the images. These points—also called feature points—are basically corners, edges, or blobs, which lead to significant changes in the intensities of pixels and thus differ strongly from their neighborhood. Once the location of a feature point is detected, an image patch around the point is extracted, transformed in a manner so that it becomes scale- and rotation-invariant and stored in a vector called feature descriptor. In this way, every feature point in an image gets its own fingerprint, which can further be used to distinguish one point from the others. We use SURF as interest operator for feature detection and subsequent extraction since it provides around 3 times higher speed with similar accuracy when compared with SIFT [37].

Subsequently, matching of extracted feature descriptors in distinct images is utilized for identifying pairs of images looking at the same part of the scene as well as estimating the relative orientation between the image pairs. The relative orientation of an image pair is defined by the orientation, represented by a rotation matrix *R*, and the position, represented by a translation vector *t*, of the second image with respect to the first one. Since the exterior orientation of the camera is approximately pre-determined and basically corresponds to the case of aerial photogrammetry, we can exploit some key conditions. For example, the regularly spaced and strip-wise arranged images allow simplifications for establishing image pairs. Hence, we simply pick the first two images as an initial pair and consider only consecutive images for setting up further pairs. Provided that the system is configured with a sufficiently small movement of the camera between images, a high overlap is guaranteed. Thus, it can be assured that all images share common feature points, which prevents the occurrence of multiple models. Another benefit of the pre-known image sequence is that we neither have to deal with a possibly bad initialization of the model, which in turn could lead to a bad reconstruction due to the accumulation of errors.

The formation procedure of image pairs further runs through a multi-stage filtering step in order to achieve as robust matches as possible. Given an image *Ii*, we compare it with all previous images *Ii*−*<sup>j</sup>* and search for the two nearest matches of each feature point in the image *Ii* to that in the image *Ii*−*<sup>j</sup>* and vice versa. The first filtering step involves a distance ratio test as proposed in [25]. Accordingly, a match is considered poor if the distance ratio between the first-best and second-best match falls under a particular threshold. Following, we also eliminate matches based on a symmetry criterion and hence reject those that have not matched each other. The last filtering step involves a geometry verification process in which we use the RANSAC test for estimating the fundamental matrix based on the remaining point matches and identify possible outliers. Finally, an image pair is considered as valid with enough overlap if the number of filtered point matches exceeds a pre-defined threshold, e.g., in our case, 100. The diagram for establishing image pairs including the filtering procedure is shown in a box in Figure 10. The box basically represents a zoom-in of the fifth box (Form image pairs) in the diagram of the 3D reconstruction pipeline (Figure 8).

**Figure 10.** Procedure for image pair formation.

Based on the filtered point matches, the fundamental matrix *F* is computed as proposed in [38], and further, with a multiplication by the camera intrinsic matrix *K*, we obtain the essential matrix *E*. Subsequently, *E* is decomposed using singular value decomposition (SVD) into a rotation matrix *R* and a translation vector *t*, which represents the relative orientation between the image pair. Due to the unknown scale between both images, *t* is normalized.

Given the relative orientation of an initial image pair, we first triangulate corresponding points in space in order to determine the 3D coordinates of the associated world points. However, due to noise, the rays will never truly intersect. To solve that problem, we apply linear triangulation [39] for minimizing the algebraic error and thus obtain an optimal solution in the least squares sense.

Consecutive images are successively registered in the local model coordinate system utilizing 2D- to 3D-point correspondences as described in [40]. Therefore, with the correspondences of 2D feature points of a considered image and 3D world points triangulated by previous image pairs, we solve the perspective-n-point (PnP) problem [41]. As a result, we obtain the exterior orientation of the considered image with rotation matrix *R* and translation vector *t* in the given coordinate system. Finally, the new feature points of the registered image are also triangulated with those of the previous images.

Obtained results for the exterior orientation, world points, and feature points are considered to be approximations and subsequently optimized in the course of least-squares bundle adjustment. Thus, after each image registration, we apply Levenberg–Marquardt optimization [42,43] in order to simultaneously refine the exterior orientation of the images, the coordinates of the triangulated world points and the feature point coordinates. However, the interior orientation of the camera is considered fixed since it has been calibrated with high accuracy beforehand.

### 5.1.3. Dense Image Matching

Contrary to SfM that estimates 3D coordinates only from distinctive image points, the objective of DIM is to find the correspondence preferably for *each* pixel. These correspondences are then further used for triangulation in order to determine the 3D world coordinates.

Generally, algorithms for stereo matching can be divided into local and global matching methods. Local methods, as the name indicates, consider only local windows for the correspondence search. Specifically, for a certain pixel in the reference image, the window is sliding over a range of possible correspondences in the target image, restricted by a pre-defined disparity search range *d*. Matching costs, based on some similarity measure, are calculated for each possible correspondence. The correspondence is finally chosen as the pixel providing the minimum cost.

Local matching methods are in general comparatively easy to implement and very computational and memory efficient. However, disparity selection using a local Winnertakes-all (WTA) strategy struggles with point ambiguities like poor or repetitive textures and thus often leads to many mismatches.

Global matching methods, for example, based on Belief Propagation [44] or Graph Cuts [45], on the other hand, set up an energy function that takes all image points into account. An optimal disparity map is achieved by minimizing the energy function. Due to the inclusion of all image points in non-local constraints, global algorithms usually achieve a higher accuracy than local methods, but at the cost of a higher computational complexity.

In contrast, Semi-Global Matching (SGM), introduced by Hirschmuller [46], represents a compromise solution since it offers a reasonable trade-off between accuracy and computation complexity. SGM also sets up a global energy function *E*, including a data term *Edata* and a smoothness term *Esmooth*. The former measures the pixel-wise matching cost while the latter is for consistency, accomplished by penalizing discontinuities in the disparity map. The disparity map generation is performed by minimization of *E* over several one-dimensional paths using dynamic programming.

$$E(D) = \sum\_{p} \left( \underbrace{C(p, D\_p)}\_{E\_{data}} + \underbrace{\sum\_{q \in N\_p} P\_1 \cdot T\left[ \left| D\_p - D\_q \right| = 1 \right]}\_{E\_{connected}} + \underbrace{\sum\_{q \in N\_p} P\_2 \cdot T\left[ \left| D\_p - D\_q \right| > 1 \right]}\_{E\_{smooth}} \right). \tag{3}$$

In order to meet real-time requirements, we implemented the compute-intensive DIM, specifically SGM, for Graphics Processing Units (GPU) using Nvidia's programming model CUDA [47]. We parallelized each sub-algorithm of SGM for massively parallel processing and implemented parallel executable functions—also called kernel—for each of the algorithms. In the following, we recall the main steps of the SGM algorithm and, building on that, we introduce our parallelization scheme for GPU implementation.

### A Cost Initialization:

An initial cost volume *<sup>C</sup><sup>p</sup>*, *Dp* of size *w* × *h* × *d* has to be built up, where *w* and *h* are the width and height of the image and *d* is the disparity search range. For that purpose, pixel-wise matching costs based on some similarity measure have to be calculated for each pixel *p* in the reference image to its *d* potential correspondences in the target image.

The most widely used similarity measures are sum of absolute differences (SAD), sum of squared differences (SSD) and normalized cross correlation (NCC) [48]. Even though SAD and SSD are easy to implement and real-time capable, they assume that pixel intensities of the same object are almost equal in the images, which might not be always

true. In contrast, NCC is less sensitive to changes of intensities as well as to gaussian noise, however it is computationally intensive.

In contrast, Census Transform (*CT*) [49] followed by a Hamming distance calculation represents an outstanding approach. *CT* compares the intensity of a considered pixel with that of its neighbors to generate a binary string—commonly referred as *CT* feature—as a representation for that pixel. The advantages of *CT* are illumination invariance, efficiency, and ease of implementation due to simple operations as XOR. We use a slightly modified version of *CT*—the Center-Symmetric Census Transform (*CS*-*CT*)—as proposed in [50], which compares neighboring pixels only to each other that are mirrored at the considered pixel. Therefore, the total number of operations decrease by around 50% and a more compact representation using just half of the memory is provided. The calculation of a *CS*-*CT* feature for a pixel at the location (*<sup>x</sup>*, *y*) is given as:

$$\begin{aligned} \text{CS}-\text{CT}\_{m,n}(\mathbf{x},y) &= \underset{(i,j)\in L}{\text{s}} (I(\mathbf{x}-i, y-j), I(\mathbf{x}+i, y+j)),\\ \text{with } \mathbf{s}(u,v) &= \begin{cases} 0, & \text{if } u \le v \\\ 1, & \text{otherwise} \end{cases}. \end{aligned} \tag{4}$$

Subsequently, the matching cost of two pixels is given by the Hamming distance of their *CT* features and is represented by:

$$\mathcal{C}(p,q) = \text{Hamning}(p,q) = \|\mathcal{CS} - \mathcal{C}T\_L(p) \oplus \mathcal{CS} - \mathcal{C}T\_R(q)\|\_1. \tag{5}$$

We implemented a kernel for *CS*-*CT*, which operates per image. The kernel is executed by a two-dimensional layout of threads in order to assign one pixel to each thread in the simplest possible way. The window size is chosen with 7 × 9 pixel and has the advantage that the resulting bit string of a *CS*-*CT* feature has a size of 31 bit and thus fits into a single 32-bit integer data type. Accordingly, an image with a size of 3840 × 2748 pixel leads to a total memory consumption of 40 MB.

A second kernel works on two *CS*-*CT* transformed images and calculates the matching costs. The kernel is again based on a two-dimensional thread layout with each thread calculating the Hamming distance for a specific *CS*-*CT* feature of the reference image to all *d* potential candidates in the target image. The XOR operation for two features and the subsequent population count in the resulting bit string—basically the number of ones—lead to values ranging from 0 to 31. Therefore, the use of a 1-byte data type for the Hamming distance and a disparity range of 80 values results in a total memory consumption of 805 MB for the initial cost volume *<sup>C</sup><sup>p</sup>*, *Dp*.

### B Cost Aggregation:

The costs of the initial cost volume have to be aggregated over several paths considering two penalties *P*1 and *P*2, with *P*2 > *P*1. The recursive aggregation on a path *r* at a specific pixel *p* and disparity value *d* is defined as:

$$\begin{aligned} \mathcal{L}\_r(p,d) &= \mathcal{C}(p,d) \\ + \min\left(\mathcal{L}\_r(p-r,d), \mathcal{L}\_r(p-r,d-1) + P\mathbf{1}, \mathcal{L}\_r(p-r,d+1) + P\mathbf{1}, \min\_i L\_r(p-r,i) + P\mathbf{2}\right) \\ &- \min\_k L\_r(p-r,k). \end{aligned}$$

Summing the aggregated costs over all paths delivers the aggregated cost volume *S* with:

$$S(p,d) = \sum\_{r} L\_r(p,d). \tag{7}$$

(6)

For the aggregation of the cost volume *C*, we implemented only the horizontal and vertical paths *Lr* in order to keep the computing time low. Thus, the aggregation is performed in the 4 directions left-to-right, right-to-left, top-to-bottom, and bottom-totop. However, the recursive calculation prevents parallelization in each path. Though,

we exploit the independency of different paths in the same direction (in the case of the horizontal paths) and different columns (in the case of vertical paths).

For instance, a kernel was implemented for the aggregation in the direction left-toright and can be called with a specific column number. In that column, the parallelization occurs in the rows and disparity space of the cost volume. In this way, each thread is responsible for the aggregation of the cost in a particular row and disparity. The kernel is called for every column starting from the left to the right, with the first column being initialized by the initial matching cost. The procedure is analogous for the three other kernels of the remaining path directions.

Subtracting the minimum path cost *mink Lr*(*p* − *r*, *k*) in each step of the cost aggregation limits the maximum cost in a path with *Cmax* + *P*2. Thus, the aggregated costs never exceed the range of 16 bits. Using images with a size of 3840 × 2748 pixel, a disparity range of 80 values and a data type of 2 bytes for the aggregated costs, the total memory consumption can be estimated at around 1.6 GB for each path.

### C Disparity Selection:

For a simple case, the disparity *d* for each pixel *p* can finally be chosen based on the minimum cost in the aggregated cost volume. However, the subsequently generated point cloud based on that disparity map shows banding artifacts, since there are only 80 possible disparity values, which in turn leads to only 80 different coordinates for the depth.

Hence, we use sub-pixel interpolation, as proposed in [51], for choosing the disparity values in order to obtain a steady and continuous point cloud. Accordingly, the disparity with the minimum aggregated cost and both of its neighbors are considered for the estimation of a parabola. The disparity is finally chosen as the abscissa on which the parabola provides its minimum. Thus, the interpolated disparity for a pixel *p* corresponds to the vertex of the parabola and can be expressed by:

$$d\_{\min\\_sub} = d\_{\min} - \frac{(\mathcal{S}(p, d\_{\min+1}) - \mathcal{S}(p, d\_{\min-1}))}{(2 \cdot \mathcal{S}(p, d\_{\min-1}) + 2 \cdot \mathcal{S}(p, d\_{\min+1}) - 4 \cdot \mathcal{S}(p, d\_{\min}))},\tag{8}$$

where *dmin* represents the integer disparity with the minimum aggregated cost.

Disparity Map Fusion and Point Cloud Generation

The actual depth *z* of a point can be derived in the following way:

$$z = \frac{b \cdot f}{d},\tag{9}$$

with *d* the disparity of the point, *f* the focal length of the camera, and *b* the baseline of the camera between two images.

However, the obtained disparity maps were generated individually and usually have overlapping areas. Thus, in the case that the disparity maps are triangulated independently, multiple 3D points would be generated for some of the same physical object point. To avoid this, a three-dimensional grid of voxels could be built up for filtering multiple points. However, since our case corresponds to a 2.5D digital elevation model, we build up a twodimensional regular grid with evenly sized cells parallel to the image planes. The points of the disparity maps are triangulated into these cells. In the case of multiple points falling into a particular cell, we simply pick the median of the depth of the points. Accordingly, for each cell we obtain at most one value for the depth.

#### *5.2. Adapting Roughness Parameter to 3D Point Clouds*

Reconstructed dense point clouds allow for the analysis of the surface topography. This can be used, for example, to estimate the roughness. Estimation of roughness can be performed (e.g., as in the conventional case) using single extracted profile lines as well as using the entire point cloud for an area-based determination. For this purpose, we adapted existing parameters (see Section 3.1.2) for 3D point clouds.

Arithmetical Mean Deviation *Ra*

For the calculation of the parameter *Ra*, first, the reference plane to which the mean deviation of the points is subsequently calculated has to be determined. In general, the plane needs to be estimated by minimizing the squared Euclidean distance of the measured surface points to the plane. That basically corresponds to the orthogonal distance regression (ODR), which also accounts for errors in the independent variables *x* and *y* in addition to the dependent variable *z* unlike the ordinary least squares (OLS) regression model. However, the main axes of the point cloud are basically parallel to the XY-plane of the coordinate system since the surface is captured parallel to the image plane and hence the difference between OLS and ODR becomes negligible. Consequently, we use OLS to estimate a plane so that the sum of squared vertical distances of the points to the plane becomes minimal.

With the points *pi* = (*xi*, *yi*, *zi*) of the point cloud, we first set up an over-determined equation system for the plane *π*:

$$
\underbrace{\begin{bmatrix} x\_1 & y\_1 & 1\\ x\_2 & y\_2 & 1\\ & \cdots\\ x\_n & y\_n & 1 \end{bmatrix}}\_A \underbrace{\begin{bmatrix} a\\b\\c \end{bmatrix}}\_x = \underbrace{\begin{bmatrix} z\_1\\z\_2\\\cdots\\z\_n \end{bmatrix}}\_b.\tag{10}
$$

where *a* and *b* are the slopes of the plane in the *x*- and *y*-direction and *c* is the intersection of the plane with the *z* axis. Subsequently, a closed-form solution of the equation system is determined by multiplying both sides of the equation by the left pseudo inverse of the design matrix *A*. Thus, we estimate the parameters of the regression plane *π*ˆ with

$$
\begin{bmatrix}
\hat{a} \\
\hat{b} \\
\mathcal{C}
\end{bmatrix} = \left(A^T A\right)^{-1} A^T b.\tag{11}
$$

Finally, the calculation of the parameter *Ra* is done by numerical integration of the points. Accordingly, we sum the vertical distances of the points *pi* to the regression plane *π* ˆ, which are basically the residuals *e*ˆ*i*, and divide it by the total number of points *n*:

$$R\_{\text{a pointcloud}} = \frac{1}{n} \sum\_{i=1}^{n} d\_{\text{vertical}}(p\_{i\prime}, \hbar) = \frac{1}{n} \sum\_{i=1}^{n} \mathcal{E}\_i. \tag{12}$$
