Appendix A. Finding the Optimal Number of Input Images and Pyramid Levels
In this section, we demonstrate the need for and the importance of the hierarchical processing scheme within the presented approach and evaluate different configurations on the size of the input bundle. In this, a couple of aspects are considered in order to find the best configuration. The objective is to find the appropriate number n of Gaussian pyramid levels and the size of the input bundle, providing a good trade-off between:
the error of the resulting depth maps, measured by L1-abs and L1-rel;
the sampling density and the entailed resources needed for the computation;
the resulting processing runtime.
For this experiment, a fronto-parallel plane orientation is used as part of the plane-sweep image matching and the NCC with a support region of
pixels is set as similarity measure. The optimization of the cost volume and the extraction of the optimal depth map is performed by employing the SGM
Π scheme, which is the adoption of the standard SGM optimization to the use of plane-sweep image matching (see
Section 2.2). The smoothness penalty within the SGM optimization is set to
, while the adaptive
penalty is used. This, together with the
sized NCC as matching cost, was chosen in accordance with the work of Scharstein et al. [
41]. To find the appropriate height of the Gaussian pyramids, the size of the input bundle, i.e., the number of input images, is set to
Table A1 lists the mean errors of the estimated depth maps when evaluated with different numbers of pyramid levels on the datasets of both the DTU and 3DOMcity benchmark. In this, the absolute and relative L1 measures are used, averaged over all depth maps within each dataset. It is to be expected that the omission of any hierarchical processing, i.e., the use of only one pyramid level and thus no coarse-to-fine processing, would lead to the smallest error between the estimate and ground truth. However, the results reveal that in the case of the DTU dataset, the smallest mean error, even if it is only slightly smaller, is achieved when setting
, while the best result in the case of the 3DOMcity dataset is achieved at
Table A1.
Mean errors achieved on the DTU and 3DOMcity datasets for a different number of Gaussian pyramid levels (n) as part of the hierarchical processing scheme. The error metrics used are the absolute L1-abs, measured in mm, as well as the relative L1-rel measure. Both are averaged over all evaluated depth maps within each dataset. The best results are underlined.
Table A1.
Mean errors achieved on the DTU and 3DOMcity datasets for a different number of Gaussian pyramid levels (n) as part of the hierarchical processing scheme. The error metrics used are the absolute L1-abs, measured in mm, as well as the relative L1-rel measure. Both are averaged over all evaluated depth maps within each dataset. The best results are underlined.
Dataset | Metric | | | | | |
DTU | L1-abs | 26.394 | 26.221 | 23.473 | 25.045 | 29.676 |
(in mm) | ±24.262 | ±23.835 | ±19.656 | ±19.298 | ±19.436 |
L1-rel | 0.036 | 0.036 | 0.032 | 0.034 | 0.041 |
| ±0.032 | ±0.032 | ±0.026 | ±0.026 | ±0.026 |
3DOMcity | L1-abs | 12.789 | 14.936 | 21.801 | 32.458 | 47.422 |
(in mm) | ±6.916 | ±6.754 | ±8.010 | ±9.408 | ±22.292 |
L1-rel | 0.010 | 0.012 | 0.017 | 0.026 | 0.037 |
| ±0.006 | ±0.006 | ±0.007 | ±0.009 | ±0.014 |
As described in
Section 2.1.3, the plane distances within the plane-sweep sampling, and thus the sampling points, are selected in such a way that two consecutive planes induce a maximum disparity difference of 1 pixel. Depending on the capturing setup, i.e., the relative poses between the images and their obliqueness and, in turn, the range of the scene depth, this can lead to a very high number of sampling points and with it to a large memory consumption, as the dimensions of the three-dimensional cost volume need to be set accordingly. Thus, in order to not exceed the memory limit, the maximum number of sampling points for the highest pyramid level is restricted to 256 in the implementation of the approach. In case of the camera setup of the DTU dataset and the configuration of this experiment, i.e., having a bundle size of
, a pyramid height of 3 is the smallest height at which the number of sampling points at the highest level does not reach or exceed the set limit, as
Table A2 shows. Comparing
Table A1 and
Table A2 further reveals that on both datasets, the best results are achieved when the highest pyramid level has a maximum of 128 sampling planes.
Table A2.
Processing runtime measured for different configurations of the pyramid height on the DTU and 3DOMcity datasets. In addition, the maximum number of sampling planes with which the scene was sampled at the highest pyramid level is stated.
Table A2.
Processing runtime measured for different configurations of the pyramid height on the DTU and 3DOMcity datasets. In addition, the maximum number of sampling planes with which the scene was sampled at the highest pyramid level is stated.
Dataset | Metric | | | | | |
DTU | Runtime | 2365 | 1315 | 386 | 220 | 187 |
(in ms) | ±15 | ±10 | ±2 | ±2 | ±1 |
max. num. planes | 256 | 256 | 128 | 64 | 32 |
3DOMcity | Runtime | 613 | 431 | 225 | 196 | 192 |
(in ms) | ±3 | ±3 | ±1 | ±1 | ±1 |
max. num. planes | 128 | 64 | 32 | 16 | 8 |
Another criterion which is used to deduce the best configuration on the height of the Gaussian pyramid is the runtime needed to estimate a single depth map.
Table A2 additionally lists the corresponding measurements taken, i.e., the number of milliseconds it takes to estimate a single depth map given a certain number of pyramid levels, as well as the number of planes used for sampling the scene space at the highest pyramid level. The measurements again show that, up to
in the case of the DTU dataset, the number of sampling planes at the highest pyramid level is equal to the limit of 256 and that with a smaller amount of sampling points, the runtime decreased drastically. Furthermore, the significant drop of one second in runtime between using a pyramid height of 2 and 3 suggests that the decreasing use of processing resources on the GPU increases the processing speed and that going from
makes a significant improvement in its efficiency. Because the use of a higher number of pyramid levels does not only reduce the amount of sampling points, but also the image size at the highest pyramid level and with it the amount of pixels that need to be matched, depending on the camera setup, a hierarchical processing is very important in order to ensure a high sampling density of the scene space, while at the same time efficiently utilizing the processing hardware and, in turn, alleviating high processing speeds. In the case of the DTU dataset, this experiment shows that the best number of pyramid levels to be used is
, which will thus be set for the successive experiments. In case of the 3DOMcity dataset,
Table A1 suggests that the best configuration is to use the original image size. A hierarchical processing scheme is needed, however, in order to use SGM
Π-sn, the extension of the SGM algorithm to consider local surface orientations in order to account for slanted surfaces. Thus, in the case of the 3DOMcity dataset, the successive experiments will be executed with
, which induces only a slightly higher mean error compared to the best configuration.
In the second part of this experiment, the effects of a different number of input images and, in turn, the optimal size
of the input bundle are evaluated. Here, the settings for the plane-sweep image matching and the subsequent SGM optimization are kept the same as before. The height of the Gaussian pyramids is fixed to
in the case of the DTU dataset and
in the case of the data from the 3DOMcity dataset.
Table A3 lists the mean errors achieved on both datasets with a different number of input images, as well as the difference in runtime with respect to the best configuration of the first part of the experiment. The results reveal that the best accuracies are achieved when five input images are used for image matching, even though, in the case of the 3DOMcity dataset, it is only a marginal improvement. As expected, the utilization of more input images in the process of image matching also leads to an increase in runtime, since more pixels are matched. At the same time, however, there is more time available to keep up with the image acquisition, as discussed in
Section 4.3. In conclusion, in the subsequent experiments, the size of the input bundle is set to
, while the height of the Gaussian image pyramids is set to
in the case of the DTU and 3DOMcity datasets, respectively.
Table A3.
Mean errors achieved on the DTU and 3DOMcity datasets for different input bundle sizes (), i.e., number of images. In addition, the differences in runtime, with respect to the measurements of the first part (i.e., ), are stated. The best results are underlined.
Table A3.
Mean errors achieved on the DTU and 3DOMcity datasets for different input bundle sizes (), i.e., number of images. In addition, the differences in runtime, with respect to the measurements of the first part (i.e., ), are stated. The best results are underlined.
Dataset | Metric | | | |
DTU | L1-abs | 23.473 | 19.832 | 21.843 |
(in mm) | ±19.656 | ±16.225 | ±21.605 |
L1-rel | 0.032 | 0.027 | 0.031 |
| ±0.026 | ±0.021 | ±0.031 |
Runtime | | +271 | +302 |
(in ms) | | | |
3DOMcity | L1-abs | 14.936 | 14.615 | 16.514 |
(in mm) | ±6.754 | ±6.254 | ±7.569 |
L1-rel | 0.012 | 0.012 | 0.014 |
| ±0.006 | ±0.007 | ±0.009 |
Runtime | | +360 | +410 |
(in ms) | | | |
Appendix B. Evaluating Different Similarity Measures in the Process of Dense Multi-Image Matching
As part of the plane-sweep multi-image matching, this approach comprises two different similarity measures and cost functions: the Hamming distance of the census transform (CT) as well as the truncated, inverted and scaled normalized cross-correlation
(NCC). While the CT is computationally less expensive than the NCC and is thus more suitable for real-time or online processing, it is less discriminative, which might result in a more ambiguous set of matched pixel correspondences. When working with a stereo normal case, in which the input images suffer only from a little perspective distortion induced by homographic transformations, the CT outperforms the NCC in both runtime and accuracy [
36]. However, as the results in
Table A4 show, the perspective distortion, resulting from the warping of images from converging cameras by means of the plane-induced homography within the plane-sweep algorithm, leads to a significant increase in error when using the CT as a similarity measure instead of the NCC.
Table A4.
Mean errors achieved on the DTU and 3DOMcity datasets when using different similarity measures and cost functions with different support regions. The best results are underlined.
Table A4.
Mean errors achieved on the DTU and 3DOMcity datasets when using different similarity measures and cost functions with different support regions. The best results are underlined.
Dataset | Metric | | | | | | |
DTU | L1-abs | 42.494 | 42.136 | 42.305 | 26.229 | 19.832 | 19.667 |
(in mm) | ±39.112 | ±37.958 | ±36.394 | ±17.816 | ±16.225 | ±16.453 |
L1-rel | 0.056 | 0.056 | 0.057 | 0.037 | 0.027 | 0.027 |
| ±0.049 | ±0.048 | ±0.046 | ±0.024 | ±0.021 | ±0.021 |
3DOMcity | L1-abs | 29.149 | 22.128 | 26.005 | 26.678 | 14.615 | 13.789 |
(in mm) | ±17.272 | ±14.218 | ±14.106 | ±10.377 | ±6.254 | ±5.962 |
L1-rel | 0.024 | 0.019 | 0.022 | 0.023 | 0.012 | 0.011 |
| ±0.016 | ±0.014 | ±0.014 | ±0.010 | ±0.007 | ±0.006 |
Apart from the two different similarity measures, the effects of different support regions are also evaluated in the scope of this experiment. In this, for each similarity measure, the most commonly used configurations were tested. A support region of a size of
pixels represents a good trade-off between uniqueness and computational complexity, while, in the case of the CT, a support region of a size of
pixels is the biggest size for which the bit-string still fits into a single 64-bit integer. The configuration of the plane-sweep algorithm and the SGM optimization is set in accordance with the values from the first experiment (see
Appendix A). In terms of the SGM penalties,
is set to 100 for all
, since the maximum matching cost of the NCC is normalized to 255, independent of the support region. For
, however,
is set to 3, 9 and 24, respectively, which is equivalent to the configuration for NCC, when considering the ratio between
and the maximum matching cost.