We use Middlebury and KITTI datasets to validate typical cost aggregation approaches, namely BF [
12], GF [
11], NL [
16], DT [
20], LRNL [
28], FCGF [
18], and ST [
17]. The input stereo images are normalized to [0, 1] at first, and then used to compute matching costs at all candidate disparities. We use the implementations in cross-scale cost aggregation [
10] for BF, GF, NL, and ST, and implement FCGF by referring to the elaboration in paper [
18]. Default parameters are used in these approaches to generate satisfactory results. We use 0.05 for
and 10 for
on the Middlebury dataset and halve them on the KITTI dataset, as there is a large portion of textureless regions in the KITTI dataset.
4.1.1. Middlebury Dataset
The Middlebury dataset contains kinds of indoor artificial scenes taken under a controlled environment with pixel-level accuracy ground truth generated by structured light. We adopt pixel-based truncated absolute differences of both color vector and gradient to measure the proximity of matching pixels. Thus, the initial matching cost of pixel
at disparity
l can be computed by
where
and
are the color vectors of two candidate matching pixels
and
in the left and right images,
is the gradient operation in the
x direction.
balances the gradient and color terms.
, and
correspond to the truncation values of these two terms. In our experiments,
,
, and
are 0.89, 7/255, and 2/255, respectively. Both left and right images are used as guidance images to generate corresponding disparity images. We use the left–right consistency check to classify pixels into
stable and
unstable, and the non-local refinement method [
16] is adopted to propagate reliable disparities from
stable pixels to
unstable ones.
Table 3 lists the error rate of non-occluded and all regions for typical cost aggregation methods. Our method outperforms BF, LRNL, FCGF, NL, DT, and ST at all metrics. NL, ST, and FCGF are all tree-based non-local filters. ST enforces tight connections for pixels in a neighborhood by building pixel-level and region-level tree structures so that the over-smoothing problem of NL can be alleviated, leading to better results. FCGF reduces the supports of neighboring pixels and generates noisy disparity images. DT can only propagate informative messages along the horizontal or vertical direction in each pass since it produces the worst results among these approaches. Both BF and GF take advantage of local structures to evaluate the similarity of neighboring pixels. However, GF adopts a linear model aiming to preserve the structure of the guidance image so that it produces more satisfactory results. We perform cost aggregation on two spatial trees, which balances the information propagated along all directions. After post-processing, our method generates competitive or even better results than GF. Moreover, our approach is more efficient than GF, as shown in
Table 2.
Figure 3 presents the final disparity images of typical cost aggregation methods. All these approaches can correctly estimate the disparities of pixels in highly textured regions. BF and GF assume all pixels in the support window on a disparity plane. NL and FCGF utilize the MST of the input image to build long-range connections. However, they tend to overuse the piece-wise constant assumption. Thus, BF, GF, NL, and FCGF produce many erroneous disparities in slant surfaces, as indicated by the boxes in the first row of
Figure 3b–e. Compared with BF and GF, NL and FCGF weaken the constraints from neighboring pixels. Hence, NL and FCGF produce better results in regions containing fine-scale details, as indicated by the boxes in the second row in
Figure 3b–e. Our method overcomes these shortages by recursive filtering and balancing the information propagated in all directions. Thus, our approach can successfully preserve fine-scale details in highly textured regions and alleviate the over-smoothing problem in slant surfaces, as shown in
Figure 3f.
4.1.2. KITTI Dataset
Both KITTI 2012 [
25] and KITTI 2015 [
26] datasets are real-world street views of both highways and rural areas captured by a driving car under natural conditions since there is a large portion of textureless regions in these stereo pairs. We adopt both the
Census Transform and the correlation of deep features extracted by CNNs to compute matching costs and then evaluate the performance of these two cost functions under different cost aggregation methods. The main idea of
Census Transform is to use a string of bits to characterize the pixels in the local window. We suppose the size of the local window is
; thus, the
Census Transform of pixel
can be defined as
where
is the string of bits for pixel
after
Census Transform and ⊗ is the bit-wise concatenation operation. The Hamming distance of two strings of bits is utilized to measure the similarity of pixels. The matching cost of pixel
at disparity
l is
where
and
are the two strings of bits for matching pixels in the left and right images,
is the operation to calculate Hamming distance, and
indicates the matching cost of the handcrafted feature. As for features extracted by CNNs, we directly use the correlation of left and right features to evaluate the similarity of matching pixels. The PSMNet [
24] combines spatial pyramid pooling and dilated convolution to enlarge the receptive field. The resultant local and global features aggregate context information at different scales and locations and are widely used in many state-of-the-art methods [
23,
30,
31]. We also use the PSMNet to extract deep features in our experiments. We denote the learned features as
, so the matching cost of pixel
at disparity
l can be expressed as
where
is the matching cost of features extracted by CNNs,
and
represent the deep features of left and right images,
is the
norm. Typical cost aggregation methods are utilized to improve the robustness of matching costs. In the refinement step, the left–right consistency check is used to identify mismatched pixels, and those mismatched pixels are assigned to the lowest disparity value of the spatially closest matched pixels on the same scanline. Finally, we adopt the weighted median filter to remove streak artifacts. The error threshold in our evaluation is 3.
Qualitative evaluation:
Figure 4 and
Figure 5 present a comparison of GF and our method using two kinds of features to compute the matching costs on the KITTI 2012 and KITTI 2015 datasets. GF can effectively filter out noise in textured regions by taking advantage of local structure in the guidance image but fails to preserve fine-scale details in real-world stereo pairs. The reason is that the intensity difference between foreground and background is indistinctive in many cases since the local linear regression model used in GF cannot accurately distinguish fine-scale details from background. Moreover, GF fails to filter out the noise in large homogeneous regions, such as the sky. Due to this, there is no informative message in these areas, resulting in matching costs at different candidate disparities becoming identical. Our method performs non-local coat aggregation on two complementary spatial tree structures, and the geodesic distances in both spatial and intensity spaces are used to evaluate the similarity of pixels along these two trees. Therefore, our method propagates informative messages across the entire image to deal with challenging occasions, for example, the degradation of region-based cost function for fine-scale structures and the lack of useful information in homogeneous areas.
The results of each method using different features to compute matching cost demonstrate that handcrafted features tend to generate more edge-aware disparity images while deep features produce better results in textureless regions. The reason is that deep CNNs with a large receptive field can utilize textural information from a wide range, improving the robustness of deep features for pixels in homogeneous areas.
Quantitative evaluation:
Table 4 presents the error rate and average end-point error of typical cost aggregation methods on the KITTI 2012 dataset. Our approach surpasses NL, ST, and FCGF in all metrics. Although the error rates of both initial and final disparity images are smaller than ours, the average end-point error of our method is close to or even smaller than that of GF. The main reason for this phenomenon is that only the disparities of pixels near the ground are provided in ground-truth disparity images. Most of those valid pixels are located in highly textured regions so that GF can utilize the structure information in the local area to generate high-quality results. Our method is superior to GF in homogeneous regions, as shown in
Figure 4 and
Figure 5. HASR incorporates edge information with
CT to further compensate for radiometric changes and adopts a hierarchical cross-based cost aggregation scheme to fuse costs at multiple scales. These strategies make their method obtain the lowest error rate in non-occluded regions. However, the end-point errors in non-occluded and all areas are 1.9 px and 2.9 px, which are 0.3 px and 1.02 px larger than ours. The reason is that they incorporate multi-scale costs by an exponential function while we take advantage of the inter-scale regularization [
10].
Table 5 lists the results of GF, NL, FCGF, ST, LRNL, and our method using deep features extracted by PSMNet [
24] to compute the matching cost. Compared with the results generated by the handcrafted feature of GF, FCGF, NL, and ST, the average error rate and end-point error of these cost aggregation methods in non-occluded regions decreased from 12.54% and 2.50 px to 9.27% and 2.32 px, improved by 26.08% and 7.20%, respectively. The average error rates of final disparity images are still lower than that of the handcrafted feature. The reason is that CNNs can incorporate information in a large receptive field to improve the robustness of features in textureless areas. Our method outperforms LRNL in all metrics by a margin. It turns out that keeping the information propagated along all directions in balance is vital for improving the quality of disparity images.
Figure 6 presents examples of disparity images and corresponding error maps of our method using the handcrafted feature and deep ones to compute the matching cost. Our approach correctly estimates the disparities of most pixels in the guidance image. Pixels with significant errors are mainly located at occluded regions as they have no matching pixels in the other view. Comparing the error maps of these two kinds of features, the average end-point error of pixels in occluded areas of handcrafted features is smaller than that of deep ones. The reason is that the matching costs of handcrafted features are less distinctive than those of deep features since it is easier to propagate reliable information from non-occlusion regions to occluded areas.
Table 6 presents the results of different approaches using the handcrafted feature to compute the matching cost on the KITTI 2015 dataset [
26]. Our method achieves the best results in most metrics. Compared with FCGF, the average end-point errors of the initial disparity images in all and non-occluded regions are decreased by 0.63 px and 0.66 px, and the error rates in these regions are reduced by about 4.3% and 4.4%, respectively. Although the error rates of GF are lower than ours after refinement, we achieve the lowest average end-point error among these methods. It means that the end-point error of most outliers of our approach is smaller than that of GF.
Table 7 lists the results of typical methods on the KITTI 2015 dataset using deep features extracted by PSMNet [
24] to compute the matching cost. Our approach surpasses all other methods in all metrics. Comparing our statistics with those of other non-local cost aggregation methods on both KITTI 2012 [
25] and KITTI 2015 [
26], our method achieves the best performance. Thus, we can conclude that the geodesic distance depending on the relative spatial relationship could be better than considering the similarity of pixels on an MST or its variants. The reason could be that geodesic distance can help to weaken the piece-wise constant assumption, leading to more accurate disparities in regions composed of large slant planes, such as roads.
Figure 7 shows disparity images and corresponding error maps of our method using both the handcrafted feature and the deep features to compute the matching cost on the KITTI 2015 dataset [
26]. We can see that the matching costs generated by deep features can deal with radiometric variations in real-world images. However, it suffers from propagating useful information to occluded areas in the image border, resulting in a higher average end-point error.