1. Introduction
Binocular stereo matching aims to restore 3D information based on a pair of rectified 2D images obtained from the same scene. Due to its passive and low-cost sensing characteristics, the acquired depth information may play a vital role in the guidance of uncrewed aerial/ground vehicles, high-end security, surveillance, and various 3D manipulation, inspection, and measurement applications.
Traditional stereo matching algorithms can be categorized into global and local algorithms [
1], depending on the extent of information used for matching evaluation. Local methods consider only neighboring pixels for each candidate pixel, while the other methods exploit the entire image information. Traditional matching algorithms normally begin with formulating a stereo matching cost that estimates the matching degree between reference patches and target patches.
Global stereo matching algorithms require computationally demanding optimization algorithms, such as graph cuts [
2] and dynamic programming [
3,
4], to find the disparity of each pixel. To facilitate the algorithms in real time, [
5] implemented an integrated scheme combining a dynamic programming algorithm and a local algorithm using a graphics processing unit (GPU), while [
6] implemented semi-global stereo matching using a field-programmable gate array (FPGA).
Unlike global algorithms, the stereo matching cost of local and non-local/semi-global algorithms is an aggregation of primary matching costs. The aggregation is normally conducted as a filtering procedure, and the resultant disparity map is obtained through the winner-take-all (WTA) strategy [
1]. Since the computational complexities of local algorithms are usually smaller than that of other algorithms, they are widely used in practical applications.
Among the early efforts of using filters for cost volume aggregation in local algorithms, the matching performance of [
7] was constrained by a fixed supporting window. This shortage was alleviated in [
8] by introducing an adaptive window-size approach. Later, the Guided Image Filtering (GIF) model [
9] was successfully implemented in [
10], which demonstrates an edge-preserving advantage. Based on [
9,
11] further proposed a weighted guided image filter (WGIF) scheme to avoid halo artifacts and was used in [
12] for disparity estimation. As the matching performance of GIF depends on the size of kernel windows, [
13] proposed an adaptive guided filtering method to exclude pixels that do not belong to the same region. Besides, an iterative guided filtering approach [
14] and an adaptive support weight version [
15] were created to improve matching accuracy. More recently, [
16] proposed weights according to structural features and filtered the matching cost volume by using the adaptive guided filtering method proposed by [
13]. These are typical local stereo matching algorithms that do not aggregate matching costs outside the supporting window.
In [
17], matching costs were aggregated according to tree structures individually derived from the entire image pair. Similarly, matching algorithms based on the so-called permeability filter [
18] and pervasive guided image filtering [
19] can effectively aggregate matching costs based on the whole image. In addition [
20], integrated multi-scale information into the scheme of [
19], significantly improving the matching performance. These algorithms use the full window for aggregation and are called non-local stereo matching algorithms. To solve the problem of matching ambiguity in low-texture areas and high sensitivity in high-texture areas [
21], proposed the use of both the local support window and the whole image.
In addition to these traditional algorithms, disparity maps can also be computed using deep learning-based methods. These algorithms have the advantage of high matching accuracy. For instance [
22], developed an automatic encoder to generate feature maps for semi-global stereo matching, and [
23] implemented an unsupervised disparity estimation neural network based on the principle of disparity consistency. Besides [
24], proposed a simplified independent component correlation algorithm (ICA)-based local similarity stereo matching algorithm to further improve matching accuracy in non-texture areas and boundaries. Among the deep learning-based methods, both [
25,
26] are typical end-to-end stereo matching networks, which realize the aggregation of matching costs through 3D convolution operations.
However, as pointed out in [
27], compared with traditional stereo matching algorithms, deep learning-based stereo matching algorithms still suffer from insufficient generalization ability. Additionally, they typically require GPU-based computing resources. The current characteristics of deep learning-based methods justify continued research on traditional methods.
In most traditional local and non-local stereo matching algorithms, the weights for the matching cost aggregation depend on the texture of the image pair. This dependency restricts the possibility of sharing weights for different scenes. To improve the computational efficiency of stereo matching, this paper proposes a texture-independent aggregation method.
The main contributions of this paper are as follows:
- (1)
We propose an aggregation algorithm for stereo matching that significantly simplifies computation without sacrificing matching performance. The aggregation weights can be shared between different scene images with the same resolution.
- (2)
To provide a higher matching accuracy, we integrate the algorithm with a multi-scale scheme to exploit the spatial distribution of texture that can achieve improved performance with a minor increase in computational efforts.
3. Results
To investigate the matching performance of the proposed scheme, performance comparisons have been made between eight representative stereo matching schemes and the proposed scheme:
We selected several stereo image pairs from the Middlebury version 3 [
31] and the KITTI Vision Benchmark Suite [
32] datasets for demonstration. The “trainingQ” of the Middlebury version 3 [
31] dataset is composed of 15 groups of pictures from “Adirondack” to “Vintage”. As summarized in
Table 1, resolutions of the images are around 480-by-720.
According to the previous section’s discussion, the design parameter is positively related to the image resolution under matching. In the following demonstrations, the values of
are simply assigned as
Figure 5 shows the disparity maps obtained by these stereo matching algorithms on four of the image sets.
Table 2 and
Table 3 present the error rates and weighted error rates of the algorithms using the complete “trainingQ” of the Middlebury version 3 [
31] dataset. In the tables, the experimental results of the four algorithms for comparison are obtained from the original literature. For ease of viewing, the graphical representations of
Table 2 and
Table 3 are shown in
Figure 6 and
Figure 7.
The performance of FASW [
21] and the proposed algorithm are more prominent for stereo matching on the dataset. Specifically, in the non-occluded region, the proposed algorithm has the lowest mismatch rate, followed by FASW [
21] and the HGIF [
20] algorithm. In the all-region, the RTSMNet [
25] algorithm has the lowest mismatch rate, followed by FASW [
21] and the proposed algorithm.
However, we can observe a significant performance deterioration in using RTSMNet [
25] to predict the disparity maps of the “Jadeplant” and “Vintage” image sets. RTSMNet [
25] is a deep learning-based method, its generalization ability depends on the quality and richness of the training dataset. Similar scenes may be rare in its training set. This behavior is not observed in traditional algorithms because their performance is less sensitive to scene type.
The time required for the matching of a rectified stereo image pair for each algorithm is summarized in
Table 4. These computations were executed in MATLAB 2017b using an Intel Core I5 8300 H and 16 GB RAM. Notably, the proposed algorithm takes the least computational time. If we consider both matching accuracy and algorithmic complexity, the proposed algorithm is close to the best accuracy while requiring the least amount of computation.
To verify the stereo matching performance of the proposed algorithm in real scenes, the autonomous driving training dataset of KITTI Vision Benchmark Suite [
32] is used for further demonstration. The dataset contains 194 image pairs with corresponding ground truth disparity maps. Performance comparisons to be presented are between the following stereo matching schemes:
Stereo matching based on adaptive guided filtering [
13], denoted as AGF.
Adaptive stereo matching using tree filtering [
17], denoted as MST.
The proposed scheme.
Figure 8 shows the disparity maps generated by these algorithms on the KITTI suite [
32]. Each algorithm performs stereo matching on 194 sets of stereo image pairs. Due to space constraints, we only select three stereo-image pairs for visual comparison.
Table 5 shows the quantitative matching performance of these algorithms. Four scores are compared: the percentage of erroneous pixels in non-occluded areas, denoted as Non-Occ (%); the percentage of erroneous pixels in all regions, denoted as All (%); the average disparity error in terms of pixel numbers in non-occluded areas, denoted as Non-Occ (pixels); and the average disparity error in terms of pixel numbers in all regions, denoted as All (pixels). The scores used for comparison are quoted from [
21].
As seen from
Figure 8 and
Table 5, compared with the five representative aggregation methods, the proposed algorithm outperforms other algorithms in items Non-Occ (%), All (%), and Non-Occ (pixels), with scores of 6.30%, 7.48%, and 1.3 pixels, respectively. Its performance is only slightly inferior to [
21] on All (pixels), with a score of 1.58, which is higher than [
21]’s 1.45.
Figure 9 shows three disparity maps generated by the proposed algorithm using three sets of multi-view remote sensing images (pictures from
https://github.com/whuwuteng/benchmark_ISPRS2021, accessed on 15 February 2019). These additional disparity maps demonstrate the effectiveness of the proposed approach.
4. Discussion
Traditional local and non-local stereo matching algorithms require an efficient aggregation procedure, which is the most critical stage for generating accurate dense disparity maps. The aggregation weights in most stereo matching algorithms in the literature, unfortunately, are scenario specific.
We re-examined the procedure of cost aggregation from the perspective of matrix operations and treated the aggregation as constraining the degree of difference between adjacent costs. By decoupling the aggregation procedure in the horizontal and vertical directions, we propose a new aggregation algorithm to effectively calculate a cost volume for stereo matching. This process is equivalent to multiplying the initial cost by two constant matrices.
The aggregation algorithm requires two constant weight matrices that are only related to the image resolution and can be calculated beforehand. These matrices are mathematically proven to be independent of image texture and thus can be applied to different scene images of the same resolution. This algorithm can be used for integration with other schemes to provide both computational efficiency and stereo matching accuracy. We demonstrate its integration with the cross-scale scheme of [
20,
30].
Through numerical experiments using indoor and outdoor benchmark stereo image datasets, we demonstrate that the integrated scheme is not only computationally efficient but also provides disparity maps that are highly comparable to the most accurate algorithms in the literature.
In the future, we will focus on real-time implementation of the proposed algorithm by using, for instance, GPU-based systems. Preprocessing techniques, such as the algorithm proposed in [
33] for detecting and removing shadows, can be employed to further improve the accuracy of dense disparity maps.