Simplified High-Performance Cost Aggregation for Stereo Matching

Chengtao Zhu; Yau-Zen Chang

doi:10.3390/app13031791

and

¹

The 38th Research Institute of China Electronics Technology Group Corporation, Hefei 230093, China

²

Department of Mechanical Engineering, Chang Gung University, Taoyuan 33302, Taiwan

³

Department of Neurosurgery, Chang Gung Memorial Hospital, Taoyuan 33305, Taiwan

⁴

Department of Mechanical Engineering, Ming Chi University of Technology, New Taipei City 243303, Taiwan

Appl. Sci.2023, 13(3), 1791;https://doi.org/10.3390/app13031791

This article belongs to the Special Issue Cutting Edge Advances in Image Information Processing

Version Notes

Order Reprints

Review Reports

Featured Application

The guidance of uncrewed aerial/ground vehicles, high-end security, surveillance, and various 3D manipulation, inspection, and measurements.

Abstract

Applying edge preservation filters for cost aggregation has been a leading technique in generating dense disparity maps. However, traditional approaches usually require intensive calculations, and their design parameters must be tuned for different scenarios to obtain the best performance. This paper shows that a simple texture-independent aggregation approach can achieve similar high performance. The proposed algorithm is equivalent to a sequence of matrix multiplications involving two weighting matrices and a primary matching cost. Notably, the weighting matrices are constant for image pairs with the same resolution. For higher matching accuracy, we integrate the algorithm with a multi-scale scheme to fully exploit the spatial distribution of textures in the image pairs. The resultant hybrid approach is efficient and accurate enough to surpass most existing approaches in stereo matching. The performance of the proposed approach is verified by extensive simulation results using the Middlebury (3rd Edition) benchmark stereo database.

Keywords:

stereo matching; cost aggregation; image filtering; binocular stereo vision

1. Introduction

Binocular stereo matching aims to restore 3D information based on a pair of rectified 2D images obtained from the same scene. Due to its passive and low-cost sensing characteristics, the acquired depth information may play a vital role in the guidance of uncrewed aerial/ground vehicles, high-end security, surveillance, and various 3D manipulation, inspection, and measurement applications.

Traditional stereo matching algorithms can be categorized into global and local algorithms [1], depending on the extent of information used for matching evaluation. Local methods consider only neighboring pixels for each candidate pixel, while the other methods exploit the entire image information. Traditional matching algorithms normally begin with formulating a stereo matching cost that estimates the matching degree between reference patches and target patches.

Global stereo matching algorithms require computationally demanding optimization algorithms, such as graph cuts [2] and dynamic programming [3,4], to find the disparity of each pixel. To facilitate the algorithms in real time, [5] implemented an integrated scheme combining a dynamic programming algorithm and a local algorithm using a graphics processing unit (GPU), while [6] implemented semi-global stereo matching using a field-programmable gate array (FPGA).

Unlike global algorithms, the stereo matching cost of local and non-local/semi-global algorithms is an aggregation of primary matching costs. The aggregation is normally conducted as a filtering procedure, and the resultant disparity map is obtained through the winner-take-all (WTA) strategy [1]. Since the computational complexities of local algorithms are usually smaller than that of other algorithms, they are widely used in practical applications.

Among the early efforts of using filters for cost volume aggregation in local algorithms, the matching performance of [7] was constrained by a fixed supporting window. This shortage was alleviated in [8] by introducing an adaptive window-size approach. Later, the Guided Image Filtering (GIF) model [9] was successfully implemented in [10], which demonstrates an edge-preserving advantage. Based on [9,11] further proposed a weighted guided image filter (WGIF) scheme to avoid halo artifacts and was used in [12] for disparity estimation. As the matching performance of GIF depends on the size of kernel windows, [13] proposed an adaptive guided filtering method to exclude pixels that do not belong to the same region. Besides, an iterative guided filtering approach [14] and an adaptive support weight version [15] were created to improve matching accuracy. More recently, [16] proposed weights according to structural features and filtered the matching cost volume by using the adaptive guided filtering method proposed by [13]. These are typical local stereo matching algorithms that do not aggregate matching costs outside the supporting window.

In [17], matching costs were aggregated according to tree structures individually derived from the entire image pair. Similarly, matching algorithms based on the so-called permeability filter [18] and pervasive guided image filtering [19] can effectively aggregate matching costs based on the whole image. In addition [20], integrated multi-scale information into the scheme of [19], significantly improving the matching performance. These algorithms use the full window for aggregation and are called non-local stereo matching algorithms. To solve the problem of matching ambiguity in low-texture areas and high sensitivity in high-texture areas [21], proposed the use of both the local support window and the whole image.

In addition to these traditional algorithms, disparity maps can also be computed using deep learning-based methods. These algorithms have the advantage of high matching accuracy. For instance [22], developed an automatic encoder to generate feature maps for semi-global stereo matching, and [23] implemented an unsupervised disparity estimation neural network based on the principle of disparity consistency. Besides [24], proposed a simplified independent component correlation algorithm (ICA)-based local similarity stereo matching algorithm to further improve matching accuracy in non-texture areas and boundaries. Among the deep learning-based methods, both [25,26] are typical end-to-end stereo matching networks, which realize the aggregation of matching costs through 3D convolution operations.

However, as pointed out in [27], compared with traditional stereo matching algorithms, deep learning-based stereo matching algorithms still suffer from insufficient generalization ability. Additionally, they typically require GPU-based computing resources. The current characteristics of deep learning-based methods justify continued research on traditional methods.

In most traditional local and non-local stereo matching algorithms, the weights for the matching cost aggregation depend on the texture of the image pair. This dependency restricts the possibility of sharing weights for different scenes. To improve the computational efficiency of stereo matching, this paper proposes a texture-independent aggregation method.

The main contributions of this paper are as follows:

(1): We propose an aggregation algorithm for stereo matching that significantly simplifies computation without sacrificing matching performance. The aggregation weights can be shared between different scene images with the same resolution.
(2): To provide a higher matching accuracy, we integrate the algorithm with a multi-scale scheme to exploit the spatial distribution of texture that can achieve improved performance with a minor increase in computational efforts.

2. Methods

2.1. Traditional Local and Non-Local Stereo Matching Algorithms

In this section, we examine two representative stereo matching algorithms proposed in [9,17], which are classified as local and non-local, respectively.

Traditional local and non-local stereo matching algorithms start with calculating a primary matching cost that reflects the degree of match between corresponding pixels on an image pair. Many measurements can be used for computing the primary matching cost. The implementation presented in this paper uses a linear combination of the truncated absolute difference of the gradient, denoted as

C_{T G D},

and the Hamming distance of the census transform:

\begin{array}{l} C_{T G D} (q, d) = \min (| \nabla_{x} I_{L} (q) - \nabla_{x} I_{R} (q + d) |, 2) + \min (| \nabla_{y} I_{L} (q) - \nabla_{y} I_{R} (q + d) |, 2) \\ C_{C e n s u s} (q, d) = Ham (T_{C e n s u s} (I_{L} (q)), T_{C e n s u s} (I_{R} (q + d))), \end{array}

(1)

where q is the location of a pixel, d is the estimated disparity of this pixel,

I_{L}

and

I_{R}

are the left and right images, respectively,

\nabla_{x}

and

\nabla_{y}

are the gradients of this pixel in the horizontal and vertical directions, respectively,

T_{C e n s u s}

represents the Census transformation [28], and Ham is the Hamming Distance computation. Left and right images often appear different at the same location due to lighting conditions and image sensing inhomogeneity. These two matching costs are adopted because they can significantly reduce the misleading effects of pixel-level intensity variations.

To consider both matching costs, we make a linear combination of

C_{T G D}

and

C_{C e n s u s}

to obtain the primary matching cost,

C_{0}

:

C_{0} (q, d) = α \cdot C_{T G D} (q, d) + (1 - α) \cdot C_{C e n s u s} (q, d),

(2)

where

α

is a weighting constant. It is set to be 0.95 in the following investigations.

The primary matching cost can be further aggregated to include information from surrounding pixels. The aggregation of the primary matching cost is the most crucial step that affects the stereo matching performance. A general form of the aggregated matching cost, denoted as

\bar{C} (p, d)

, can be written as

\bar{C} (p, d) = \sum_{q \in Ω (p)} W_{0} (p, q) \cdot C_{0} (q, d),

(3)

where

Ω (p)

is the supporting region centered at p, and

W_{0} (p, q)

is the aggregation weight. For the well-celebrated algorithms based on the weighted guided image filter, such as those proposed in [10,12,16], the aggregation weight, denoted as

W_{G I F} (p, q)

, can be represented in the following form:

W_{0} = W_{G I F} (p, q) = \sum_{k \in Ω (p) \cap Ω (q)} \frac{1}{| Ω (p) | \cdot | Ω (k) |} (1 + \frac{(I_{L} (p) - μ (I_{L} (k))) \cdot (I_{L} (q) - μ (I_{L} (k)))}{σ^{2} (I_{L} (k)) + ε}),

(4)

where

μ

and

σ^{2}

represent the mean value and variance operations, respectively, and

ε

is a small constant introduced to avoid division by zero. From (4), we know that

W_{G I F}

is closely related to the patterns on

I_{L}

. Besides,

W_{G I F}

depends on the spatial distribution of the image since its value is a function of the interaction between two regions which are centered at p and q, denoted as

Ω (p)

and

Ω (q)

.

In [17], the matching cost values are aggregated based on the minimum spanning tree (MST) being derived from the guidance image, where the aggregation weight, denoted as

W_{M S T}

, is calculated as

W_{0} = W_{M S T} (p, q) = \prod_{(u, v) \in L (p, q)} \exp (- \frac{| I_{L} (u) - I_{L} (v) |}{β}),

(5)

where

L (p, q)

is a path determined by the MST between a pixel pair located at p and q, u and v are the coordinates of pixels in the path, and

β

is a shaping parameter. It is clear from (5) that

W_{M S T}

also depends on the patterns on

I_{L}

.

Once we have the aggregated matching cost, the estimated disparity map, denoted as

\hat{D} (p)

, can be obtained by the winner-take-all computation:

\hat{D} (p) = \arg \min_{d} \bar{C} (p, d), d \in {d_{\min}, \dots, d_{\max}},

(6)

where

d_{\min}

and

d_{\max}

are the lower and upper bounds of disparity, assumed to be known beforehand.

Based on the above examination and derivation, we know that the aggregation weights of [9,17] are related to the texture and spatial distance of the participating pixels on an image pair. Different scene images have different aggregation weights, making the weights not sharable. This shortcoming is typical for most local and non-local stereo matching algorithms.

2.2. The Proposed Aggregation Method

We propose conducting a texture-independent cost aggregation scheme in three steps. In the first step, we consider the matching cost continuity in the horizontal direction. Then we enforce the vertical continuity of the cost in the second step. In each of these two steps, the aggregation is derived based on the minimization of an objective function. In the third step, we integrate the proposed algorithm with a cross-scale scheme to provide higher matching accuracy and efficient computation simultaneously.

2.2.1. The First Step: Horizontal Aggregation

In the first step, the proposed aggregation cost assumes the least squared difference from the primary matching cost. In addition, the cost has fewer differences with the horizontal 2 neighbors of each pixel. Based on these assumptions, we compose an objective function

J_{h}

to be minimized:

\begin{array}{l} J_{h} = \sum_{p \in I_{L}} {[{\bar{C}}_{1} (p, d) - C_{0} (p, d)]}^{2} + λ \cdot \sum_{p \in I_{L}} \sum_{q \in Ν_{h} (p)} {[{\bar{C}}_{1} (p, d) - {\bar{C}}_{1} (q, d)]}^{2} \\ = \sum_{x = 1}^{M} \sum_{y = 1}^{N} {[{\bar{C}}_{1} (x, y, d) - C_{0} (x, y, d)]}^{2} + λ \cdot \sum_{x = 1}^{M} \sum_{y = 1}^{N} \sum_{(i, j) \in Ν_{h} (x, y)} {[{\bar{C}}_{1} (x, y, d) - {\bar{C}}_{1} (i, j, d)]}^{2} \\ = \sum_{x = 1}^{M} (\sum_{y = 1}^{N} {[{\bar{C}}_{1} (x, y, d) - C_{0} (x, y, d)]}^{2} \\ + 2 λ \cdot \sum_{y = 2}^{N - 1} {{[{\bar{C}}_{1} (x, y, d) - {\bar{C}}_{1} (x, y - 1, d)]}^{2} + {[{\bar{C}}_{1} (x, y, d) - {\bar{C}}_{1} (x, y + 1, d)]}^{2}}), \end{array}

(7)

where

{\bar{C}}_{1}

is the proposed aggregation cost to be defined in the first step, M and N are the vertical and horizontal resolutions of

I_{L}

,

λ

is a normalization factor, and

Ν_{h}

is a set of 2 horizontal neighbors of the pixel under consideration, as depicted in Figure 1. In the figure, the coordinates of pixels are expended as

p = (x, y)

and

q = (i, j) .

This 2-neighbor regularization term limits the abrupt changes in the horizontal direction of the proposed aggregation cost.

Figure 1. A pixel located at p and its two horizontal neighbors, denoted as

Ν_{h} (x, y)

.

The aggregated matching cost in this step is constrained only in the horizontal direction, being independent in the vertical direction. We assume the objective function

J_{H}

can be decomposed into M independent objective functions of the rows:

J_{h} = \sum_{x = 1}^{M} H_{x},

(8)

where

\begin{array}{l} H_{x} = \sum_{y = 1}^{N} {[{\bar{C}}_{1} (x, y, d) - C_{0} (x, y, d)]}^{2} \\ + 2 λ \cdot \sum_{y = 2}^{N - 1} {{[{\bar{C}}_{1} (x, y, d) - {\bar{C}}_{1} (x, y - 1, d)]}^{2} + {[{\bar{C}}_{1} (x, y, d) - {\bar{C}}_{1} (x, y + 1, d)]}^{2}} . \end{array}

(9)

In minimizing each of the objective function

H_{x}

, we may take partial derivative of it with respect to

\bar{C} ’ (x, y, d)

:

\frac{\partial H_{x}}{\partial {\bar{C}}_{1} (x, y, d)} = 0, y \in {1, 2, \dots, N}

(10)

This results in the following N equations for different y:

{\begin{cases} (1 + 2 λ) \cdot {\bar{C}}_{1} (x, 1, d) - 2 λ \cdot {\bar{C}}_{1} (x, 2, d) = C_{0} (x, 1, d) \\ - 2 λ \cdot {\bar{C}}_{1} (x, y - 1, d) + (1 + 4 λ) \cdot {\bar{C}}_{1} (x, y, d) - 2 λ \cdot {\bar{C}}_{1} (x, y + 1, d) = C_{0} (x, y, d), y \in {2, 3, \dots, N - 1} \\ - 2 λ \cdot {\bar{C}}_{1} (x, N - 1, d) + (1 + 2 λ) \cdot {\bar{C}}_{1} (x, N, d) = C_{0} (x, N, d) . \end{cases}

(11)

If we define

A_{h} = [\begin{matrix} 1 + 2 λ & - 2 λ & 0 \\ - 2 λ & 1 + 4 λ & - 2 λ \\ - 2 λ & ⋱ & ⋱ \\ ⋱ & 1 + 4 λ & - 2 λ \\ 0 & - 2 λ & 1 + 2 λ \end{matrix}], B (x) = [\begin{matrix} {\bar{C}}_{1} (x, 1, d) \\ ⋮ \\ {\bar{C}}_{1} (x, N, d) \end{matrix}], D (x) = [\begin{matrix} C_{0} (x, 1, d) \\ ⋮ \\ C_{0} (x, N, d) \end{matrix}],

(12)

Equation (11) can be written in a matrix form as

A_{h} \cdot B (x) = D (x)

(13)

Besides, with (12) we observe that:

\begin{array}{l} {\bar{C}}_{1} = [\begin{matrix} {\bar{C}}_{1} (1, 1, d) & {\bar{C}}_{1} (1, 2, d) & \dots & {\bar{C}}_{1} (1, N, d) \\ {\bar{C}}_{1} (2, 1, d) & {\bar{C}}_{1} (2, 2, d) & \dots & {\bar{C}}_{1} (2, N, d) \\ ⋮ & ⋮ & ⋮ & ⋮ \\ {\bar{C}}_{1} (M, 1, d) & {\bar{C}}_{1} (M, 2, d) & \dots & {\bar{C}}_{1} (M, N, d) \end{matrix}] = [\begin{matrix} B {(1)}^{T} \\ B {(2)}^{T} \\ ⋮ \\ B {(M)}^{T} \end{matrix}], \\ C_{0} = [\begin{matrix} C_{0} (1, 1, d) & C_{0} (1, 2, d) & \dots & C_{0} (1, N, d) \\ C_{0} (2, 1, d) & C_{0} (2, 2, d) & \dots & C_{0} (2, N, d) \\ ⋮ & ⋮ & ⋮ & ⋮ \\ C_{0} (M, 1, d) & C_{0} (M, 2, d) & \dots & C_{0} (M, N, d) \end{matrix}] = [\begin{matrix} D {(1)}^{T} \\ D {(2)}^{T} \\ ⋮ \\ D {(M)}^{T} \end{matrix}], \end{array}

(14)

Taking the whole image into consideration, (13) becomes

{\bar{C}}_{1} \cdot A_{h}^{T} = C_{0}

(15)

The matrix

A_{h}

is tridiagonal, symmetric, and sparse. This N-by-N matrix is invertible for positive

λ

[29]. From (15), we may obtain the proposed aggregated matching cost,

{\bar{C}}_{1}

, by inverting

A_{h}

:

{\bar{C}}_{1}^{T} = A_{h}^{- 1} \cdot C_{0}^{T}

(16)

This procedure has the advantage that the matching cost can be calculated using a simple linear combination with constant weights while the resultant matching cost has a small difference with the horizontal neighborhoods, and only a smooth transition in this direction is possible.

2.2.2. The Second Step: Vertical Aggregation

Similar to the first step, we can create an objective function, denoted as

J_{v}

, for an aggregated matching cost,

{\bar{C}}_{2} .

The function applies constraints in the vertical direction while enforces its least squared difference from the matching cost aggregated in the first step,

{\bar{C}}_{1}

:

\begin{array}{l} J_{v} = \sum_{y = 1}^{N} {\sum_{x = 1}^{M} {[{\bar{C}}_{2} (x, y, d) - {\bar{C}}_{1} (x, y, d)]}^{2} \\ + 2 λ \cdot \sum_{x = 2}^{M - 1} {{[{\bar{C}}_{2} (x, y, d) - {\bar{C}}_{2} (x - 1, y, d)]}^{2} + {[{\bar{C}}_{2} (x, y, d) - {\bar{C}}_{2} (x + 1, y, d)]}^{2}}} \\ \equiv \sum_{y = 1}^{N} V_{y} . \end{array}

(17)

We assume the objective function is composed of sub-functions

V_{y}

, which are mutually independent in the horizontal direction.

{\bar{C}}_{2}

can be solved by taking a partial derivative of

V_{y}

with respect to

{\bar{C}}_{2}

in each column. By accumulating the results through the whole image, we have

{\bar{C}}_{2} = A_{v}^{- 1} \cdot {\bar{C}}_{1}

(18)

where

A_{v} = [\begin{matrix} 1 + 2 λ & - 2 λ & 0 \\ - 2 λ & 1 + 4 λ & - 2 λ \\ - 2 λ & ⋱ & ⋱ \\ ⋱ & 1 + 4 λ & - 2 λ \\ 0 & - 2 λ & 1 + 2 λ \end{matrix}]

(19)

Again, the matrix

A_{v}

is an M-by-M invertible tridiagonal sparse matrix for positive

λ

[29]. Combining (18) and (16), and considering that

A_{h}

is symmetric, we have

{\bar{C}}_{2} = A_{v}^{- 1} \cdot {\bar{C}}_{1} = A_{v}^{- 1} \cdot C_{0} \cdot {(A_{h}^{- 1})}^{T} = A_{v}^{- 1} \cdot C_{0} \cdot A_{h}^{- 1} .

(20)

From (20), we know that the aggregation of the primary matching cost can be simplified as a series of matrix multiplication. The horizontal aggregation weight,

A_{h}

, and the vertical aggregation weight,

A_{v}

, are constant tridiagonal matrices of the parameter

λ

. Their inverses can be calculated beforehand once the image sizes, M and N, are known. Under this condition, the computational complexity of the proposed algorithm is much simpler than most aggregation approaches.

Taking M = 480, N = 720, and

λ = 7

as an example, Figure 2 and Figure 3 show visualized representations of

A_{v}^{- 1}

in two and three dimensions. It can be seen from these figures that the values of the matrix are mostly concentrated near the diagonal. It can be clearly seen that the values of the points on both sides of the diagonal gradually decrease as the distance from the diagonal increases. The behavior of

A_{h}^{- 1}

is similar to that of

A_{v}^{- 1}

.

Figure 2. Two-dimensional visualization of a typical

A_{v}^{- 1}

(M = 480, N = 720, and

λ = 7

), where lighter colors represent higher values. (a) is the entire matrix and (b) is a partial view of (a) with the length enlarged by a factor of 16.

Figure 3. Three-dimensional visualization of a typical

A_{v}^{- 1}

(M = 480, N = 720, and

λ = 7

). Component values are expressed as heights.

As

A_{v}^{- 1}

is symmetric and sparse, its element values in the row and column directions follow the same pattern. For a fixed image size,

A_{v}^{- 1}

is completely defined by the only parameter

λ

. Figure 4 demonstrates the element values of

A_{v}^{- 1}

as a function of

λ

. The image size is also 480-by-720, and only elements on row 240 are depicted.

Figure 4.

A_{v}^{- 1}

as a function of

λ

. The image size is 480-by-720. The plots show the element values of

A_{v}^{- 1}

on row 240. (a) The value of

A_{v}^{- 1}

for different

λ

values. (b) Partial enlargement of (a).

Also, as observed in Figure 4, the distribution of the sparse matrix becomes smoother when

λ

is larger, which implies a larger effective range of the aggregation. When

λ

is smaller, the distribution of the sparse matrix is sharper, and the effective range of the matching cost aggregation is smaller. This observation is consistent with (7) and (17) that, when the value of

λ

is larger, the objective functions,

J_{h}

and

J_{v}

, have stronger constraints on the aggregated matching costs,

{\bar{C}}_{1}

and

{\bar{C}}_{2}

.

From the above analysis, we have gathered that the effective range of

A_{h}^{- 1}

and

A_{v}^{- 1}

is similar to the size of the aggregation window or the length of an aggregation path in the local and non-local algorithms. For images with larger resolutions, a larger value of

λ

is required to effectively aggregate the matching cost in a broader range. For images with smaller resolutions, a smaller value of

λ

is more adequate.

2.2.3. The Proposed Scheme: Integration of the Texture-Independent Scheme with a Cross-Scale Cost Aggregation Algorithm

To enhance the matching accuracy of the cost aggregation algorithm derived in the last section, we present its integration with a cross-scale scheme of [20,30] in this section. This hybrid approach is recommended since higher matching accuracy can be achieved with only a slight increase in computation. Similarly, other algorithms can be integrated with the proposed scheme.

The cross-scale scheme of [20,30], denoted as CS, considers the primary matching costs of various scales. The scheme begins with down-sampling an image pair into K pairs. For each image pair, a matching cost is calculated according to (20). We denote the matching cost of the k-th pair as

{\tilde{C}}_{k} (p, d)

,

k \in {0, 1, \dots, K}

, where K is the highest pair with the lowest resolution. The framework is derived by introducing an objective function for the cross-scale scheme of [20,30]:

J_{C S} = \sum_{k = 0}^{K} \sum_{q \in Ω (p)} {[{\hat{C}}_{k} (p, d) - {\tilde{C}}_{k} (q, d)]}^{2} + \sum_{z = 1}^{K} γ^{z} \cdot {[{\hat{C}}_{k} (p, d) - {\hat{C}}_{k - 1} (p, d)]}^{2}

(21)

where

γ

is a constant constraining factor and

Ω (p)

is the aggregation window centered at p. For K = 2 and

γ

= 1.5, minimizing (21) with respect to

{\hat{C}}_{k}

results in a simple representation that the aggregated matching cost of the zero-layer is

{\hat{C}}_{0} (q, d) = 0.56 \cdot \frac{\sum_{q \in Ω (p)} {\tilde{C}}_{0} (q, d)}{| Ω (p) |} + 0.26 \cdot \frac{\sum_{q \in Ω (p)} {\tilde{C}}_{1} (q, d)}{| Ω (p) |} + 0.18 \cdot \frac{\sum_{q \in Ω (p)} {\tilde{C}}_{2} (q, d)}{| Ω (p) |}

(22)

where

| Ω (p) |

is the number of pixels within the aggregation window

Ω (p)

. As can be observed in the original scheme [20,30], the larger the value of K, the more multi-scale features the aggregated matching cost contains. The larger the value of

γ

, the larger the proportion of down-sampled information in the aggregation matching cost, and vice versa.

A primary disparity map

\hat{D}

can be obtained using the winner-take-all principle:

\hat{D} (x, y) = \arg \min_{d \in {d_{\min}, d_{\max}}} {\hat{C}}_{0} (x, y, d) .

(23)

To reduce the noise effects on

\hat{D}

, we create an updated cost volume based on the primary disparity map of (23):

\bar{C} ’ (x, y, d) = {\begin{cases} | \hat{D} (x, y) - d |, i f (x, y) \notin S_{p} \\ 0, i f (x, y) \in S_{p} \end{cases},

(24)

where

S_{p}

is the erroneous region detected via the left-right consistency examination. Implementation details of (21), (22), and (24) can be found in [20]. Finally, we aggregate the updated cost volume using the algorithm of (20):

\bar{C} ’ ’ = A_{v}^{- 1} \cdot \bar{C} ’ \cdot A_{h}^{- 1}

(25)

This is the final aggregated matching cost. We can then obtain the final disparity map by applying the winner-take-all computational principle to

\bar{C} ’ ’

once again.

3. Results

To investigate the matching performance of the proposed scheme, performance comparisons have been made between eight representative stereo matching schemes and the proposed scheme:

End-to-end real-time stereo matching network proposed in [25], denoted as RTSMNet.
Matching algorithm based on a combination of the adaptive support weight with iterative guided filter and the sum of gradient matching [15], denoted as ISM.
Sparse representation over a learned discriminative dictionary for stereo matching [4], denoted as DDL.
Stereo matching algorithm based on two-phase adaptive optimization of ad-census and gradient fusion [8], denoted as TPAO.
Stereo matching algorithm based on per pixel difference adjustment, iterative guided filter, and graph segmentation [14], denoted as IGF.
Local stereo matching using adaptive cross-region-based guided image filtering with orthogonal weights [16], denoted as ACR-GIF-OW.
Hierarchical guided-image-filtering for stereo matching [20], denoted as HGIF.
Stereo matching with fusing adaptive support weights [21], denoted as FASW.
The proposed scheme, which is an integration with a cross-scale cost aggregation algorithm as described in the last section.

We selected several stereo image pairs from the Middlebury version 3 [31] and the KITTI Vision Benchmark Suite [32] datasets for demonstration. The “trainingQ” of the Middlebury version 3 [31] dataset is composed of 15 groups of pictures from “Adirondack” to “Vintage”. As summarized in Table 1, resolutions of the images are around 480-by-720.

Table 1. Resolution of the images for stereo matching.

According to the previous section’s discussion, the design parameter is positively related to the image resolution under matching. In the following demonstrations, the values of

λ

are simply assigned as

λ = \frac{M}{480} \cdot \frac{N}{720} \cdot 6

(26)

Figure 5 shows the disparity maps obtained by these stereo matching algorithms on four of the image sets. Table 2 and Table 3 present the error rates and weighted error rates of the algorithms using the complete “trainingQ” of the Middlebury version 3 [31] dataset. In the tables, the experimental results of the four algorithms for comparison are obtained from the original literature. For ease of viewing, the graphical representations of Table 2 and Table 3 are shown in Figure 6 and Figure 7.

Figure 5. Visual comparison of disparity maps obtained by different algorithms. (a) The left scene images selected from the Middlebury version 3 [31] dataset (from left to right): Adirondack, Jadeplant, Playroom, and Playtable. (b) Ground-truth disparity maps, (c) Disparity maps generated by RTSMNET [25]. (d) Disparity maps generated by ISM [15]. (e) Disparity maps generated by DDL [4]. (f) Disparity maps generated by TPAO [8]. (g) Disparity maps generated by IGF [14]. (h) Disparity maps generated by ACR-GIF-OW [16]. (i) Disparity maps generated by HGIF [20]. (j) Disparity maps generated by FASW [21]. (k)Disparity maps generated by an integration of the proposed scheme.

Table 2. Comparison of the error rates in the non-occluded region without refinements using different matching algorithms (%). In calculating the error rates, the error threshold = 1.0.

Table 3. Comparison of the error rates in the all-region without refinement (%). In calculating the error rates, the error threshold = 1.0. The best score for each image set is highlighted in bold.

Figure 6. Comparison of the error rates in the non-occluded region. This diagram is a graphical representation of Table 2.

Figure 7. Comparison of the error rates in the all-region. This diagram is a graphical representation of Table 3.

The performance of FASW [21] and the proposed algorithm are more prominent for stereo matching on the dataset. Specifically, in the non-occluded region, the proposed algorithm has the lowest mismatch rate, followed by FASW [21] and the HGIF [20] algorithm. In the all-region, the RTSMNet [25] algorithm has the lowest mismatch rate, followed by FASW [21] and the proposed algorithm.

However, we can observe a significant performance deterioration in using RTSMNet [25] to predict the disparity maps of the “Jadeplant” and “Vintage” image sets. RTSMNet [25] is a deep learning-based method, its generalization ability depends on the quality and richness of the training dataset. Similar scenes may be rare in its training set. This behavior is not observed in traditional algorithms because their performance is less sensitive to scene type.

The time required for the matching of a rectified stereo image pair for each algorithm is summarized in Table 4. These computations were executed in MATLAB 2017b using an Intel Core I5 8300 H and 16 GB RAM. Notably, the proposed algorithm takes the least computational time. If we consider both matching accuracy and algorithmic complexity, the proposed algorithm is close to the best accuracy while requiring the least amount of computation.

Table 4. Comparison of the running time of each algorithm for a pair of rectified images (s). These computations were executed in MATLAB 2017b using an Intel Core I5 8300H and 16 GB RAM.

To verify the stereo matching performance of the proposed algorithm in real scenes, the autonomous driving training dataset of KITTI Vision Benchmark Suite [32] is used for further demonstration. The dataset contains 194 image pairs with corresponding ground truth disparity maps. Performance comparisons to be presented are between the following stereo matching schemes:

Stereo matching based on adaptive guided filtering [13], denoted as AGF.
Adaptive stereo matching using tree filtering [17], denoted as MST.
The ISM algorithm [15].
The HGIF algorithm [20].
The FASW algorithm [21].
The proposed scheme.

Figure 8 shows the disparity maps generated by these algorithms on the KITTI suite [32]. Each algorithm performs stereo matching on 194 sets of stereo image pairs. Due to space constraints, we only select three stereo-image pairs for visual comparison. Table 5 shows the quantitative matching performance of these algorithms. Four scores are compared: the percentage of erroneous pixels in non-occluded areas, denoted as Non-Occ (%); the percentage of erroneous pixels in all regions, denoted as All (%); the average disparity error in terms of pixel numbers in non-occluded areas, denoted as Non-Occ (pixels); and the average disparity error in terms of pixel numbers in all regions, denoted as All (pixels). The scores used for comparison are quoted from [21].

Figure 8. Visual comparison of disparity maps obtained by different algorithms using the KITTI Vision Benchmark Suite [32]. (a) Three of the left scene images selected from the dataset. (b) Ground-truth disparity maps. (c) Disparity maps generated by ISM [15]. (d) Disparity maps generated by AGF [13]. (e) Disparity maps generated by MST [17]. (f) Disparity maps generated by HGIF [20]. (g) Disparity maps generated by FASW [21]. (h) Disparity maps generated by the proposed scheme.

Table 5. Comparison of the effects of each algorithm using the KITTI image dataset.

As seen from Figure 8 and Table 5, compared with the five representative aggregation methods, the proposed algorithm outperforms other algorithms in items Non-Occ (%), All (%), and Non-Occ (pixels), with scores of 6.30%, 7.48%, and 1.3 pixels, respectively. Its performance is only slightly inferior to [21] on All (pixels), with a score of 1.58, which is higher than [21]’s 1.45.

Figure 9 shows three disparity maps generated by the proposed algorithm using three sets of multi-view remote sensing images (pictures from https://github.com/whuwuteng/benchmark_ISPRS2021, accessed on 15 February 2019). These additional disparity maps demonstrate the effectiveness of the proposed approach.

Figure 9. Disparity maps of remote sensing images. (a) Left image of the scene (b) Right image of the scene (c) Disparity maps obtained by the proposed algorithm.

4. Discussion

Traditional local and non-local stereo matching algorithms require an efficient aggregation procedure, which is the most critical stage for generating accurate dense disparity maps. The aggregation weights in most stereo matching algorithms in the literature, unfortunately, are scenario specific.

We re-examined the procedure of cost aggregation from the perspective of matrix operations and treated the aggregation as constraining the degree of difference between adjacent costs. By decoupling the aggregation procedure in the horizontal and vertical directions, we propose a new aggregation algorithm to effectively calculate a cost volume for stereo matching. This process is equivalent to multiplying the initial cost by two constant matrices.

The aggregation algorithm requires two constant weight matrices that are only related to the image resolution and can be calculated beforehand. These matrices are mathematically proven to be independent of image texture and thus can be applied to different scene images of the same resolution. This algorithm can be used for integration with other schemes to provide both computational efficiency and stereo matching accuracy. We demonstrate its integration with the cross-scale scheme of [20,30].

Through numerical experiments using indoor and outdoor benchmark stereo image datasets, we demonstrate that the integrated scheme is not only computationally efficient but also provides disparity maps that are highly comparable to the most accurate algorithms in the literature.

In the future, we will focus on real-time implementation of the proposed algorithm by using, for instance, GPU-based systems. Preprocessing techniques, such as the algorithm proposed in [33] for detecting and removing shadows, can be employed to further improve the accuracy of dense disparity maps.

Author Contributions

Conceptualization, Y.-Z.C. and C.Z.; methodology, C.Z. and Y.-Z.C.; software, C.Z.; validation, Y.-Z.C. and C.Z.; formal analysis, C.Z.; investigation, Y.-Z.C.; resources, Y.-Z.C. and C.Z.; data curation, C.Z.; writing—original draft preparation, C.Z.; writing—review and editing, Y.-Z.C.; visualization, C.Z.; supervision, Y.-Z.C.; project administration, Y.-Z.C.; funding acquisition, Y.-Z.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by National Science and Technology Council, Taiwan, grant number 110-2221-E-182-034 and 111-2221-E-182-057; and Chang Gung Memorial Hospital, grant number CORPD2J0041 and CORPD2J0042.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data sharing not applicable. The Middlebury version 3 datasets are available at https://vision.middlebury.edu/stereo/data/scenes2014/. The KITTI Vision Benchmark Suite datasets are available at https://www.cvlibs.net/datasets/kitti/. The multi-view remote sensing images are available at https://github.com/whuwuteng/benchmark_ISPRS2021. These datasets were accessed on 15 February 2019.

Conflicts of Interest

The authors declare no conflict of interest.

Nomenclature

$A_{h}$	a constant matrix used in deriving the proposed aggregation cost of the first step consisting of the parameter λ.
$A_{v}$	a constant matrix used in deriving the proposed aggregation cost of the second step consisting of the parameter λ.
B	a matrix consisting of elements of ${\bar{C}}_{1}$ .
$\bar{C}$	the general form of an aggregated matching cost.
$\bar{C} ’$	an updated disparity cost volume to reduce the noise effects in images.
$\bar{C} ’ ’$	the final aggregated matching cost that integrates the proposed algorithm with a cross-scale cost aggregation scheme described in [20,30].
$C_{0}$	the primary matching cost used in this paper.
${\bar{C}}_{1}$	the proposed aggregation cost defined in the first step.
${\bar{C}}_{2}$	the proposed aggregation cost defined in the second step.
$C_{C e n s u s}$	a matching cost calculated by the Hamming distance of the census transform.
${\tilde{C}}_{k}$	the matching cost of the k-th scale layer, each aggregated by the proposed scheme.
${\hat{C}}_{k}$	the estimated matching cost of the k-th scale layer, each aggregated by an integrated scheme consisting of the proposed scheme and a cross-scale cost aggregation algorithm described in [20,30].
$C_{T G D}$	a matching cost calculated by the truncated absolute difference of the gradient.
D	a matrix consisting of elements of $C_{0}$ .
d	the estimated disparity of a pixel.
$\hat{D}$	the disparity map based on an image pair.
$Ham$	a function to calculate the Hamming Distance.
$H_{x}$	an independent objective function of image rows.
$I_{L}$ , $I_{R}$	the left and right images, respectively.
$J_{C S}$	an objective function for the cross-scale scheme proposed in [20,30].
$J_{h}$ , $J_{v}$	objective functions representing the least squared difference between the proposed aggregation cost and the primary matching cost in the horizontal and vertical directions, respectively.
K	the number of pairs in the cross-scale cost aggregation. Each image pair is down-sampled from the original pair.
L	a path determined by the minimum spanning tree technique [17].
M	the vertical resolution of $I_{L}$ .
min	a function to calculate the minimum value between several amounts.
N	the horizontal resolution of $I_{L}$ .
$Ν_{h}$	a set of 2 horizontal neighbors.
p, q	the location of a pixel. The symbols are interchangeably used as parameters.
$T_{C e n s u s}$	the Census transformation.
$V_{y}$	an independent objective function of image columns.
$W_{0}$	the general form of aggregation weights.
$W_{G I F}$	the aggregation weight of the weighted guided image filter.
$W_{M S T}$	the aggregation weight based on the minimum spanning tree technique [17].
Greek symbols
α	a weighting constant.
β	a shaping parameter.
γ	a constraining factor.
ε	a small constant introduced to avoid division by zero.
λ	a normalization factor.
μ	the mean value function.
$σ^{2}$	the variance function.
Ω	the supporting region centered at a pixel.
Mathematical Operator
$\nabla_{x}$ , $\nabla_{y}$	the gradients of intensity in the horizontal and vertical directions, respectively.

References

Scharstein, D.; Szeliski, R. A taxonomy and evaluation of dense two-frame stereo correspondence algorithms. Int. J. Comput. Vis. 2002, 47, 7–42. [Google Scholar] [CrossRef]
Boykov, Y.; Veksler, O.; Zabih, R. Fast approximate energy minimization via graph cuts. IEEE Trans. Pattern Anal. Mach. Intell. 2001, 23, 1222–1239. [Google Scholar] [CrossRef]
Hirschmuller, H. Stereo processing by semiglobal matching and mutual information. IEEE Trans. Pattern Anal. Mach. Intell. 2008, 30, 328–341. [Google Scholar] [CrossRef]
Yin, J.; Zhu, H.; Yuan, D.; Xue, T. Sparse representation over discriminative dictionary for stereo matching. Pattern Recognit. 2017, 71, 278–289. [Google Scholar] [CrossRef]
Hallek, M.; Boukamcha, H.; Mtibaa, A.; Atri, M. Dynamic programming with adaptive and self-adjusting penalty for real-time accurate stereo matching. J. Real-Time Image Proc. 2022, 19, 233–245. [Google Scholar] [CrossRef]
Lu, Z.; Wang, J.; Li, Z.; Chen, S.; Wu, F. A Resource-Efficient Pipelined Architecture for Real-Time Semi-Global Stereo Matching. IEEE Trans. Circuits Syst. Video Technol. 2022, 32, 660–673. [Google Scholar] [CrossRef]
Yoon, K.J.; Kweon, I.S. Adaptive support-weight approach for correspondence search. IEEE Trans. Pattern Anal. Mach. Intell. 2006, 28, 650–656. [Google Scholar] [CrossRef]
Liu, H.; Zhang, H.; Nie, X.; He, W.; Luo, D.; Jiao, G.; Chen, W. Stereo matching algorithm based on two-phase adaptive optimization of AD-Census and gradient fusion. In Proceedings of the Conference on Real-time Computing and Robotics, Xining, China, 15–19 July 2021; pp. 726–731. [Google Scholar]
He, K.; Sun, J.; Tang, X. Guided image filtering. IEEE Trans. Pattern Anal. Mach. Intell. 2013, 35, 1397–1409. [Google Scholar] [CrossRef]
Hosni, A.; Rhemann, C.; Bleyer, M.; Rother, C. Fast cost-volume filtering for visual correspondence and beyond correspondence search. IEEE Trans. Pattern Anal. Mach. Intell. 2013, 35, 504–511. [Google Scholar] [CrossRef]
Li, Z.; Zheng, J.; Zhu, Z.; Yao, W.; Wu, S. Weighted guided image filtering. IEEE Trans. Image Process. 2014, 24, 120–129. [Google Scholar]
Hong, G.S.; Kim, B.G. A local stereo matching algorithm based on weighted guided image filtering for improving the generation of depth range images. Displays 2017, 49, 80–87. [Google Scholar] [CrossRef]
Yang, Q.; Ji, P.; Li, D.; Yao, S.; Zhang, M. Fast stereo matching using adaptive guided filtering. Image Vis. Comput. 2014, 32, 202–211. [Google Scholar] [CrossRef]
Hamzah, R.A.; Ibrahim, H.; Hassan, A.H.A. Stereo matching algorithm based on per pixel difference adjustment, iterative guided filter and graph segmentation. J. Vis. Commun. Image Represent. 2017, 42, 145–160. [Google Scholar] [CrossRef]
Hamzah, R.A.; Kadmin, A.F.; Hamid, M.S.; Ghani, S.F.A.; Ibrahim, H. Improvement of stereo matching algorithm for 3D surface reconstruction. Signal Process. Image Commun. 2018, 65, 165–172. [Google Scholar] [CrossRef]
Kong, L.; Zhu, J.; Ying, S. Local stereo matching using adaptive cross-region-based guided image filtering with orthogonal weights. Math. Probl. Eng. 2021, 2021, 1–20. [Google Scholar] [CrossRef]
Yang, Q. Stereo matching using tree filtering. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 37, 834–846. [Google Scholar] [CrossRef]
Çı˘gla, C.; Alatan, A.A. An efficient recursive edge-aware filter. Signal Process. Image Commun. 2014, 29, 998–1014. [Google Scholar] [CrossRef]
Zhu, C.; Chang, Y.Z. Efficient stereo matching based on pervasive guided image filtering. Math. Probl. Eng. 2019, 2019, 3128172. [Google Scholar] [CrossRef]
Zhu, C.; Chang, Y.Z. Hierarchical guided-image-filtering for efficient stereo matching. Appl. Sci. 2019, 9, 3122. [Google Scholar] [CrossRef]
[Wu, W.; Zhu, H.; Yu, S.; Shi, J. Stereo matching with fusing adaptive support weights. IEEE Access 2019, 7, 61960–61974. [Google Scholar]
Nguyen, V.D.; Nguyen, H.V.; Jeon, J.W. Robust stereo data cost with a learning strategy. IEEE Trans. Intell. Transp. Syst. 2017, 18, 248–258. [Google Scholar] [CrossRef]
Godard, C.; Aodha, O.M.; Brostow, G.J. Unsupervised monocular depth estimation with left-right consistency. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 6602–6611. [Google Scholar]
Chen, S.; Zhang, J.; Jin, M. A simplified ICA-based local similarity stereo matching. Vis. Comput. 2021, 37, 411–419. [Google Scholar] [CrossRef]
Chang, J.R.; Chen, Y.S. Pyramid stereo matching network. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 19–21 June 2018; pp. 5410–5418. [Google Scholar]
Xie, Y.; Zheng, S.; Li, W. Feature-guided spatial attention upsampling for real-time stereo matching network. IEEE MultiMedia. 2021, 28, 38–47. [Google Scholar] [CrossRef]
Tian, M.; Yang, B.; Chen, C.; Huang, R.; Huo, L. HPM-TDP: An efficient hierarchical PatchMatch depth estimation approach using tree dynamic programming. ISPRS 2019, 155, 37–57. [Google Scholar] [CrossRef]
Zabih, R.; Woodfill, J. Non-parametric local transforms for computing visual correspondence. In Proceedings of the Third European Conference on Computer Vision, Stockholm, Sweden, 2–6 May 1994; pp. 151–158. [Google Scholar]
Venetis, I.E.; Kouris, A.; Sobczyk, A.; Gallopoulos, E.; Sameh, A. A direct tridiagonal solver based on Givens rotations for GPU architectures. Parallel Comput. 2015, 49, 101–116. [Google Scholar] [CrossRef]
Zhang, K.; Fang, Y.; Min, D.; Sun, L.; Yang, S.; Yan, S.; Tian, Q. Cross-scale cost aggregation for stereo matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23-28 June 2014; pp. 1590–1597. [Google Scholar]
Scharstein, D.; Hirschmüller, H.; Kitajima, Y.; Krathwohl, G.; Nesic, N.; Wang, X.; Westling, P. High-resolution stereo datasets with subpixel-accurate ground truth. In Proceedings of the German Conference on Pattern Recognition (GCPR 2014), Münster, Germany, 12–15 September 2014; pp. 1–12. [Google Scholar]
Moritz, M.; Geiger, A. Object Scene Flow for Autonomous Vehicles. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 8–10 June 2015; pp. 1–10. [Google Scholar]
Rani, E.F.I.; Pushparaj, T.L.; Raj, E.F.I. Escalating the resolution of an urban aerial image via novel shadow amputation algorithm. Earth Sci. Inform. 2022, 15, 905–913. [Google Scholar] [CrossRef]

Figure 1. A pixel located at p and its two horizontal neighbors, denoted as

Ν_{h} (x, y)

.

Figure 1. A pixel located at p and its two horizontal neighbors, denoted as

Ν_{h} (x, y)

.

Figure 2. Two-dimensional visualization of a typical

A_{v}^{- 1}

(M = 480, N = 720, and

λ = 7

), where lighter colors represent higher values. (a) is the entire matrix and (b) is a partial view of (a) with the length enlarged by a factor of 16.

Figure 2. Two-dimensional visualization of a typical

A_{v}^{- 1}

(M = 480, N = 720, and

λ = 7

), where lighter colors represent higher values. (a) is the entire matrix and (b) is a partial view of (a) with the length enlarged by a factor of 16.

Figure 3. Three-dimensional visualization of a typical

A_{v}^{- 1}

(M = 480, N = 720, and

λ = 7

). Component values are expressed as heights.

Figure 3. Three-dimensional visualization of a typical

A_{v}^{- 1}

(M = 480, N = 720, and

λ = 7

). Component values are expressed as heights.

Figure 4.

A_{v}^{- 1}

as a function of

λ

. The image size is 480-by-720. The plots show the element values of

A_{v}^{- 1}

on row 240. (a) The value of

A_{v}^{- 1}

for different

λ

values. (b) Partial enlargement of (a).

Figure 4.

A_{v}^{- 1}

as a function of

λ

. The image size is 480-by-720. The plots show the element values of

A_{v}^{- 1}

on row 240. (a) The value of

A_{v}^{- 1}

for different

λ

values. (b) Partial enlargement of (a).

Figure 5. Visual comparison of disparity maps obtained by different algorithms. (a) The left scene images selected from the Middlebury version 3 [31] dataset (from left to right): Adirondack, Jadeplant, Playroom, and Playtable. (b) Ground-truth disparity maps, (c) Disparity maps generated by RTSMNET [25]. (d) Disparity maps generated by ISM [15]. (e) Disparity maps generated by DDL [4]. (f) Disparity maps generated by TPAO [8]. (g) Disparity maps generated by IGF [14]. (h) Disparity maps generated by ACR-GIF-OW [16]. (i) Disparity maps generated by HGIF [20]. (j) Disparity maps generated by FASW [21]. (k)Disparity maps generated by an integration of the proposed scheme.

Figure 6. Comparison of the error rates in the non-occluded region. This diagram is a graphical representation of Table 2.

Figure 7. Comparison of the error rates in the all-region. This diagram is a graphical representation of Table 3.

Figure 8. Visual comparison of disparity maps obtained by different algorithms using the KITTI Vision Benchmark Suite [32]. (a) Three of the left scene images selected from the dataset. (b) Ground-truth disparity maps. (c) Disparity maps generated by ISM [15]. (d) Disparity maps generated by AGF [13]. (e) Disparity maps generated by MST [17]. (f) Disparity maps generated by HGIF [20]. (g) Disparity maps generated by FASW [21]. (h) Disparity maps generated by the proposed scheme.

Figure 9. Disparity maps of remote sensing images. (a) Left image of the scene (b) Right image of the scene (c) Disparity maps obtained by the proposed algorithm.

Table 1. Resolution of the images for stereo matching.

Resolution	Adirondack	ArtL	Jadeplant	Motorcycle	MotorcycleE	Piano	PianoL	Pipes
M	496	277	497	497	497	481	481	485
N	718	347	659	741	741	707	707	735
Resolution	Playroom	Playtable	PlaytableP	Recycle	Shelves	Teddy	Vintage
M	476	463	462	486	497	450	480
N	699	680	681	720	738	375	722

Table 2. Comparison of the error rates in the non-occluded region without refinements using different matching algorithms (%). In calculating the error rates, the error threshold = 1.0.

Test Sets	RTSMNet	ISM	DDL	TPAO	IGF	ACR-GIF-OW	HGIF	FASW	Proposed
Adirondack	4.5	15.5	6.5	7.1	14.9	7.9	5.7	6.8	4.8
ArtL	11.5	16.5	13.1	13.2	15.9	10.1	10.9	10.5	9.7
Jadeplant	43.0	25.1	18.8	16.1	24.2	17.4	19.7	19.0	17.4
Motorcycle	8.9	10.0	9.0	6.8	9.6	6.8	8.5	6.8	7.6
MotorcycleE	8.9	11.0	7.7	5.9	10.5	6.8	8.5	6.3	6.7
Piano	13.3	23.5	14.8	17.9	24.1	19.1	14.0	13.8	14.3
PianoL	25.2	32.6	26.6	26.7	32.1	34.2	29.3	25.8	27.2
Pipes	13.8	15.5	11.6	10.1	15.1	9.4	9.9	9.4	9.1
Playroom	18.0	21.6	16.2	17.6	21.5	18.7	15.2	16.7	16.3
Playtable	10.6	39.6	18.4	27.3	39.6	24.3	16.7	20.8	16.6
PlaytableP	6.7	20.9	10.6	12.5	20.9	11.4	11.0	10.2	10.3
Recycle	6.8	15.0	9.1	9.1	14.4	9.5	8.3	8.1	8.7
Shelves	22.2	34.4	37.8	38.6	36.0	38.4	31.6	32.6	33.6
Teddy	9.4	7.2	6.2	6.2	7.0	6.1	5.6	5.2	6.3
Vintage	39.3	34.4	27.3	27.5	34.6	25.6	27.0	23.9	26.5
Weighted Average	14.7	19.3	13.6	13.9	19.1	14.0	12.9	12.5	12.4

The best score for each image set is marked in bold.

Table 3. Comparison of the error rates in the all-region without refinement (%). In calculating the error rates, the error threshold = 1.0. The best score for each image set is highlighted in bold.

Test Sets	RTSMNet	ISM	DDL	TPAO	IGF	ACR-GIF-OW	HGIF	FASW	Proposed
Adirondack	6.2	18.4	8.8	11.7	17.6	12.5	8.0	8.7	7.1
ArtL	15.5	25.6	22.8	24.5	24.8	23.4	21.2	20.8	20.7
Jadeplant	47.8	36.8	32.4	28.4	35.9	31.3	33.0	32.3	31.8
Motorcycle	11.7	14.9	13.4	14.1	14.5	14.9	12.6	10.9	11.9
MotorcycleE	11.7	15.9	11.8	13.1	15.3	14.6	12.9	10.5	11.2
Piano	16.8	27.6	19.2	22.9	28.1	23.8	18.1	18.2	18.4
PianoL	28.4	36.3	30.5	30.9	35.7	37.9	33.1	29.5	30.7
Pipes	21.7	26.2	22.7	22.8	25.7	21.8	21.5	20.2	20.7
Playroom	25.0	29.7	25.1	26.7	29.4	28.4	23.8	25.5	25.0
Playtable	13.6	42.9	23.3	33.1	42.9	30.5	21.2	25.0	22.6
PlaytableP	8.8	25.9	14.3	19.2	25.9	18.5	14.3	13.4	14.4
Recycle	7.8	17.5	11.4	12.7	17.0	13.6	10.3	10.2	11.2
Shelves	23.7	35.5	39.1	39.9	37.0	40.3	32.6	34.0	34.7
Teddy	11.2	12.0	11.8	12.7	11.8	12.3	11.2	10.9	11.7
Vintage	41.1	37.7	31.5	32.2	37.9	30.1	31.0	28.7	30.8
Weighted Average	18.0	24.9	19.5	21.1	24.6	21.6	18.7	18.2	18.4

The best score for each image set is marked in bold.

Table 4. Comparison of the running time of each algorithm for a pair of rectified images (s). These computations were executed in MATLAB 2017b using an Intel Core I5 8300H and 16 GB RAM.

Test Sets	ACR-GIF-OW	HGIF	Proposed
Adirondack	215	31	10
ArtL	34	7	1
Jadeplant	374	87	21
Motorcycle	191	37	10
MotorcycleE	192	36	10
Piano	169	27	7
PianoL	171	26	7
Pipes	197	37	11
Playroom	211	31	11
Playtable	169	29	9
PlaytableP	167	30	9
Recycle	199	30	8
Shelves	195	29	8
Teddy	68	14	4
Vintage	499	98	25

The best score for each image set is marked in bold.

Table 5. Comparison of the effects of each algorithm using the KITTI image dataset.

Algorithms	Non-Occ (%)	All (%)	Non-Occ (Pixels)	All (Pixels)
ISM	8.88	10.01	1.87	2.07
AGF	8.59	9.73	1.77	1.99
MST	23.27	24.41	3.47	4.15
HGIF	6.57	7.78	1.35	1.61
FASW	6.89	8.12	1.31	1.45
Proposed	6.30	7.48	1.30	1.58

Non-Occ (%): Percentage of erroneous pixels in non-occluded areas, All (%): Percentage of erroneous pixels in all regions, Non-Occ (pixels): Average disparity error in terms of pixel numbers in non-occluded areas, and All (pixels): Average disparity error in terms of pixel numbers in all regions. The best score for each image set is marked in bold.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Simplified High-Performance Cost Aggregation for Stereo Matching

Featured Application

Abstract

1. Introduction

2. Methods

2.1. Traditional Local and Non-Local Stereo Matching Algorithms

2.2. The Proposed Aggregation Method

2.2.1. The First Step: Horizontal Aggregation

2.2.2. The Second Step: Vertical Aggregation

2.2.3. The Proposed Scheme: Integration of the Texture-Independent Scheme with a Cross-Scale Cost Aggregation Algorithm

3. Results

4. Discussion

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Nomenclature

References

Article Metrics

Citations

Article Access Statistics