**2. Proposed Method**

#### *2.1. The Cost Aggregation Based on the Pervasive Guided-Image-Filtering (PGIF)*

This section presents the basic procedure of the pervasive guided-image-filtering (PGIF) [25] scheme. It serves as the definitions of some of the variables and algorithms for the proposed scheme to be shown in the next section.

In the GIF-based stereo matching algorithms, such as WGIF [21], GDGIF [22], and PGIF [25] are discussed in the last section. The aggregated cost, *Cd*,<sup>0</sup>(*p*) is a linear function of the guidance image, *IG* at a local disparity patch such that:

$$
\overline{\mathcal{C}}\_{d,0}(p) = a\_{d,0}(p) \cdot I\_{\mathcal{G}}(p) + b\_{d,0}(p), \tag{1}
$$

where *p* = (*x*, *y*) is the location of the central pixel and *d* indicates the disparity. Note that *Cd*,<sup>0</sup>(*p*) have three dimensions (*x*, *y*, *d*) and is often denoted as the aggregated cost volume. Likewise, *ad*,<sup>0</sup>(*p*) and *bd*,<sup>0</sup>(*p*) can also be called the parameter volumes of the model. The second subscript, 0, indicates that these values are related to the original resolution. For the following stereo matching operations, we regard the left image as the reference image, such that *IG* = *IL*,0, and the right image as the target image.

The disparity map, *<sup>D</sup>*(*p*) is composed of the disparity value corresponding to the minimum aggregated cost at each location *p*:

$$D(p) = \arg\min\_{d} \overline{C}\_{d,0}(p). \tag{2}$$

In order to solve for *ad*,<sup>0</sup>(*p*) and *bd*,<sup>0</sup>(*p*), we proposed an objective function in Reference [25], which includes a weighted sum of the squared difference between the linear model and a primary matching cost, denoted as *Cd*,<sup>0</sup>(*q*), and a regularization term:

$$E\_{d,0}(p) = \sum\_{q \in \Omega\_0(p)} a \wp(p, q) \cdot \left[ \left[ a\_{d,0}(p) \cdot I\_{d,0}(q) + b\_{d,0}(p) - C\_{d,0}(q) \right]^2 + \varepsilon(p) \cdot \left[ a\_{d,0}(p) \right]^2 \right] \tag{3}$$

where Ω0 is the support window for a pixel located at *p*, <sup>ω</sup>0(*p*, *q*) is the weight that reflects the contribution of a pixel located at *q*, and ε(*p*) is the regularization factor to limit the magnitude of *ad*,<sup>0</sup>(*p*). Because of an e fficient iterative computation scheme, to be presented below, the support window is the whole image.

The primary matching cost, *Cd*,<sup>0</sup>(*q*), represents the degree of match between a pixel located at *q* of the left image and a pixel located at *q* + *d* of the right image. This cost is smaller for a higher degree of match. We employ a truncated version of the absolute gradient di fference to define the cost:

$$\mathcal{C}\_{d,0}(q) = \min \left( \left| \nabla\_{\mathbf{x}} I\_{\mathbf{L},0}(q) - \nabla\_{\mathbf{x}} I\_{\mathbf{R},0}(q+d) \right|\_{\mathbf{f}} \tau \right) + \min \left( \left| \nabla\_{\mathbf{y}} I\_{\mathbf{L},0}(q) - \nabla\_{\mathbf{y}} I\_{\mathbf{R},0}(q+d) \right|\_{\mathbf{f}} \tau \right) \tag{4}$$

where *IL*,<sup>0</sup> and *IR*,<sup>0</sup> are the left and right images of the stereo pair in the original resolution, ∇*x* and <sup>∇</sup>*y* are the horizontal and vertical gradients, respectively, and τ is the truncation threshold, normally assigned as 2. The use of the threshold helps to reduce the mismatch in obscured or noisy regions.

The optimum values of the parameters in Equation (3), represented as *a*<sup>∗</sup> *<sup>d</sup>*,<sup>0</sup>(*p*) and *b*∗ *<sup>d</sup>*,<sup>0</sup>(*p*), are obtained by minimizing the objective function Equation (3):

$$\begin{array}{c} \frac{\partial \mathbb{E}\_{d\rho}(p)}{\partial \mathbf{a}\_{d\rho}(p)} = 0\\ \frac{\partial \mathbb{E}\_{d\rho}(p)}{\partial \mathbf{b}\_{d\rho}(p)} = 0 \end{array} \tag{5}$$

We have:

*a*<sup>∗</sup> *<sup>d</sup>*,<sup>0</sup>(*p*) = *<sup>q</sup>*<sup>∈</sup>Ω0(*p*) [<sup>ω</sup>0(*p*,*q*)·*<sup>I</sup> <sup>L</sup>*,<sup>0</sup>(*q*)·*Cd*,<sup>0</sup>(*q*)] *<sup>q</sup>*<sup>∈</sup>Ω0(*p*) <sup>ω</sup>0(*p*,*q*) − *<sup>q</sup>*<sup>∈</sup>Ω0(*p*) [<sup>ω</sup>0(*p*,*q*)·*<sup>I</sup> <sup>L</sup>*,<sup>0</sup>(*q*)] *<sup>q</sup>*<sup>∈</sup>Ω0(*p*) <sup>ω</sup>0(*p*,*q*) · *<sup>q</sup>*<sup>∈</sup>Ω0(*p*) [<sup>ω</sup>0(*p*,*q*)·*Cd*,<sup>0</sup>(*q*)] *<sup>q</sup>*<sup>∈</sup>Ω0(*p*) <sup>ω</sup>0(*p*,*q*) *<sup>q</sup>*<sup>∈</sup>Ω0(*p*) [<sup>ω</sup>0(*p*,*q*)·*<sup>I</sup> <sup>L</sup>*,<sup>0</sup>(*q*)·*<sup>I</sup> <sup>L</sup>*,<sup>0</sup>(*q*)] *<sup>q</sup>*<sup>∈</sup>Ω0(*p*) <sup>ω</sup>0(*p*,*q*) − ⎧ ⎪⎪⎪⎨ ⎪⎪⎪⎩ *<sup>q</sup>*<sup>∈</sup>Ω0(*p*) [<sup>ω</sup>0(*p*,*q*)·*<sup>I</sup> <sup>L</sup>*,<sup>0</sup>(*q*)] *<sup>q</sup>*<sup>∈</sup>Ω0(*p*) <sup>ω</sup>0(*p*,*q*) ⎫ ⎪⎪⎪⎬ ⎪⎪⎪⎭ 2 <sup>+</sup>ε(*p*) *b*∗ *<sup>d</sup>*,<sup>0</sup>(*p*) = *<sup>q</sup>*<sup>∈</sup>Ω0(*p*) [<sup>ω</sup>0(*p*,*q*)·*Cd*,<sup>0</sup>(*q*)] *<sup>q</sup>*<sup>∈</sup>Ω0(*p*) <sup>ω</sup>0(*p*,*q*) − *a*<sup>∗</sup> *<sup>d</sup>*,<sup>0</sup>(*p*) · *<sup>q</sup>*<sup>∈</sup>Ω0(*p*) [<sup>ω</sup>0(*p*,*q*)·*IL*,<sup>0</sup>(*q*)] *<sup>q</sup>*<sup>∈</sup>Ω0(*p*) <sup>ω</sup>0(*p*,*q*) (6)

In PGIF [25], the weight <sup>ω</sup>0(*p*, *q*) is decomposed into horizontal and vertical weighting factors, *WH i*,*x*,0 and *W<sup>V</sup> j*,*y*,0:

$$
\omega\_0(p,q) = \omega\_0(\mathbf{x}, \mathbf{y}, i, j) = \overline{\mathcal{W}}\_{i, \mathbf{x}, 0}^H \cdot \overline{\mathcal{W}}\_{j, \mathbf{y}, 0}^V \tag{7}
$$

With the convention that *q* = (*i*, *j*) and *p* = (*x*, *y*), the weighting factors, *WH i*,*x*,0 and *W<sup>V</sup> j*,*y*,0, can be recursively calculated as:

$$
\overline{W}\_{i,\mathbf{x},0}^{H} = \begin{cases}
\max(i, \mathbf{x}) \\
\prod\_{u=\min(i,x)+1} \exp\left(\frac{f([\mathbf{I}\_{L,0}(u,j) - \mathbf{I}\_{L,0}(u-1,j)])}{-\beta}\right) & i \neq \mathbf{x} \\
& \mathbf{1}\_{\prime} \\
\prod\_{u=\min(j,y)+1}^{\max(j,y)} \exp\left(\frac{f([\mathbf{I}\_{L,0}(u,j) - \mathbf{I}\_{L,0}(u,-1)])}{-\beta}\right) & j \neq y \\
& \mathbf{1}\_{\prime} \\
\text{with } f(\boldsymbol{\xi}) = \begin{cases} 0, & j=y \\
\mathbf{1}\_{\prime} & \text{if } \boldsymbol{\xi} = \mathbf{0} \\
\mathbf{1}\_{\prime} & \text{if } \boldsymbol{\xi} > \mathbf{0}
\end{cases}
\end{cases} \tag{8}
$$

where the parameter β is a constant factor. By this scheme, the weight <sup>ω</sup>0(*p*, *q*) depends on both the spatial and intensity di fferences, and the contribution of any pixel located at *q* to the pixel located at *p* can be e ffectively involved. Besides, the introduction of function *f* alleviates the e ffects of abrupt change in intensity, enhancing the immunity to noise caused by the recursive calculation.

#### *2.2. Stereo Matching Based on Hierarchical Guided-Image-Filtering (HGIF)*

A major shortage of approaches using the scheme starting from an energy function of Equation (3), including WGIF [21], GDGIF [22], and PGIF [25], is the restriction of the cost metrics to local regions of the support window <sup>Ω</sup>0(*p*). Even though PGIF [25] exploits the whole image for matching, the e ffects of patterns far from the pixel under aggregation decay with distance.

The proposed scheme begins with a down-sampling of the original image pair, *IL*,<sup>0</sup> and *IR*,0, where the left image is selected as the reference image and the right image as the target image. As the images are down-sampled by a factor of 2, we denote the resultant image pairs as *IL*,*<sup>z</sup>* and *IR*,*<sup>z</sup>* for *z* ∈ {0, 1, ... ,*K*} with *K* being the roughest level.

Firstly, the optimal parameters, *a*<sup>∗</sup> *d*,*<sup>z</sup>* (*q*) and *b*∗ *d*,*<sup>z</sup>* (*q*), of each image pair, *IL*,*<sup>z</sup>* and *IR*,*z*, are calculated using Equation (6). We define a set of unknown parameters, *<sup>a</sup>*<sup>ˆ</sup>*d*,*<sup>z</sup>*(*p*) and ˆ *bd*,*<sup>z</sup>*(*p*), as aggregated parameters to be found. In order to aggregate the cross-scale patterns, we create an objective function *Fd*(*p*) that is the weighted mean-square error between the optimized parameters, *a*<sup>∗</sup> *d*,*<sup>z</sup>* (*q*) and *b*∗ *d*,*<sup>z</sup>* (*q*), and the unknown parameters, *<sup>a</sup>*<sup>ˆ</sup>*d*,*<sup>z</sup>*(*p*) and ˆ *bd*,*<sup>z</sup>*(*p*), throughout each scale:

$$\begin{split} F\_{d}(p) &= \sum\_{z=0}^{K} \frac{\sum\_{q \in l\_{Lz}(p)} \omega\_{z}(p,q) \cdot \left[a\_{d,z}^{\star}(q) - \mathfrak{a}\_{d,z}(p)\right]^{2}}{\sum\_{q \in l\_{Lz}(p)} \omega\_{z}(p,q)} + \sum\_{z=0}^{K} \frac{\sum\_{q \in l\_{Lz}(p)} \omega\_{z}(p,q) \cdot \left[b\_{d,z}^{\star}(q) - \widehat{b}\_{d,z}(p)\right]^{2}}{\sum\_{q \in l\_{Lz}(p)} \omega\_{z}(p,q)} \\ &+ \sum\_{z=1}^{K} \gamma^{z} \cdot \left(\left[\mathfrak{a}\_{d,z}(p) - \mathfrak{a}\_{d,z-1}(p)\right]^{2} + \left[\widehat{b}\_{d,z}(p) - \widehat{b}\_{d,z-1}(p)\right]^{2}\right), \end{split} \tag{9}$$

where the weight <sup>ω</sup>*z*(*p*, *q*) is defined similarly to Equations (7) and (8) with the subscript 0 replaced by *z*. This objective function inherits the weighting scheme of Reference [25] that not only reflects the effects of distance and intensity di fference, but also applies constraints on the GIF parameters in the scale direction.

In Equation (9), the positive constant γ is a constraint factor on the squared di fference between the parameters *<sup>a</sup>*<sup>ˆ</sup>*d*,*<sup>z</sup>*(*p*) and *<sup>a</sup>*<sup>ˆ</sup>*d*,*z*−<sup>1</sup>(*p*), and between ˆ *bd*,*<sup>z</sup>*(*p*) and ˆ *bd*,*z*−<sup>1</sup>(*p*), respectively. The constraint weights, γ*z* for *z* ∈ {0, 1, ... ,*K*}, are larger between layers that are far from the original image, layer 0, if γ > 1. This trend is reversed for γ < 1.

There are totally 2(*K* + 1) unknown parameters in Equation (9), namely *<sup>a</sup>*<sup>ˆ</sup>*d*,<sup>0</sup>(*p*), ˆ *bd*,<sup>0</sup>(*p*), ..., *<sup>a</sup>*<sup>ˆ</sup>*d*,*<sup>K</sup>*(*p*), and ˆ *bd*,*<sup>K</sup>*(*p*). They can be obtained by setting the partial derivatives of Equation (9) with respect to these parameters to zero:

$$\begin{cases} \frac{\partial \mathcal{F}\_d(p)}{\partial \mathring{a}\_{d,x}(p)} = 0\\ \frac{\partial \mathcal{F}\_d(p)}{\partial \mathring{a}\_{d,x}(p)} = 0 \end{cases} , z \in \{0, 1, 2, \dots, K\} \tag{10}$$

Equation (10) can be simplified as:

⎪⎪⎪⎨

$$\begin{cases} (1+\gamma) \cdot \mathbb{A}\_{d,0}(p) - \gamma \cdot \mathbb{A}\_{d,1}(p) = \gz(a^\*\_{d,z}(p)), & \text{for } z=0\\ -\gamma^z \cdot \mathbb{A}\_{d,z=1}(p) + (1+\gamma^z+\gamma^{z+1}) \cdot \mathbb{A}\_{d,z}(p) - \gamma^{z+1} \cdot \mathbb{A}\_{d,z+1}(p) = \gz(a^\*\_{d,z}(p)), & \text{for } z \in \{1, 2, \dots, K-1\} \\ -\gamma^X \cdot \mathbb{A}\_{d,K-1}(p) + (1+\gamma^K) \cdot \mathbb{A}\_{d,K}(p) = \gz(a^\*\_{d,z}(p)), & \text{for } z=K \end{cases} \tag{11}$$

and:

⎪⎪⎪⎪⎨

$$\begin{cases}(1+\gamma)\cdot\widehat{b}\_{d,0}(p)-\gamma\cdot\widehat{b}\_{d,1}(p)=g\_z(b^\*\_{d,z}(p)), & \text{for } z=0\\-\gamma^z\cdot\widehat{b}\_{d,z-1}(p)+(1+\gamma^z+\gamma^{z+1})\cdot\widehat{b}\_{d,z}(p)-\gamma^{z+1}\cdot\widehat{b}\_{d,z+1}(p)=g\_z(b^\*\_{d,z}(p)), & \text{for } z\in\{1,2,\dots,K-1\}\\-\gamma^X\cdot\widehat{b}\_{d,K-1}(p)+(1+\gamma^K)\cdot\widehat{b}\_{d,K}(p)=g\_z(b^\*\_{d,z}(p)), & \text{for } z=K,\end{cases} \tag{12}$$

where *gz* is a nominal average function to the parameters, *<sup>a</sup>*<sup>∗</sup>*d*,*<sup>z</sup>*(*q*) and *<sup>b</sup>*<sup>∗</sup>*d*,*<sup>z</sup>*(*q*), with scale-dependent weights <sup>ω</sup>*z*(*p*, *q*): 

$$\begin{array}{c} g\_z(a^\*\_{d,z}(p)) = \frac{\sum\_{\substack{\omega\\ q \in I\_{d,z}(p)}} \omega\_z(p,q) \cdot a^\*\_{d,z}(q)}{\sum\_{\substack{\omega\\ q \in I\_{d,z}(p)}} \omega\_z(p,q)}\\ g\_z(b^\*\_{d,z}(p)) = \frac{\sum\_{\substack{\omega\\ q \in I\_{d,z}(p)}} \omega\_z(p,q)}{\sum\_{\substack{\omega\\ q \in I\_{d,z}(p)}} \omega\_z(p,q)} \end{array} \tag{13}$$

The aggregated parameters, *<sup>a</sup>*<sup>ˆ</sup>*d*,<sup>0</sup>(*p*) and *bd*,<sup>0</sup>(*p*), can be solved using the system of linear Equations (11) and (12). Taking the case of *K* = 2 (down-sampled twice) and γ = 1.5 as an example, Equation (11) can be calculated as:

ˆ

$$\begin{cases} 2.5 \cdot \hat{a}\_{d,0}(p) - 1.5 \cdot \hat{a}\_{d,1}(p) = \g{\_0}(a\_{d,0}^\*(p)) \\ -1.5 \cdot \hat{a}\_{d,0}(p) + 4.75 \cdot \hat{a}\_{d,1}(p) - 2.25 \cdot \hat{a}\_{d,2}(p) = \g{\_1}(a\_{d,1}^\*(p)) \\ -2.25 \cdot \hat{a}\_{d,1}(p) + 3.25 \cdot \hat{a}\_{d,2}(p) = \g{\_2}(a\_{d,2}^\*(p)) \end{cases} \tag{14}$$

We have that:

$$\mathfrak{A}\_{d,0}(p) = 0.557 \cdot \lg\_0(a^\*\_{d,0}(p)) + 0.262 \cdot \lg\_1(a^\*\_{d,1}(p)) + 0.181 \cdot \lg\_2(a^\*\_{d,2}(p)). \tag{15}$$

Likewise, ˆ *bd*,<sup>0</sup>(*p*) is a linear combination of the parameters, *<sup>b</sup>*<sup>∗</sup>*d*,<sup>0</sup>(*p*), *<sup>b</sup>*<sup>∗</sup>*d*,<sup>1</sup>(*p*) and *<sup>b</sup>*<sup>∗</sup>*d*,<sup>2</sup>(*p*):

$$
\hat{b}\_{d,0}(p) = 0.557 \cdot \lg\_0(b\_{d,0}^\*(p)) + 0.262 \cdot \lg\_1(b\_{d,1}^\*(p)) + 0.181 \cdot \lg\_2(b\_{d,2}^\*(p))\tag{16}
$$

Thus, these aggregated parameters, *<sup>a</sup>*<sup>ˆ</sup>*d*,<sup>0</sup>(*p*) and *bd*,<sup>0</sup>(*p*), include the effect of features come from different resolutions. These parameters are three-dimensional with dimensions (*x*, *y*, *d*) where the *x* and *y* dimensions are in the image planes and the *d* dimension is in the disparity direction.

ˆ

After obtaining the aggregated parameters, the cost volume *C* ˆ *d*(*p*) is calculated according to Equation (1):

$$
\hat{\mathsf{C}}\_{d}(p) = \mathfrak{d}\_{d,0}(p) \cdot I\_{L,0}(p) + \mathfrak{f}\_{d,0}(p) \tag{17}
$$

This calculation is conducted using elementwise multiplications and additions in the image plane for each disparity. Finally, the disparity map, *D* ˆ (*p*), can be obtained using the winner-take-all (WTA) procedure of Equation (2):

$$
\hat{D}(p) = \operatorname\*{argmin}\_{d} \hat{C}\_{d}(p). \tag{18}
$$

The procedure of our proposed algorithm for stereo matching is depicted in Figure 1, taking *K* = 2 as an example, and summarized in the following steps:

1. Down-sample the image pairs to build a pair of pyramids of images, *IL*,*<sup>z</sup>* and *IR*,*<sup>z</sup>* for *z* ∈ {0, 1, ... ,*K*}, with *K* + 1 levels of resolution;

2. Calculate the weight matrix <sup>ω</sup>*z*(*p*, *q*) as a multiplication of two weighting factors, *WHi*,*x*,*<sup>z</sup>* and *<sup>W</sup>Vj*,*y*,*<sup>z</sup>*for *z* ∈ {0, 1, ... ,*K*}:

$$
\omega\_{\overline{z}}(p,q) = \overline{\mathcal{W}}\_{i,x,z}^{H} \cdot \overline{\mathcal{W}}\_{j,y,z}^{V} \tag{19}
$$

where the weighting factors are recursively calculated for *z* ∈ {0, 1, ... ,*K*} as:

$$\overline{\boldsymbol{W}}\_{i,\boldsymbol{x},\boldsymbol{z}}^{H} = \begin{cases} \max(\boldsymbol{i},\boldsymbol{x}) \\ \displaystyle\prod\_{\boldsymbol{u}=\min(\boldsymbol{i},\boldsymbol{x})+1} \exp\left(\frac{f\left[\left[\boldsymbol{I}\_{\boldsymbol{x}}(\boldsymbol{u},\boldsymbol{j}) - \boldsymbol{I}\_{\boldsymbol{z}}(\boldsymbol{u}-\boldsymbol{1},\boldsymbol{j})\right]\right]}{-\beta}\right) & \boldsymbol{i} \neq \boldsymbol{x} \\ \boldsymbol{1} & \boldsymbol{i} = \boldsymbol{x} \end{cases}$$
 
$$\overline{\boldsymbol{W}}\_{j,\boldsymbol{y},\boldsymbol{z}}^{V} = \begin{cases} \max(\boldsymbol{j},\boldsymbol{y}) & \text{ $\!\!\! $ f $}\left(\frac{f\left[\left[\boldsymbol{I}\_{\boldsymbol{x}}(\boldsymbol{x},\boldsymbol{u}) - \boldsymbol{I}\_{\boldsymbol{z}}(\boldsymbol{x},\boldsymbol{u}-\boldsymbol{1})\right]\right]}{-\beta}\right) & \boldsymbol{j} \neq \boldsymbol{y} \\ \boldsymbol{1} & \boldsymbol{j} = \boldsymbol{y} \end{cases} \tag{20}$  
$$\text{with $$
f(\boldsymbol{\xi}) = \begin{cases} 0, & \text{if } \boldsymbol{\xi} = 0 \\ 1, & \text{if } \boldsymbol{\xi} > 0 \end{cases}$$

3. Generate the primary matching cost volume for each scale where *z* ∈ {0, 1, ... ,*K*}:

$$\mathbb{C}\_{d,z}(q) = \min \left( \left| \nabla\_{\mathbf{x}} I\_{L,z}(q) - \nabla\_{\mathbf{x}} I\_{R,z}(q+d) \right|\_{\mathsf{I}} \tau \right) + \min \left( \left| \nabla\_{\mathbf{y}} I\_{L,z}(q) - \nabla\_{\mathbf{y}} I\_{R,z}(q+d) \right|\_{\mathsf{I}} \tau \right). \tag{21}$$

4. Find the optimum parameters for each resolution, where *z* ∈ {0, 1, ... , *K*}:

$$\begin{split} a^\*\_{d,z}(p) &= \frac{\frac{\sum\_{\{\omega\in[p,q]\}}\sum\_{\omega\in[p,q]}\omega\_{1,z}(p)}{q\omega\_{1,z}(p)} - \frac{q\operatorname{id}\_{L,z}(p)}{q\omega\_{1,z}(p)} \cdot \frac{q\operatorname{id}\_{L,z}(p)}{q\omega\_{1,z}(p)}}{q\omega\_{1,z}(p)} \cdot \frac{q\operatorname{id}\_{L,z}(p)}{q\omega\_{1,z}(p)}}{\frac{\sum\_{\{\omega\in[p,q]\}}\sum\_{\omega\in[p,q]}\omega\_{1,z}(p)}{q\omega\_{1,z}(p)} \cdot \left\{\frac{q\operatorname{id}\_{L,z}(p)}{q\omega\_{1,z}(p)}\right\}^2}{\frac{\operatorname{id}\_{L,z}(p)}{q\operatorname{id}\_{L,z}(p)} \cdot \frac{\operatorname{id}\_{L,z}(p)}{q\omega\_{1,z}(p)}}} \cdot \left(\frac{\sum\_{\{\omega\in[p,q]\}}\sum\_{\omega\in[p,q]}\omega\_{2,z}(p)}{q\omega\_{1,z}(p)}\right)^2 + \varepsilon(p) \tag{22}$$
 
$$b^\*\_{d,z}(p) = \frac{\sum\_{\{\omega\in[p,q]\}}\sum\_{\omega\in[p,q]}\omega\_{1,z}(p)}{q\omega\_{1,z}(p)} - a^\*\_{d,z}(p) \cdot \frac{\sum\_{\{\omega\in[p,q]\}}\sum\_{\omega\in[p,q]}\omega\_{1,z}(p)}{q\omega\_{1,z}(p)}}{\sum\_{\{\omega\in[p,q]\}}(p)}. \tag{23}$$

5. Calculate the nominal average function to the parameters, *gz*(*a*<sup>∗</sup>*d*,*<sup>z</sup>*(*p*)) and *gz*(*b*<sup>∗</sup>*d*,*<sup>z</sup>*(*p*)), according to Equation (13) for *z* ∈ {0, 1, ... ,*K*};


**Figure 1.** An overview of the proposed stereo matching process with two levels of down-sampling as an example.

As depicted in Figure 1, the parameters *<sup>a</sup>*<sup>ˆ</sup>*d*,<sup>0</sup>(*p*) and *bd*,<sup>0</sup>(*p*) are found by solving the system of linear equations, composed of Equations (11) and (12). As the matrix inverse can be conducted in advance, the calculation can be simplified into matrix multiplication, as demonstrated in Equation (14).

ˆ

Besides, the cost volume *C*ˆ *d*(*p*) is calculated using the linear model of Equation (17). In contrary to the current multi-scale approaches in the literature, such as References [27–29], where the cross-scale aggregation is based on costs themselves, the parameter-based cross-resolution aggregation of the proposed procedure is unique and e fficient.

## **3. Experimental Results**

In the proposed scheme, there are two design parameters, β in Equation (20) and γ in Equation (9). According to Equations (19) and (20), we have that larger β will cause *WH i*,*x*,*<sup>z</sup>* and *W<sup>V</sup> j*,*y*,*<sup>z</sup>* to increase, so ω*z* is larger and *D*ˆ (*p*) will be smoother. Based on Equation (9), we also have that when γ is larger, the constraint between the scale layers is stronger.

We conducted experiments using the training dataset of the KITTI Vision Benchmark Suite [30] to determine the proper values for these two parameters. The dataset contains 200 image pairs and 200 ground truth disparity maps. The results of the performance of the proposed scheme using the dataset are summarized in Figure 2.

Figure 2a,b display the mean values of the percentage of the erroneous pixels on the disparity maps, denoted as the average error rates, and the standard deviations. Each of the mean values along with the standard deviation was calculated using these 200 image pairs. When β = 2 and γ = 1.5 these two parameters achieved their best performance. We fixed these values for all the calculation experiments presented in this section.

**Figure 2.** The effect of parameter selection on the performance of the proposed scheme using 200 stereo image pairs of the training dataset in the KITTI Vision Benchmark Suite [30]. The dataset was downloaded from the URL: www.cvlibs.net/datasets/kitti/index.php. Each figure shows the mean values of the error rates and their corresponding standard deviations. (**a**) The average error rate with respect to the parameter β when γ = 1.5. (**b**) The average error rate with respect to the parameter γ when β = 2.

To validate the effectiveness of the proposed scheme, extensive comparative experiments were conducted. We studied six state-of-the-art stereo matching algorithms to compare with the proposed scheme:


We tested these frameworks using the Middlebury (version 3) benchmark stereo database downloaded via the URL: vision.Middlebury.edu/stereo/ [31]. The "trainingQ" image set, which contains 15 stereo pairs, were used for the performance evaluation, starting from "Adirondack" to "Vintage". Of them, only five are shown in Figure 3 due to the space limitation. They are: "Adirondack", "Pipes", "Playroom", "Playtable", and "Shelves". These pictures are with a typical resolution of 720 by 480. These image pairs are down-sized twice, to 360 by 240 and 180 by 120, for example, in the proposed scheme. This arrangemen<sup>t</sup> corresponds to *K* = 2.

**Figure 3.** Datasets and their corresponding ground truth disparity maps selected from the experimented data for visual comparison. (**a**) Left images of the image pairs: Adirondack, Pipes, Playroom, Playtable, and Shelves; (**b**) ground truth disparity maps of these images. Image courtesy of the Middlebury (version 3) benchmark stereo database via the URL: vision.middlebury.edu/stereo/ [31].

Besides, the Middlebury defines two measures for evaluating average error rates, including non-occluded (non-occ) and all regions. The weighted average error rate is an official metric of the benchmark in measuring the accuracy of matching by using different weights for different image pairs. These weights are employed to compensate for the variation in the matching difficulty, as remarked on the website. Specifically, the image pairs: "PianoL", "Playroom", "Playtable", "Shelves", and "Vintage" contribute only half of the error rates.

Figure 4 shows the disparity maps obtained by these algorithms. Among them, only the results of SRDD [13] were improved by the disparity refinement procedure. The corresponding error rates in the non-occluded region and the all-region are summarized in Tables 1 and 2, respectively.

Taking a close look of the disparity maps of Figure 4, we have that there was a significant improvement in the matching quality of CS-FCVF [19,29] over FCVF [19], especially in the texture-less regions of the "Playtable" and "Shelves" cases. This improvement was less significant but could also be observed in that of the CS-PGIF [25,29] over PGIF [25]. These improvements were due to the effective cross-scale cost aggregation scheme of Reference [29].

According to the weighted average error rates listed in Tables 1 and 2, DSG [8] performed worst. Also, the matching performance of SRDD [13] was better than CS-FCVF [19,29] but slightly worse than CS-PGIF [25,29]. It is worthy to note that SRDD [13] applies semi-global cost aggregation and post-processing refinement to further improve the matching accuracy. However, the proposed scheme, even without refinement, performed better than SRDD [13] in most of the cases and achieves the smallest weighted average error rates in both the all-region and the non-occluded (non-occ) region.

The experiment was executed in MATLAB 2017b using an Intel Core I5 8300H and 16 GB RAM. Table 3 summarizes the execution time of these algorithms. We have that the multi-scale versions, CS-FCVF [19,29] and CS-PGIF [25,29], required more time for computation than their original versions, FCVF [19] and PGIF [25], respectively, as expected.

Similarly, the proposed algorithm needed to calculate the filtering parameters in multiple scale layers, it ran longer than the FCVF algorithm [19] and the PGIF algorithm [25]. In addition, the CS-FCVF [19,29] and CS-PGIF [25,29] algorithms only incorporated one matching cost parameter, while the proposed algorithm has to compute two parameters, *<sup>a</sup>*<sup>ˆ</sup>*d*,<sup>0</sup>(*p*) and ˆ*bd*,<sup>0</sup>(*p*) in Equations (11) and (12). We also have that the proposed algorithm ran slightly longer than CS-FCVF [19,29] and CS-PGIF [25,29]. However, the increased computation time was marginal and still within the same order of magnitude. Moreover, both DSG [8] and SRDD [13] required much more computational resource than the other algorithms due to their deep neural network and semi-global cost aggregation scheme, respectively, as pointed out in Reference [13]. Based on Table 3, we may conclude that the proposed scheme had the best performance when taking both the matching correctness and the calculation efficiency into consideration.

**Figure 4.** Comparisons of disparity maps obtained by different algorithms. All the results are obtained without the refinement procedure except SRDD (The sparse representation over discriminative dictionary scheme) [13]. (**<sup>a</sup>**–**g**) are disparity maps generated by: (**a**) FCVF (The fast cost volume filtering scheme) [19], (**b**) CS-FCVF (A combination of the cross-scale cost aggregation scheme [29] and FCVF), (**c**) PGIF (The pervasive guided image filter scheme) [25], (**d**) CS-PGIF (A combination of [29] and PGIF), (**e**) DSG (The deep self-guided cost aggregation scheme) [8], (**f**) SRDD (The sparse representation over discriminative dictionary scheme) [13], and (**g**) the proposed scheme. Among them, the images of (**e**) DSG and (**f**) SRDD are by courtesy of the Middlebury (version 3) benchmark stereo database via the URL: vision.middlebury.edu/stereo/ [31].


**Table 1.** Comparison of the weighted average error rates in the non-occluded (non-occ) region between seven algorithms (%). All of the results are obtained without the refinement procedure except SRDD [13]. The lowest error records are marked in bold.

**Table 2.** Comparison of the weighted average error rates in the all-region between seven algorithms (%). All of the results are obtained without the refinement procedure except SRDD [13]. The lowest error records are marked in bold.


**Table 3.** Comparison of the computation time for four selected image sets between five algorithms (s).

