**1. Introduction**

Stereo vision aims at providing rich distance information of the captured scenes via image pairs. This is normally accomplished by matching algorithms to generate dense disparity maps. The maps can be transformed into three-dimensional information of the scene by the principle of triangulation with many potential applications, such as autonomous navigation, 3D reconstruction, and vision-based object handling.

Although the stereo matching problem has been under extensive research for decades, it is still di fficult to obtain accurate matching under ill-posed conditions such as texture-less regions, repeated patterns, occlusion areas, and reflective surfaces. The current stereo matching algorithms can mainly be divided into two categories: The conventional matching algorithms [1] and the deep-learning-based stereo matching approaches.

The stereo matching algorithm based on deep learning regards the process of deriving the disparity map as a classification problem or a regression problem. For instance, Zbontar [2] used the convolutional neural networks (CNNs) to estimate the similarity of image blocks and uses the measures as the matching cost in the traditional stereo matching algorithm. Similarly, Nahar [3] proposed unsupervised pre-trained networks to estimate hierarchical features and combine them with a pixel-based intensity matching cost in a global energy minimization framework for dense disparity estimation. By combining a disparity estimation network with a CNN that was trained by a synthetically generated dataset, Mayer [4] demonstrated the e ffectiveness of deep learning in stereo matching. Pang [5] proposed a cascaded CNN architecture that is composed of two stages: The first stage advances the work of Reference [4] by equipping it with extra up-convolution modules, while the second stage generates residual signals across multiple scales. The summation of the outputs from the two stages gives the final disparity. Kendall [6] used deep unary features to compute a stereo matching cost volume. In this approach, disparity values are regressed for aggregation from the cost volume using 3D convolutions.

Another method to implement deep learning-based stereo matching is to use the networks to exploit context information. For example, Chang [7] developed a spatial pyramid pooling module to the aggregate context in di fferent scales to form a cost volume. The cost volume is regularized by a stacked network to further improve the utilization of global context information. Besides, Williem [8] used the deep learning technique for the cost volume aggregation based on self-guided filtering.

Deep learning-based methods are promising as they can apply the high-level object detection as a guideline for within-object matching. However, most of the current schemes use supervised learning methods that assume the true disparity is known in advance. This assumption is impractical for many applications [9]. Moreover, these approaches might be invalid for an unknown environment and cannot be well transplanted to robotic and embedded systems [10].

The conventional stereo matching approaches are classified as global or local according to the construction of an objective function that rates the degree of match between an image pair [1]. The objective function of the global methods consists of a data term (the measurement part) and a regularization term (the penalty part). The data term designates the similarity between aggregated matching costs of pixels on the images, and the regularization term is included to provide constraints from neighboring pixels. Belief propagation [11] and dynamic programming [12,13] are the major global methods. However, global approaches need a lot of computing resource and are generally not suitable for real-time applications.

In contrast to the global approaches, the objective function of the local methods contains only the measurement part. The local methods generally perform the stereo matching in four stages [1]: (1) The calculation of the preliminary matching cost, (2) aggregation of the cost over support windows, (3) estimation of the disparity, and (4) refining the disparity. Among them, the cost aggregation step is usually transformed into an image filtering procedure of the matching cost and the disparity maps are obtained by the winner-takes-all method [1]. The local methods require less computation and are popular for fast disparity calculations.

Cost aggregation is crucial for matching performance in the local algorithms. Bilateral filtering [14] is among the early approaches that led to the increase of computational complexity with the increase of support-window size. Later, tree filtering [15], domain transformation [16], recursive edge-aware filter [17], and full-image guided filtering [18] were proposed to decouple computational complexity with the support window size. However, these approaches all su ffer from the weight-decay problem when there is a significant intensity di fference between neighboring pixels. This behavior deteriorates information propagation and impairs the resulting matching performance.

Hosni [19] suggested treating the generation of disparity as a labeling problem, which is implemented through the steps of constructing a cost volume, cost volume filtering, followed by winner-takes-all label selection. Along this line, the guided-image-filter (GIF) [20] substantially involves cost volume filtering because it can generate clear edge profiles free from the gradient-reversal artifacts. Later, Li [21] introduced an edge-aware weighting, denoted as the weighted guided-image-filtering (WGIF), to improve GIF. Kou [22] proposed a gradient-domain guided-image-filter (GDGIF) to reduce halo artifacts by incorporating an explicit first-order edge-aware constraint. Nevertheless, due to the lack of pixel information outside the fixed window, the implementation of WGIF by Hong [23] results in restricted performance.

To remove the fixed-window limitation, approaches with adaptive guided filters were proposed [24]. However, information outside the support windows is still missing. In a recent paper [25], we introduced weights that take both distance and intensity differences into account to extend the scheme of GIF. We called our approach the pervasive guided-image-filtering, denoted as PGIF [25], which exploits the whole image for aggregation.

Also, recent years have seen the development of coarse-to-fine (CTF) strategies to enhance the stereo matching accuracy. For instance, Hu [26] proposed to reduce the search space of local stereo matching by introducing a candidate set of neighbor pixels. Tang [27] introduced a multi-scale pixel feature vector to provide effective matching of radiometric differences. Advanced techniques to find improved disparity range was provided in the work of Li [28] by recursive multi-scale decomposition. These methods assume the existence of disparity consistency. In contrast to these approaches, the matching cost integration method of Zhang [29] resulted in superior matching, where cost aggregation is formulated to enforce the consistency of the cost volume among the neighboring scales.

Inspired by the multiscale scheme of Zhang [29], we extend the pervasive guided-image-filtering, PGIF [25], to exploit the cross-scale features in the cost volume. In our approach, the consistency is imposed on the GIF parameters, rather than the cost volume of Zhang [29], in the neighboring-scale direction. We call it hierarchical guided-image-filtering, denoted as HGIF.

The main contribution of this paper can be summarized as:

