The main objective of this section is to take advantage of traditional matching methods to obtain some high-confidence matching points in local regions by means of artificially selected simple features. These matching points are then used as labeled training samples for subsequent matching networks. Generally speaking, sparse matching methods based on feature points and lines can obtain stable and accurate feature points for matching, but the number of matching points obtained by these methods is too small to support the training requirements of subsequent network models. In contrast, the dense matching method can obtain more matching points. The random walk algorithm proposed by Ham et al. [
14] first converts the matching problem into a probability model. The matching cost is regarded as the probability of matching between points, which can provide a reference for us to screen the accuracy of pre-matching points. However, the inherent smoothing assumption of this method makes the matching unsatisfactory in areas of occlusion or parallax abrupt change. For this reason, we consider that the edges of the optical images of urban areas are semantically linked to the edges of the disparity map. Texture-similar regions of optical images present fixed or linear changes in disparity values on the disparity map, without abrupt changes. In contrast, optical images with abrupt textures are prone to abrupt changes in disparity values. We can understand that there is a certain constraint relationship between the optical image and the disparity map, and since the disparity map is a form of representation of the stereo matching relationship, the use of optical image edge information can provide valuable clues to obtain the matching relationship between the parallax abrupt change region.
Based on the above ideas, this paper proposes a pre-matching method based on superpixel random walk, and the outline of the pre-matching is given in
Figure 2. Using the constraint relationship between optical image and disparity image, the noise results in weak texture, and the mismatch of occlusion or parallax abrupt change region are removed by superpixel segmentation and two constraint criteria. Specifically, we first construct a point matching cost using selected simple features. Then, we aggregate them into block matching cost based on the superpixel segmentation results. Finally, the matching cost is updated and optimized according to the two constraint criteria of parallax consistency and mutation, so as to obtain some stable matching points. The method consists of three steps: constructing the point matching cost, constructing the block matching cost, and optimizing and updating the cost.
3.1.1. Point Matching Cost
One of the most basic ideas of stereo matching is to describe the matching correlation between two images by constructing a matching cost function, so that the two points with the greatest correlation can be selected as matching points. In this stage, we first construct the initial matching cost of the pixel with common features in the existing matching methods. The gradient features, census transform, rank transform and mutual information features commonly used in stereo matching methods have high accuracy and stability. Compared with rank transformation and mutual information that requires initial disparity values and hierarchical iterations, this paper takes into account the computational needs of the subsequent block matching and optimization algorithms, as well as the fact that the urban scenes of interest contain a large number of buildings, so the initial matching cost of the pre-matching method is constructed using gradient features and census transform, which are more computationally efficient and more sensitive to building edges.
The census transform [
27] technique is to convert the pixels of the left and right images into binary vectors and compare them to the surrounding pixels within finite support regions, as shown in Equation (1):
where
and
denote the intensity values of the target pixel and the pixels around the target, respectively,
denotes the cascade,
w is the window around (
i,
j), and
H is the binary function that returns 0 or 1. We use a 5 × 5 window to encode a binary vector of each pixel in the Census transform. The binary vectors are encoded by comparing the intensity values of the center and its surrounding pixels, as in Equation (2):
where
is the binary function of
and
. The binary vector is assigned to each pixel in the left and right images. The matching cost is calculated using the Hamming distance [
28] of the two binary vectors, as shown in Equation (3):
where
is the matching cost based on Hamming distance at disparity
d. The subscripts
l and
r denote the left image and right image, respectively. Since the census transform encodes the image structure based on the relative ordering of pixel intensities, it has better robustness to illumination variations and image noise. However, due to this property, matching blur may result in weakly textured areas with the same or similar textures. To solve these problems, we include gradient features in the calculation of the initial matching cost.
The matching cost based on image gradient features is defined as in Equation (4):
where
is the matching cost based on gradient feature at disparity
d.
and
denote the horizontal and vertical gradient images, respectively. The gradient images are calculated with a 5 × 5 sobel filter.
The census transform and gradient features are combined by weight to construct the following point matching cost, as shown in Equation (5):
where
and
are the weight parameter to balance the census term and the gradient term, respectively.
and
are truncation values used to limit the influence of outliers.
is the matching cost of each pixel in the right image compared to each point on the epipolar line of the left image.
3.1.2. Block Matching Cost
The urban scenes contain a large number of artificial buildings. The most obvious feature of such buildings is the similarity texture of the building top surface, and the junction between buildings and non-buildings, which is prone to texture change. This property on the disparity map has similar performance. We use this property to aggregate the point matching cost into a block matching cost, so that there is a smooth parallax constraint within the block, while the inter-block is more prone to parallax abrupt change, as shown in
Figure 3.
Superpixel is a block of images consisting of neighboring pixels with similar texture, color, and illumination characteristics. Different pixels in one superpixel may have the same geometric features and similar parallax. Thus, segmenting the optical image by superpixels has similar results to segmenting the parallax map. For this reason, we use each superpixel block segmented from the optical image as a guide for our aggregation point matching cost. In this paper, we use the simple linear iterative clustering [
29] to perform superpixel segmentation on the left and right images.
The block matching cost can be given by Equation (6):
where
s is a superpixel block,
is the cost function of the superpixel
s when the disparity is
d, and
is the number of points in the superpixel
s.
represents the point matching cost when the disparity is
d at (
i,
j) in the superpixel
s. The left image matching cost
and the right image matching cost
are calculated separately.
Although we construct the matching cost function for local blocks, the segmentation results of the superpixel will largely affect the matching results. Specifically, there are two problems: the larger the superpixel block, the more likely it is to have under-segmentation, where regions with different disparity are segmented in one superpixel block; the smaller the superpixel block, the more likely it is to have over-segmentation, where regions with the same disparity are segmented in different superpixel blocks, and as shown in
Figure 4.
3.1.3. Optimization and Updating
In the actual segmentation process, it is difficult to guarantee that accurate segmentation results can be obtained every time, which causes a certain error in the block matching cost. For this reason, in this paper, the matching cost of each superpixel block is updated iteratively under the condition of considering the influence of surrounding blocks, provided that the superpixel chunks are small enough, so as to achieve a stable block matching cost. This idea of iteratively updating the matching cost is similar to the random walk algorithm; both of them are designed to obtain a smooth stable probability distribution or matching cost. To this end, this paper improves the random walk algorithm to update the block matching cost and eliminate the interference caused by over-segmentation through constraints such as smoothness, consistency and mutability. The final result of block matching cost aggregation is similar to that of superpixel exact segmentation.
Random walk was first proposed for image segmentation [
30]. It starts from a node in the graph and faces two choices at each step, randomly choosing an adjacent node or returning to the starting node. The algorithm contains a parameter
c for the restart probability and 1 −
c for the probability of moving to an adjacent node. After iterations to reach stability, this probability distribution can be considered as the distribution influenced by the start node. We apply the random walk to the block matching cost update, and the update function is defined as Equation (7):
where
denotes the initial matching cost when the disparity value is
d,
denotes the updated matching cost,
t is the number of iterations, and
k is the number of superpixels. The weighting matrix
contains the edge weights of all superpixels, and
is obtained by normalizing the rows of
W. Edge weights are used to describe the probability that the matching cost of a superpixel block is passed to the neighboring blocks. We assume that neighboring superpixel blocks on an optical image tend to have similar disparity values on the disparity map when the color distance is close. Therefore, neighboring superpixels with similar intensities have more influence on each other. The edge weight
of the
u-th and
v-th superpixel blocks is calculated by Equation (8):
where
and
are the intensities of the
u-th and
v-
th superpixel blocks, respectively, and
and
are parameters that control the shape of the function.
The matching cost gradually reaches convergence as the number of iterations t increases. The above method provides a local minimum, but the limitations of the smoothness constraint mean that it does not provide a good solution in regions of occlusion or parallax abrupt change. Therefore, we added parallax consistency and mutation constraints to correct and optimize the matching cost for these regions.
The occluded pixels involved in this paper are the pixel points that appear in only one view and are not visible in the other view. In order to eliminate the effect of occluded pixels on the matching cost update, we use parallax consistency to detect occluded pixel blocks and set the occluded pixel blocks to zero in the matching cost update process. Parallax consistency means that the matching relationships obtained in the two views should correspond to each other, and the occluded pixels do not satisfy this consistency. Therefore, we propose the following consistency constraint function, as shown in Equation (9):
where
and
are the current parallax maps of the left image and right image, respectively, and
and
are the
x and
y centroids of superpixel
s. The superpixel blocks with inconsistent disparity in the left and right disparity maps are divided into occluded superpixels and set to 0, while the other blocks are set to 1 as non-occluded superpixels. The occlusion masks
are obtained by splicing each
. Finally, the matching cost is multiplied by the occlusion mask to obtain the consistent matching cost after the parallax consistency constraint, as shown in Equation (10):
where
denotes the element-wise product function.
The random walk algorithm considers that adjacent blocks have more influence on each other, which is manifested in the parallax map by the existence of smoothness constraints and prone to errors in the parallax abrupt change region. For example, in the eaves of a building, the disparity value varies greatly, but the disparity boundary becomes blurred due to the smoothness constraint. To prevent such problems, we add a mutability constraint. First, we calculate the temporary disparity value of the superpixel based on the current matching cost, as shown in Equation (11):
where
is the edge weight,
is the consistency constraint,
is the current disparity of the neighboring superpixel, and
is the temporary disparity value of the
u-th superpixel. The mutability matching cost is calculated using the temporary disparity values composed of all superpixel blocks, as in Equation (12).
where
is the Equation (11) calculated parallax,
is the scalar parameter, and
denotes the truncation parameter, which play an important role in controlling the parallax mutability.
The mutability constraint preserves disparity boundaries by maintaining the intensity difference between adjacent superpixels, avoiding blurring small objects into the background and thus preserving more detailed information.
Combining parallax consistency and mutability matching cost, we construct the following block matching cost iterative update function, as shown in Equation (13):
where
is the mutability matching cost calculated in Equation (12),
is the consistency matching cost calculated according to Equation (10),
λ is used to balance them, and
c is the restart probability. The consistency and mutability matching costs are determined based on the current matching cost
. The matching cost propagates along the graph
, and the initial matching cost is aggregated into the current matching cost, which is proportional to the restart probability (1 −
c). The combination of the superpixel matching cost and the initial point matching cost constitutes the final matching cost
, and the parallax value
is determined by minimizing the matching cost, as in Equation (14):
where
s is the superpixel corresponding to pixel (
i,
j),
γ denotes the weight of the superpixel and the point matching cost, and argmin means finding a disparity value to minimize the matching cost
P.
Since the matching cost P describes the degree of matching between two image point pairs, we can filter all the matching points by setting a threshold to obtain some of the pairs with higher matching confidence, which we call pre-matched pairs.