An Underwater Stereo Matching Method: Exploiting Segment-Based Method Traits without Specific Segment Operations

Xu, Xinlin; Xu, Huiping; Ma, Lianjiang; Sun, Kelin; Yang, Jingchuan

doi:10.3390/jmse12091599

Open AccessArticle

An Underwater Stereo Matching Method: Exploiting Segment-Based Method Traits without Specific Segment Operations

by

Xinlin Xu

^1,2,*

,

Huiping Xu

^1,2,

Lianjiang Ma

^1,2,

Kelin Sun

^1,2 and

Jingchuan Yang

^1,2

¹

Institute of Deep-Sea Science and Engineering, Chinese Academy of Sciences, Sanya 572000, China

²

University of Chinese Academy of Sciences, Beijing 101408, China

^*

Author to whom correspondence should be addressed.

J. Mar. Sci. Eng. 2024, 12(9), 1599; https://doi.org/10.3390/jmse12091599

Submission received: 10 August 2024 / Revised: 2 September 2024 / Accepted: 2 September 2024 / Published: 10 September 2024

(This article belongs to the Special Issue Underwater Observation Technology in Marine Environment)

Download

Browse Figures

Versions Notes

Abstract

:

Stereo matching technology, enabling the acquisition of three-dimensional data, holds profound implications for marine engineering. In underwater images, irregular object surfaces and the absence of texture information make it difficult for stereo matching algorithms that rely on discrete disparity values to accurately capture the 3D details of underwater targets. This paper proposes a stereo method based on an energy function of Markov random field (MRF) with 3D labels to fit the inclined plane of underwater objects. Through the integration of a cross-based patch alignment approach with two label optimization stages, the proposed method demonstrates features akin to segment-based stereo matching methods, enabling it to handle images with sparse textures effectively. Through experiments conducted on both simulated UW-Middlebury datasets and real deteriorated underwater images, our method demonstrates superiority compared to classical or state-of-the-art methods by analyzing the acquired disparity maps and observing the three-dimensional reconstruction of the underwater target.

Keywords:

underwater stereo vision; cross-based patch; segmentation; Markov random field

1. Introduction

The ability of stereo vision to capture the three-dimensional shapes of underwater structure presents a multitude of potential applications in marine engineering [1,2,3], including seabed mapping [4], underwater robot navigation [5], marine biology studies [6], the preservation of cultural heritage [7], and maneuvering of marine vehicles [8].

This system, consisting of two horizontally displaced cameras, concurrently captures images of a scene from slightly offset perspectives [9]. Stereo matching methods represent the matching point pairs obtained from images taken from two perspectives, as in the form of the disparity map.

The task of securing sufficient ground-truth disparity maps specifically for training deep learning methods applied to underwater imagery presents a challenge [10,11]. On the contrary, non-deep learning-based stereo methods are not subject to this difficulty. All stereo methods face challenges due to poor underwater images when utilized in marine engineering applications [3].

The scattering caused by a multitude of floating particles [12] results in a subtle displacement of object positions in the image. Artificial illumination introduces an imbalance in lighting [13]. These phenomena raise the loss of intricate details and the color variations of two perspective images [14], resulting in compromised image quality. Stereo methods relying on one-dimensional discrete disparity values are suited to cases where adjacent pixels share identical disparity values. However, these algorithms struggle to describe irregular underwater objects with curved or inclined surfaces effectively. In addition, missing and repetitive textures [2] diminish accessible image information for stereo matching. Thus, the accuracy of stereo methods is compromised by these challenges [15].

The aim of this research is to propose a stereo method designed specifically for marine engineering applications, addressing common issues arising from deteriorated images, irregular underwater objects, and insufficient texture information. The proposed method adapts a coarse-to-fine matching strategy and is an evolution of segment-based approaches. The main contributions are listed as follows:

To address the irregular surfaces present in typical underwater images, we introduce an energy function grounded in 3D labels.
Our strategies for harnessing the segment information within cross-based patches diverge for the processes of expansion and propagation.
To fortify the robustness of the propagation process, we improved the construction scheme of the cross-based patches.
Our algorithm can be applied to parallel computation, be it in the expansion or the propagation process.

2. Related Work

The cost function measures the similarity between a pair of matching points on both the left and right images [16]. When examining the cost of individual pairs of points, the disparity map exhibits many noisy areas. So, researchers aggregate the cost of neighboring pixels around the point being matched. The area of cost aggregation is referred as a support window. All the pixel information contained within the window is considered as information about the central pixel, thereby enhancing its distinctiveness.

The local methods are based on two types of basic units: fixed square grids [17] and adaptive regions [18]. The disparity of each pixel is obtained in isolation, assuming that all pixels within a unit share identical disparities.

Assigning a uniform disparity value to all pixels within a local support window results in the constraint of fronto-parallel bias, where a constant disparity cannot depict a slanted plane. Given that a constant value fails to describe a plane, PatchMatch [19] opts to express it mathematically instead. Researchers have developed a refined representation of a geometric plane, providing a description of a disparity plane, as shown in Equation (1).

d_{p} = a_{p} p_{x} + b_{p} p_{y} + c_{p}

(1)

In Equation (1), the parameters

(a_{p}, b_{p}, c_{p})

are assigned as a 3D label to the pixel

p = (p_{x}, p_{y})

. The theoretical basis of the parameters is derived from the normal vector of a surface plane. There is a huge continuous label space

R^{3}

for each pixel composed of parameters of labels. PatchMatch updates the 3D label through the label propagation process, where a pixel can generate and propagate a label to its four neighbors. This label is referred to as a candidate label or proposal. Each pixel label is updated after several iterations in a raster-scan order.

The local method’s susceptibility to noise interference arises from its exclusive consideration of similarity between matching points. Compared to local methods, global methods, based on MRF, are robust to noise [20,21]. They integrate the cost function with an embedded smoothness constraint [22]. The constraint serves to penalize disparities that are inconsistent among adjacent pixels to obtain a smooth disparity map. Furthermore, the integration of the cost function and the smoothness term is referred to as the energy function.

The energy function can be solved by optimizers such as graph cut (GC) [8] and belief propagation (BP) [23]. However, BP is a sequential optimizer, which improves each variable while it keeps others in a static condition. Nevertheless, GC improves all variables simultaneously by accounting for interactions across variables and helps optimization avoid local minima. The operation for graph cut estimates the MRF model for an entire image, named as the

α -

expansion move.

LocalExp, which combines the characteristics of local methods and global methods, can optimize the labels concurrently within a predefined grid with a local expansion move [24]. This move can be regarded as a subset of the

α

-expansion move. The existing algorithms [3,21], based on LocalExp, are applied in marine engineering.

Compared with other methods, segment-based methods demonstrate superior performance in regions with missing textures [25,26]. With the transition of segmentation techniques from traditional methods [27] to deep learning methods [28,29,30], segment-based stereo methods have emerged, such as SD-MVS [31]. These methods allocate the same label to all pixels in a segmented region [25,26,32]. Each label of pixels within the segmented region experiences a reduction in the search range of

R^{3}

, simplifying the process of obtaining the appropriate label. These methods have been applied on land but have not yet been adapted for underwater engineering applications.

3. Materials and Methods

3.1. Overview

As depicted in Figure 1, the workflow of the framework is presented. Our method can be divided into two label optimization stages: a coarse matching in Section 3.4.1 and a fine matching stage in Section 3.4.2. This method comprises three primary procedures: subdivision of local units, an expansion process, and a propagation process. Our approach divides the provided RGB image areas into two primary components: cross-based patches in Section 3.3.2 and fixed grids in Section 3.3.3. These units, in conjunction with the two processes, constitute the two stages. In the coarse matching stage, our method demands several rounds of expansion process on fixed grids. This is succeeded by the fine matching stage, wherein a single propagation process is carried out within cross-based patches. The expansion process can be further divided into three distinct optimization steps in Section 3.4.1. Both stages of the method optimize the energy function in Section 3.2 by employing local expansion moves to update pixel labels.There are different approaches in the processes of expansion and propagation that use segment information within cross-based patches, limiting the acquisition scope of proposals used in local expansion moves.

3.2. Energy Function

For the minimization of the energy function, the combination of graph cut with local expansion moves has an advantage [33]. By enabling a single min-cut to assign the same 3D label to multiple pixels within a local region [24], it facilitates the discovery of smooth solutions. Furthermore, the ability to simultaneously update multiple variables helps to avoid being trapped in unfavorable local minima.

Building upon previous studies utilizing energy functions derived from 3D labels, we utilize Equation (1) to introduce over-parameterization for the pixel p’s disparity

d_{p}

. Consequently, we aim to determine labels

l_{p} = (a_{p}, b_{p}, c_{p}) \in R^{3}

for each pixel in an image

L

. In estimating these labels, we introduce an energy function designed specifically for underwater environments.

E (l) = \sum_{p \in Ω} ϕ_{p} (l_{p}) + λ \sum_{(p, q) \in N} ψ_{p q} (l_{p}, l_{q}) .

(2)

With Equation (2), we can obtain an appropriate label

l_{p}

by minimizing the energy of the local patch

Ω (Ω \in L)

. The parameter

N

means the neighbor area of pixel p.

3.2.1. Data Term

The first item of Equation (2) can be called the data item or cost function. This term serves as the the principal criterion for a pixel to find its matching pixel. Given a label

l_{p}

, the data term is defined in Equation (3).

ϕ_{p} (l_{p}) = \sum_{s \in Ω_{p}} W_{p s} ρ (s ∣ l_{p}) .

(3)

In Equation (3),

Ω_{p}

is the support window of p. The cost function is composed of the raw cost

ρ (\cdot)

and the weight of filter

W_{p s}

[34]. The weight, as defined in Equation (4), is calculated by the guided image-filtering algorithm to achieve the edge-awareness ability at a low complexity.

W_{p s} (I) = \frac{1}{{| ω |}^{2}} \sum_{k : (p, s) \in ω_{k}} (1 + \frac{(I_{p} - μ_{k}) (I_{s} - μ_{k})}{σ_{k}^{2} + ϵ})

(4)

The filtering area is

{| ω |}^{2}

centered on p. The pixels p and s have two normalized color vectors, namely,

I_{p}

and

I_{s}

. In addition,

μ_{k}

and

σ_{k}^{2}

comprise the parameter matrix for

I_{p}

. The parameter

ϵ

is a regularization penalty. This filtering can mitigate the noise in disparity maps.

The raw cost function

ρ (s ∣ l_{p})

is inspired by [35], which integrates two components: the zero-mean normalized cross-correlation value (ZNCC) [36] and the Hamming distance of census transformation (CT) [37]. ZNCC measures the similarity between two image windows by computing their correlation score. CT is a local descriptor used to represent the spatial arrangement of pixel values within an image window.

Both ZNCC and CT components are constructed using discrete disparity values. Our method introduces the 3D label into

ρ (s ∣ l_{p})

, enhancing the method’s performance in areas that correspond to irregular object surfaces. The cost

ρ (s ∣ l_{p})

for a pixel

s (x, y)

is delineated as follows:

ρ (s ∣ l_{p}) = α m i n (H (s ∣ l_{p}), τ_{H}) + (1 - α) m i n (Z (s ∣ l_{p}), τ_{Z}) .

(5)

In Equation (5),

Z (s ∣ l_{p})

and

H (s ∣ l_{p})

represent the ZNCC value and the Hamming distance, respectively, when the label of pixel s is defined as

l_{p}

. The truncated parameters

τ_{H}

and

τ_{Z}

enhance the robustness of the method for occluded regions. For the purpose of evaluating the similarity between two windows in the image pair, we set the window size for both the ZNCC and CT algorithms as

W = (2 m + 1) (2 n + 1)

.

By normalizing the pixel values within an image window to have zero mean and unit variance, ZNCC enhances robustness against variations in brightness and contrast. Normalization, which mitigates the influence of illumination changes and enhances the capture of structural and textural features in images [35,38], is defined as

Z (s ∣ l_{p}) = 1 - \frac{{cov}_{x, y, d}^{*} (s, r)}{\sqrt{{var}_{x - d, y}^{*} (r)}} .

(6)

The disparity d is obtained by Equation (1). Then, the coordinate of the corresponding pixel

r_{x - d, y}

is established.

The various parts of the ZNCC can be expressed as follows:

\begin{matrix} {cov}_{x, y, d}^{*} (s, r) = \sum_{i = - m}^{m} \sum_{j = - n}^{n} s_{x + i, y + j} \cdot r_{x - d + i, y + j} \\ - W {\bar{s}}_{x, y} \cdot {\bar{r}}_{x - d, y}, \end{matrix}

(7)

{var}_{x - d, y}^{*} (r) = \sum_{i = - m}^{m} \sum_{j = - n}^{n} r_{x - d + i, y + j}^{2} - W \cdot {({\bar{r}}_{x - d, y})}^{2} .

(8)

Here,

s_{x + i, y + j}

represents the gray value of pixel

(i, j)

located in the image window. In addition,

{\bar{s}}_{x, y}

and

{\bar{r}}_{x - d, y}

represent the gray average of the image window, respectively.

The census transform [37] encodes the relative ordering of pixel values within an image window rather than using the actual pixel values, providing resilience to variations in lighting and noise. The Hamming distance of census transform

H (s ∣ l_{p})

can be obtained by performing an exclusive-or operation (XOR operation) between two transformation windows, as in Equation (9).

H (s ∣ l_{p}) = \sum_{i = - m}^{m} \sum_{j = - n}^{n} \frac{C T {(s)}_{i j} \oplus C T {(r)}_{i j}}{(2 m + 1) \times (2 n + 1)} .

(9)

The variable

C T {(s)}_{i j}

is a number of s’s in a census encode array, encoded as

C T {(s)}_{i j} = \{\begin{matrix} 0, & G (s_{x, y}) \leq G (s_{x + i, y + j}) \\ 1, & G (s_{x, y}) > G (s_{x + i, y + j}) . \end{matrix}

(10)

Here,

G (s_{x, y})

is the gray value of pixel s, where

i \in {- m, \dots, m}

and

j \in {- n, \dots, n}

. The length of the census array is

(2 m + 1) (2 n + 1)

. Census transformation is crucial due to its capacity to convey relative information among adjacent pixels.

None the ZNCC or the CT functions incorporate information related to color intensity. These functions remain insensitive to the absence of image color caused by light absorption of artificial illumination. Additionally, they exhibit resilience to noise and remain unaffected by the imbalanced gray value of the underwater image pair. By integrating the features of 3D labels to accommodate complex object surfaces [19], the cost function Equation (3) is capable of adapting to marine engineering environments.

3.2.2. Smoothness Term

The second item of Equation (2), the smoothness term, is proposed to penalize the different labels in a local area [24]. This term is formulated as follows:

ψ_{p q} (l_{p}, l_{q}) = max (w_{p q}, ϵ) min ({\bar{ψ}}_{p q} (l_{p}, l_{q}), τ_{dis}),

(11)

where

ϵ

is the threshold and

ω_{p q}

is a contrast-sensitive weight, defined in Equation (12).

w_{p q} = e^{- {∥I_{L} (p) - I_{L} (q)∥}_{1} / γ}

(12)

This weight describes the color similarity between two adjacent pixels. The function

{\bar{ψ}}_{p q} (\cdot)

penalizes the disparity dissimilarity between p and q, which is shown as

{\bar{ψ}}_{p q} (l_{p}, l_{q}) = |d_{p} (l_{p}) - d_{p} (l_{q})| + |d_{q} (l_{q}) - d_{q} (l_{p})| .

(13)

Here,

d_{i} (l_{j})

is the disparity of i, which employs the label

l_{j}, where (i, j) \in (p, q)

. The threshold

τ_{dis}

ensures the operation of the smoothness term exclusively in regions where disparities remain continuous. The smoothness term can enhance the precision of matching in image areas with low texture and repetitive patterns.

3.3. Subdivision of Local Units

The division of the basic units is the foundation for two matching stages. In this section, we utilize the inherent segment information of cross-based adaptive regions to steer the propagation process of the fine matching stage. The analysis of this process urges us to improve the construction scheme of the cross-based patch. Additionally, our method employs fixed square grids in the expansion process of the coarse matching stage.

3.3.1. Basic Cross-Based Patch Scheme

Relative to fixed-shape local units, the adaptive unit is less affected by noise [27], rendering it as an ideal choice for underwater environments.

The cross-based patch can capture local structures within an image and handle different types of textures efficiently. It is iteratively expanded by examining the similarity between the central pixel and its neighboring pixels within a cross-shaped window. This patch enhances feature extraction and representation by considering the local context around each pixel.

The cross-based patch is extended by a color-sensitive cross skeleton that is composed of orthogonal line segments [37,39]. It exhibits superior efficiency in contrast to pixel-wise methods [19,40] that maintain an independent label space

R^{3}

. The cross-based patch also has good edge awareness and smaller complexity.

Given an anchor p as an example, which is the center of a grid

C_{i j}

as depicted in the left section of Figure 2, We build a cross skeleton first, whose left arm is constructed by a set of consecutive pixels

p_{l}

in the left side of p, where

p_{l}

satisfies the following rules:

{max}_{i = R, G, B} {∥I (p) - I (p_{l})∥}_{i} < τ,

(14)

max ∥ p - p_{l} ∥ < = L .

(15)

Here,

{∥ \cdot ∥}_{i}

means the ith color difference between p and

p_{l}

. The

∥ \cdot ∥

is the distance between these pixels, whose thresholds are

τ

and L. The remaining arms of the cross skeleton in other directions are constructed following a similar process to that of the left arm.

The cross skeleton then gathers all vertical arms whose central pixels are positioned within the horizontal arms of p. The final patch is denoted as a cross-based patch

U_{p}

.

U_{p} = \{T_{s} \cup B_{s} ∣ s \in (L_{p} \cup R_{p})\},

(16)

where the pixel s, located in horizontal arms (

L_{p}

and

R_{p}

) of the cross skeleton, is the center of

T_{s}

and

B_{s}

. The region

U_{p}

is shown in the middle of Figure 2.

Constrained within specific boundaries, such as the grid

C_{i j}

, the cross-based patch serves as an imperfect segment region.

3.3.2. Extended Cross-Based Patch Scheme

Examining the principle that pixels share a 3D label within a segment region, the cross-based patch’s anchor possesses the capacity to adopt any label from its patch. The propagation process can extend the anchor’s label searching range from its patch to the connectivity domain. Further insights are provided in Section 3.4.2.

However, due to differences in color features, according to Equation (14), the propagation process will be executed separately on the small texture on the surface of the object and on other surface areas. This results in the inability of labels from other areas to propagate to the small texture, leading a poor performance of the propagation process in these textures.

Inspired by [41], we can find all small textures in an image. Based on Equation (14), neighboring pixels with features similar to the center pixel are grouped into a connectivity region in the four directions of the cross skeleton and marked as visited. This iterative process continues until all pixels have been visited.

This method of pixel clustering offers a robust approach for partitioning the connectivity domains across the entire image. Given a connectivity domain whose size is smaller than a specific threshold, pixels within this domain are marked as texture anchors. In this paper, this threshold is harmonized with the max disparity value.

Figure 3 illustrates the utilization of binary marks to annotate whether a pixel is situated in a small texture. If p is a texture anchor, all pixels in its cross-based patch

U_{p}

are recognized as constituents of the same texture. This recognition is based on the consistent criteria in Equation (14), used to define the cross-based patch and the division of texture.

We refer to regions in Section 3.3.1 as the basic cross skeleton and the basic cross-based patch, respectively. Given a texture anchor, its cross-based patch must be enhanced to optimize the connectivity between the anchor’s texture area and its neighboring regions. We elongate all arms of p’s basic cross skeleton to their maximum length, defined as follows:

max ∥ p - p_{i} ∥ = L .

(17)

Here,

p_{i}

represents an endpoint of an arm and L is the threshold in Equation (15). This modified cross skeleton is referred to as the extended cross skeleton. Finally, all vertical arms are consolidated as depicted in Equation (16). These steps form an extended cross-based patch

E_{p}

, as shown in the right portion of Figure 3.

Some neighbor pixels of a small texture are covered by these extended patches, enabling connectivity between the texture and its surrounding areas. During the propagation process, neighboring areas can propagate their proposals to texture pixels via these covered pixels.

3.3.3. Fixed Grid Scheme

Serving as the basic unit in local expansion moves, the image

L

is divided into numerous square grids equally, and each grid is indexed by 2D coordinates

(i, j)

, denoted as the center grid

C_{i j} \in L

. An expansion grid

R_{i j}

is composed of

C_{i j}

and its surrounding eight center grids, as depicted in the right portion of Figure 2.

In the coarse and fine matching stages, labels in the expansion grid can be updated with a proposal generated by local expansion moves. An expansion grid is outlined by

R_{i j} = C_{i j} \cup \{⋃_{(m, n) \in N (i, j)} C_{m n}\} .

(18)

To capitalize on the adaptability of different grid sizes to suit various image areas [42], our method possesses three size levels of the center grid, denoted as

H = {h 1, h 2, h 3}

. Throughout this section, we employ h to represent any level. The cross-based region of the grid’s center p must be confined within the boundaries of

C_{i j}

, as shown in the middle of Figure 2. The dimension of h is defined as follows:

h = 2 L + 1 .

(19)

3.4. Label Optimization Procedure

The label optimization procedure in this study comprises two key processes: expansion in the coarse matching and propagation in the fine matching. Concurrent optimization procedures can be executed across multiple expansion grids within the same grid group, facilitated by the substantial inter-grid spacing.

As the starting step, for each grid within the same group, we initiate labels through parallel computation. We opt for a random pixel

p = (x, y)

within a center grid. We then assign a normal vector

\vec{n} = (n_{x}, n_{y}, n_{z})

and introduce a random disparity d to p, where

d \in [0, d_{m a x}]

[19]. Equation (20) is applied to transform the pixel coordinate and vector into a label

l_{p} = (a_{p}, b_{p}, c_{p})

.

a_{p} = - \frac{n_{x}}{n_{z}}, b_{p} = - \frac{n_{y}}{n_{z}} and c_{p} = \frac{n_{x} x + n_{y} y + n_{z} d}{n_{z}}

(20)

To estimate the pixel 3D labels, the method employs graph cut optimization of an energy function to enable the model to estimate of Markov random field [33], namely, local expansion moves. We use Equation (21) to derive a local label mapping l, for a expansion grid

R_{i j}

.

l = argmin E (l^{'} ∣ l_{p}^{'} \in \{l_{p}, α_{i j}\}, p \in R_{i j})

(21)

We perform the label optimizations of a pixel p on the current label

l_{p}

, transforming them into

l_{p}^{'}

. The updated label

l_{p}^{'}

derived from a selection between two binary variables, namely,

l_{p}^{'} \in {l_{p}, α_{i j}}

. When the energy related to incorporating a proposal

α_{i j}

becomes less than that associated with retaining the existing label

l_{p}

, p will opt for

α_{i j}

as its updated label

l_{p}^{'}

. Employing the graph cut algorithm, the energy function can be resolved, enabling the simultaneous allocation of pixels sharing similar features to the proposal

α_{i j}

.

The main computational time-consuming part of our method is composed of local expansion moves, which are accelerated via parallel computation. Proper scheduling is essential, as local expansion moves cannot occur simultaneously due to overlapping expansion grids.

To enable parallel computation, a strategy is implemented to group partitioned grids [24]. The grouping strategy involves selecting center grids at intervals of four vertical and horizontal units, bringing them into the same group. Grids within the same group do not overlap. These grid gaps ensure the independence of local expansion moves within the same group, preventing interference. This design enables simultaneous optimization of labels across multiple central grids within the same group. Consequently, we can perform them simultaneously in a parallel fashion.

3.4.1. Coarse Matching Stage

During the coarse matching phase, we directly utilize segment information derived from the cross-based patch within the central grid. In this stage, the center grid is divided into two parts using a cross-based patch to represent foreground and background in local areas, followed by updating pixel labels for each region. We create a cross-based patch

U_{p}

around the grid

C_{i j}

’s center p, as shown in Figure 4. Due to the length constraint in Equation (19),

U_{p}

is confined within the grid

C_{i j}

. Consequently, the center grid can be partitioned into

U_{p}

and the remaining area

M_{p}

, where

U_{p} \cap M_{p} = \emptyset, U_{p} \cup M_{p} = C_{i j}

.

Three steps for generating proposals are employed in the expansion process. The initial step is the space propagation. This step entails selecting a random pixel from

U_{p}

or

M_{p}

. Due to the similar color features in Equation (14), pixels within

U_{p}

should be assigned an identical proposal by Equation (21), concurrently reducing the impact of this label for the pixels in

M_{p}

. This step is then repeated in

M_{p}

.

The second step is inspired by [24] and called plane refinement. This step involves introducing a series of perturbations in

U_{p}

and

M_{p}

, yielding proposals.

Let us take the plane refinement in

U_{p}

as an example. Within each iteration, we employ random sampling in

U_{p}

to obtain a pixel q. The label

l_{q} = (a_{q}, b_{q}, c_{q})

is then converted by the form of a pixel coordinate

q = (x, y, d)

, where d is the current disparity, and a normal vector

\vec{n} = (n_{x}, n_{y}, n_{z})

. The parameter of the vector is defined by Equation (20). We estimate a disparity

d^{'}

by introducing a random perturbation

Δ_{d}

to d, where

Δ_{d} \in [- Δ_{d}^{m a x}, Δ_{d}^{m a x}]

. A new random normal vector

{\vec{n}}^{'}

can be acquired by

{\vec{n}}^{'} = \vec{Δ_{n}} + \vec{n}

, with three parameters of

\vec{Δ_{n}}

constrained in

[- Δ_{n}^{m a x}, Δ_{n}^{m a x}]

. Finally, the perturbed proposal

l_{q}^{'}

is derived from the combination of

d^{'}

and

{\vec{n}}^{'}

by Equation (20).

Repeating the described step

k_{r}

times, a searching range of proposals is derived. Notably, with each iteration, the perturbation searching ranges of

Δ_{d}

and

\vec{Δ_{n}}

are reduced by half. The labels within the expansion grid

R_{i j}

are periodically updated with local expansion moves. Similarly, we replicate the plane refinement in area

M_{p}

. The plane refinement assigns two separate label spaces for

U_{p}

and

M_{p}

.

Lastly, the proposal by RANSAC fitting is employed to the center grid

C_{i j}

, which can estimate the 3D label of a disparity plane from labels within

C_{i j}

that may contain outliers. These steps collectively constitute the entirety of the coarse matching stage.

We exemplify the expansion process in Algorithm 1. There are four fundamental steps: groundwork (lines 1–3), space propagation (lines 5, 6), plane refinement (lines 7–12), and RANSAC fitting (lines 13, 14). In groundwork, a cross-based region

E_{p}

is established with the grid center point as the anchor p. The unoccupied region

M_{p}

is what remains after removing

E_{p}

from the grid

C_{i j}

. In space propagation, we obtain a random label

l_{q}

and utilize

α_{i j}

-expansion in

R_{i j}

. In plane refinement, the

α_{i j}

-expansion is applied by random labels in

A_{p}

coupled with a series of perturbations. In RANSAC fitting,

α_{i j}

is obtained after the mentioned steps have been executed. These steps are performed in every center grid.

Algorithm 1: Iterative expansion processes.

3.4.2. Fine Matching Stage

In the fine matching stage, we integrate multiple segments of incomplete information contained within cross-based patches. This integration is accomplished during the propagation process. The fine matching stage enables adjacent cross-based patches with similar features to exchange candidate labels through the propagation process. This allows anchors of cross-based patches to access a broader range of candidate labels beyond their individual patches and into their connected domain.

During the propagation process, all pixels within a center grid

C_{i j}

are sequentially chosen as points to be updated. Starting from the top-left point o of the grid, this process proceeds from left to right and top to bottom, as shown in Figure 4.

Following the implementation of the extended cross-based patch in Section 3.3.2, small texture regions exhibit connectivity with their adjacent areas, allowing the propagation process to seamlessly operate between them. When the propagation process arrives at pixel p, this process initially undergoes the following steps. A basic cross-based patch or an extended one,

E_{p}

, is established, depending on whether p lies within a small texture. Then, all pixels within the patch

E_{p}

are recorded as visited, and we also build a temporary expansion grid

R_{t}

centered around p. The size of

R_{t}

is defined by Equation (18) for local expansion moves. Using local expansion moves to update labels in grid

R_{t}

, the algorithm iterates over unvisited pixels until the entire image area is covered.

Inspired by PatchMatch [19], the propagation process involves a random selection of pixel s in

E_{p}

. The label of s is designated as a proposal

α_{i j}

for the grid

R_{t}

, as shown in Figure 4. Likewise, s acquires proposals stemming from s’s cross-based patch. If s has already undergone label updates, receiving s’s labels by

E_{p}

signifies that

E_{p}

employs a proposal from the s’s patch.

The aggregation of pixels into connectivity domain S is based on shared texture characteristics in this study. Given that the patch

E_{p}

and the connectivity domain S adhere to the partitioning rule in Equation (14) and under the further constraint of Equation (15), which is exclusively applicable to building the cross-based patch,

E_{p}

can be considered a subset of S, where

E_{p} \subseteq S

.

To sum up, the label searching range for p extends from

E_{p}

to S. Thus, during the fine matching stage, the propagation process exhibits attributes similar to segment-based methods, where pixels within a connectivity domain are assigned a unified label space

R^{3}

.

3.4.3. Summary of Optimization Stages

This section provides a summary of two matching stages. Common segment-based stereo methods rely on the pre-segmentation of image regions using diverse image features [28,29,30]. Within these regions, candidate labels are identified and used in conjunction with the graph cut to finalize the disparity estimation process, such as PMSC [25].

In Section 3.3.1, we discussed how the cross-based patch constrained by the distance rule becomes an imperfect segmenting region. Previous algorithms [27,37] construct cross-based patches for each pixel, allowing the acquisition of candidate labels within their respective regions. Their methods, however, rely solely on the incomplete segmentation information provided by these patches. Our method simplifies their method by focusing on cross-based patches centered on the center grid pixel in the coarse matching stage. During the fine matching stage, neighboring cross-based patches with similar features can exchange candidate labels through the propagation process, thus broadening the scope of proposal sources beyond the confines of their respective cross-based patches to segment regions based on color features.

As discussed in Section 3.3.3, by subdividing fixed center grids and expansion grids, parallel computation can be accomplished within grids of the same group. Within each group, we incorporated a gap of one center grid width between adjacent expansion grids. This gap is critical for maintaining the independence of local expansion moves.

We summarize the label optimization procedure. Our method commences by defining grid sizes, which are confined by the length of the cross skeleton. Then, we mark the pixels in small textures and initialize the label mappings. Coupled with texture anchors and local expansion moves, we perform iterative expansion processes and the propagation process at two matching stages in parallel.

Following the execution of both matching stages, our method engages in a post-processing stage. This step is designed to enhance the results further, incorporating left-right consistency checks and median filtering of the disparity maps, as described in [24]. The main phases of our methodology are shown in Figure 5.

The framework of our method is shown in Algorithm 2, which commences by defining grid sizes. Then, we mark the pixels in small speckles and initialize the label maps with the smallest grid size. The main loop of our method is shown in lines from 4 to 13. Coupled with speckle anchors, we perform the iterative expansion process of Algorithm 1 and the iterative propagation process in parallel.

Algorithm 2: Overview of optimization stages.

4. Results

Extensive experiments were conducted employing imagery from diverse sources to evaluate the efficacy of the proposed method for acquiring 3D information of underwater targets within the field of marine engineering. This section commences with an introduction to the algorithm’s operating environment, followed by an elucidation of primary parameter configurations in this study in Section 4.1. The experiments in this section involved the utilization of the UW-Middlebury dataset, which is a customized variant of the Middlebury benchmark dataset tailored specifically for underwater environments, in Section 4.2. We also conducted experiments on real underwater binocular image pairs spanning different scenes in Section 4.3.

4.1. Settings

In our experimental setup, we employed a personal computer equipped with a Xeon E5-1620 CPU (Intel, Santa Clara, CA, USA) (clocked at 3.50 GHz with four physical cores). Parallel acceleration was achieved through the deployment of eight threads and a memory allocation of 36 GB. Our method was implemented using C++ in conjunction with OpenCV.

The configuration of our energy function is defined as follows. In Equation (5), we adapt

{τ_{H}, τ_{Z}} = {0.5, 0.4}

to regulate the cost value scope, following [3,36]. To harmonize the impact of ZNCC and CT on the data term,

α

is adjusted to 0.5. The image window parameters for ZNCC and CT are set as

m = 4, n = 3

, which accommodates precisely 63 pixels, enabling the generation of a 63-bit binary code using census transformation. This encoding, nearly equivalent to 64 bits, allows for convenient bitwise operations to acquire the Hamming distance. The kernel area of the filter,

ω_{k}

, is set to 41 × 41 in Equation (4), following [43].

The parameters governing the smoothness term are outlined as follows. In Equation (2), we set

λ = 1

to equalize the data term and the smoothness term. The values of

{ϵ, τ_{d i s}}

and

γ

in Equations (11) and (12) are according to LocalExp [24], with

{ϵ, τ_{d i s}}

fixed at

{0.01, 2.5}

and

γ

set to 25. In addition, we have a color threshold

τ = 60

in Equation (14), following [37]. Our method has three distance thresholds

L = {l_{1}, l_{2}, l_{3}}

in Equation (15). Each of these thresholds is directly proportional to the image’s width, aiming to take advantage of grids of various sizes to adapt to different textural regions. The number of iterations for plane refinement is set at

k_{r} = 8

, following [24].

4.2. Experiments on UW-Middlebury Dataset

The findings in [44] establish a correlation between the underwater image degradation and the accuracy of disparity estimation. We introduce a deep learning rendering method [45], which converts the Middlebury dataset [46] into the underwater UW-Middlebury dataset, as shown in Figure 6. The generated dataset utilizes natural light fields to emulate the features of underwater scenes, transforming ordinary images into underwater-style portrayals. This method capitalizes on the properties of natural light, effectively replicating the visual characteristics unique to submerged environments.

The images in the UW-Middlebury dataset exhibit significant color deviations from the original images. Alongside the color transfer, the depth-based turbidity simulator is utilized to generate the degradation characteristic of actual underwater imagery. We conduct the algorithm in [47], a probabilistic network that offers freedom from the absence of ground-truth images, to enhance images within the UW-Middlebury dataset. This algorithm can adjust the contrast of the simulated images, thereby mitigating color distortion. In Figure 6, ten sets of experimentally relevant images are depicted. The image pairs of “Piano” and “Motorcycle” in the dataset exhibit imbalanced illumination, simulating the challenges posed by artificial underwater lighting. Each set comprises the left image from matched pairs within the Middlebury dataset, the ground-truth disparity map, their counterpart in UW-Middlebury dataset, and the result after enhancement processing of the UW-Middlebury image.

4.2.1. Ablation Study

There are two optimization stages: the coarse matching stage and the fine matching stage. We can constrain the iterations for each stage to compare the impact of them. We designed multiple sets of quantitative experiments and employed three groups of images—Reindeer, Teddy, and Cones—sourced from the UW-Middlebury dataset. These images are, respectively, denoted by the abbreviations r, t, and c in Figure 7.

Our approach is geared towards minimizing the data term in favor of the smoothness term without delving into the intricacies of energy function computation. The cost function in Equation (3) can also derive a label mapping l for a propagation grid

R_{i j}

by Equation (22).

l = argmin E_{d a t a} (l^{'} ∣ l_{p}^{'} \in \{l_{p}, α_{i j}\}, p \in R_{i j})

(22)

There are the same variables in Equations (21) and (22). In Figure 7, in the first two iterations, label updates are steered by the cost function in Equation (22), followed by the adoption of local expansion moves in Equation (21) for updates in the subsequent iterations.

Due to the absence of the plane refinement, the propagation process in the fine stage fails to generate sufficient candidate labels. In contrast to the continuous and infinite label space

R^{3}

, algorithms involved in the fine stage can only engage with a very limited number of candidate labels. Therefore, it is more appropriate to view this process as a redistribution of preexisting pixel labels. Assigning appropriate candidate labels for pixels becomes highly challenging, thus preventing this stage from being executed independently. Therefore, increasing the iteration count in this stage exerts minimal influence on optimizing results. Substituting a cost function with an energy function for local expansion moves enhances algorithmic accuracy, though altering the optimization stages leads to less pronounced improvements in accuracy while notably extending runtime, as shown in Figure 8.

Following the methods that employ a coarse-to-fine matching strategy, such as [26], which comprise multiple coarse matching stages followed by one fine matching stage, our method undergoes a total of six iterations. The choice of iteration rounds represents a trade-off between accuracy and computational efficiency.

In Figure 7, the legend “Prop” represents one coarse stage and one fine stage in the first two iterations and three coarse stages and one fine stage in the subsequent iterations. “Prop_r” signifies the use of these hybrid processes for conducting disparity estimation for Reindeer. Meanwhile, the legend “Exp” encompasses only coarse stages, regardless of whether the label-updating criteria account for the smoothness term. In ablation experiments to assess the roles of two matching stages, the coarse stage can be evaluated separately through multiple executions with “Exp” curves. Drawing from the abundant candidate labels yielded by the step of plane refinement in this stage, the fine stage can iteratively refine these labels, thereby shedding light on the function of the fine stage with “Prop” curves.

The results of our experiments on three groups of images indicate that our method readily converges when only the coarse stage is applied. The application of the fine stage can optimize the results of the coarse stage, resulting in a further improvement in accuracy in both of the two label-updating criteria.

In Figure 9, we compared the visualization outcomes of the disparity maps for different matching stages with Cones. Within the framework of local expansion moves utilizing the data item and the energy function, the disparity maps resulting from the method using only the coarse stage and the method combining the coarse and fine stages are referred to as “Expan_d”, “Expan_e”, “Prop_d”, and “Prop_e”, respectively. Following the post-processing of “Prop_e”, the final disparity map is referred to as “Post_processing”, whose error rates are documented in Table 1. In the right panel of Figure 7, the second and sixth iterations of the “Exp_c” line correspond to the maps “Expan_d” and “Expan_e”. Similarly, the “Prop_c” line corresponds to the maps “Prop_d” and “Prop_e”.

Observations within Frame 2 revealed instances of error matching using the coarse stage. While introducing an energy function improved this issue, the algorithm still exhibited misalignment in the repetitive texture region. In the regions outlined in Frames 1 and 3, the algorithm exhibited mismatches near the boundaries of the target when utilizing the coarse stage alone. However, the introduction of a supplementary fine stage significantly improved algorithm performance in these regions, as shown in the disparity map “Prop_e”.

Parallelization of two optimization stages: When updating labels using Equation (22), comparing CPU×8 and CPU×1, we observe a speed-up of about 3.5×. When updating labels using Equation (21), comparing CPU×8 and CPU×1, parallel computation increases operational speed by roughly a factor of 4. Our algorithm does not outperform others in terms of running speed. The coarse matching stage of our algorithm consumes roughly twice the runtime of LocalExp, detailed in [24], an algorithm with a similar framework to our coarse stage. This is primarily due to LocalExp’s simple cost functions, which are based on pixel color features, and the lack of division in the adaptive regions. Despite using an energy function rather than a cost function to update candidate labels in both matching stages, the modest increase in runtime suggests that the method’s computational complexity is primarily influenced by the cost function. The extended runtime of the fine matching stage should stem from the propagation process requiring more iterations within a center grid.

4.2.2. Results on the UW-Middlebury Dataset

We compared our method with LocalExp [24], Zhuang’s [21], and Lv’s [3] methods. The Cones, Reindeer, and Teddy in the UW-Middlebury dataset were subjected to the cost function described in Equation (3), while the remaining images were evaluated using a deep learning-based cost volume [48]. We substituted the raw cost function

ρ (\cdot)

of Equation (5) with the following function:

C (s, d) = m i n (C_{C N N} (s, d), τ_{C N N}),

(23)

where

τ_{C N N} = 0.5

is a truncation coefficient to limit the range of cost values. Given a discrete disparity d, the function

C_{C N N} (s, d)

indicates an aggregation of matching costs for all pixels within an

11 \times 11

square window centered on the point s following [24].

However, PaLPaBEL [35], PatchMatch [19], and SGBM [41] are limited to applying their individual cost functions only for the whole dataset, influenced by their algorithm framework.

The error rates of different methods for disparity estimation on the UW-Middlebury dataset are illustrated. In Table 2, we display the rankings for the bad 1.0 metric, which measures the percentage of faulty pixels based on an error threshold of 1.0 pixels, using non-occluded regions as the metric. Table 1 adapts the same metrics for the entire image area. Both tables show the best result in bold for each image.

Our method exhibits a clear advantage over the other algorithms used for comparison. Through the application of distinct matching costs to different image groups in our experiment, we conducted a comparative analysis of the disparity estimation results for Cones, Teddy, and Reindeer. The results show the advantages of our cost function in Equation (3) while also illustrating the superiority of our label optimization process when compared to the matching errors for Adirondack and the six remaining images.

4.3. Experiments on Real Underwater Images

Two authentic datasets, sourced from deep-sea and nearshore environments, respectively, were utilized in this section’s experiments. We employed a real underwater dataset acquired by the Hawaii Institute of Marine Biology [10]. The images have the sizes of

512 \times 512

and

645 \times 515

pixels.

The dataset is accessed on 20 February 2024 at: https://github.com/kskin/UWStereo. In this scene, we directly used the raw underwater image pairs as input for the experiment. The images in this dataset were pre-processed by prior researchers. Consequently, our method directly employs them for analysis. In Figure 10, the left column shows underwater images. From top to bottom are left images of Leaf and Seabed. The remaining columns depict the disparity maps generated by different algorithms. We compared our method with all algorithms employed in Section 4.2, except for PatchMatch and SGBM, given their poor performance in Section 4.2.2.

We also utilized another dataset previously captured by our research institute [3], which is a color correction method designed for non-uniform lighting conditions. This dataset comprises five underwater scenes, which capture the feature of underwater images, showcasing color deviations resulting from underwater light absorption and scattering, as well as image quality degradation due to the uneven lighting condition. The dataset provides calibration data for the images and ground-truth meshes for the target objects obtained from a Kinect.

The dataset from our institute features a resolution of 4 K, processed using an enhancement algorithm [49]. Following this, stereo rectification was conducted based on the calibration data obtained from the “MATLAB Stereo Camera Calibrator” [50]. This dataset is accessed on 26 February 2024 at: https://github.com/uwstereo/underwater-datasets. We conducted a comparative experiment on multiple stereo matching methods using calibrated images, as shown in the left column of Figure 11. These underwater images are labeled Coralstone1, Coralstone2, Shell, Starfish, and Fish from top to bottom. The second column depicts the ground-truth 3D mesh of underwater targets. The remaining columns depict the disparity maps generated by different algorithms.

To elucidate the significance of each component of the algorithms, we subdivided the compared algorithms into their constituent elements, disparity representation, energy function, and optimization strategy, as detailed in Table 3, irrespective of the specific details of the comparison algorithms.

PaLPaBEL [35], a stereo matching algorithm based on a propagation optimization strategy, is not suitable for parallel computing. It is the only method that employs discrete disparity values for disparity estimation in Table 3. On irregular surfaces, its resultant disparity maps exhibited block artifacts, with noise regions akin to those illustrated in Coralstone2 of Figure 11, evident upon map scaling. The performance of other methods in irregular object surfaces does not present this problem, emphasizing the significance of utilizing 3D labels for disparity representation.

LocalExp [24], which puts forward the local expansion moves, utilizes the same optimizer as our method. Furthermore, its algorithmic framework is akin to the approaches developed by Lv [3] and Zhuang [21]. Notably, it is the sole method in the table that employs an energy function based on color features, which comprise pixel color and color change in gradient. The results indicate that LocalExp yielded disparity maps characterized by significant noise and mismatched points when evaluated on two underwater datasets. Conversely, other stereo matching methods that employ relative pixel value information as the cost function exhibit superior performance. These comparative experiments demonstrate the benefits of our color-intensity-independent energy function when handling degraded quality in underwater images.

The coarse matching stage in our approach shares algorithmic structural similarities with the methods of Lv [3] and Zhuang [21]. The key difference lies in how these methods rely on random sampling within a central grid area to extract candidate labels, whereas our coarse matching stage utilizes segment information of cross-based patches across a grid to determine the range for extracting candidate labels. Experimental results on two sets of underwater images show that Lv’s method excels in deep-sea environments, as shown in Figure 10, while Zhuang’s method performs better in nearshore environments, as shown in Figure 11. The robustness of these methods across different underwater datasets is limited. In contrast, by incorporating segment-level information derived from cross-based patches into the extraction of candidate labels for both the expansion and propagation processes, our approach demonstrates superior performance across both datasets. With a smoother disparity map, the visual evidence confirms the reduced noise levels in disparity maps resulting from our method. Our results have fewer error regions (completely black areas in disparity maps), where the disparity value is 0 and equivalent to infinite depth.

To assess the accuracy of different methods, we performed triangulation of the objects to obtain point clouds, which are segmented in the disparity maps [51]. A comparison of these point clouds with their corresponding ground-truth mesh involved an initial manual coarse registration, followed by fine registration using the iterative closest point (ICP) algorithm [52]. The mean and standard deviation of the Hausdorff distance between the clouds and the meshes are used as evaluation metrics. The statistical findings for each method are summarized in Table 4.

The superiority of our method in accurately reconstructing 3D underwater targets is highlighted in Table 4. Compared to Lv’s method, the most accurate currently available method, our method has shown a significant enhancement in the average precision of target reconstruction. Furthermore, notable improvements have been achieved in the accuracy of deviations for each reconstructed point. These experiment results show that our method delivers favorable outcomes by generating dense disparity maps in demanding underwater conditions characterized by light absorption and scattering.

5. Discussion

Results from the ablation study in Section 4.2.1 reveal that the introduction of an energy function for label optimization can enhance the algorithm’s performance in areas featuring repetitive textures. Combining the coarse and fine matching stages further strengthens the algorithm’s robustness in both textured repetitive regions and target boundary areas.

In Section 4.2.2 and Section 4.3, we evaluated the accuracy of various algorithms using disparity maps and reconstructed point clouds. Our method demonstrated superior performance in accuracy compared to both classical algorithms and current state-of-the-art approaches.

The impact of the working environment is generally more pronounced in classical methods [3]. Underwater environments often contain complex geometry, which can lead to significant occlusions. SGBM and PatchMatch, as classic local stereo methods, may struggle to accurately handle these occlusions, resulting in errors in disparity maps. In addition, color distortion in underwater images poses a challenge, as identical objects project with differing color characteristics across left and right images. This inconsistency in color features hinders the accurate depiction of the consistency of matching points through cost functions for PatchMatch and SGBM. Consequently, they often yield sparse disparity maps. State-of-the-art methods optimize energy functions using optimizers such as local expansion moves. In contrast to classic algorithms that focus on the similarity of matching points, optimizers can improve algorithm performance in occlusions and obtain dense disparity maps by considering the smoothness of the disparity map.

Given the real underwater datasets in Section 4.3, we conducted a visualization analysis using disparity maps to evaluate how our method performs relative to state-of-the-art methods. Relative to PaLPaBEL, our 3D label-based method exhibited adaptability to intricate surfaces. Compared to LocalExp, our non-color-feature-based energy function yielded disparity maps with lower noise levels. Additionally, the propagation process employed in our method demonstrated improved robustness across diverse underwater datasets, outperforming the methods of Lv and Zhuang.

6. Conclusions

This paper presents a stereo matching method designed for obtaining 3D information of underwater targets in marine engineering. The proposed method pioneers the integration of the cross-based patch into a propagation process in order to capitalize on the advantages of segment-based methods. To address the limitation in the propagation process imposed by the small texture, improvements are made to the cross-based patch approach.

Experimental results illustrate the superiority of our method over existing algorithms, highlighting the adaptability of the energy function to handle irregular underwater objects with the incorporation of 3D labels. In underwater simulation and real datasets, our method outperforms others in the level of detail present in the obtained disparity maps. As a variant of segment-based methods, our method has a better performance on poor textured surfaces. Furthermore, it demonstrates greater accuracy than other existing methods in obtaining three-dimensional information for marine engineering tasks.

Our current method exhibits an undesirable running speed. The forthcoming focus of our efforts will be on optimizing the division speed of the adaptive region. One potential strategy involves transitioning from processing RGB images to utilizing alternative image formats for adaptive region division, as described in [53].

Author Contributions

Conceptualization, X.X.; methodology, X.X.; software, X.X. and L.M.; validation, L.M., K.S. and J.Y.; investigation, X.X., L.M., K.S., and J.Y.; resources, H.X., K.S., and J.Y.; data curation, X.X.; writing—original draft preparation, X.X.; writing—review and editing, H.X., L.M., K.S., and J.Y.; visualization, X.X.; supervision, H.X.; project administration, X.X. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the National Key R&D Program of China (2021YFC2800500).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data are contained within the article.

Acknowledgments

Thanks to Deep Sea Video Technology Laboratory, Institute of Deep-sea Science and Engineering, Chinese Academy of Sciences for the great support.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

MRF	Markov random field
ICP	Iterative closest point
ZNCC	Zero-mean normalized cross correlation
CT	Census transformation
RANSAC	Random sample consensus

References

Beall, C.; Lawrence, B.J.; Ila, V.; Dellaert, F. 3D reconstruction of underwater structures. In Proceedings of the 2010 IEEE/RSJ International Conference on Intelligent Robots and Systems, Taipei, Taiwan, 18–22 October 2010; IEEE: Piscataway, NJ, USA, 2010; pp. 4418–4423. [Google Scholar]
Li, T.; Rong, S.; Cao, X.; Liu, Y.; Chen, L.; He, B. Underwater image enhancement framework and its application on an autonomous underwater vehicle platform. Opt. Eng. 2020, 59, 083102. [Google Scholar] [CrossRef]
Lv, W.; Jin, X.; Jiang, G. A 3D Label Stereo Matching Method Using Underwater Energy Function. In Proceedings of the 2023 IEEE International Conference on Image Processing (ICIP), Kuala Lumpur, Malaysia, 8–11 October 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 2445–2449. [Google Scholar]
Hogue, A.; German, A.; Jenkin, M. Underwater environment reconstruction using stereo and inertial data. In Proceedings of the 2007 IEEE International Conference on Systems, Man and Cybernetics, Montréal, QC, Canada, 7–10 October 2007; IEEE: Piscataway, NJ, USA, 2007; pp. 2372–2377. [Google Scholar]
Rizzini, D.L.; Kallasi, F.; Aleotti, J.; Oleari, F.; Caselli, S. Integration of a stereo vision system into an autonomous underwater vehicle for pipe manipulation tasks. Comput. Electr. Eng. 2017, 58, 560–571. [Google Scholar] [CrossRef]
Drap, P.; Seinturier, J.; Hijazi, B.; Merad, D.; Boi, J.M.; Chemisky, B.; Seguin, E.; Long, L. The ROV 3D Project: Deep-sea underwater survey using photogrammetry: Applications for underwater archaeology. J. Comput. Cult. Herit. (JOCCH) 2015, 8, 1–24. [Google Scholar] [CrossRef]
Mogstad, A.A.; Ødegård, Ø.; Nornes, S.M.; Ludvigsen, M.; Johnsen, G.; Sørensen, A.J.; Berge, J. Mapping the historical shipwreck figaro in the high arctic using underwater sensor-carrying robots. Remote Sens. 2020, 12, 997. [Google Scholar] [CrossRef]
Bobkov, V.; Melman, S.; Kudrashov, A.; Scherbatyuk, A. Vision-based navigation method for a local maneuvering of the autonomous underwater vehicle. In Proceedings of the 2017 IEEE Underwater Technology (UT), Busan, Republic of Korea, 21–24 February 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 1–5. [Google Scholar]
Sinha, S.N.; Mordohai, P.; Pollefeys, M. Multi-view stereo via graph cuts on the dual of an adaptive tetrahedral mesh. In Proceedings of the 2007 IEEE 11th International Conference on Computer Vision, Rio De Janeiro, Brazil, 14–21 October 2007; IEEE: Piscataway, NJ, USA, 2007; pp. 1–8. [Google Scholar]
Skinner, K.A.; Zhang, J.; Olson, E.A.; Johnson-Roberson, M. Uwstereonet: Unsupervised learning for depth estimation and color correction of underwater stereo imagery. In Proceedings of the 2019 International Conference on Robotics and Automation (ICRA), Montreal, QC, Canada, 20–24 May 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 7947–7954. [Google Scholar]
Ichimaru, K.; Furukawa, R.; Kawasaki, H. CNN based dense underwater 3D scene reconstruction by transfer learning using bubble database. In Proceedings of the 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), Waikoloa Village, HI, USA, 7–11 January 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 1543–1552. [Google Scholar]
Chuang, T.Y.; Ting, H.W.; Jaw, J.J. Dense stereo matching with edge-constrained penalty tuning. IEEE Geosci. Remote Sens. Lett. 2018, 15, 664–668. [Google Scholar] [CrossRef]
Ancuti, C.O.; Ancuti, C.; De Vleeschouwer, C.; Bekaert, P. Color balance and fusion for underwater image enhancement. IEEE Trans. Image Process. 2017, 27, 379–393. [Google Scholar] [CrossRef] [PubMed]
Drews, P.; Nascimento, E.; Moraes, F.; Botelho, S.; Campos, M. Transmission estimation in underwater single images. In Proceedings of the IEEE International Conference on Computer Vision Workshops, Sydney, Australia, 2–8 December 2013; pp. 825–830. [Google Scholar]
Massot-Campos, M.; Oliver-Codina, G. Optical sensors and methods for underwater 3D reconstruction. Sensors 2015, 15, 31525–31557. [Google Scholar] [CrossRef]
Scharstein, D.; Szeliski, R. A taxonomy and evaluation of dense two-frame stereo correspondence algorithms. Int. J. Comput. Vis. 2002, 47, 7–42. [Google Scholar] [CrossRef]
Hariyama, M.; Takeuchi, T.; Kameyama, M. VLSI processor for reliable stereo matching based on adaptive window-size selection. In Proceedings of the 2001 ICRA—IEEE International Conference on Robotics and Automation (Cat. No. 01CH37164), Seoul, Republic of Korea, 21–26 May 2001; IEEE: Piscataway, NJ, USA, 2001; Volume 2, pp. 1168–1173. [Google Scholar]
Yoon, K.J.; Kweon, I.S. Locally adaptive support-weight approach for visual correspondence search. In Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), San Diego, CA, USA, 20–25 June 2005; IEEE: Piscataway, NJ, USA, 2005; Volume 2, pp. 924–931. [Google Scholar]
Barnes, C.; Shechtman, E.; Finkelstein, A.; Goldman, D.B. PatchMatch: A randomized correspondence algorithm for structural image editing. ACM Trans. Graph. 2009, 28, 24. [Google Scholar] [CrossRef]
Xu, Y.; Yu, D.; Ma, Y.; Li, Q.; Zhou, Y. Underwater stereo-matching algorithm based on belief propagation. Signal Image Video Process. 2023, 17, 891–897. [Google Scholar] [CrossRef]
Zhuang, S.; Zhang, X.; Tu, D.; Ji, Y.; Yao, Q. A dense stereo matching method based on optimized direction-information images for the real underwater measurement environment. Measurement 2021, 186, 110142. [Google Scholar] [CrossRef]
Olsson, C.; Ulén, J.; Boykov, Y. In defense of 3d-label stereo. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Portland, OR, USA, 23–28 June 2013; pp. 1730–1737. [Google Scholar]
Felzenszwalb, P.F.; Huttenlocher, D.P. Efficient belief propagation for early vision. Int. J. Comput. Vis. 2006, 70, 41–54. [Google Scholar] [CrossRef]
Taniai, T.; Matsushita, Y.; Sato, Y.; Naemura, T. Continuous 3D label stereo matching using local expansion moves. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 2725–2739. [Google Scholar] [CrossRef]
Li, L.; Zhang, S.; Yu, X.; Zhang, L. PMSC: PatchMatch-based superpixel cut for accurate stereo matching. IEEE Trans. Circuits Syst. Video Technol. 2016, 28, 679–692. [Google Scholar] [CrossRef]
Li, H.; Li, Z.; Huang, J.; Meng, B.; Zhang, Z. Accurate hierarchical stereo matching based on 3D plane labeling of superpixel for stereo images from rovers. Int. J. Adv. Robot. Syst. 2021, 18, 17298814211002113. [Google Scholar] [CrossRef]
Xu, H.; Chen, X.; Liang, H.; Ren, S.; Wang, Y.; Cai, H. Crosspatch-based rolling label expansion for dense stereo matching. IEEE Access 2020, 8, 63470–63481. [Google Scholar] [CrossRef]
Haq, M.A.; Khan, I.; Ahmed, A.; Eldin, S.M.; Alshehri, A.; Ghamry, N.A. DCNNBT: A novel deep convolution neural network-based brain tumor classification model. Fractals 2023, 31, 2340102. [Google Scholar] [CrossRef]
Yousef, R.; Khan, S.; Gupta, G.; Siddiqui, T.; Albahlal, B.; Alajlan, S.; Haq, M.A. U-Net-based models towards optimal MR brain image segmentation. Diagnostics 2023, 13, 1624. [Google Scholar] [CrossRef]
Haq, M.A.; Rahaman, G.; Baral, P.; Ghosh, A. Deep learning based supervised image classification using UAV images for forest areas classification. J. Indian Soc. Remote Sens. 2021, 49, 601–606. [Google Scholar] [CrossRef]
Yuan, Z.; Cao, J.; Li, Z.; Jiang, H.; Wang, Z. SD-MVS: Segmentation-Driven Deformation Multi-View Stereo with Spherical Refinement and EM Optimization. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 20–27 February 2024; Volume 38, pp. 6871–6880. [Google Scholar]
Ji, P.; Li, J.; Li, H.; Liu, X. Superpixel alpha-expansion and normal adjustment for stereo matching. J. Vis. Commun. Image Represent. 2021, 79, 103238. [Google Scholar] [CrossRef]
Altantawy, D.A.; Obbaya, M.; Kishk, S. A fast non-local based stereo matching algorithm using graph cuts. In Proceedings of the 2014 9th International Conference on Computer Engineering & Systems (ICCES), Cairo, Egypt, 22–23 December 2014; IEEE: Piscataway, NJ, USA, 2014; pp. 130–135. [Google Scholar]
Lu, J.; Yang, H.; Min, D.; Do, M.N. Patch match filter: Efficient edge-aware filtering meets randomized search for fast correspondence field estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Portland, OR, USA, 23–28 June 2013; pp. 1854–1861. [Google Scholar]
O’Byrne, M.; Pakrashi, V.; Schoefs, F.; Ghosh, B. A stereo-matching technique for recovering 3D information from underwater inspection imagery. Comput.-Aided Civ. Infrastruct. Eng. 2018, 33, 193–208. [Google Scholar] [CrossRef]
Giachetti, A. Matching techniques to compute image motion. Image Vis. Comput. 2000, 18, 247–260. [Google Scholar] [CrossRef]
Mei, X.; Sun, X.; Zhou, M.; Jiao, S.; Wang, H.; Zhang, X. On building an accurate stereo matching system on graphics hardware. In Proceedings of the 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops), Barcelona, Spain, 6–13 November 2011; IEEE: Piscataway, NJ, USA, 2011; pp. 467–474. [Google Scholar]
Lin, C.; Li, Y.; Xu, G.; Cao, Y. Optimizing ZNCC calculation in binocular stereo matching. Signal Process. Image Commun. 2017, 52, 64–73. [Google Scholar] [CrossRef]
Zhang, K.; Lu, J.; Lafruit, G. Cross-based local stereo matching using orthogonal integral images. IEEE Trans. Circuits Syst. Video Technol. 2009, 19, 1073–1079. [Google Scholar] [CrossRef]
Boykov, Y.; Veksler, O.; Zabih, R. Fast approximate energy minimization via graph cuts. IEEE Trans. Pattern Anal. Mach. Intell. 2001, 23, 1222–1239. [Google Scholar] [CrossRef]
Hirschmuller, H. Stereo processing by semiglobal matching and mutual information. IEEE Trans. Pattern Anal. Mach. Intell. 2007, 30, 328–341. [Google Scholar] [CrossRef]
Yang, Q.; Ji, P.; Li, D.; Yao, S.; Zhang, M. Fast stereo matching using adaptive guided filtering. Image Vis. Comput. 2014, 32, 202–211. [Google Scholar] [CrossRef]
Besse, F.; Rother, C.; Fitzgibbon, A.; Kautz, J. Pmbp: Patchmatch belief propagation for correspondence field estimation. Int. J. Comput. Vis. 2014, 110, 2–13. [Google Scholar] [CrossRef]
Yu, X.; Xing, X.; Zheng, H.; Fu, X.; Huang, Y.; Ding, X. Man-made object recognition from underwater optical images using deep learning and transfer learning. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 1852–1856. [Google Scholar]
Ye, T.; Chen, S.; Liu, Y.; Ye, Y.; Chen, E.; Li, Y. Underwater light field retention: Neural rendering for underwater imaging. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 488–497. [Google Scholar]
Scharstein, D.; Hirschmüller, H.; Kitajima, Y.; Krathwohl, G.; Nešić, N.; Wang, X.; Westling, P. High-resolution stereo datasets with subpixel-accurate ground truth. In Proceedings of the Pattern Recognition: 36th German Conference—GCPR 2014, Münster, Germany, 2–5 September 2014; Proceedings 36. Springer: Berlin, Germany, 2014; pp. 31–42. [Google Scholar]
Fu, Z.; Wang, W.; Huang, Y.; Ding, X.; Ma, K.K. Uncertainty inspired underwater image enhancement. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Berlin, Germany, 2022; pp. 465–482. [Google Scholar]
Zbontar, J.; LeCun, Y. Stereo matching by training a convolutional neural network to compare image patches. J. Mach. Learn. Res. 2016, 17, 2287–2318. [Google Scholar]
Fan, Y.; Jin, X.; Deng, R.; Xie, J.; Sun, K.; Yang, J.; Zhang, B. Depth-rectified statistical scattering modeling for deep-sea video descattering. Infrared Laser Eng. 2022, 51, 20210919. [Google Scholar]
Zhang, Z. A flexible new technique for camera calibration. IEEE Trans. Pattern Anal. Mach. Intell. 2000, 22, 1330–1334. [Google Scholar] [CrossRef]
Kirillov, A.; Mintun, E.; Ravi, N.; Mao, H.; Rolland, C.; Gustafson, L.; Xiao, T.; Whitehead, S.; Berg, A.C.; Lo, W.Y.; et al. Segment anything. arXiv 2023, arXiv:2304.02643. [Google Scholar]
Besl, P.J.; McKay, N.D. Method for registration of 3-D shapes. In Sensor Fusion IV: Control Paradigms and Data Structures; SPIE: Boston, MA, USA, 1992; Volume 1611, pp. 586–606. [Google Scholar]
Hosni, A.; Bleyer, M.; Gelautz, M. Near real-time stereo with adaptive support weight approaches. In Proceedings of the 3DPVT, Paris, France, 17–20 May 2010; pp. 2–6. [Google Scholar]

Figure 1. The pipeline of our method, which is founded on the construction of two units: the fixed grid and the adaptive region, corresponding to different matching stages for proposing proposals.

Figure 2. The relation between a cross-based patch and a center grid. The cross-based patch constructed with the center p of grid

C_{i j}

as the anchor is covered by grid

C_{i j}

.

Figure 2. The relation between a cross-based patch and a center grid. The cross-based patch constructed with the center p of grid

C_{i j}

as the anchor is covered by grid

C_{i j}

.

Figure 3. The illustration of the process of enhancing the cross-based patch. Binary labels, “0” and “1”, are employed to signify whether the pixels are located within the small speckle.

Figure 4. The propagation process and expansion process occurring in a center grid

C_{i j}

.

Figure 4. The propagation process and expansion process occurring in a center grid

C_{i j}

.

Figure 5. Summary of the proposed methodology.

Figure 6. UW-Middlebury dataset.

Figure 7. Effects of different optimization stages on error rate.

Figure 8. Effect of optimization stages on running time.

P r o p_{r}

and

P r o p_{c}

signify the use of two optimization stages for conducting disparity estimation for Reindeer and Cones, respectively.

Figure 8. Effect of optimization stages on running time.

P r o p_{r}

and

P r o p_{c}

signify the use of two optimization stages for conducting disparity estimation for Reindeer and Cones, respectively.

Figure 9. Visual effect of optimization stages. The images illustrate the changes in disparity during the coarse-to-fine matching stage, with and without the propagation process. Frames 1, 3 shows the method’s performance in boundary regions, whereas frame 2 highlights its effectiveness in areas with repetitive textures.

Figure 10. Visualization analysis of disparity results from diverse algorithms utilizing the dataset captured by Hawaii Institute [3,21,24,35].

Figure 11. Visualization analysis of disparity results from diverse algorithms utilizing the dataset captured by our institute [3,21,24,35].

Table 1. UW-Middlebury benchmark dataset for 1.0-pixel accuracy. Error percents for all pixels (all) are shown. The shaded areas indicate the cost functions employed by each algorithm across three images. Best results are highlighted in bold.

Label	Average (All)	Cones	Teddy	Reind	Adiro	Motor	Piano	Pipes	Playr	Playt	Recyc
Our method	11.58	10.89	14.06	14.06	5.0	8.41	11.06	15.43	17.96	9.99	8.96
LocalExp [24]	13.5	14.01	17.3	19.3	4.9	8.37	12.26	16.85	17.16	11.07	9.84
Zhuang’s method [21]	12.45	14.73	15.61	14.72	4.71	8.07	12.72	17.19	18.53	11.59	9.84
Lv’s method [3]	14.01	12.25	19.78	22.15	4.41	8.28	14.77	16.38	20.15	11.22	10.73
PaLPaBEL [35]	31.08	16.34	25.92	30.05	33.01	25.65	34.77	34.02	45.05	36.14	29.8
SGBM [41]	35.85	26.53	33.9	36.33	39.59	26.81	34.77	32.92	45.56	54.79	31.23
PatchMatch [19]	32.46	20.33	38.04	24.27	35.34	26.51	33.34	36.06	49.06	32.9	28.77

Table 2. UW-Middlebury benchmark dataset for 1.0-pixel accuracy. Error percents for non-occluded regions (nonocc) are shown. The shaded areas indicate the cost functions employed by each algorithm across three images. Best results are highlighted in bold.

Label	Avg (nonocc)	Cones	Teddy	Reind	Adiro	Motor	Piano	Pipes	Playr	Playt	Recyc
Our method	6.57	4.94	11.47	9.03	1.46	3.65	7.33	4.97	9.71	6.86	6.28
LocalExp [24]	8.51	7.5	13.17	13.67	2.14	4.85	8.69	6.93	9.74	7.73	6.84
Zhuang’s method [21]	7.64	7.73	11.99	12.16	1.98	4.47	9.33	7.5	10.38	8.38	7.17
Lv’s method [3]	7.73	5.01	14.53	13.47	1.43	3.75	10.04	5.66	9.84	6.49	7.09
PaLPaBEL [35]	24.77	8.12	18.18	24.81	30.03	18.68	30.64	22.84	37.33	30.47	26.6
SGBM [41]	28.28	17.08	25.88	31.74	34.22	17.8	29.92	18.83	36.75	49.29	25.7
PatchMatch [19]	25.69	12.57	31.43	17.62	30.77	19.4	28.65	24.86	41.65	26.03	23.9

Table 3. An overview of comparative algorithm frameworks. The main framework of a method is composed of three components, with the configurations of them documented in checkmarks.

Method	Disparity Representation		Energy Function		Optimization Strategy
Method	Discrete Value	3D Label	Color-Based	Non-Color	Expansion	Propagation
PaLPaBEL [35]	✓			✓		✓
LocalExp [24]		✓	✓		✓
Zhuang’s method [21]		✓		✓	✓
Lv’s method [3]		✓		✓	✓
Our method		✓		✓	✓	✓

Table 4. The distances between ground-truth meshes and point clouds obtained by algorithms, measured in meters (mm). Best results are highlighted in bold.

Datasets	Average		Coralstone1		Coralstone2		Shell		Starfish		Fish
Method	avr	stdev	avr	stdev	avr	stdev	avr	stdev	avr	stdev	avr	stdev
Our method	0.42	8.29	0.14	2.68	0.05	4.48	0.12	3.57	0.51	6.98	1.26	23.72
Lv’s method [3]	6.21	16.83	0.27	3.94	12.66	9.56	1.77	9.26	8.90	25.18	7.43	36.20
LocalExp [24]	19.18	62.02	0.94	4.60	38.71	166.60	24.94	45.68	7.73	17.69	23.58	75.49
Zhuang’s method [21]	8.70	41.60	6.73	41.35	6.87	46.78	9.61	61.71	0.10	6.02	20.18	52.15
PaLPaBEL [35]	15.32	86.75	4.77	17.11	31.28	85.05	9.46	80.23	4.66	18.58	26.42	78.76

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Xu, X.; Xu, H.; Ma, L.; Sun, K.; Yang, J. An Underwater Stereo Matching Method: Exploiting Segment-Based Method Traits without Specific Segment Operations. J. Mar. Sci. Eng. 2024, 12, 1599. https://doi.org/10.3390/jmse12091599

AMA Style

Xu X, Xu H, Ma L, Sun K, Yang J. An Underwater Stereo Matching Method: Exploiting Segment-Based Method Traits without Specific Segment Operations. Journal of Marine Science and Engineering. 2024; 12(9):1599. https://doi.org/10.3390/jmse12091599

Chicago/Turabian Style

Xu, Xinlin, Huiping Xu, Lianjiang Ma, Kelin Sun, and Jingchuan Yang. 2024. "An Underwater Stereo Matching Method: Exploiting Segment-Based Method Traits without Specific Segment Operations" Journal of Marine Science and Engineering 12, no. 9: 1599. https://doi.org/10.3390/jmse12091599

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

An Underwater Stereo Matching Method: Exploiting Segment-Based Method Traits without Specific Segment Operations

Abstract

1. Introduction

2. Related Work

3. Materials and Methods

3.1. Overview

3.2. Energy Function

3.2.1. Data Term

3.2.2. Smoothness Term

3.3. Subdivision of Local Units

3.3.1. Basic Cross-Based Patch Scheme

3.3.2. Extended Cross-Based Patch Scheme

3.3.3. Fixed Grid Scheme

3.4. Label Optimization Procedure

3.4.1. Coarse Matching Stage

3.4.2. Fine Matching Stage

3.4.3. Summary of Optimization Stages

4. Results

4.1. Settings

4.2. Experiments on UW-Middlebury Dataset

4.2.1. Ablation Study

4.2.2. Results on the UW-Middlebury Dataset

4.3. Experiments on Real Underwater Images

5. Discussion

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI