Multi-Beam Sonar Target Segmentation Algorithm Based on BS-Unet

Zhang, Wennuo; Zhang, Xuewu; Zhang, Yu; Zeng, Pengyuan; Wei, Ruikai; Xu, Junsong; Chen, Yang

doi:10.3390/electronics13142841

Open AccessArticle

Multi-Beam Sonar Target Segmentation Algorithm Based on BS-Unet

by

Wennuo Zhang

¹,

Xuewu Zhang

^1,*,

Yu Zhang

¹,

Pengyuan Zeng

²

,

Ruikai Wei

¹,

Junsong Xu

³ and

Yang Chen

³

¹

College of Information Science and Engineering, Hohai University, Changzhou 213000, China

²

College of Artificial Intelligence and Automation, Hohai University, Changzhou 213000, China

³

The Second Construction Co., Ltd. of CSCEC 7th Division, Suzhou 215300, China

^*

Author to whom correspondence should be addressed.

Electronics 2024, 13(14), 2841; https://doi.org/10.3390/electronics13142841

Submission received: 19 June 2024 / Revised: 16 July 2024 / Accepted: 17 July 2024 / Published: 19 July 2024

(This article belongs to the Special Issue AI Used in Mobile Communications and Networks)

Download

Browse Figures

Versions Notes

Abstract

:

Multi-beam sonar imaging detection technology is increasingly becoming the mainstream technology in fields such as hydraulic safety inspection and underwater target detection due to its ability to generate clearer images under low-visibility conditions. However, during the multi-beam sonar detection process, issues such as low image resolution and blurred imaging edges lead to decreased target segmentation accuracy. Traditional filtering methods for echo signals cannot effectively solve these problems. To address these challenges, this paper introduces, for the first time, a multi-beam sonar dataset against the background of simulated crack detection for dam safety. This dataset included simulated cracks detected by multi-beam sonar from various angles. The width of the cracks ranged from 3 cm to 9 cm, and the length ranged from 0.2 m to 1.5 m. In addition, this paper proposes a BS-UNet semantic segmentation algorithm. The Swin-UNet model incorporates a dual-layer routing attention mechanism to enhance the accuracy of sonar image detail segmentation. Furthermore, an online convolutional reparameterization structure was added to the output end of the model to improve the model’s capability to represent image features. Comparisons of the BS-UNet model with commonly used semantic segmentation models on the multi-beam sonar dataset consistently demonstrated the BS-UNet model’s superior performance, as it improved semantic segmentation evaluation metrics such as Precision and IoU by around 0.03 compared to the Swin-UNet model. In conclusion, BS-UNet can effectively be applied in multi-beam sonar image segmentation tasks.

Keywords:

underwater targets; image segmentation; multi-beam sonar; attention mechanism; online convolutional reparameterization

1. Introduction

Semantic segmentation is an important part of safety inspection in hydraulic engineering [1,2]. Accurately segmenting engineering defects can help inspection personnel determine the extent of damage [3]. Traditional image segmentation algorithms mainly include threshold segmentation, edge detection, region-based algorithms, and clustering algorithms. The interactive segmentation algorithm proposed by Freedman et al. [4], which combines shape prior knowledge, belongs to the category of edge detection algorithms and partially addresses the problem of inaccurate segmentation when there is edge diffusion or multiple similar objects. Chuang et al. [5] proposed a fuzzy C-means algorithm that integrates spatial information into membership functions for clustering, solving the problem of uneven region segmentation in clustering algorithms. In addition, classical algorithms such as the OTSU algorithm [6] and region-growing algorithm [7] also belong to traditional image segmentation algorithms.

With the rapid development of deep learning algorithms, the use of neural networks for semantic segmentation has gradually replaced traditional methods, becoming the mainstream approach today. The initial semantic segmentation methods involved dividing images into small blocks for training neural networks or directly compressing images and then classifying pixels. Network structures such as LeNet-5 [8], AlexNet [9], and VGG [10] adopted this approach and achieved good results because the fully connected layers of neural networks require fixed-size images. The fully convolutional network (FCN) proposed by Long et al. [11], which uses convolution instead of fully connected layers, allows the network to input images of any size. FCN demonstrated that neural networks can be trained end-to-end for semantic segmentation [12], laying the foundation for using neural networks for semantic segmentation.

The quality of sonar images is strongly influenced by underwater environments, generally having low resolution. The application of semantic segmentation algorithms in underwater image processing has always been a challenging research area. There are already some specific algorithm applications; for instance, DeepSea [13] is a deep learning model specifically designed for underwater image processing, used for underwater target identification and seafloor terrain mapping, aiding personnel in quickly and accurately identifying targets for underwater detection and operations. SonarNet [14] is a deep learning-based sonar image segmentation model specifically used for processing sonar images, assisting personnel in segmenting different types of underwater targets in sonar images.

Although the aforementioned algorithms have made significant progress in image segmentation tasks for underwater target detection, there are still some shortcomings concerning their application. This paper uses the above-mentioned model to train sonar data and perform target segmentation tasks. It identifies two main issues: limited model generalization ability and inability to address the impact of low resolution in sonar images.

In recent years, significant progress has been made in semantic segmentation algorithms, not only including the expansion of application scenarios but also innovations in algorithm and model architectures. The introduction of Transformer models has significantly improved semantic segmentation algorithms in dealing with long-range dependency issues in images [15]. Transformer is a deep learning architecture based on attention mechanisms, and its introduction has inspired new architecture designs in the field of semantic segmentation. For example, models like Vision Transformer (ViT) [16] and its subsequent variants such as SETR [17], Swin Transformer [18], etc., apply Transformers to visual tasks, significantly improving the performance of semantic segmentation models. This article addresses the above-mentioned issues by using forward-looking multibeam sonar to collect simulated dam cracks detected from multiple angles in a simulated experimental site. It establishes, for the first time, a simulated dam crack dataset based on forward-looking multibeam sonar. Additionally, this paper integrates the Swin-UNet model with the Bi-Level Routing Attention [19] mechanism and introduces online convolutional reparameterization [20] to further reduce computational costs during model training. This proposes the BRA-Swin-UNet (BS-UNet) model for image segmentation tasks.

2. Related Works

2.1. Swin Transformer

Swin Transformer is an innovative Transformer architecture designed for computer vision tasks. By introducing a window-based self-attention mechanism, it effectively applies the Transformer model to various visual tasks such as image classification, object detection, semantic segmentation, and more [21,22,23]. The key point of Swin Transformer lies in its ability to handle images in a hierarchical and efficient manner, overcoming some limitations of traditional Transformer models in visual tasks. Many tasks in the visual domain require dense predictions at the pixel level. Traditional Transformer models face challenges when dealing with high-resolution images because the computational complexity of their self-attention mechanism grows quadratically with image size. However, Swin Transformer can create hierarchical feature maps, and its computational complexity scales linearly with image size.

Swin Transformer starts working from small-sized images and gradually merges neighboring image patches in the deeper layers of the network to construct hierarchical feature maps. This hierarchical feature map enables Swin Transformer to easily integrate with advanced algorithms such as FPN [24] or U-Net [25,26,27] for tasks like semantic segmentation.

The key to achieving linear computational complexity in Swin Transformer is its local computation of self-attention within non-overlapping windows of the image. Each window contains a fixed number of image patches, resulting in computational complexity linearly proportional to the image size. Unlike previous Transformer-based architectures, which often generate single-resolution feature maps with high computational complexity, Swin Transformer can serve as the backbone network for various visual tasks. A comparison between Swin Transformer and ViT is illustrated in Figure 1.

2.2. Swin-UNet

Swin-UNet is a branch that applies the Swin Transformer architecture to image segmentation tasks. This model improves upon the traditional U-Net model by replacing the convolutional layers in U-Net with Swin Transformer blocks, combining the Transformer’s ability to handle long-range dependencies in models with the efficiency of U-Net in image segmentation tasks.

The Swin-UNet model first constructs a symmetric encoder–decoder structure with skip connections based on Swin Transformer blocks. In the encoder, local-to-global self-attention is implemented, while in the decoder, global features are upsampled to the input resolution for corresponding pixel-level segmentation predictions. The input images to the model are initially divided into several non-overlapping blocks, each of which is inputted into a Transformer-based encoder to learn deep feature representations. The decoder utilizes an upsampling method with expansion layers to upsample the extracted contextual features, which are then fused with the multi-scale features from the encoder stage through skip connections to restore the spatial resolution of feature maps and perform further segmentation predictions. In Swin-UNet, Swin Transformer blocks appear consecutively in pairs. Each Swin Transformer block consists of a normalization layer, a multi-head self-attention module, a residual connection, and a two-layer perceptron with Gelu activation function. The architecture of Swin-UNet is illustrated in Figure 2, where H and W are the height and width of the feature map, and C represents the feature dimension.

3. Proposed Method

3.1. Bi-Level Routing Attention

The Bi-Level Routing Attention (BRA) structure is an attention mechanism that enables self-attention operations between a region and several other related regions. This helps reduce computational complexity while mitigating loss in long-range correlations. The operational process of the BRA structure is illustrated in Figure 3, where S² represents the number of non-overlapping regions, k represents the number of selected relevant regions, mm stands for matrix multiplications, and O denotes the final output feature matrix.

In self-attention operations, the query (Q), key (K), and value (V) are used as inputs. The attention function computes a weighted sum of each Q transformed into V, where the weights are calculated as the normalized dot product between Q and the corresponding K. The calculation formula is as follows, where C represents the feature dimension.

Attention (Q, K, V) = softmax (\frac{{QK}^{T}}{\sqrt{C}}) V

(1)

For visual Transformers, X is the flattened spatial feature map, i.e., N = H × W, where H and W are the height and width of the feature map, respectively. “Multi-head” refers to splitting the output along the channel dimension into h heads, with each head using a set of independent projection weights:

\begin{array}{l} MHSA (X) = Concat ({head}_{0}, {head}_{1}, \dots \dots, {head}_{h}) W^{0} \\ {head}_{i} = Attention ({XW}_{i}^{q}, {XW}_{i}^{k}, {XW}_{I}^{v}) \end{array}

(2)

where

{head}_{i}

is the output of the ith attention head, and

W_{i}^{q}, W_{i}^{k}, W_{i}^{v}

are the corresponding projection weights of the input. The complexity of MHSA is O(N²), indicating that such high computational complexity is a key issue limiting the scalability of Transformers. The proposed Bi-Level Routing Attention (BRA) introduces a dynamic two-layer sparse attention mechanism. The key idea is to divide self-attention operations into two stages: a coarse-grained region-level stage and a fine-grained level stage. In the coarse-grained region stage, irrelevant key–value pairs are filtered out, leaving only a subset of relevant regions. In the fine-grained stage, attention operations are performed on these relevant regions. In the BRA structure, given a 2D region feature map X, it is first decomposed into S × S non-overlapping regions, each containing HW/S² feature vectors. Then, linear projection is used to derive query, key, and value vectors.

Q = X^{r} W^{q}, K = X^{r} W^{k}, V = X^{r} W^{v}

(3)

where

W^{q}, W^{k}

and

W^{v}

are the projection weights for query, key, and value, respectively. Region-level queries and keys are obtained by averaging these Q and K vectors; then, a matrix multiplication between

Q^{r}

and the transpose of

K^{r}

is performed, yielding a matrix that expresses whether the relationships between regions are close.

A^{r} = Q^{r} {{(K}^{r})}^{T}

(4)

The data in matrix

A^{r}

represent the semantic correlation between two regions. Next, the topk operation is applied to derive the k regions most correlated with the target region.

I^{r} = topkIndex (A^{r})

(5)

By utilizing the routing index matrix from region to region, fine-grained attention can be computed. For each query in region i, attention operations are performed on all key and value values from the k routing regions indexed by

I^{r}

.

{\begin{cases} K_{g} = gather (K, I^{r}) \\ V_{g} = gather (V, I^{r}) \end{cases}

(6)

Among them,

K_{g}

and

V_{g}

represent the collection of all key and value values in k related regions. Finally, BRA introduces a local context enhancement term, LCE(V).

O = Attention (Q^{g}, K^{g}, V^{g}) + LCE (V)

(7)

3.2. Online Convolutional Re-Parameterization

Structural reparameterization is applied in many computer vision tasks, including compact model design, architecture search, and model pruning, among other areas. The core idea of reparameterization refers to the ability for different architectures to be transformed into each other through equivalent transformations of parameters. In a reparameterized model, the Batch Normalization (BN) layer is a crucial component. In a reparameterization block, a BN layer is immediately added after each convolutional layer. Removing these BN layers would lead to performance degradation of the model. However, considering computational efficiency, these BN layers introduce a significant computational burden during the training phase. During inference, complex modules can be compressed into a single convolutional layer. Still, during training, BN layers are non-linear, thus preventing the merging of the entire module and resulting in a large number of intermediate computational operations (large FLOPS) and buffered feature maps (High Memory Usage). Such high computational complexity severely affects the efficiency of the model.

Online Convolutional Re-parameterization (OREPA) removes all non-linear BN layers and introduces linear scaling layers. These linear scaling layers have similar properties to BN layers; they diversify optimization across different branches, each layer is linear, and they can be merged into convolutional layers during training. In the second stage, the complex linear blocks from the previous step are simplified into a single convolutional layer. OREPA significantly reduces training costs by reducing the computational and storage burden caused by intermediate layer computations, with minimal impact on model performance. The process of Online Convolutional Re-parameterization is illustrated in Figure 4.

3.3. BS-UNet

This paper proposes a BRA-Swin-UNet network structure by integrating the BRA attention mechanism and online convolutional reparameterization structure on the basis of the Swin-UNet network model.

In the Swin-UNet network structure, the input images need to undergo Patch Embedding first, dividing the input image into several small blocks (i.e., “patches”), and then transforming these blocks into a one-dimensional sequence of vectors. This allows the original two-dimensional image data to be processed by the Transformer model. This paper introduces the BRA attention mechanism between Patch Embedding and the input to the first Swin Transformer block, aiming to optimize the feature fusion process before deep feature extraction by utilizing BRA’s dual-layer routing capability. Swin Transformer divides the image using a fixed-size window. For images with lower resolutions, using a single-size window for division may not fully capture key information in the image. Introducing the BRA mechanism can group relevant regions together during window division, enabling the better handling of detailed information in tasks such as semantic segmentation of sonar images.

In the Swin-UNet architecture, there is a convolutional layer at the output end, mapping the complex information processed by all previous layers to the target space, i.e., converting deep features into the final segmentation image. In the decoder stage of Swin-UNet, the spatial dimensions of the image are gradually restored. The final convolutional layer is responsible for synthesizing these restored details and contextual information to generate refined segmentation results. Therefore, the convolutional layer not only completes the mapping from features to pixel-level classification but also affects the details and accuracy of the model’s segmentation results. Optimizing this layer can significantly improve the model’s performance in multi-beam sonar image segmentation tasks.

This paper replaces the original convolutional layers with online reparameterization convolutions. Through the design of reparameterization, the model’s expressive power can be enhanced without increasing additional computational burden. The model can capture and integrate sonar image features in a more efficient way, thereby improving the accuracy and quality of sonar image segmentation. The online convolutional reparameterization structure further improves computational efficiency during model training based on reparameterization. The BS-UNet model is illustrated in Figure 5. The red module represents the BRA attention mechanism, which is applied after Patch Embedding, while the OERPA structure is applied just before the final output segmentation result.

4. Experimental Results

4.1. Multibeam Sonar Data Acquisition

Reservoir dams are significant hydraulic infrastructure for water resource allocation projects, and ensuring their safe operation directly relates to flood prevention, comprehensive water resource development and utilization, and ecological environmental protection downstream. The long-term operation of reservoir dams can be affected by natural and human factors, leading to various structural defects, among which crack defects are the most common. This paper artificially induces cracks in underwater concrete structures at the experimental site and utilizes multi-beam sonar for collecting crack data.

During the data collection process, the BlueView M1200d series forward-looking multibeam sonar from Teledyne Marine in the Daytona Beach, FL, USA was used. This sonar provides two operating modes with frequencies of 1.2 MHz and 2.1 MHz, with imaging widths of 130° for 1.2 MHz and 80° for 2.1 MHz. Although 1.2 MHz can provide a wider imaging range, the imaging quality of the high-frequency mode is higher. Considering that the imaging resolution of the sonar itself is not high, and both data augmentation using Pix2Pix network model and training segmentation models based on other deep learning methods require high-quality images as support, this study chose to use the 2.1 MHz operating mode for data collection.

This paper mainly focuses on the simulation of experimental environments for the most common crack defects. Since the basic structure of dams is composed of concrete material, concrete is also used as the basic structure in the simulated experimental environment, and artificial damage is applied to create cracks on its surface. The simulated experimental scene is shown in Figure 6.

Compared to early frontal sonar scans with very small opening angles, modern multibeam sonars achieve wide coverage in a horizontal direction by deploying multiple beams horizontally. Multibeam sonars no longer use traditional single-beam rotating reception methods; instead, they utilize pre-formed multiple receiving beams to obtain horizontal azimuth and distance information within the sectoral scanning range. Multibeam sonars employ beamforming technology to provide the receiving transducer with multiple receiving beams of the same width but in different directions. The directionality of the beams determines the horizontal azimuth of the signal, while the beam width determines the resolution in the azimuth direction. The horizontal and vertical beam opening angles and detection range of forward-looking multibeam sonars are illustrated in Figure 7.

When using a multibeam sonar for detection, the pitch angle of the sonar (i.e., the degree of tilt of the sonar relative to the horizontal plane) is a key factor influencing the imaging of the multibeam sonar. The pitch angle directly affects the coverage range of the transmitted sound beams, and by adjusting the pitch angle, the detection range and coverage area of the sonar can be optimized according to actual requirements. Proper pitch angle settings can help improve the resolution and accuracy of detecting targets.

By precisely controlling the emission angle of the sound beams, the sonar can more effectively focus on specific areas or targets, thereby obtaining higher-quality echo data and improving image clarity and target recognition capabilities. After considering the requirements of the simulated experimental scene and the imaging quality, this study chose to collect data with the sonar’s pitch angle between 45° and 75°, at a distance of about 1 m from the target plane. To better adjust the sonar’s pitch angle, the sonar was mounted on a small gimbal. The assembly of the multibeam sonar and the gimbal is shown in Figure 8.

During the process of simulated experiments, in order to expand the dataset as much as possible, this paper collected two-dimensional sonar images containing crack defects at various angles by adjusting the pitch angle of the sonar and changing the horizontal relative position between the sonar and the defect. In total, 500 valid sonar images were collected, the sonar images used for training, validation, and testing were collected from different areas, and there was no overlap in the regions. Some of the images are shown in Figure 9.

4.2. Analysis of Ablation Experiment Results

The original dataset in this paper was first divided into training, validation, and testing sets, with no overlapping regions between them, in an 8:1:1 ratio. The data augmentation was then applied to the training set, resulting in a total of 1000 sonar images. The training set was used to train the model, the validation set was used to evaluate the model’s performance during training, and the testing set was used to assess the model’s accuracy.

All experiments in this paper were implemented under the PyTorch 1.8.0 framework. A BS-UNet model was built based on the PyTorch framework, and Res-UNet++ [28], Attention-UNet [29], and Trans-UNet [30] models were also constructed as comparative experimental models. Each training batch size was set to 8, base-lr was 10⁻², and a total of 100 epochs of training were conducted. These hyperparameter values were determined based on a comprehensive consideration of validation experiment results and hardware configuration resources.

The algorithm model in this paper is an improvement upon the Swin-UNet framework. The backbone network still utilizes Swin Transformer. Between Patch Embedded and Swin Transformer, a BRA attention module was introduced, and an online convolutional reparameterization structure was applied in the convolutional layer at the output end. To validate the effectiveness of these two modules, ablation experiments were conducted on the algorithm proposed in this paper. Figure 10 shows the visual results of the ablation experiments.

Figure 10 demonstrates comparative results between the BS-UNet algorithm, the original Swin-UNet algorithm, and the Swin-UNet algorithm with the added BRA attention module. The parts circled in red in the image indicate the areas where the segmentation results of the algorithm either lack or exceed values compared to the standard binary mask image. From the figure, it can be observed that the Swin-UNet algorithm exhibits discontinuities in segmenting underwater targets, with a rough handling of details. Although the Swin-UNet + BRA algorithm shows some improvement in detail processing, it still suffers from segmentation content omissions and redundancies. In comparison, the BS-UNet algorithm shows no obvious omissions in the segmentation structure compared to the other two algorithms, and it does not exhibit discontinuities in segmenting targets. Although there are still deficiencies in handling some edge details, there is a noticeable improvement compared to the other two algorithms.

Based on the performance indicator data comparison provided in Table 1, it can be seen that both the inclusion of the BRA module and the Oerpa module enhance the segmentation performance of the Swin-UNet model. Compared to the original Swin-UNet algorithm, BS-UNet increased the Precision indicator by 3% and the Recall indicator by 3.5%. It can be observed that the algorithm proposed in this paper performed better in sonar data segmentation tasks.

4.3. Analysis of Comparative Experiment Results

To accurately evaluate the effectiveness of the BS-UNet network model, this paper compares its performance metrics with representative networks in the field of image segmentation, including Res-UNet, Attention-UNet, and Trans-UNet. The experimental visualization results of different networks on the sonar dataset are shown in Figure 11.

The algorithm proposed in this paper enhances the model’s understanding and processing capabilities of different regions and features in images based on preserving the advantages of the Swin-UNet architecture, through the Bi-Level Routing Attention mechanism. The application of online convolutional reparameterization allows the model to dynamically adjust its convolutional kernels, enhancing the model’s ability to capture and represent image features with an adaptive adjustment mechanism without significantly increasing computational complexity. These adjustments make the algorithm in this paper superior to other algorithms in crack segmentation under different shapes and angles. Therefore, the BS-UNet segmentation network plays a positive role in dam crack segmentation. Table 2 presents the segmentation metrics of various network models in the comparative experiments.

From Table 2, it can be observed that on the self-built sonar dataset, BS-UNet achieves Precision, IoU, Recall, and Dice metrics of 79.4%, 65.4%, 78.7%, and 79.0%, respectively, outperforming other semantic segmentation models and validating the effectiveness of the algorithm proposed in this paper. This study also employed 3-fold cross validation for validation experiments, and the final results effectively support the conclusions of this paper.

Figure 12 shows the loss function curves and Dice curves of BS-UNet, Att-UNet, and Swin-UNet. It can be seen that the curves are relatively stable with high fitting, further illustrating the good performance of the algorithm proposed in this paper in handling sonar data.

4.4. Comparison of Model Computational Efficiency

In addition to segmentation accuracy, model size, computational complexity, and inference time are also important evaluation metrics for measuring the quality of segmentation models. Excessive computational complexity can lead to a decrease in model detection efficiency and higher device requirements. The parameter count (Params) refers to the total number of parameters in the model used to measure the model’s spatial complexity. Floating-point operations (FLOPs) can be understood as computational load, used to measure the algorithm’s time complexity. This paper comprehensively considers the time complexity, spatial complexity, and inference time of the models to evaluate their computational efficiency. Table 3 presents the comparative results of computational efficiency for each model.

The experimental results indicate that the Swin-UNet model has significant advantages in terms of time complexity, space complexity, and inference time. Although the algorithm proposed in this paper increases the inference time to some extent, it is still more lightweight compared to other algorithms. Since online convolutional reparameterization cannot be reflected through parameters such as inference time, this paper initiated another set of comparative experiments; under the same hardware conditions and dataset, the time for 200 rounds of training was calculated for models with Swin-UNet and ordinary convolutional reparameterization, as well as for the algorithm proposed in this paper. Ultimately, the training time for the Swin-UNet model with ordinary convolutional reparameterization was 2 h and 11 min, while the training time for the algorithm proposed in this paper was 1 h and 34 min. It can be observed that the online convolutional reparameterization structure effectively alleviated the computational burden brought by the multi-branch convolutional structure.

5. Conclusions

This paper primarily proposes a BS-UNet network model that integrates the BRA attention mechanism and online convolutional reparameterization structure for the multi-beam sonar segmentation task and validates the model’s effectiveness using simulated crack image datasets collected by multi-beam sonar.

Swin-UNet is a deep learning model that combines the Swin-Transformer and U-Net architectures and is currently one of the commonly used models in semantic segmentation tasks. In this paper, the BRA attention mechanism was incorporated into the Swin-UNet model, which is primarily used to capture and emphasize important information at different levels. With a dual-layer layout, it can simultaneously focus on both global and local features that have the greatest impact on the final task, enhancing segmentation details for low-resolution sonar images. In addition to the BRA module, an online convolutional reparameterization structure was added to the output end of Swin-UNet, which improved the model’s expressive power and adaptability without significantly increasing the computational complexity of the model.

The dataset used in the segmentation comparative experiments of this paper was collected using forward-looking multibeam sonar. By comparing the segmentation effects of crack data detected at different shapes and angles across multiple models, the conclusion can be drawn that the incorporation of the BRA attention mechanism and online convolutional reparameterization enabled the BS-UNet model to achieve good performance in multi-beam sonar image segmentation tasks. It can accurately and coherently segment crack defects.

Author Contributions

Conceptualization, W.Z., Y.Z. and X.Z.; methodology, W.Z., Y.Z. and X.Z.; software, P.Z. and R.W.; validation, W.Z., Y.Z. and P.Z.; formal analysis, P.Z. and R.W.; investigation, W.Z. and Y.Z.; resources, J.X. and Y.C.; data curation, W.Z., Y.Z. and X.Z.; writing—original draft preparation, W.Z.; writing—review and editing, X.Z.; visualization, W.Z. and X.Z.; supervision, J.X. and Y.C.; project administration, J.X. and Y.C.; funding acquisition, J.X. and Y.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Key Research and Development Program of China (2022YFB4703404); Research and Development Project of China Construction Co., Ltd. for the Year 2023 (CSCEC-2023-Z-10); and Ministry of Housing and Urban-Rural Development 2022 Science and Technology Plan Project: Research on Intelligent Diagnosis and Evaluation Technology for Drainage Pipeline Network Operational Efficiency (2022-K-165).

Data Availability Statement

The original contributions presented in this study are included in this article; further inquiries can be directed to the corresponding author.

Conflicts of Interest

Authors Junsong Xu and Yang Chen was employed by the company The Second Construction Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Zhu, Y.; Tang, H. Automatic Damage Detection and Diagnosis for Hydraulic Structures Using Drones and Artificial Intelligence Techniques. Remote Sens. 2023, 15, 615. [Google Scholar] [CrossRef]
Ren, Q.; Li, M.; Shen, Y.; Zhang, Y.; Bai, S. Pixel-level shape segmentation and feature quantification of hydraulic concrete cracks based on digital images. J. Hydroelectr. Eng. 2021, 40, 234. [Google Scholar] [CrossRef]
Gaugel, S.; Wu, B.; Anand, A.; Reichert, M. Supervised Time Series Segmentation as Enabler of Multi-Phased Time Series Classification: A Study on Hydraulic End-of-Line Testing. In Proceedings of the 2023 IEEE 21st International Conference on Industrial Informatics (INDIN), Lemgo, Germany, 18–20 July 2023. [Google Scholar]
Freedman, D.; Zhang, T. Interactive graph cut based segmentation with shape priors. In Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), San Diego, CA, USA, 20–25 June 2005. [Google Scholar]
Chuang, K.S.; Tzeng, H.L.; Chen, S.; Wu, J.; Chen, T.J. Fuzzy c-means clustering with spatial information for image segmentation. Comput. Med. Imaging Graph. 2006, 30, 9–15. [Google Scholar] [CrossRef] [PubMed]
Min, H. Application of an improved Otsu algorithm in image segmentation. J. Electron. Meas. Instrum. 2010, 24, 443–449. [Google Scholar]
Lu, J.; Lin, H.; Pan, Z. Adaptive Region Growing Algorithm in Medical Images Segmentation. J. Comput. Aided Des. Comput. Graph. 2005, 17, 2168–2173. [Google Scholar]
Lecun, Y.; Bottou, L. Gradient-based learning applied to document recognition. Proc. IEEE 1998, 86, 2278–2324. [Google Scholar] [CrossRef]
Technicolor, T.S.; Related, S.O.R.; Technicolor, T.S.; Related, S.O.R. ImageNet Classification with Deep Convolutional Neural Networks. In Proceedings of the Advances in Neural Information Processing Systems 25, Lake Tahoe, NV, USA, 3–6 December 2012. [Google Scholar]
Iglovikov, V.; Shvets, A. TernausNet: U-Net with VGG11 Encoder Pre-Trained on ImageNet for Image Segmentation. arXiv 2018, arXiv:1801.05746. [Google Scholar]
Shelhamer, E.; Long, J.; Darrell, T. Fully Convolutional Networks for Semantic Segmentation. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; Volume 39, pp. 640–651. [Google Scholar] [CrossRef]
Zhang, Z.; Zhang, C.; Shen, W.; Yao, C.; Liu, W.; Bai, X. Multi-Oriented Text Detection with Fully Convolutional Networks. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
Zhou, J.; Troyanskaya, O.G. Predicting effects of noncoding variants with deep learning–based sequence model. Nat. Methods 2015, 12, 931–934. [Google Scholar] [CrossRef] [PubMed]
He, J.; Chen, J.; Xu, H.; Yu, Y. SonarNet: Hybrid CNN-Transformer-HOG Framework and Multifeature Fusion Mechanism for Forward-Looking Sonar Image Segmentation. IEEE Trans. Geosci. Remote Sens. 2024, 62, 4203217. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention Is All You Need. In Proceedings of the Advances in Neural Information Processing Systems 30, Long Beach, CA, USA, 4–7 December 2017. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In Proceedings of the International Conference on Learning Representations 2021, Vienna, Austria, 4 May 2021. [Google Scholar]
Zheng, S.; Lu, J.; Zhao, H.; Zhu, X.; Luo, Z.; Wang, Y.; Fu, Y.; Feng, J.; Xiang, T.; Torr, P.H.S.; et al. Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), Montreal, BC, Canada, 11–17 October 2021. [Google Scholar]
Zhu, L.; Wang, X.; Ke, Z.; Zhang, W.; Lau, R. BiFormer: Vision Transformer with Bi-Level Routing Attention. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023. [Google Scholar]
Hu, M.; Feng, J.; Hua, J.; Lai, B.; Huang, J.; Gong, X.; Hua, X. Online Convolutional Re-parameterization. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022. [Google Scholar]
Zhao, Z.; Hu, D.; Wang, H.; Yu, X. Convolutional Transformer Network for Hyperspectral Image Classification. IEEE Geosci. Remote Sens. Lett. 2022, 19, 6009005. [Google Scholar] [CrossRef]
Chen, G.; Mao, Z.; Wang, K.; Shen, J. HTDet: A Hybrid Transformer-Based Approach for Underwater Small Object Detection. Remote Sens. 2023, 15, 1076. [Google Scholar] [CrossRef]
Meng, X.; Yang, Y.; Wang, L.; Wang, T.; Li, R.; Zhang, C. Class-guided Swin Transformer for Semantic Segmentation of Remote Sensing Imagery. IEEE Geosci. Remote Sens. Lett. 2022, 19, 6517505. [Google Scholar] [CrossRef]
Lin, T.Y.; Dollar, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature Pyramid Networks for Object Detection. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
Ni, Z.L.; Bian, G.B.; Zhou, X.H.; Hou, Z.G.; Li, Z. RAUNet: Residual Attention U-Net for Semantic Segmentation of Cataract Surgical Instruments. In Proceedings of the 26th International Conference, ICONIP 2019, Sydney, NSW, Australia, 12–15 December 2019. [Google Scholar]
Li, Y.; Yan, B.; Hou, J.; Bai, B.; Huang, X.; Xu, C.; Fang, L. UNet based on dynamic convolution decomposition and triplet attention. Sci. Rep. 2024, 14, 271. [Google Scholar] [CrossRef] [PubMed]
Zhang, X.; Yang, S.; Jiang, Y.; Chen, Y.; Sun, F. FAFS-UNet: Redesigning skip connections in UNet with feature aggregation and feature selection. Comput. Biol. Med. 2024, 170, 108009. [Google Scholar] [CrossRef] [PubMed]
Li, X.; Fang, Z.; Zhao, R.; Mo, H. Brain Tumor MRI Segmentation Method Based on Improved Res-UNet. IEEE J. Radio Freq. Identif. 2024, 1. [Google Scholar] [CrossRef]
Wang, H.; Qiu, S.; Zhang, B.; Xiao, L. Multilevel Attention Unet Segmentation Algorithm for Lung Cancer Based on CT Images. Comput. Mater. Contin. 2024, 78, 1569–1589. [Google Scholar] [CrossRef]
Yu, J.; He, X.; Qin, J.; Zhang, W.; Xiang, J.; Zhao, W. Trans-UNeter: A new Decoder of TransUNet for Medical Image Segmentation. In Proceedings of the 2023 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), Istanbul, Turkiye, 5–8 December 2023. [Google Scholar]

Figure 1. Swin Transformer compared to ViT. (a) Swin Transformer’s window partitioning; (b) ViT’s window partitioning.

Figure 2. Swin-UNet structure.

Figure 3. BRA structure operation flow.

Figure 4. Online convolution Re-parameterization. (a) Convolutional input and output under normal circumstances; (b) convolutional input and output after module linearization; (c) convolutional input and output after module fusion.

Figure 5. The BS-UNet network structure, where the BRA attention mechanism is between Patch Embedded and the input Swin Transformer block, while convolutional reparameterization is at the final output end.

Figure 6. Experimental site setup and data collection process.

Figure 7. The horizontal and vertical beam opening angles and detection range of the M1200d are as follows.

Figure 8. Assembly diagram of sonar and gimbal.

Figure 9. Partial two-dimensional sonar images containing crack defects.

Figure 10. Ablation experiment visualization results. Green boxes refer to areas where cracks are located, and red circles indicate missing or excess content in the segmentation results. (a) Original image; (b) segmentation standard after image annotation; (c) segmentation results of Swin-UNet; (d) segmentation results of Swin-UNet + BRA; (e) segmentation results of the proposed algorithm in this paper.

Figure 11. Comparison of experimental visualization results. Green boxes refer to areas where cracks are located, and red circles indicate missing or excess content in the segmentation results. (a) Original image; (b) segmentation standard after image annotation; (c) segmentation results of Res-UNet; (d) segmentation results of Att-UNet; (e) segmentation results of Trans-UNet; (f) segmentation results of Swin-UNet; (g) segmentation results of the proposed algorithm in this paper.

Figure 12. Loss function graph and Dice graph. (a) Comparison of loss function curves for Att-UNet, Swin-UNet, and BS-UNet Algorithms. (b) Comparison of dice function curves for Att-UNet, Swin-UNet, and BS-UNet Algorithms.

Table 1. Performance index of ablation experimental model.

Model	Precision	IoU	Recall	Dice
Swin-UNet	0.764	0.610	0.752	0.758
Swin-UNet + BRA	0.771	0.618	0.757	0.764
BS-UNet	0.794	0.654	0.787	0.790

Table 2. Comparison of the performance index of the experimental network model.

Model	Precision	IoU	Recall	Dice
Res-UNet	0.645	0.447	0.592	0.618
Att-UNet	0.747	0.560	0.693	0.718
Trans-UNet	0.768	0.612	0.752	0.760
Swin-UNet	0.764	0.610	0.752	0.758
BS-UNet	0.794	0.654	0.787	0.790

Table 3. Comparison of model computational efficiency.

Model	Inference Time	FLOPs	Params
Res-UNet	0.0087 s	12.108 G	4.064 M
Att-UNet	0.0596 s	51.018 G	34.897 M
Trans-UNet	0.0672 s	24.065 G	62.094 M
Swin-UNet	0.0232 s	5.916 G	27.145 M
BS-UNet	0.0274 s	6.140 G	27.186 M

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, W.; Zhang, X.; Zhang, Y.; Zeng, P.; Wei, R.; Xu, J.; Chen, Y. Multi-Beam Sonar Target Segmentation Algorithm Based on BS-Unet. Electronics 2024, 13, 2841. https://doi.org/10.3390/electronics13142841

AMA Style

Zhang W, Zhang X, Zhang Y, Zeng P, Wei R, Xu J, Chen Y. Multi-Beam Sonar Target Segmentation Algorithm Based on BS-Unet. Electronics. 2024; 13(14):2841. https://doi.org/10.3390/electronics13142841

Chicago/Turabian Style

Zhang, Wennuo, Xuewu Zhang, Yu Zhang, Pengyuan Zeng, Ruikai Wei, Junsong Xu, and Yang Chen. 2024. "Multi-Beam Sonar Target Segmentation Algorithm Based on BS-Unet" Electronics 13, no. 14: 2841. https://doi.org/10.3390/electronics13142841

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Multi-Beam Sonar Target Segmentation Algorithm Based on BS-Unet

Abstract

1. Introduction

2. Related Works

2.1. Swin Transformer

2.2. Swin-UNet

3. Proposed Method

3.1. Bi-Level Routing Attention

3.2. Online Convolutional Re-Parameterization

3.3. BS-UNet

4. Experimental Results

4.1. Multibeam Sonar Data Acquisition

4.2. Analysis of Ablation Experiment Results

4.3. Analysis of Comparative Experiment Results

4.4. Comparison of Model Computational Efficiency

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI