Underwater Target Detection Using Side-Scan Sonar Images Based on Upsampling and Downsampling

Tang, Rui; Chen, Yimin; Gao, Jian; Hao, Shaowen; He, Hunhui

doi:10.3390/electronics13193874

Open AccessArticle

Underwater Target Detection Using Side-Scan Sonar Images Based on Upsampling and Downsampling

by

Rui Tang

,

Yimin Chen

^*

,

Jian Gao

,

Shaowen Hao

and

Hunhui He

School of Marine Science and Technology, Northwestern Polytechnical University, Xi’an 710072, China

^*

Author to whom correspondence should be addressed.

Electronics 2024, 13(19), 3874; https://doi.org/10.3390/electronics13193874

Submission received: 19 August 2024 / Revised: 24 September 2024 / Accepted: 27 September 2024 / Published: 30 September 2024

(This article belongs to the Special Issue Selected Papers for the 2024 4th International Conference on Autonomous Unmanned Systems (4th ICAUS 2024))

Download

Browse Figures

Versions Notes

Abstract

:

Side-scan sonar (SSS) images present unique challenges to computer vision due to their lower resolution, smaller targets, and fewer features. Although the mainstream backbone networks have shown promising results on traditional vision tasks, they utilize traditional convolution to reduce the dimensionality of feature maps, which may cause information loss for small targets and decrease performance in SSS images. To address this problem, based on the yolov8 network, we proposed a new underwater target detection model based on upsampling and downsampling. Firstly, we introduced a new general downsampling module called shallow robust feature downsampling (SRFD) and a receptive field convolution (RFCAConv) in the backbone network. Thereby multiple feature maps extracted by different downsampling techniques can be fused to create a more robust feature map with a complementary set of features. Additionally, an ultra-lightweight and efficient dynamic upsampling module (Dysample) is introduced to improve the accuracy of the feature pyramid network (FPN) in fusing different levels of features. On the underwater shipwreck dataset, our improved model’s mAP50 increased by 4.4% compared to the baseline model.

Keywords:

underwater target detection; side-scan sonar image; neural network

1. Introduction

Currently, the marine economy and shipping industry are developing rapidly, and human activities utilizing and developing the ocean are becoming increasingly frequent. Along with this, the incidence of maritime accidents has been rising yearly, with more and more ships sinking after encountering unexpected situations. Due to the advantages of sound waves, such as long propagation distance in underwater, the target detection methods of acoustic-based are more widely used than optical-based. Among acoustic methods, since side-scan sonar SSS provides a larger coverage of seabed exploration and higher resolution images compared to forward-looking sonar, making SSS image target detection become a current research hotspot. In the past, underwater target detection mainly relied on manual screening. The detection results were largely affected by personal experience and cognition. However, manual detection is very inefficient, and cannot meet the requirements for large-area intelligent detection and speed of underwater unmanned vehicles. To address this problem, researchers have conducted studies on target detection methods based on SSS images.

Dylan et al. studied the ability of the deep learning target detection model YOLO to detect abnormal objects on the seabed in SSS images. The experimental results showed that YOLO can achieve a high confidence in anomaly detection [1]. Considering the existing shipwreck target SSS image dataset is small, Tang et al. proposed a YOLOv3 model based on transfer learning to identify seabed shipwrecks and achieved a mAP of 89.49% [2]. Considering the sparsity and barren features of SSS images, Yu et al. proposed a TR-YOLO v5s automatic target detection algorithm that integrates transformer modules, attention modules and downsampling modules to meet the accuracy and efficiency requirements [3]. Considering the imaging mechanism of SSS, based on YOLOv5, Li et al. divided the strip image formed by SSS in a short time into many sub-images, which are sent into a retrained detection model to extract subsea pipelines [4]. Li et al. proposed a YOLO-slimming model based on a convolutional network by introducing an efficient feature encoder. At the same time, sparsity regularization was performed to speed up the inference performance [5]. Zhao et al. proposed a joint ray model and sonar equation method for the simulation of SSS image samples. The results show that training simulation images can also achieve good results [6]. Li et al. used a method to quickly determine whether there are suspicious targets. Then, the method called MA-YOLOv7 was used to detect the screened images, which containing multi-scale information fusion and attention [7].

Currently, mainstreamtarget detection models, such as the YOLO series, use standard convolution to reduce the dimension of feature maps, while extracting key information from the image in the backbone network. However, standard convolution can exacerbate the difficulties of SSS image target detection. Because standard convolution will cause the loss of information for small target in the SSS image while reducing the dimension of feature map. Additionally, feature pyramid network (FPN) is usually established in target detection models to fuse information at different feature levels to improve detection results. How to effectively fuse these different levels of features in FPN to maintain semantic information without losing detailed information is a challenging problem. At present, there are few scholars carrying out research in this area.

In summary, to solve the current problems that still exist in underwater target detection using SSS images, this paper proposed a new model using SSS images based on upsampling and downsampling. The network architecture of this paper is shown in Figure 1. The main contributions of this paper are as follows:

(1): The shallow robust feature downsampling (SRFD) module is introduced innovatively, which is specially designed for the downsampling of shallow features, aiming to obtain optimal shallow feature maps.
(2): The receptive field convolution (RFCAConv) is introduced for downsampling in stages 2, 3, and 4, to ensure that semantic information is maximally retained while reducing the data dimension.
(3): Dynamic upsampling module (Dysample) is introduced, which replaces bilinear interpolation upsampling and improves the accuracy of FPN in fusing different levels of features without significantly increasing computational complexity or training time.
(4): The impact of upsampling and downsampling on the accuracy of underwater target detection models is discussed.

The rest of this paper is organized as follows. Section 2 describes the underwater target detection model. Section 3 presents the experimental verification and results. Section 4 discusses the key points of this paper and future research. Section 5 presents conclusions.

2. Method

To address the shortcomings of current mainstreamtarget detection models in the backbone network and FPN, and to explore the impact of upsampling and downsampling on underwater target detection in SSS images, this paper introduces SRFD and RFCAConv to fuse multiple feature maps extracted by different downsampling while creating feature maps with complementary feature sets. By capturing contextual information and abstract features in SSS images, the backbone network extracts feature maps with rich semantic information to serve subsequent feature pyramid network.

In addition, the Dysample is introduced to improve the FPN for accuracy of feature fusion at different levels. The network architecture is shown in Figure 1. Among the figure, each deep downsampling (DDM) contains one RFCAConv and three C2f modules.

2.1. Backbone Network

The complex underwater acoustic channel environment, bubbles in the underwater, water masses with uneven temperatures and other factors will affect the propagation speed of sound waves in the underwater, resulting in distortion of sonar images and weak target boundary information. The secondary sound waves generated by scatterers such as marine organisms and uneven seabeds will form reverberation noise, generating speckle noise on the SSS image.

The original SSS images contain both a large amount of critical information and noise information at the pixel-level. To optimize the detection effectiveness of the model, strategies must be adopted to selectively retain salient information and eliminate noise information. Currently, most backbone networks use a standardconvolution for downsampling the original image. However, this approach often does not work well when dealing with SSS image detection tasks.

This paper innovatively introduces the SRFD module [8] to solve above problems, t, which is specially designed for the downsampling of shallow features, aiming to obtain optimal shallow feature maps. This feature map is further processed in stages 2, 3, and 4 in the backbone network. Given that the feature maps in subsequent stages contain rich semantic information, this paper introduces RFCAConv [9] for downsampling in stages 2, 3, and 4, to ensure that semantic information is maximally retained while reducing data dimension.

2.1.1. SRFD Module

This section will elaborate on the functions and processes of the SRFD module. The processing flow of SRFD is shown in Figure 2.

In the SRFD module, a convolution with a kernel size of

7 \times 7

and a stride of 1 is applied to the SSS image for preliminary processing. After completing the initial elimination of redundant information, the resulting feature map x is replicated into two copies,

y_{1}

and

y_{1}

. For

y_{1}

, a

c u t

operation is first employed to decompose adjacent pixels into four sub-feature maps:

c_{1}

,

c_{2}

,

c_{3}

, and

c_{4}

. Subsequently, these four sub-feature maps are merged into a new feature map

y_{11}

using the

c o n c a t

operation. To reduce the number of channels in

y_{11}

and enhance its robustness, using a convolution to halve its channel count, which has a kernel size of

1 \times 1

and a stride of 1. Finally, an optimized feature map

y_{12}

is obtained through Batch Normalization (

B N

). To simplify the description, the series of processing steps starting from the

c u t

operation are defined as the

f u s i o n

function, as shown in Equation (1). The series of operations to obtain

y_{12}

from

y_{1}

is named cut-slice downsampling (

D c u t

), as shown in Equation (2) and Figure 3.

f u s i o n = B N (C o n v (C o n c a t (x, y, z)))

(1)

D c u t = f u s i o n (c 1, c 2, c 3, c 4)

(2)

where

C o n c a t

represents the concatenation operation, which is used to merge multiple feature maps in the channel dimension.

C o n v

represents the convolution operation.

After processing

y_{1}

and obtaining

y_{12}

, turn to the processing of

y_{2}

. Firstly, group convolution (

G r o u p C o n v

) is used to process

y_{2}

to obtain the intermediate feature map

y_{21}

. Next, to further downsample and integrate local feature information, Depthwise Separable Convolution (

D W C o n v

) is used to process

y_{21}

. The series of operations to obtain

y_{22}

from

y_{2}

is named Dconv, and its mathematical expression is shown in Equation (3).

D conv = B N (D W C o n v (G r o u p C o n v (x)))

(3)

After obtaining the feature maps

y_{12}

and

y_{22}

, the concat operation is used to concatenate these two feature maps, forming a more comprehensive feature map

y_{3}

. Subsequently, a convolution is applied to

y_{3}

, which has a kernel size of

1 \times 1

and a stride of 1. Finally,

B N

is performed to obtain the feature map X. The specific process is shown in Equation (4).

y_{12} = D c u t (y_{1}) y_{22} = D conv (y_{2}) X = f u s i o n (y_{12}, y_{22})

(4)

The series of operations described above, from x to X, have achieved a

2 \times

downsampling processing from the original size. Next, further downsampling from

2 \times

to

4 \times

will be performed. The

D c u t

operation and the

D c o n v

operation as defined in Equation (3) are first performed on X to obtain the feature map

X_{1}

. Subsequently, a group convolution operation is performed on X, generating a new feature map

X_{2}^{'}

, whose number of channels increases from 1/2C of X to C. Then, the

m a x p o o l i n g

operation is applied to

X_{2}^{'}

, yielding the feature map

X_{2}

. This process is named

D m a x

, as shown in Equation (5). Through this approach, efficient downsampling of the feature map and expansion of channel number is achieved while reducing noise interference to serve subsequent feature learning and detection tasks.

D m a x = B N (m a x p o o l i n g (G r o u p C o n v (X)))

(5)

Finally, the

D c u t

operation, as described in Equation (2), is applied to the feature map X for downsampling, resulting in the feature map

X_{3}

. After completing the above operations, the feature maps

X_{1}

,

X_{2}

, and

X_{3}

are concatenatedusing the

c o n c a t

operation, resulting in a combined feature map

Y_{1}

with 3C channels. To obtain the output feature map Y, a convolution is applied to

Y_{1}

, which has a kernel size of

1 \times 1

and a stride of 1.

2.1.2. RFCAConv

The downsampling in the 2nd, 3rd, and 4th stages of the current mainstream backbone networks for target detection is usually implemented by standard convolution, which reduces the parameters of the medel through parameter sharing. However, this approach neglects the differential information between different locations. Meanwhile, standard convolution does not take into account the importance of each feature during processing, which affects the effectiveness of feature extraction in the backbone network and restricts the performance of model.

From the above analysis, this paper further considers the spatial features of receptive field based on introducing the SRFD module to extract shallow features. For this purpose, RFCAConv is introduced to extract deep features, and its structure is shown in Figure 4.

In RFCAConv, a sequence of operations is first applied to process the shallow feature map Y obtained from the SRFD module, resulting in an intermediate feature map

Y_{1}

. This sequence specifically includes the following steps: Firstly, a grouped convolution operation is performed. It not only enhances feature extraction capabilities, but also reduces the number of parameters. Subsequently, a

B N

operation is carried out. Finally, an activation function of ReLU is applied to increase the nonlinearity of the model.

After the aforementioned sequence operation, the feature map’s dimensions change from the original

C \times H \times W

to

C k^{2} \times H \times W

. Here C stands for the number of channels. H and W are the height and width of feature map, respectively. And k is the size of convolution kernel. This operation aims to reshape the feature map from

C k^{2} \times H \times W

to

C \times k H \times k W

, yielding a new feature map

Y_{2}

. The operation facilitates more effective feature information processing in subsequent steps. After obtaining

Y_{2}

, perform a series of operations according to the Coordinate Attention method described in [10] to obtain the feature map Z, which is the final output. It will be used as a deep feature in subsequent network layers.

Through this series of carefully designed operations, RFCAConv can more effectively extract and utilize feature information, thereby enhancing the overall network performance.

2.1.3. Improved Backbone Network

In common target detection models, the downsampling module is typically composed of standard convolutions. However, for the task of underwater target detection in SSS images with lower resolution and smaller targets, this downsampling approach is not suitable. The reason is that standard convolutions may change or discard crucial feature information when processing such images, affecting detection accuracy.

To address the issue, this paper introduces the SRFD module in shallow feature downsampling layer and applies the RFCAConv module in deep feature downsampling layer. Specifically, the SRFD module, through its unique design, can more effectively remove redundant information and noise from the original image, capturing and preserving detailed information in shallow features, which is crucial for detecting small targets and low-resolution images. On the other hand, the RFCAConv module, with its receptive field convolution characteristics, can better preserve and utilize key features during the deep feature downsampling process. With these improvements, this paper constructs a new backbone network, as shown in Figure 1.

2.2. FPN Network

In convolutional neural networks, with the number of network layers increases, the size of feature maps gradually decreases, while semantic information becomes more abstract. How to maintain semantic information without losing detailed information and effectively fuse these different levels of features is a challenging task. In FPN, low-resolution features are fused with high-resolution features after upsampling. Improving the feature upsampling module in FPN is one of the keys to enhancing its performance.

In response to the above problems, Dysample [11] is introduced. This module aims to replace bilinear interpolation upsampling and improve the accuracy of FPN in fusing different levels of features without significantly increasing computational complexity or training time.

2.2.1. Dysample

This section will introduce the Dysample module in detail. Given the significant computational burden imposed by dynamic convolution, Dysample returns to the essence of upsampling by adopting a more direct and efficient approach known as point sampling.

The Dysample initially proposes a concise design scheme. In this approach, point-by-point offsets are generated through linear projection, and the

g r i d_s a m p l e

function in PyTorch is utilized to resample point values based on these offsets. To further optimize the upsampling process, a novel upsampler is constructed by controlling the initial sampling positions, adjusting the movement scope of offsets and dividing the upsampling process into several independent groups, as shown in Figure 5.

Given an input feature X and a sampling set S, which size is

C \times H_{1} \times W_{1}

and

2 \times H_{2} \times W_{2}

, respectively. Based on the position information in S, the

g r i d_s a m p l e

function is utilized to resample the hypothetical bilinear-interpolated X into

X^{'}

with a size of

C \times H_{2} \times W_{2}

, as shown in Equation (6).

X^{'} = g r i d_s a m p l e (X, S)

(6)

For the generation of the sampling set S, we first need to determine an upsampling scale factor s and an input feature map X. Then, a linear layer with a channel number of C and an output channel number of

2 s^{2}

is used to process the feature map X. This linear layer processing generates an offset map O with dimensions of

2 s^{2} \times H \times W

. For further processing, pixel shuffle is applied to reshape the offset map O, transforming its size to

2 \times s H \times s W

. Finally, the sampling set S is obtained by element-wise addition of the offsets O and original sampling grid G. The flowchart is illustrated in Figure 6.

\begin{matrix} O = 0.25 linear (X) \end{matrix} S = G + O

(7)

2.2.2. Improved FPN Network

Traditional upsampling methods, such as nearest neighbor and bilinear interpolation are efficient, but limited in flexibility. While some learnable upsamplers, like deconvolution and pixel shuffling, enhance flexibility, they may introduce checkerboard artifacts or be unfriendly to advanced tasks.

In recent years, dynamic upsamplers have also shown tremendous potential. For instance, CARAFE achieves feature upsampling through dynamic convolution by generating content-aware upsampling kernels. Both FADE and SAPA combine high-resolution guidance features with low-resolution input features to produce dynamic kernels, so that the upsampling process to be guided by higher-resolution structures. However, these upsamplers have complex structures and require significant computation, necessitating customized CUDA implementations and requiring longer inference times. Specifically, FADE and SAPA not only increase computational load but also limit their application scenarios due to the introduction of high-resolution guidance features. In response to the above upsampling problem, this paper introduces DySample to construct a new FPN, as illustrated in Figure 1.

3. Experimental Verification and Results

To demonstrate the effectiveness of our model in underwater target detection, experiments were conducted on a SSS image underwater shipwreck dataset. The experimental results show that under the same initial conditions, our model improved by 4.4% compared with the baseline model (YOLOv8) [12].

3.1. Training Dataset Preparation

The experimental datasets used in this paper are from the domestic mapping dataset, the SeabedObjects-KLSG dataset [13], the SCTD dataset [14], the AI4Shipwrecks dataset [15], and SSS images published on the Internet. A total of 850 SSS images were collected. The dataset was split into training set, validation set, and test set in a ratio of 7:2:1. Some typical SSS images are shown in Figure 7.

Since the SSS image dataset is very precious, we have not yet obtained a large amount of data other than shipwrecks. Therefore, we cannot discuss in depth the promotion of the model to other types of underwater targets besides shipwrecks. At the same time, we are planning to carry out data collection work on three types of objects: underwater pipes, steel balls, and tires. After we have enough data, we will conduct more extensive tests on the performance of the model.

3.2. Experimental Environment

This paper conducts experiments by python 3.8 based on pytorch 2.0.0. The experiments were conducted in a local environment and trained using a single GPU, as shown in Table 1 for the specific configuration.

3.3. Implementation Details

We adjusted the initial learning rate in the range [0.01, 0.1]. We adjusted the batch size and workers in multiples of 4. We adjusted the momentum in the range [0.80, 0.99], as well as the optimizers of stochastic gradient descent (SGD) and AdamW. Finally, in the current state, the following parameters can obtain the best results that can be found in Table 2.

3.4. Ablation Studies

To objectively demonstrate the efficacy and importance of each introduced module, this paper executed ablation experiments on the SSS images dataset. The experimental results are shown in Table 3. The ablation experiments mainly include the following aspects: (1) only SRFD is included; (2) only RFCAConv is included; (3) only Dysample is included; (4) both SRFD and RFCAConv are included; (5) all modules in this paper are included;

The “Params” in Table 3 means the number of parameters of the model. The unit of “Params” is “M”, which is equal to

1 \times 10^{6}

. The “FLOPs” refers to floating point operations. The unit of “FLOPs” is “G”, which is equal to

1 \times 10^{9}

.

According to the results of ablation experiments, it is clear that the modules of upsampling and downsampling introduced in this paper are effective.

3.5. Detector Evaluation

This paper evaluates the performance of models using precision (P), recall (R), and mean average precision (mAP) on the test dataset as shown in Equations (8) and (9).

P = \frac{T P}{T P + F P}, R = \frac{T P}{T P + F N}

(8)

A P = \int_{1}^{0} P (R) d R, m A P = \frac{\sum A P}{m}

(9)

where

T P

represents the case where the predicted sample and actual sample both are a positive sample.

F P

represents the case where the predicted sample is a positive sample, but the actual sample is a negative sample.

F N

represents the case where a negative sample is predicted, but the actual sample is positive. The mAP50 and mAP50-95 represent the

m A P

score, when the IoU threshold is 0.5 and IoU = 0.5,…, 0.95, respectively.

3.6. Comparison with Existing Methods

Figure 8 shows the Precision–Recall (PR) curves of our model and the baseline model.

Since the papers in the introduction don’t have open source code, it’s time-consuming and difficult to compare with them. The baseline models in the introduction papers are well known to us. Therefore, in this section, the accuracy of our model is compared with the accuracy of the baseline models used in these papers. At the same time, this paper also compares the difference in computational load. Table 4 summarizes all detection results and computational load on the SSS images shipwreck dataset.

From Table 4, compared with the baseline model, the mAP50 of the improved model in this paper has been improved by 4.4%.

Figure 9 shows some visualization results of the baseline model and our model. In the Figure 9, the baseline model mistakenly detects the icon as a shipwreck, and multiple detection boxes appear when detecting a shipwreck covered by sand or mud on the seabed. In addition, the confidence of the detected shipwreck is not high. In contrast, our model has better detection results.

4. Discussion

For the SSS image target detection task, it is very important to extract key feature information from the rough original image. One of the effective ways is to use efficient downsampling and upsampling methods. Usually, these methods will inevitably bring a larger amount of computation, resulting in slower detection speed, especially when deployed on small devices or edge devices. Therefore, for these devices, on the basis of applying more effective upsampling and downsampling methods, a lighter network architecture or module must be designed.

The validation and deployment of the current model are conducted on a desktop computer with power access, without considering its power consumption. But for underwater edge devices and small devices, it must also be considered. Because the problem of underwater energy supply cannot be ignored. This is as important as the detection speed and accuracy of the model.

It is well known that when the number of training samples is very limited, the model may not be able to learn the complex features from the images, resulting in poor performance. By increasing training samples can help the model learn more useful features, thereby alleviating the underfitting phenomenon. With the number of training samples increases, the model has more opportunities to learn the general laws and characteristics of the data, thereby improving its predictive ability on unseen data, that is, generalization ability. When training samples is too large or the model complexity is too high, the model may begin to “remember” the noise or outliers in the training data instead of learning the general laws of the data.

For the current SSS image underwater target detection task, in addition to the upsampling and downsampling studied in this paper, the limited number of datasets is an important reason that affects its detection effect. Although this paper has constructed the largest SSS image shipwreck dataset at present, it is still not enough. It is obvious that the order of magnitude of this data set is small compared to VOC, COCO, and ImageNet. In response to the problem of small sample, some targeted solutions are also needed.

5. Conclusions

In this paper, we proposed an improved model. First, the SRFD and RFCAConv are introduced in the backbone network to fuse multiple feature maps extracted by different downsampling to create a more robust feature map with a complementary set of features. In addition, Dysample is introduced to improve the accuracy of the FPN module in fusing different levels of features. Compared with the baseline model, the mAP50 of our model increased from 83.9% to 88.3% and the FLOPs decreased from 8.1 G to 7.1 G.

Meanwhile, experimental results also show that the traditional convolutional downsampling method achieves effects such as reducing the dimension of the feature map, but it also causes some problems such as the loss of key features, which is fatal to the SSS image underwater target detection task. Furthermore, the most commonly used upsamplers are transposed convolution and bilinear interpolation, which follow fixed rules to interpolate upsampled values and fail to produce smooth image transitions. This may cause the upsampled image to appear jagged, blurry, or noticeably pixelated. At the same time, details and texture information may also be lost during the upsampling process, which will seriously affect the effect of the target detection model. Introducing a more robust feature downsampling module and a more efficient upsampler can reduce its impact and improve the effect of underwater target detection. In addition, it should be noted that the problem of computation also is considered to achieve a balance between accuracy and computation.

For the few-shot problem, there are some tentative solutions, such as transfer learning, using remote sensing images to generate SSS images, zero-shot target detection, and domain adaptation operations using forward-looking sonar images or remote sensing images.

In the future, we will also explore which method is more effective for this problem. Based on the collected SSS image datasets and public datasets, we will explore the target detection algorithm on small devices and edge devices, which is also an important engineering demand at present.

Author Contributions

Conceptualization, Y.C. and R.T.; methodology, Y.C.; software, R.T.; validation, R.T., J.G. and S.H.; formal analysis, H.H.; investigation, Y.C.; resources, Y.C.; data curation, R.T. and H.H.; writing—original draft preparation, R.T.; writing—review and editing, Y.C. and J.G.; visualization, H.H.; supervision, J.G.; project administration, Y.C.; funding acquisition, Y.C. All authors have read and agreed to the published version of the manuscript.

Funding

This work is supported in part by the National Natural Science Foundation of China under Grants of 52471347 and 52102469, in part by the Double First-Class Foundation under Grants of 0206022GH0202, and in part by the Scientific Research and Technology Development Major Project in Liuzhou 2022AAA0102.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Dataset available on request from the authors.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Einsidler, D.; Dhanak, M.; Beaujean, P.-P. A deep learning approach to target recognition in side-scan sonar imagery. In Proceedings of the Oceans 2018 Mts/IEEE Charleston, Charleston, SC, USA, 22–25 October 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 1–4. [Google Scholar]
Yulin, T.; Jin, S.; Bian, G.; Zhang, Y. Shipwreck target recognition in side-scan sonar images by improved yolov3 model based on transfer learning. IEEE Access 2020, 8, 173450–173460. [Google Scholar] [CrossRef]
Yu, Y.; Zhao, J.; Gong, Q.; Huang, C.; Zheng, G.; Ma, J. Real-time underwater maritime object detection in side-scan sonar images based on transformer-yolov5. Remote Sens. 2021, 13, 3555. [Google Scholar] [CrossRef]
Li, Y.; Wu, M.; Guo, J.; Huang, Y. A strategy of subsea pipeline identification with sidescan sonar based on yolov5 model. In Proceedings of the 2021 21st International Conference on Control, Automation and Systems (ICCAS), Jeju, Republic of Korea, 12–15 October 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 500–505. [Google Scholar]
Li, Z.; Chen, D.; Yip, T.L.; Zhang, J. Sparsity regularization-based real-time target recognition for side scan sonar with embedded gpu. J. Mar. Eng. 2023, 11, 487. [Google Scholar] [CrossRef]
Xi, Z.; Zhao, J.; Zhu, W. Side-scan sonar image simulation considering imaging mechanism and marine environment for zero-shot shipwreck detection. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1–13. [Google Scholar] [CrossRef]
Li, L.; Li, Y.; Yue, C.; Xu, G.; Wang, H.; Feng, X. Real-time underwater target detection for auv using side scan sonar images based on deep learning. Appl. Ocean Res. 2023, 138, 103630. [Google Scholar] [CrossRef]
Lu, W.; Chen, S.-B.; Tang, J.; Ding, C.H.; Luo, B. A robust feature downsampling module for remote-sensing visual tasks. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1–12. [Google Scholar] [CrossRef]
Zhang, X.; Liu, C.; Yang, D.; Song, T.; Ye, Y.; Li, K.; Song, Y. Rfaconv: Innovating spatial attention and standard convolutional operation. arXiv 2023, arXiv:2304.03198. [Google Scholar]
Hou, Q.; Zhou, D.; Feng, J. Coordinate attention for efficient mobile network design. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual, 19–25 June 2021; pp. 13713–13722. [Google Scholar]
Liu, W.; Lu, H.; Fu, H.; Cao, Z. Learning to upsample by learning to sample. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 4–6 October 2023; pp. 6027–6037. [Google Scholar]
Ultralytics, Yolov8. 2024. Available online: https://docs.ultralytics.com/yolov8 (accessed on 23 September 2024).
Huo, G.; Wu, Z.; Li, J. Underwater object classification in sidescan sonar images using deep transfer learning and semisynthetic training data. IEEE Access 2020, 8, 7407–47418. [Google Scholar] [CrossRef]
Zhang, P.; Tang, J.; Zhong, H.; Ning, M.; Liu, D.; Wu, K. Self-trained target detection of radar and sonar images using automatic deep learning. IEEE Trans. Geosci. Remote Sens. 2021, 60, 1–14. [Google Scholar] [CrossRef]
Sethuraman, A.V.; Sheppard, A.; Bagoren, O.; Pinnow, C.; Anderson, J.; Havens, T.C.; Skinner, K.A. Machine learning for shipwreck segmentation from side scan sonar imagery: Dataset and benchmark. Int. J. Robot. Res. 2024. [Google Scholar] [CrossRef]

Figure 1. The network architecture of this paper.

Figure 2. The structure diagram of SRFD.

Figure 3. The schematic diagram of cut-slice downsampling.

Figure 4. The structure diagram of the RFCAConv.

Figure 5. The structure diagram of DySample.

Figure 6. The core components in the DySample.

Figure 7. Some typical images in the training dataset. (a) The three images are characterized by a small target, a blurred target, and a target that can only be identified by shadow in turn; (b) the four images are characterized by a target mostly buried by sand, multiple targets, a weak target in the middle of the waterfall, and a target that is cut in half by the waterfall in turn (the target is in a rectangular box).

Figure 8. Comparison of the PR curves. (a) The PR curve of baseline model; (b) The PR curve of our model.

Figure 9. Some visualization results of the baseline model and our model.

Table 1. Experimental environment.

Item	Configuration Value
CPU	Intel(R) Core(TM) i5-13490F 2.50 GHz
GPU	NVIDIA RTX 4070 12 G
Operating System	Ubuntu 20.04
Deep Learning Framework	Pytorch 2.0.0 + cu118

Table 2. Training parameter configuration.

Configuration Item	Configuration Value
optimizer	SGD
momentum	0.938
weight decay	0.0005
training epochs	500
warmup epochs	5
initial warm-up momentum	0.8
image mosaic probability	0.9
batch size	16
workers	16
initial learning rate	0.01
learning rate strategy	ploy

Table 3. Ablation experiment results (“✓” means including this module, “✕” means not including this module).

SRFD	RFCAConv	Dysample	mAP50	mAP50-95	Params (M)	FLOPs (G)
✓	✕	✕	84.4%	57.9%	3.028	9.4
✕	✓	✕	86.0%	56.0%	3.041	8.5
✕	✕	✓	84.7%	55.0%	3.018	7.3
✓	✓	✕	87.6%	60.2%	3.046	7.5
✓	✓	✓	88.3%	59.8%	3.054	7.1

Table 4. Comparison of detection results and computational load.

Model	mAP50	mAP50-95	Params (M)	FLOPs (G)
YOLOv5	72.1%	37.9%	7.012	15.8
YOLOv6	80.6%	54.1%	4.700	11.4
YOLOv7	84.4%	49.3%	6.006	13.0
Baseline	83.9%	53.5%	3.005	8.1
Ours	88.3%	59.8%	3.054	7.1

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Tang, R.; Chen, Y.; Gao, J.; Hao, S.; He, H. Underwater Target Detection Using Side-Scan Sonar Images Based on Upsampling and Downsampling. Electronics 2024, 13, 3874. https://doi.org/10.3390/electronics13193874

AMA Style

Tang R, Chen Y, Gao J, Hao S, He H. Underwater Target Detection Using Side-Scan Sonar Images Based on Upsampling and Downsampling. Electronics. 2024; 13(19):3874. https://doi.org/10.3390/electronics13193874

Chicago/Turabian Style

Tang, Rui, Yimin Chen, Jian Gao, Shaowen Hao, and Hunhui He. 2024. "Underwater Target Detection Using Side-Scan Sonar Images Based on Upsampling and Downsampling" Electronics 13, no. 19: 3874. https://doi.org/10.3390/electronics13193874

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

Underwater Target Detection Using Side-Scan Sonar Images Based on Upsampling and Downsampling

Abstract

1. Introduction

2. Method

2.1. Backbone Network

2.1.1. SRFD Module

2.1.2. RFCAConv

2.1.3. Improved Backbone Network

2.2. FPN Network

2.2.1. Dysample

2.2.2. Improved FPN Network

3. Experimental Verification and Results

3.1. Training Dataset Preparation

3.2. Experimental Environment

3.3. Implementation Details

3.4. Ablation Studies

3.5. Detector Evaluation

3.6. Comparison with Existing Methods

4. Discussion

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI