1. Introduction
In autonomous exploration tasks of AUVs, target detection technology based on SSS imagery plays a crucial role [
1,
2,
3]. Among traditional detection methods and deep-learning-based detection methods, deep-learning-based approaches have shown significant advantages and have seen extensive applications in the field of underwater target detection [
4,
5,
6]. However, the quantity and quality of data samples directly impact the detection performance of deep learning methods. Obtaining sufficient high-quality target data is extremely challenging and costly due to an AUV’s unstable underwater posture and positioning errors caused by cumulative inaccuracies. Therefore, it is of paramount significance for research to explore how to generate data from small-sample SSS images and improve target detection performance.
In terms of data generation, deep generative models, such as VAE [
7,
8], EBM [
9,
10], GANs [
11,
12,
13], normalizing flow [
14,
15], and diffusion model [
16,
17], have shown great potential in creating new patterns that humans cannot properly distinguish. However, due to the unique characteristics of underwater data, there is relatively limited research on data augmentation specifically for underwater data. Current research generally relies on the combination of some simulation models and GANs for data generation.
If only a set of original data is available, conditional GANs and some unconditional GANs can be considered. By providing either target images or target images with labels as input, GANs can fit simulated images to approach real images through random noise. Chen et al. [
18] proposed a deep convolutional generative adversarial network (SGAN) based on group padding and uniform-sized convolutional kernels, which is used for high-quality data augmentation. Xu et al. [
19] combined DenseNet and ResNet with WGAN-GP to propose an image generation network called CWGAN-GP&DR, which extends underwater sonar datasets and effectively improves the classification performance of underwater sonar images. Wang et al. [
20] addressed the issues of low resolution and poor imaging quality in commonly used image generation methods. They proposed a new model based on DCGAN, improving the network structure and the loss function of the discriminator. They also introduced a controllable multi-layer transformed convolutional layer structure, enhancing the image resolution and imaging quality.
If pixel-level paired training images are available, GANs based on pix2pix can be used to generate images. Jegorova et al. [
21] proposed a novel method for generating realistic SSS images called Markov-chain-conditioned pix2pix (MC-pix2pix) and used MC-pix2pix data to train an autonomous target detection network. Jiang et al. [
22] presented a pix2pix-based semantic image synthesis model. The proposed method reconstructs SSS-simulated images from simple hand-drawn segmentation maps of target photos and can generate sonar images with different water depths and frequencies. Lee et al. [
23] used a segmentation network to obtain mask images from the original images for training the pix2pix network. The trained network is then used to generate sonar images to enhance the effectiveness of the segmentation network.
In addition to single-group image data and paired image data, unsupervised GAN networks like CycleGAN can also be utilized. CycleGAN takes two sets of unlabeled data as input and employs a method similar to style transfer to achieve target data generation. Liu et al. [
24] developed a novel humanized underwater acoustic image simulator based on 3D modeling software. Then, using the dataset generated by the simulator, they applied the CycleGAN network to generate realistic acoustic images. Zhang et al. [
25] addressed the issue of imbalanced and speckle-noise-prone acoustic images that often lead to mode collapse. They proposed a spectrum normalization CycleGAN network, where spectrum normalization is applied to both the generator and discriminator to stabilize GAN training.
The aforementioned data generation methods based on GANs require a relatively large dataset for training, and GAN training can be challenging. When the dataset is small, it can easily lead to mode collapse, necessitating certain complex techniques to improve the training process [
26]. On the other hand, diffusion models, compared to GANs, are more stable and do not require an additional discriminator to be trained. Therefore, diffusion models have shown great potential in various fields, such as computer vision [
27,
28], sequence modeling [
29,
30], audio processing [
31,
32], and artificial intelligence research [
33,
34]. However, there is currently no research on diffusion models specifically for generating underwater SSS data.
The diffusion model has achieved success in many fields, making it highly worthwhile to explore how to apply the diffusion model to one’s own research domain. This paper utilizes a small-sample dataset of SSS data collected from sea trials to train a diffusion model and compares it with generation methods based on GANs. The effectiveness of the diffusion model in generating small-sample SSS data is demonstrated. Finally, the generated data are tested on mainstream detection networks and an improved YOLOv7 network to further validate the enhancement in detection accuracy achieved by training networks with data generated by the diffusion model.
The contributions of this paper can be summarized as follows:
- (1)
The first application of the DDPM to generate small-sample SSS data yielded excellent results in the experiments. It addresses the challenges associated with acquiring SSS data using an AUV and reduces data collection costs.
- (2)
An improvement was made to the YOLOv7 model by introducing an ECANet attention mechanism to the YOLOv7 network, enhancing the feature extraction capability for underwater targets and improving the detection accuracy of small targets in SSS images.
- (3)
A dataset of small underwater targets in SSS was constructed, and a comprehensive comparison was conducted between current mainstream data generation methods and object detection methods on this dataset, fully demonstrating the effectiveness of the proposed approach in this paper.
The rest of this paper is organized as follows: In
Section 2, we introduce the diffusion-model-based data generation method and the improved YOLOv7 network structure used in this study.
Section 3 presents the experimental process and showcases the results. In
Section 4, we discuss the strengths and limitations of the proposed methods.
Section 5 concludes the paper and provides an outlook on future work.
3. Results
To validate the proposed method for SSS image generation and the improved YOLOv7 model, we first compared it with image generation methods based on DDPM and several GAN-based methods. By comparing the quality of generated images and conducting quantitative analysis, we confirmed the stability and effectiveness of the DDPM-based method in generating small-sample underwater data. Subsequently, we augmented the training set with data generated using DDPM and tested it on popular detection networks as well as our improved YOLOv7 detection network. This further demonstrated that training the network with data generated using DDPM can improve the detection accuracy of the network while also highlighting the superiority of our improved YOLOv7 detection network in SSS object detection.
The proposed method was implemented on a system with the following specifications: Intel(R) Xeon(R) Platinum 8255C CPU with 2.50 GHz, 24 GB of RAM, Nvidia GeForce RTX 3090, CUDA 11.3, Ubuntu 20.04 operating system, 64 bits, and the PyTorch framework.
3.1. Model Evaluation Metrics
The most commonly used evaluation metrics for generative models are Inception Score (IS), Fréchet Inception Distance (FID), and Perceptual Path Length (PPL). However, the Inception Net-V3 model used in calculating the IS metric is trained on the ImageNet dataset, which does not include underwater images. Therefore, the IS metric is not suitable for assessing the quality of the generated images in this study. The PPL metric utilizes the VGG network and focuses on whether the generator can effectively separate and recombine the features of different images. This metric is typically used in face detection and is not applicable to the SSS images generated in this study. On the other hand, the FID metric, although also utilizing the Inception Net-V3 model, directly considers the distance between generated and real data at the feature level. It does not rely on an additional classifier, making it suitable for evaluating the SSS images generated in this study. Therefore, only the FID metric is used as the measure of the generated image quality in this study.
The formula for calculating the FID metric is as follows:
where
represents the mean of the features of real images,
represents the mean of the features of generated images,
represents the covariance matrix of real images,
represents the covariance matrix of generated images, and
represents the trace of the matrices.
A smaller FID value indicates a closer resemblance between the generated distribution and the real image distribution.
To evaluate the performance improvement in the detection model by incorporating generated data and the detection effectiveness of the improved YOLOv7 model, we utilize precision (
P), recall (
R), and mean Average Precision (
) as evaluation metrics. The calculation formulas are defined as follows:
where true positive (
) is the number of correctly detected positive samples, false positive (
) is the number of falsely detected negative samples, and false negative (
) is the number of undetected positive samples.
N is the number of detected categories.
3.2. Dataset Preparation
During the data collection phase, we utilized a 324-caliber AUV equipped with an SSS to navigate through designated areas where pre-placed targets were deployed, following a predetermined route. The AUV and the SSS utilized in the experiment are depicted in
Figure 6.
To achieve real-time processing of SSS images, we employ a strategy of extracting image segments from the sonar waterfall plot every 30 s, as shown in
Figure 7. Additionally, the targets occupy a very small proportion of the entire SSS image. Directly inputting the entire image into the network for training would generate a large number of negative samples, which would impact the training process and waste computational resources. To address this issue, we crop the images into small patches of size 200 × 200 and each patch overlaps by 50 pixels to prevent loss of target features. From these patches, we select the ones that contain targets for training, thus avoiding irrelevant backgrounds that may introduce negative samples. Similarly, during the detection phase, we perform the same cropping operation before feeding the entire image into the detection network.
Due to the uncertainty of sea conditions, collecting data at sea is extremely challenging. After analysis and comparison, we successfully collected 388 valid target images, including 53 cone targets, 55 cylinder targets, and 280 non-preset seabed targets that could cause interference. This dataset exhibits a significant class imbalance issue, and the data volume is limited. We set the ratio of the training set, validation set, and test set to 0.6:0.2:0.2 to maximize the number of samples, and we refer to this original dataset as DatasetA.
Next, we used the DDPM method to generate data. By selecting similar data from the generated dataset, we increased the total number of cone and cylinder data to match the non-targets, which is 280. To ensure a fair comparison of the experimental results, the generated data were only used for the training set, while the quantity of the validation set and test set remained unchanged. The dataset with the added generated data is referred to as DatasetB. The sample counts of DatasetA and DatasetB are shown in
Table 1.
Finally, based on the two aforementioned datasets, the effectiveness of the generated data was tested on different detection networks.
3.3. Comparison of Data Generated by DDPM and GANs
Our experimental goal is to generate target-containing images that are similar to the original data in order to increase the sample size of the training set and improve the performance of the detection model. In the experiment, we compared the adversarial autoencoder (AAE), auxiliary classifier GAN (ACGAN), boundary-seeking GAN (BGAN), and DDPM methods. The images generated by the DDPM algorithm and GANs are shown in
Figure 8.
From the generated results, it can be observed that the DDPM algorithm produces SSS data with a high degree of similarity to the original data, even when trained with limited data. However, the image quality generated by the GAN method is relatively poor. The GAN approach only generates some background information without producing the desired target images, and it may even suffer from mode collapse.
To quantitatively analyze the similarity between the data generated by the DDPM method and the real data, we calculated the FID metric for the generated and real images, as shown in
Table 2. In terms of classification similarity index (dim = 768), the difference between the generated data and the original data extracted by the neural network is very small. These results indicate that the generated synthetic sonar images have a high similarity to the real images. Furthermore, we computed eight Haralick textural features [
45] (angular second moment, contrast, correlation, inverse difference moment, sum entropy, entropy, difference variance, and difference entropy) for both datasets and used the multidimensional scaling (MDS) method to measure the texture dissimilarity in two dimensions, as depicted in
Figure 9. The vertical and horizontal axes are dimensionless and represent the degree of texture differences. The results show that the generated data exhibit similar texture features to the real sonar data in three different categories and overall.
3.4. Performance Comparison of YOLOv7 Networks Trained with DatasetA and DatasetB
In the experiment, our objective is to generate data that are highly similar to real SSS images. However, the most important aspect is to improve the detection performance of our detection network. Therefore, we trained two YOLOv7 models using DatasetA and DatasetB separately (detailed information about the two datasets can be found in
Table 1) and compared the detection performance of the two models.
We conducted a comprehensive comparison of the two trained models using the test set. The comparative results from the confusion matrices are presented in
Figure 10. The outcomes from the confusion matrix demonstrate the remarkable performance of the network trained on DatasetB, showcasing enhanced accuracy and balance in detecting target and background classes.
In order to further analyze model performance, we examined the PR curves, as presented in
Figure 11. Notably, the network trained on DatasetB demonstrated a significantly higher mAP for detection when compared to the network trained on DatasetA. The
[email protected] value exhibited an impressive increase of 27.9%.
A visualization of detection results is illustrated in
Figure 12. Clearly, the network trained on DatasetB exhibited outstanding performance, surpassing its counterpart trained on DatasetA by successfully detecting more targets and achieving higher accuracy on the test set.
For a comprehensive view of the specific metrics, refer to
Table 3. It provides a detailed breakdown of the performance measures, further reinforcing the superiority of the network trained on DatasetB over DatasetA.
3.5. Performance Comparison of YOLOv7 Integrated with Different Attention Mechanisms
In order to investigate the performance of various commonly used attention mechanisms when integrated with YOLOv7, we individually incorporated each attention mechanism at the same position within the YOLOv7 network. Performance tests were conducted on both the original dataset and the augmented dataset, and the results are presented in
Table 4.
From
Table 4, it can be observed that incorporating attention mechanisms does not necessarily improve the detection performance of the network. Inappropriate attention mechanisms, on the contrary, can lead to a decrease in the network’s detection performance. Simultaneously, we can also discern that, for the dataset and YOLOv7 network used in this study, ECA and CA exhibit better performance, both of which significantly enhance the network’s detection capabilities. However, when comparing these two attention mechanisms, ECA introduces fewer additional parameters. Therefore, in situations where there is a similar improvement in performance, adding ECA is a preferable choice for underwater platforms with limited computational resources.
3.6. Performance Comparison of Improved YOLOv7 Network against Original YOLOv7 and Other Networks
To validate the performance of our proposed improved YOLOv7 model, we conducted training and testing on DatasetA and DatasetB, respectively. A comparison of the confusion matrices of the improved models on the test set is shown in
Figure 13, and the PR curves are illustrated in
Figure 14. Compared with
Figure 10 and
Figure 11, our improved model has higher detection accuracy on both datasets. The detection results of the improved models are depicted in
Figure 15. A comparison of the detection results with those shown in
Figure 12 illustrates that the improved YOLOv7 network achieves accurate detection for objects that were either missed or incorrectly identified by the original YOLOv7 network, as demonstrated in images 10.png and 15.png. These findings affirm the improved YOLOv7 network’s superior detection performance and efficacy.
The specific metrics are provided in
Table 5. Compared to
Table 3, the improved model achieved varying degrees of improvement in the mAP metric on the test set. On DatasetA, the
[email protected] increased by 3.4%, while on DatasetB, the
[email protected] improved by 2%.
Additionally, we conducted a comparative analysis of the improved YOLOv7 network with several commonly used underwater object detection networks, including Faster-RCNN, SSD, EfficientDet-D0, DETR, YOLOv5, and YOLOv7. Specific comparison metrics are detailed in
Table 6. The data provided in the table strongly validate the effectiveness of the proposed data augmentation method. Employing our approach to enhance training data significantly improved the detection performance of the model. Furthermore, the results confirm the superior performance of our improved YOLOv7 model in detecting acoustic small targets.
We also compared our improved YOLOv7 network with Co-DETR, a state-of-the-art network trained on the COCO dataset. From the data in
Table 6, it can be observed that our improved YOLOv7 network and Co-DETR network perform similarly on our dataset, with Co-DETR showing better performance in the mAP.5:.95 metric on DatasetB. It can be anticipated that with more fine-tuning of Co-DETR, its detection performance on our dataset is likely to surpass that of our improved YOLOv7. However, it should be noted that Co-DETR has a parameter count of 348 million, nearly ten times that of our improved YOLOv7 network (37 million). This makes the application of Co-DETR on underwater platforms with limited computational resources highly challenging. Therefore, to ensure the deployment and use of models on underwater platforms, we often make a trade-off by sacrificing some detection performance and opting for more lightweight networks.
4. Discussion
From
Figure 8 and
Figure 9 and
Table 2, it can be observed that DDPM performs well in generating underwater small-sample SSS data. Compared to GAN networks, the diffusion model is easier to train and can generate images similar to the original dataset even with limited raw data. From
Table 3,
Table 5 and
Table 6, it can be seen that training the detection model with generated images significantly improves the mAP metric. However, since the diffusion model generates data based on the learned probability distribution, the probability of generating targets with few samples in the original data is also low. In this study, a total of 18,000 images were generated, and 480 images were selected for data augmentation. This selection process involves a certain level of subjectivity, but it is completely acceptable compared to the training difficulties and mode collapse issues of GAN networks.
Furthermore, from
Table 3,
Table 5 and
Table 6, it can also be observed that the improved YOLOv7 model in this paper has better detection performance, with significant improvements in detection metrics and actual detection results. However, there are still some errors in object detection, such as in
Figure 15, where the model detects the cone in 12.png as a cylinder and misses a non-target in 14.png. This is because different objects in sonar images can have very similar characteristics, posing a major challenge in underwater acoustic target detection.
In summary, the diffusion model has great potential in underwater applications. It is well known that underwater data collection is costly and challenging. The advantage of the diffusion model lies in its ability to generate data based on small samples and its stable training process. This can significantly reduce the cost of data acquisition, making it highly suitable for underwater target detection tasks.
5. Conclusions
This paper leveraged SSS images collected by AUV to generate data using DDPM and compared it with the GAN method. The results demonstrated the superiority of the diffusion model in generating small-sample underwater datasets. Additionally, an SSS small-target dataset was constructed, addressing the challenges and high costs associated with underwater target data collection. Furthermore, considering the characteristics of small underwater target data, improvements were made to the YOLOv7 network by incorporating an ECANet attention module to enhance feature extraction capabilities. Finally, the generated data were added to the training set, and tests were conducted on mainstream detection networks as well as our improved YOLOv7 network. The results validated that training the network and adding the generated data improves detection accuracy, with an mAP increase of approximately 30% across different detection networks. Moreover, the superiority of the improved YOLOv7 network in detecting small underwater targets was confirmed, with a 2.0% mAP improvement compared to the original network.
Additionally, this study had certain limitations. Firstly, in the generation of side-scan sonar images using the diffusion model, the process of generating a substantial amount of data and then selecting high-quality data for augmenting training samples is highly time-consuming and not suitable for real-time online generation. On the other hand, the improved YOLOv7 network introduced in this paper may sometimes make errors in recognizing different samples with very high similarity. However, we believe that in cases where there has not been a significant improvement in side-scan sonar imaging accuracy, identifying such highly similar but distinct samples is a challenging task, which is a common issue in underwater acoustic target detection. Finally, in terms of the dataset used, this study employed a relatively small number of samples. This limitation could potentially prevent some networks from fully demonstrating their performance. However, in the field of underwater acoustic target detection, data collection is inherently challenging. Therefore, conducting detection tasks with limited data aligns well with real-world engineering demands.
In future work, we plan to incorporate lightweight network techniques to reduce model complexity and improve the speed of generating images and detecting targets. The aim is to adapt to underwater platforms with limited computational resources. Additionally, we will conduct more experiments, collect more data, expand the dataset, and enhance the diversity of target categories within the dataset to accommodate a broader range of underwater target detection scenarios.