2.1. Overall Structure of DS-SIAUG
The overall training network structure of DS-SIAUG is illustrated in
Figure 1. It comprises a DDPM diffusion generation model and a YOLO target detection model, with the Disrupted Student model incorporated into the training of the YOLO target detection model.
The DDPM diffusion module serves as an image generator, training to augment the side-scan sonar dataset. The YOLO network acts as a discriminator, detecting and determining the authenticity of the generated images. By designing an adversarial interaction between these two modules, we achieve the goal of augmenting the dataset, laying the foundation for subsequent target recognition tasks. The specific adversarial training process is as follows:
- (1)
Data preparation. Use the side-scan sonar shipwreck target as an example to explain the dataset augmentation method. Select a dataset of actual side-scan sonar shipwreck target images (Real SSS Image, 1,000 images), part of which is used as the test dataset (Test Dataset, 200 images) and part as the training dataset (Training Dataset, 800 images). To expand the training dataset, conventional image augmentation methods, such as rotation and mirroring, are first used to increase the training set to about 7000 images, called “The First Augmented Dataset”.
- (2)
Initial training phase. The First Augmented Dataset is input and trained using the DDPM (Deep Diffusion Model) and YOLOv5 network models. The result of this phase is obtaining the initial DDPM model (DDPM 1) and YOLO detection model (YOLO Detection Model 1).
- (3)
Image generation, filtering, and augmentation. Using the trained DDPM 1 model, generate a dataset of side-scan sonar images (Augmented Images, about 10,000 images) and input it into the YOLO Detection Model 1 for detection. A threshold is set (different thresholds will filter different numbers of side-scan sonar images with shipwreck features), with the threshold in this paper set at 0.5, filtering out about 2000 images (YOLO filter images). These filtered images are then augmented using conventional image augmentation methods, such as rotation and mirroring, increasing them to about 7000 images, called “Diffusion Generated Dataset 1”.
- (4)
Iterative training. Iterative training includes Disrupted Student training and diffusion model training. Disrupted Student training refers to adding student disruptions to “Diffusion Generated Dataset 1” and merging it with “The First Augmented Dataset” to train the YOLO detection model (YOLO Detection Model 2); diffusion model training refers to merging “Diffusion Generated Dataset 1” with “The First Augmented Dataset” to form “The Second Augmented Dataset”, which is then used to train the diffusion model (DDPM2).
- (5)
Using the DDPM2 model to regenerate the side-scan sonar image dataset (Augmented Images, about 10,000 images), input it into the YOLO Detection Model 2 for filtering, and perform conventional augmentation on the filtered dataset, naming the augmented dataset “Diffusion Generated Dataset 2”.
- (6)
Repeat steps (4) and (5), iteratively generating new datasets to continuously enhance the model’s detection capabilities.
In the above process, the Disrupted Student training of the diffusion dataset, the DDPM (Deep Diffusion Model), and the YOLOv5 detection model are the core modules of the adversarial reinforcement training structure.
2.2. Disrupted Student Training Model
The specific structure of the Disrupted Student model is shown in
Figure 2.
This method can be considered an improved version of self-training, which itself is a method within semi-supervised learning (e.g., [
34,
35]) and knowledge distillation [
36]. Initially, we use a teacher model to label the augmented images, and then disruptions (such as noise interference and geometric deformations) are added to these images. These labeled and disrupted images are then used to train the student model, ultimately aiming to achieve better performance than the teacher model.
Taking step four from
Section 2.1 as an example, where the teacher model is the previously trained YOLO Detection Model 1 and the student model is the YOLO Detection Model 2 to be trained in this round, the steps are as follows:
First step: Training of the teacher model and automatic image labeling. The “First Augmented Dataset”, which has already been labeled, is used to train the teacher model, YOLO Detection Model 1. This mature teacher model then automatically labels the unlabeled “Diffusion Generated Dataset 1” with side-scan sonar shipwreck targets.
Second step: Adding disruptions to “Diffusion Generated Dataset 1”. We select some of the most common types of noise disruptions used in side-scan sonar and apply random rotations, mirroring, and deformations to the images to increase the difficulty of recognition for the student model.
Third step: The teacher model guides the student model’s training. The student model is trained using the disrupted “Diffusion Generated Dataset 1” along with the labels provided by the teacher model before the disruptions were added.
2.3. DDPM Model Structure
The diffusion model consists of two phases: “forward diffusion” and “reverse diffusion”. “Forward diffusion” refers to the process of repeatedly adding small “mist-like” effects (random noise) to a clear image, causing the image to gradually become blurred until it turns into an indistinguishable haze. “Reverse diffusion” is the inverse process of “forward diffusion”, where noise is progressively removed from this hazy image, restoring it to its original clarity. The diffusion model is trained and generates images through this process, as shown in
Figure 3.
The diffusion model (DDPM) is defined in the form of a parameterized Gaussian Markov chain.
The “forward diffusion” process involves repeatedly adding a controlled amount of random noise to the image until it becomes “fogged”. It is assumed that the training data follow a distribution satisfying the
condition. The forward diffusion process adds Gaussian noise to the sample images in sequential time steps over
T time steps, as shown in Equation (1).
In the equation, represents the conditional distribution, and is the variance of the Gaussian noise added at each step, which satisfies the condition . If T is sufficiently large, the diffused data lose the characteristics of the original data , becoming random noise.
According to Equation (1),
can be obtained through reparameterization sampling. Let
After derivation, the relationship between
and
is obtained:
The “reverse diffusion” model learns to remove noise. The training model progressively eliminates noise from random noise and gradually restores it to the initial image. This requires the model to be able to predict the appearance of the image after each denoising step. This process relies on a large amount of data and highly complex calculations; the model needs to learn how to accurately restore images at different noise levels.
If the reverse process is obtained,
That is, by predicting the state of
t − 1 from the state of
t, we can gradually reconstruct an image from random noise
. DDPM uses a neural network to fit the reverse process.
Therefore, it can be derived that
In the DDPM paper, the variance is not calculated; instead, the mean
is fitted through the neural network to obtain
.
Because and are known, the neural network only needs to fit .
The training process is as shown in Algorithm 1. First, extract a sample from the data and randomly select a time t from
. Then, pass
into the forward propagation process of Gaussian Diffusion, sample a random noise, and load it into
, forming
. Next, place
and
into Unet. Unet generates a sine positional encoding based on time
and combines it with
, predicts the added noise, and returns it. Gaussian Diffusion calculates the loss between this noise and the random noise. Finally, calculate the L2 loss between the noise predicted by Unet and the previously sampled random noise through Gaussian Diffusion, compute the gradient, and update the weights.
Algorithm 1 Training |
1: repeat |
2: |
3: |
4: |
5: Take gradient descent step on |
6: until converged |
Repeat the above steps until the Unet network is fully trained.
Once the diffusion model is well-trained, the sampling process begins by sampling
from the standard normal distribution. From
, the following steps are repeated sequentially. Sample
from the standard normal distribution as preparation for reparameterization, compute
based on the model, and sample z by combining
and
, using reparameterization techniques to obtain
. After the loop ends, return
, which is the newly generated image. The specific training pseudocode is shown in Algorithm 2.
Algorithm 2 Sampling |
1: |
2: for do |
3: if , else |
4: |
5: end for |
6: return |