SVDDD: SAR Vehicle Target Detection Dataset Augmentation Based on Diffusion Model

Wang, Keao; Pan, Zongxu; Wen, Zixiao

doi:10.3390/rs17020286

Open AccessArticle

SVDDD: SAR Vehicle Target Detection Dataset Augmentation Based on Diffusion Model

by

Keao Wang

^1,2,3,4,

Zongxu Pan

^1,2,3,4,*

and

Zixiao Wen

^1,2,3,4

¹

Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100094, China

²

Key Laboratory of Technology in Geo-Spatial Information Processing and Application System, Chinese Academy of Sciences, Beijing 100190, China

³

School of Electronic, Electrical and Communication Engineering, University of Chinese Academy of Sciences, Beijing 100049, China

⁴

Key Laboratory of Target Cognition and Application Technology (TCAT), Beijing 100190, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2025, 17(2), 286; https://doi.org/10.3390/rs17020286

Submission received: 18 December 2024 / Revised: 12 January 2025 / Accepted: 13 January 2025 / Published: 15 January 2025

(This article belongs to the Special Issue Target Recognition and Detection Based on High Resolution Radar Images)

Download

Browse Figures

Versions Notes

Abstract

:

In the field of target detection using synthetic aperture radar (SAR) images, deep learning-based supervised learning methods have demonstrated outstanding performance. However, the effectiveness of deep learning methods is largely influenced by the quantity and diversity of samples in the dataset. Unfortunately, due to various constraints, the availability of labeled image data for training SAR vehicle detection networks is quite limited. This scarcity of data has become one of the main obstacles hindering the further development of SAR vehicle detection. In response to this issue, this paper collects SAR images of the Ka, Ku, and X bands to construct a labeled dataset for training Stable Diffusion and then propose a framework for data augmentation for SAR vehicle detection based on the Diffusion model, which consists of a fine-tuned Stable Diffusion model, a ControlNet, and a series of methods for processing and filtering images based on image clarity, histogram, and an influence function to enhance the diversity of the original dataset, thereby improving the performance of deep learning detection models. In the experiment, the samples we generated and screened achieved an average improvement of 2.32%, with a maximum of 6.6% in

m A P_{75}

on five different strong baseline detectors.

Keywords:

SAR; vehicle; target detection; data augmentation; diffusion model; ControlNet; high resolution

1. Introduction

Synthetic aperture radar (SAR) is an advanced active detection technology that can perform continuous imaging operations regardless of time and weather conditions, and is able to penetrate certain types of obstructions. This capability has made target detection using SAR images a hot topic for researchers. In recent years, deep learning has been widely applied in various research fields due to its powerful feature abstraction capabilities and adaptability to different types of data. As research on deep learning in the field of synthetic aperture radar (SAR) images has deepened, significant progress has been made in target detection (ATR) for aircraft, ships, buildings, and other objects based on deep learning [1,2].

Vehicle detection is an important branch of remote sensing target detection, which has high practical application value in several areas, including assessing economic development in urban areas, urban infrastructure planning, natural environment monitoring and protection, and the construction of intelligent transportation systems. However, due to the smaller size of vehicles compared to other types of targets like aircraft or ships, higher-resolution SAR image data are required for vehicle detection. Addressing the scarcity of such data has been a persistent challenge for researchers.

1.1. Related Works

In the area of SAR vehicle detection, early works primarily utilized the only existing public vehicle SAR image dataset at the time, MSTAR [3]. However, since MSTAR only contains vehicle targets and clutter data, lacking complex scene data, most of these studies involved synthesizing the targets from MSTAR onto background images. Long et al. [4] extracted vehicle targets from MSTAR and embedded them into clutter data, proposing a rotation detection model based on these. Zhang et al. [5] constructed the SAR-OD dataset by embedding targets from MSTAR into background images and used data augmentation techniques to improve detection accuracy. Sun et al. [6] created the LGSVOD dataset by embedding vehicle targets from MSTAR into background images and proposed an improved YOLOv5 model using transformers to enhance detection accuracy.

The vehicle target detection datasets constructed in this way lack sufficient authenticity, with drawbacks such as a homogeneous background and discrepancies in vehicle distribution compared to real data, still falling short of meeting the requirements for SAR vehicle target detection tasks. In 2023, Lin et al. [7] proposed SIVED, a vehicle detection dataset constructed using semi-automatic annotation algorithms based on FARAD, MiniSAR, [8] and MSTAR image data. Yang et al. [9] introduced a two-stage training-based universal SAR object detection framework, which achieved performance improvements in target classification and detection under small sample conditions through self-supervised learning on multiple datasets that include SIVED. In recent years, the performance of image generation networks has significantly improved, which has brought new ideas on how to alleviate the scarcity of SAR vehicle target data. Compared to traditional data augmentation methods, data generated by image generation networks can better enhance data diversity and balance data distribution. In the field of SAR, research on data augmentation using GANs [10] has achieved significant results, primarily applied to target recognition and target detection tasks. In the field of vehicle target recognition, there have been some studies based on the MSTAR dataset: Kong et al. [11] augmented the MSTAR dataset using GANs, improving the accuracy of the classifier model. Mao et al. [12] proposed CN-GAN, which combines LSGAN and Pix2Pix to enhance the signal-to-noise ratio and stability of GAN-generated SAR samples. Cui et al. [13] used WGAN-GP to generate images and designed a filter to select high-quality data at specific angles, improving the quality of the augmented data. Zhu et al. [14] applied LIME to data augmentation using GANs to evaluate the contribution of augmentation samples and their matching degree with targets, selecting the most representative generated images. Du et al. [15] proposed an augmentation method based on WGAN-GP, using a pre-trained classifier as a constraint to improve the instability caused by Jensen–Shannon divergence, completing the augmentation of multi-class data from MSTAR. In other SAR target detection areas, there are also data augmentation efforts based on GANs using more complex images: Kwon et al. [16] used a conditional GAN to convert EO images into SAR images for SAR ship target detection while employing cycle consistency loss to ensure structural consistency between the generated SAR images and the original EO images. Huang et al. [17] used a VAE to replace the generator structure in GANs with the VAE-wgangp algorithm to overcome the issues of low signal-to-noise ratio and limited diversity in GAN-generated images. Wu et al. [18] introduced cross-domain attention mechanisms and cross-domain multi-scale feature fusion into GANs, proposing CDA-GAN, which achieves the conversion from optical images to high-quality SAR images and is applied in SAR ship target detection. Sun et al. [6] designed an attribute-guided GAN trained with spectral normalization, conditioning on category and angle information as inputs to generate target images of specified categories and angles.

Diffusion [19] is a newer image generation network structure that, compared to GANs, often produces more diverse and more realistic images. Additionally, Diffusion networks are more stable and reliable for generating large-sized images and do not suffer from the mode collapse issues frequently encountered during GAN training, providing suitable conditions for generating SAR images that include complex backgrounds. Qosja et al. [20] applied DDPM to the MSTAR dataset and confirmed through experimental results that the DDPM algorithm benefits from training on large-scale clutter data. Zhang et al. [21] utilized a DDPM algorithm with multiple conditions and trainable environmental prompts to generate SAR ship image data for both inshore and offshore scenarios, allowing control over the direction and position of ship targets by inputting images containing processed ship instances.

Beyond image generation, variants of Diffusion networks can also accomplish various image tasks such as SAR target detection, SAR image denoising, and style transfer between optical and SAR images. DiffDet4SAR (Zhou et al. [22]) introduced DiffusionDet into SAR aircraft target detection. Perera et al. [23] applied DDPM to SAR image denoising, achieving good removal of speckle noise. Seo et al. [24] trained a Brownian bridge diffusion model on paired SAR and EO data to achieve style transfer from SAR to EO images, enhancing performance in a flood segmentation task based on a single instance. Guo et al. [25] proposed a Brownian bridge diffusion model called CM-Diffusion, which maps between SAR and optical images by memorizing color, achieving improvements over GAN-based methods.

It can be seen that previous data augmentation efforts for vehicle target detection have been based on the MSTAR dataset, which, aside from targets located in the center of the images, contains only monotonous backgrounds without any other objects. The MSTAR dataset does not depict variations of vehicle targets in different scenes, nor does it include examples of vehicle targets occluded by other objects. Therefore, the model trained on MSTAR is only capable of generating vehicle targets with a monotonic background, which implies that these data augmentation methods lack the capability to generate complex backgrounds, making it difficult to meet the demands of vehicle target detection in real-world scenarios with complex backgrounds.

1.2. Motivation and Contributions

The ability to handle diverse backgrounds, where vehicle targets are frequently obscured or confused with non-vehicle targets, is crucial for the performance of SAR vehicle target detectors. To generate augmented data with complex backgrounds, a Stable Diffusion model is trained using high-resolution SAR images suitable for vehicle target detection. During the training process, we propose solutions to the following two urgent needs: (1) To control the type of background in which the vehicle target is situated, we constructed a dataset with object prompts which will serve as conditions for the objects in the generated image backgrounds when training the Stable Diffusion model. (2) To reduce the costs associated with labeling generated samples, a ControlNet [26] will be incorporated into the fine-tuned Stable Diffusion to add new control conditions. By inputting images that identify target and background segmentation information into the model, it will be able to control the orientation and position of the generated targets.

Building on the above ideas, this paper examines publicly available SAR datasets for vehicle target detection and finds that, compared to image slices containing vehicle targets, slices that contain only the scene without vehicle targets often constitute a larger portion. Vehicle targets are concentrated in a few specific scenarios, and slices that do not contain vehicles often have a richer variety of object types and spatial arrangements. Based on these observations, this paper posits that by training the image generation model using both SAR slices containing and not containing vehicle targets, the image generation network can learn more about the reflective features and arrangements of objects in SAR images, thereby providing conditions for generating vehicle targets in diverse environments and enhancing the richness of the vehicle target detection dataset.

Finally, the principles of Diffusion-generated images introduce randomness into the image generation process, meaning that the generated images do not always maintain good quality. Thus, a filtering step for the generated data is necessary. Additionally, different target detectors possess distinct characteristics; the samples that can enhance the performance of these detectors are not entirely the same. This implies that our designed filtering method should adaptively select augmented samples that correspond to each different detector. Therefore, in addition to traditional image processing methods, we specifically introduce an influence function [27] as one of the standards for measuring sample quality.

As far as we know, this paper is the first to apply diffusion models for data augmentation in SAR vehicle target detection datasets. The main contributions of this paper are as follows:

•: A total of 9989 SAR image slices sized 512 × 512 were collected from the FARAD and MiniSAR datasets, covering the Ka, Ku, and X bands, and detailed annotations for various objects present in each slice were provided, forming a labeled dataset usable for multiple SAR image-related tasks.
•: Based on the dataset, we fine-tuned a Stable Diffusion model trained on optical remote sensing images, acquiring a model capable of generating various SAR images, particularly SAR image slices containing vehicle targets. We further fine-tuned the model based on ControlNet to make the positions and orientations of the vehicle targets in the generated image slices controllable.
•: We also propose a histogram-based image adjustment method and filtering methods based on clarity and an influence function, forming a framework to augment SAR vehicle target detection training dataset, and achieved performance improvement across five classic strong baseline detectors.

The remainder of this paper is organized as follows. Section 2 presents the methodology of our work. Section 3 introduces the dataset we collected, annotated, and utilized in subsequent training and presents the experimental results and analysis. Section 4 provides further discussion on our work. Section 5 summarizes the entire work of this paper.

2. Materials and Methods

2.1. Preliminaries

2.1.1. DDPM

The concept of DDPM [19] is derived from thermodynamics, and its core idea is to transform between a standard normal distribution and the target distribution through a forward noising process and a backward denoising process, both modeled as Markov processes. The forward process of DDPM gradually adds noise to the data, causing them to approach a standard normal distribution. This process is represented as follows:

q (x_{1 : T}) = Π_{t = 1}^{T} q (x_{t} | x_{t - 1})

(1)

q (x_{t} | x_{t - 1}) = N (x_{t}; \sqrt{1 - β_{t}} x_{t - 1}, β_{t} I)

(2)

Here, q is the distribution of the forward process,

q (x_{t} | x_{t - 1})

is the conditional distribution obtained by adding noise to the data from time

t - 1

to obtain the distribution at time t, and

β_{t}

is a set of designed noise factors added at each time t, which gradually decay as t increases. As t approaches infinity, the distribution of the data also approaches a standard normal distribution. From the above two equations, we can derive the distribution corresponding to any time t:

q (x_{t} | x_{0}) = N (x_{t}; \bar{α_{t}} x_{0}, \sqrt{1 - \bar{α_{t}}} I)

(3)

where

α_{t} = 1 - β_{t}

and

\bar{α_{t}} = Π_{i = 1}^{t} α_{i}

. After applying the reparameterization technique to the above process, the representation of the forward process can be transformed into

x_{t} = \bar{α_{t}} x_{t - 1} + \sqrt{1 - \bar{α_{t}}} ϵ_{t}

(4)

where

ϵ_{t} \sim N (0, I)

.

The backward process of DDPM uses a neural network (in Stable Diffusion, a U-net model) to fit the denoising process from the standard Gaussian distribution to the true data distribution. Specifically, in the backward process, neural network

p_{θ} (x_{t - 1} | x_{t})

is used for approximating the true distribution

q (x_{t - 1} | x_{t}, x_{0})

. Using the expression of the forward process, the following formula can be derived:

q (x_{t - 1} | x_{t}, x_{0}) = N (x_{t - 1}; \tilde{μ} (x_{t}, x_{0}), \tilde{β_{t}} I)

(5)

where

\tilde{μ} (x_{t}, x_{0}) = \frac{\sqrt{\bar{α_{t - 1}}} β_{t}}{1 - \bar{α_{t}}} x_{0} + \frac{\sqrt{α_{t}} (1 - \bar{α_{t - 1}})}{1 - \bar{α_{t}}} x_{t}

,

\tilde{β_{t}} = \frac{1 - \bar{α_{t - 1}}}{1 - \bar{α_{t}}} β_{t}

.

The distribution obtained from the neural network can be expressed as follows:

p_{θ} (x_{t - 1} | x_{t}, x_{0}) = N (x_{t - 1}; μ_{θ} (x_{t}, t), Σ_{θ} (x_{t}, x_{0}))

(6)

Here,

μ_{θ} (x_{t}, t)

and

Σ_{θ} (x_{t}, x_{0})

are the mean and variance of the distribution estimated using neural network parameters.

DDPM aims to optimize the negative log-likelihood at time 0. Through the variational bound, the negative log-likelihood can be written as

E [- l o g p_{θ} (x_{0})] \leq E_{q} [- l o g \frac{p_{θ} (x_{0 : T})}{q (x_{1 : T} | x_{0})}] = E_{q} [- l o g p (x_{T}) - Σ_{t \geq 0} l o g \frac{p_{θ} (x_{t - 1} | x_{t})}{q (x_{t} | x_{t - 1})}]

(7)

By introducing

x_{0}

as a condition, the above expression can be reformulated in terms of KL divergence:

\begin{matrix} E [- l o g p_{θ} (x_{0})] \leq E_{q} [D_{K L} (q (x_{T} | x_{0}) ∥ p_{θ} (x_{T})) + Σ_{t > 1} D_{K L} (q (x_{t - 1} | x_{t}, x_{0}) ∥ p_{θ} (x_{t - 1} | x_{t})) \\ - l o g p_{θ} (x_{0} | x_{1})] \end{matrix}

(8)

Therefore, the optimization objective is to minimize the KL divergence between the model at each step and the true distribution, which are both Gaussian distributions. From the formula of the KL divergence between the two Gaussian distributions, we can obtain

L_{t - 1} = E_{t, x_{0}, x_{t}} [∥ \tilde{μ_{t}} (x_{0}, x_{t}) - μ_{t} (x_{t}, t) ∥^{2}] + C

(9)

Since

x_{t}

is obtained by adding Gaussian noise to

x_{0}

, the predicted mean and predicted noise are indistinguishable; thus, the final loss function is

L_{s i m p l e} = E_{t, x_{0}, x_{t}} [∥ ϵ_{t} - ϵ_{θ} (x_{t}, t) ∥^{2}]

(10)

In practice, diffusion models have shown stronger capabilities than GANs in image generation tasks.

2.1.2. Latent Diffusion

Latent diffusion [28] is an improvement on DDPM. Before performing the diffusion process of DDPM, it first maps images from pixel space to a lower-dimensional latent space using a VAE encoder. By compressing the representation of image data, it models important information in the image structure in advance, thereby reducing the difficulty and computational load of fitting the diffusion process.

To facilitate control over the generated images, other data types used as conditions for image generation, such as text and images, are incorporated into the image features in the diffusion process after being encoded and then combined through cross attention.

Stable Diffusion is a text-to-image generation model based on latent diffusion, allowing users to input text descriptions from which high-quality images are generated according to those descriptions. In Stable Diffusion, the input text descriptions are first transformed into representations corresponding to the representations in the latent space of images using CLIP [29], and then integrated into the diffusion process through cross-attention.

2.1.3. ControlNet

ControlNet [26] is a model fine-tuning method based on zero convolution. Its fundamental idea is to create a bypass that duplicates a module with the same parameter and structure as the original model in certain layers of the model’s information path. The output of this module is then added to the output of the original information path after passing through a zero-convolution layer, thereby achieving the effect of fine-tuning the model. Because of zero convolution, the output from the bypass is zero before training begins, ensuring that the ControlNet branch has no impact on the model’s output prior to training, while gradually becoming effective as training progresses.

By freezing the parameters of the original model and only training the parameters of the newly added bypass, ControlNet can achieve a certain degree of fine-tuning on the model while keeping the parameters of the original model unchanged. Additionally, since the original model is frozen, the number of parameters adjusted during training is also reduced. Also, a new input can be added to the bypass of ControlNet to provide the original model with a pathway to accept new inputs.

2.1.4. Influence Function

Influence functions [27] are a concept derived from statistical learning, commonly used in research on the interpretability of machine learning. Their core idea is to investigate how the weight of a particular sample affects the model parameters. Specifically, suppose the training set of a neural network is

z_{1}, z_{2}, \dots, z_{n}

, and the trainable model parameters are

θ

. The loss function for sample

z_{i}

is

L (z_{i}, θ)

. The model parameters obtained by minimizing the loss on the training set are

\hat{θ} = \underset{θ}{a r g m i n} \frac{1}{n} Σ_{i = 1}^{n} L (z_{i}, θ)

(11)

If the weight of a specific sample z in the training set is altered by

ϵ

, the resulting model parameters will also change, denoted as

{\hat{θ}}_{ϵ, z}

:

{\hat{θ}}_{ϵ, z} = \underset{θ}{a r g m i n} \frac{1}{n} Σ_{i = 1}^{n} L (z_{i}, θ) + ϵ L (z, θ)

(12)

Based on this, the influence function is defined as the derivative of the model parameters with respect to

ϵ

:

I (z) = \frac{d {\hat{θ}}_{ϵ, z}}{d ϵ} |_{ϵ = 0} = - H_{\hat{θ}}^{- 1} \nabla L (z, \hat{θ})

(13)

where

H_{\hat{θ}}

is the Hessian matrix of the model parameters.

H_{\hat{θ}} = \frac{1}{n} Σ_{i = 1}^{n} \nabla_{θ}^{2} L (z_{i}, \hat{θ})

Furthermore, if we want to consider the effect of changing the weight of sample z on the loss of another sample

z_{t e s t}

, we can calculate the derivative of the loss of

z_{t e s t}

with respect to the coefficient

ϵ

:

I (z, z_{t e s t}) = \frac{d L (z_{t e s t}, \hat{θ})}{d ϵ} |_{ϵ = 0} = - L (z_{t e s t}, \hat{θ}) H_{\hat{θ}}^{- 1} \nabla L (z, \hat{θ})

(14)

From the definition, a positive influence function indicates that increasing the weight of sample z will increase the loss of

z_{t e s t}

, meaning it has a negative impact on

z_{t e s t}

. Conversely, a negative value implies that z has a positive impact on

z_{t e s t}

. The smaller the influence function, the better the impact of z is.

In practical computation, due to the large number of parameters in neural networks, the corresponding Hessian matrix also has a very high dimension, making it difficult to compute its inverse directly. Instead, the influence function can be estimated iteratively: randomly sample a sequence of samples

{z_{s_{1}}, z_{s_{2}}, \dots, z_{s_{t}}}

from the training set, let

v = L (z, \hat{θ})

, and then set

{\tilde{H}}_{0}^{- 1} v = v

. The following method can be used to approximate

H_{\hat{θ}}^{- 1} v

:

{\tilde{H}}_{j}^{- 1} v = v + (I - \nabla_{θ}^{2} L (z_{s_{j}}, \hat{θ})) {\tilde{H}}_{j - 1}^{- 1} v

(15)

2.2. Our Method

Based on the aforementioned technique, we propose our Diffusion model-based SAR vehicle target detection dataset enhancement framework:

The goal of our Diffusion model-based SAR vehicle target detection dataset enhancement framework is to generate images similar to those in the real SAR vehicle detection dataset from randomly sampled Gaussian noise using a diffusion model. By providing rotatable bounding box annotation information for the vehicle targets and prompt words regarding the objects within the images, we can control the position of the vehicle targets in the generated images and the objects in the background of those images. Furthermore, we processed the generated images through a series of methods, filtering out images that are not the most effective part for enhancing the detector’s performance, thereby increasing the diversity of the SAR vehicle target detection dataset and improving the performance of the SAR vehicle target detector.

Our framework consists of a Stable Diffusion model, a ControlNet model trained on top of it, and a series of methods for processing and filtering the generated images. Figure 1 provides an overview of our entire work.

We first fine-tuned the Stable Diffusion model using the SAR image dataset we annotated with background object information, resulting in a Stable Diffusion model for SAR images. This model can generate SAR image slices containing corresponding objects from randomly sampled Gaussian noise, guided by prompt words.

Building on the fine-tuned Stable Diffusion model, we trained ControlNet using data from the FARAD part of the SIVED training set. During this process, we annotated the objects in these images with the same label types in Section 1 and used the rotatable bounding box annotation information from SIVED to generate target–background segmentation images for each slice. These images were added as additional conditions in the training of ControlNet to control the position of the vehicle targets being generated. Specifically, the white areas of the input target–background segmentation images correspond to the vehicle targets, while the black areas correspond to the background. The Stable Diffusion + ControlNet model trained through this step can generate images containing vehicle targets at specified locations, with backgrounds that include the objects described in the prompt words.

We used rotatable bounding box annotations from the training set along with randomly selected labels as conditional inputs to control the model when generating corresponding samples. In practice, we found that due to the instability of images generated by the generative model, adding these generated images to the training set directly would easily lead to a decline in the performance of the detector trained on the augmented training set. To address this, we proposed a series of methods for filtering and processing the generated images.

First, we filtered the generated sample based on clarity: some of the generated images were very blurry, and including these images in the training set would lead to a significant decline in detector performance while the influence function cannot effectively distinguish between them. We believe that is because these blurry images often had a high loss function value, and from the definition of the influence function, it can be seen that there is a proportional relationship between the influence function and the loss of the samples. This suggests that these blurry images tend to have extremely large positive or extremely small negative influence functions, causing the influence function to not work properly. Therefore, we used the Laplacian operator to filter the generated image slices and calculated the variance of the filtered images as a measure of image blur, ultimately removing the 30% blurriest of the images.

Next, we applied histogram specification guided by average grayscale matching (HSGAGM). The images generated through the above steps are already quite similar to the real data in terms of shape and position, but there remained slight differences in the histograms. Visually, the generated samples tended to be brighter, and such images would not enhance the detector effectively in practical use. For the SIVED training set, we used Earth Mover’s Distance [30] to calculate the distance between the histograms of any two images, and we also computed the absolute difference between the average grayscales of the two images. We then plotted a scatter plot with these two values as the axes. Our observations revealed a strong linear relationship between the Earth Mover’s distance of the histograms of two images from SIVED and the absolute difference of their average grayscales. Moreover, when the absolute difference of the average grayscales of the two images was close to zero, the Earth Mover’s distance of their histograms was also close to zero. Figure 2 illustrates the scatter plot of the two relationships, which indicates that when the average grayscales of two images are close, their histogram is also very similar.

This results in a method to adjust the histograms of the generated samples towards real samples guided by average grayscale matching:

For the real samples used to train the model in SIVED, we calculated the average grayscale for each sample, denoted as

{m_{i}}_{i = 1}^{n}

, and then computed the mean

μ

and variance

σ

of

{m_{i}}_{i = 1}^{n}

. Next, we standardized the average grayscales of all real samples as

m_{i}^{'} = \frac{m_{i} - μ}{σ}

.

For the generated samples, we similarly calculated the average grayscale for each sample, denoted as

{a_{i}}_{i = 1}^{n^{a}}

, and computed the mean

μ^{a}

and variance

σ^{a}

of

{a_{i}}_{i = 1}^{n^{a}}

. Then, we standardized the average grayscales of all generated samples as

a_{i}^{'} = \frac{a_{i} - μ^{a}}{σ^{a}}

.

By doing this, the distribution of the average grayscales of the generated samples was aligned with the distribution of the average grayscales of the real samples on the same scale, allowing them to be matched with each other. Consequently, for each generated sample, we searched the training set for the slice with the closest standardized average grayscale and adjusted its histogram to match the histogram of the corresponding slice in the training set.

Data filter based on influence function: We used the impact of a generated sample on all samples in an entire training dataset as a measure of the quality of the generated sample. We first trained a detector on the training set before augmentation and then calculated the sum of the influence function of a generated sample for this detector on all samples in the training dataset.

I (z_{a u g}, T) = - Σ_{z_{i} \in T} [L (z_{i}, \hat{θ}) H_{\hat{θ}}^{- 1} \nabla L (z, \hat{θ})] = - [Σ_{z_{i} \in T} L (z_{i}, \hat{θ})] H_{\hat{θ}}^{- 1} \nabla L (z, \hat{θ})

(16)

Here,

z_{a u g}

denotes the newly generated samples and

T

denotes the original training set of the object detector. To reduce computational complexity, we made slight modifications to the process of calculating the influence function, avoiding the iterative process for each newly generated sample. We first randomly sample a sequence of samples from the training dataset, denoted as

{z_{s_{1}}, z_{s_{2}}, \dots, z_{s_{t}}}

, and set

v = Σ_{z_{i} \in T} L (z_{i}, \hat{θ})

. We then set

v {\tilde{H}}_{0}^{- 1} = v

and approximate

v H_{\hat{θ}}^{- 1}

using the following method.

v {\tilde{H}}_{j}^{- 1} = v + v {\tilde{H}}_{j - 1}^{- 1} (I - \nabla_{θ}^{2} L (z_{s_{j}}, \hat{θ}))

(17)

Directly multiply the computed value of

- [Σ_{z_{i} \in T} L (z_{i}, \hat{θ})] H_{\hat{θ}}^{- 1}

by the gradient of each generated new sample. We can thus obtain the sample quality of the generated images. We discard the worst 50% of the samples based on this criterion (based on our experience, this often includes all samples with a negative influence and a few samples with an insignificant positive influence) and randomly sample from the remaining samples, using them as augmentation training samples.

2.3. Dataset Construction

2.3.1. Our Dataset

Original Data

We used the original data from FARAD and MiniSAR [8] mentioned above to construct a high-resolution SAR image dataset, and detailed information about these datasets is presented in Table 1.

Data Annotation Methods and Examples

The construction method of the dataset is as follows:

(1): The images from the aforementioned datasets were sliced into 512 × 512 pixel patches with a step size of 400 pixels, resulting in a total of 9989 patches.
(2): Each patch was annotated with the objects present in it. Specifically, we annotated seven categories of objects in the images: playground, hill, river, road, vehicle, building, and tree. Each patch may contain multiple labels from the above categories.

Specific examples of the annotations are shown in the Figure 3.

Data Distribution

Table 2 records all the labels in our annotated dataset along with the number and percentage of slices that contain each label.

In particular, to gain a better understanding of the environment in which the vehicle targets are situated, we counted the number of other labels in the slices that contained the ‘vehicle’ label. Table 3 shows the result.

2.3.2. SIVED

SIVED is a rotatable bounding box annotation dataset for SAR vehicle target detection. This dataset selects patches containing vehicle targets from FARAD and MiniSAR and stitches patches from MSTAR in groups of 4 × 4, resulting in a total of 1044 patches of size 512 × 512 that include 12,013 vehicle targets. SIVED manually annotated FARAD and MiniSAR data through visual interpretation, with the assistance of Google Earth data at corresponding locations to further distinguish between easily confused small buildings and vehicle targets. For the data from MSTAR, a semi-automatic annotation method based on object detection networks was employed. In this paper, we use the training set of SIVED as the base dataset for augmentation and evaluate model performance based on the detector’s performance on the SIVED test set. To facilitate the subsequent training and generation processes using images from SIVED, we also performed further annotations on the SIVED training set following the methods described in the previous section.

3. Experimental Results and Analysis

3.1. Train and Infer

Our training process is divided into two steps.

(1) We performed full fine-tuning of the Stable Diffusion model on the dataset we constructed before. Given the similarities between optical remote sensing images and SAR images in terms of perspective, object types, and target sizes, we chose to further fine-tune the Stable Diffusion model trained on optical remote sensing images (Yuan et al. [31], which is a fine-tuned version of Stable Diffusion v1.4) to obtain a Stable Diffusion model for SAR image generation.

In particular, to ensure fairness in subsequent experiments, we removed the samples containing the “vehicle” label from the dataset and replaced them with the training set from SIVED because both our dataset and SIVED were constructed based on FARAD and MiniSAR. This ensures that all slices containing vehicle samples in the training set for the Stable Diffusion fine-tuning come from the training set of SIVED. We trained the model on one Nvidia V100-SXM2-32GB from https://www.autodl.com/ (accessed on 14 January 2024), using a batch size of 1, resulting in a Stable Diffusion model suitable for SAR images.

Samples of SAR images generated using the fine-tuned Stable Diffusion model are shown in Figure 4.

It can be seen that our fine-tuned Stable Diffusion model demonstrates a good understanding of prompts and objects; however, it still has shortcomings in the details of the generated images. In images generated directly using our fine-tuned Stable Diffusion model, the generated vehicles often have incorrect structures, blurry details, and overlapping instances. We believe that this is due to the small proportion of vehicles in the entire slice, along with the relatively vague descriptions in the prompt, which lack specific information about the vehicles’ locations and orientations, and the number of vehicles in the slice. This makes it difficult for Stable Diffusion to learn the connection between the vehicles in the images and the input prompt “vehicle”. Additionally, since Stable Diffusion’s prompts can only control the type of object generated and not the specific location of the object, the positions of the vehicles in the generated images are completely random. This randomness poses a significant diffiuculty for subsequent work if these data are used directly. Therefore, whether we can input the information about the vehicles’ positions and orientations into the network is crucial for applying the generative model to vehicle object detection data augmentation. For this reason, we continue to fine-tune the model using ControlNet.

(2) Further Fine-Tuning for Vehicle Target Detection Based on ControlNet: In this part, we further fine-tune Stable Diffusion using ControlNet to enhance its performance in generating vehicle details. To ensure fairness in subsequent experiments, all vehicle samples used during the training of ControlNet came from the SIVED training set and did not include data from the validation or test sets. Since vehicle target detection in the MSTAR data is relatively easy, which means augmentation for MSTAR data is unnecessary, we trained ControlNet only using data sourced from FARAD in the SIVED training set, along with the target–background segmentation images generated based on their rotatable bounding box annotations. We trained the corresponding model on an NVIDIA GeForce RTX 3090 GPU with a batch size of 2 and a learning rate of 5 × 10⁻⁶ for 15,000 steps. Samples of images at different stages of the all augmentation process are shown in Figure 5.

It can be observed that the model is able to generate vehicle targets effectively along with their complex backgrounds. When trees are generated above the vehicle’s bounding box, the model can accurately produce vehicles that are partially obscured by the trees. Additionally, in situations where vehicles are closely arranged, the model can clearly differentiate each generated vehicle target. After applying histogram specification based on average grayscale matching, the generated targets appear visually closer to reality compared to before the adjustment. In terms of the appearance and arrangement of the generated targets, there is also a good level of diversity.

3.2. Computational Resources

As mentioned earlier, our approach augments the original training set by incorporating samples generated by the generative model after filtering and processing. To obtain these newly added samples, it is necessary to fine-tune the Stable Diffusion model, train the ControlNet, generate samples with the Stable Diffusion + ControlNet model, and compute an influence function for each detector to be enhanced. These steps need extra computational resources to achieve performance improvements on the detectors. Table 4 shows the computational resources used for each different part:

3.3. Experiment Settings

This paper selects five different rotatable bounding box object detectors that performed well on the SIVED dataset as experimental subjects for data augmentation. They are as follows:

Rotated Faster R-CNN [32]: This is a two-stage rotatable bounding box object detection algorithm that extracts feature maps from the target images using a backbone. The Region Proposal Network (RPN) screens all potential bounding box locations, and through ROI pooling, fixed-size features are obtained for each location’s box. Finally, classification and regression are performed to determine the box size and rotation angle.
Gliding Vertex [33]: Gliding Vertex is a regression strategy for rotatable bounding boxes. It predicts the positions of the four corners of the rotatable box by estimating the movement distances of the four corner points of a horizontal box along its edges. The entire detector framework is based on Rotated Faster R-CNN.
R3Det [34]: R3Det is a single-stage object detector based on RetinaNet that combines the advantages of horizontal and rotatable boxes. It features a Fine-grained Refinement Module (FRM) that utilizes the location information of Refined Anchors to reconstruct the feature map for feature alignment.
KLD [35]: KLD is another regression strategy for rotatable bounding boxes. It converts the rotatable bounding box into a two-dimensional Gaussian distribution and calculates the KL divergence between the Gaussian distributions as the regression loss. This paper conducts experiments by adding the KLD structure to R3Det.
Oriented Reppoints [36]: Oriented Reppoints is a single-stage object detector based on an adaptive point representation learning method, which can generate adaptive point sets for geometrical structures in any orientation. It accurately classifies and locates targets using three directed transformation functions.

The code for the object detectors we used is based on the mmrotate [37] detection framework, and all detectors use ResNet50 as the backbone. We trained all five detectors on one NVIDIA GeForce RTX 3090 GPU using the SGD optimizer with the following hyperparameters: an initial learning rate of 0.01, an optimizer momentum of 0.937, a weight decay set to 0.0005, and a training batch size of 8, for a total of 180 epochs. The learning rate was reduced to 0.1 times the initial learning rate after the 90th epoch, and the training started with a warm-up of 2000 steps at one-third of the learning rate. (We used a longer warm-up period because we found that, after adding augmentation samples, some models, such as R3Det, became more unstable at the beginning of training, leading to failures. This situation can be solved by setting a longer warm-up period. We believe this is because the newly added samples have greater background diversity compared to real samples, resulting in a significant difference in the background distribution between the new and real data, which leads to this training instability.)

3.4. Visual Analysis

In this section, the effectiveness of augmentation data for the object detector is qualitatively analyzed by visualizing the specific results of the experimental process.

3.4.1. Visual Analysis on Diversity

In this part, we visualize the generated samples using the same target–background segmentation information and prompts but with different random seeds. As shown in Figure 6, the images differ in both vehicle targets and backgrounds due to the different random seeds used during generation, demonstrating good diversity. Additionally, since the prompts are consistent, most images in the same column have similar object types in the backgrounds, but there are instances where the images generated by diffusion do not adhere to the prompts. This is attributed to the randomness inherent in the generation process.

In summary, the randomness in the diffusion backward process provides significant diversity to the generated vehicle targets and backgrounds.

3.4.2. Visual Analysis on Detection Performance

In this section, we visualize the detection results of the detector before and after enhancement, along with the ground-truth objects. We used oriented reppoints as the detector, which achieved an improvement of 2.1% in

m A P_{75}

after the training set augmentation. We visualized the detection performance on the SIVED test set before and after data augmentation, with the results shown in Figure 7. Upon observation, we found that in most images, the augmented detector shows a reduction in the number of false positives for objects that are visually similar to vehicles, while the increase in detected ground-truth targets is relatively minor. We believe this is primarily due to the significant enhancement of the image background diversity, which provides a large number of false-positive samples to the dataset, effectively contributing to the reduction in the model’s false-positive rate.

3.5. Metrics

We used mAP (mean average precision) and recall as metrics to evaluate the performance of the detector. The definitions of these evaluation metrics are as follows:

P = \frac{T P}{T P + F P}; R = \frac{T P}{T P + F N}; m A P = \int_{0}^{1} P (R) d R

(18)

Here, TP (True Positive) refers to the detected true targets and FP (False Positive) refers to the detected false targets. and FN (False Negative) refers to the true targets that were not detected.

3.6. Main Results

We took the checkpoint of the epoch that performed the best on the validation set during the entire training process and tested its performance on the test set. We used mAP and Recall at IoU = 0.75 as metrics to measure detector performance. Due to randomness in the selection process, we conducted three experiments for each configuration and took the best result among them. We also trained several image-to-image generation networks on our dataset as baseline methods to compare with our approach. These baseline models also utilize segmentation information of targets and background to generate SAR images, and we randomly sampled images from these images to augment the training set. We conducted experiments by adding different numbers of samples to the training set. The best results are in Table 5.

We can see that our method achieved varying degrees of improvement across the five strong baseline detectors, particularly with a 6.6% increase in

m A P_{75}

over Rotated Faster R-CNN. Additionally, for most detectors, our method achieved better results than the baseline image generation models. This indicates that the idea of further selecting and processing samples generated by the generative model is effective in data augmentation.

3.7. Ablation Test

Due to the lack of capability of Stable Diffusion to generate SAR images before fine-tuning on our dataset, and the model’s inability to generate vehicle targets effectively before fine-tuning by ControlNet, the images generated by both methods cannot be used as augmented data. Therefore, our ablation experiments will focus on the processing steps for the model’s generated samples. To test the effectiveness of each step in the data processing part of the data augmentation framework we proposed, we conducted ablation experiments for each component steps, using the Oriented Reppoints detector and adding 25% generated samples to the training set. Because there is randomness involved in the selection process, we conducted three experiments for each configuration and took the best result. Table 6 shows the experimental results for various ablation combinations.

We can see that each method contributes to enhancing the performance of the detector. Among all three methods, we believe that histogram specification guided by average grayscale matching plays the most significant role; without it, the samples cannot effectively enhance the model’s performance. It is worth noting that even without filtering the generated samples and merely selecting them randomly, it is still possible for some samples that can positively impact the model to be chosen, which may still lead to beneficial enhancement effects for the model.

3.8. Experiments on Num of Added Data

In experiments, we found that detectors achieved different performances with varying amounts of augmentation samples. We experimented with adding different quantities of samples to the training set and observed the performance of each detector. Due to the randomness involved in the selection process, we conducted three experiments for each configuration and took the best result. For our method, we added generated samples to the training set at evenly spaced intervals, ranging from 6.25% to 50% of the original dataset, and tested the performance of the detectors. Due to time and computational resource constraints, we only experimented with the performance of detectors trained with samples equivalent to 12.5% and 25% of the original training set size for these baseline methods. Table 7 lists the changes in performance of the five detectors as different quantities of samples were added.

We can see that the five detectors are more likely to achieve performance improvements when the added generated samples range from 6.25% to 25%. After adding more samples, the performance of the detectors noticeably decreases. This may be because, despite the fact that the generated samples are quite similar to the real samples, they still have a different distribution. Adding too many generated samples can cause the model to become biased toward the generated samples, resulting in decreased performance on the real samples.

3.9. Experiments on Influence Function of Different Detectors

In this section, we investigate the effect of samples selected using the influence function of other detectors. For each detector, we use the influence functions of the five detectors to select a number of samples corresponding to the best performance of that detector in the previous section, and we experiment with their performance. Table 8 shows the experimental results.

It can be seen that, due to the different structures of various detectors, the samples selected using the influence function of a particular detector do not always perform well on other detectors. This further demonstrates the necessity of using the influence function of each detector separately to select samples.

As mentioned earlier, the Gliding Vertex we used is a modified version of the Rotated Faster R-CNN, while the KLD we used is a modified version of R3Det. This means that most of the structure of Gliding Vertex is similar to that of Rotated Faster R-CNN, and KLD is similar to R3Det. In the results of this experiment, we observed that in most cases, the samples filtered by the influence functions from detectors with a structure similar to the detectors to be enhanced showed better performance than those filtered by influence functions of other detectors. Since the calculation of the influence function involves the second derivative of each parameter of the model, this result aligns with the guess made based on the definition of the influence function.

4. Discussion

We first annotated a high-resolution SAR image dataset used for training Stable Diffusion. This dataset can be used not only for the enhancement of SAR vehicle target detection dataset augmentation but also for tasks such as SAR image classification and SAR ground object classification.

To address the scarcity of the SAR vehicle detection dataset caused by the limited availability of high-resolution SAR data and high annotation costs, we proposed an effective, flexible, and scalable SAR data generation method. Its effectiveness is reflected in the ability of the generated samples to enhance the performance of the target detector on the existing SAR vehicle target detection dataset. Flexibility is demonstrated in the capability to specify the positions of vehicle targets in the samples and to flexibly control the types of objects present in the slices through prompt words, allowing for the generation of complex backgrounds and enhancing the diversity of the vehicle target detection dataset. Scalability is demonstrated by the community surrounding Stable Diffusion, enabling further research based on our work through various existing methods in the community and providing more applications. Our work demonstrates the feasibility of generated data to augment the training set for SAR vehicle object detection.

Limitations

Due to resource limitations, the research presented in this paper was conducted using image slices of size 512 × 512, while also adhering to the resolutions of Farad and MiniSAR. In future work, we will explore methods for using Diffusion models to enhance datasets with remote sensing images of varying sizes and resolutions. Another focus of our upcoming research is how to better control the generated samples not only by using object labels but also by employing more precise text descriptions, such as specifying the size and type of buildings, the direction of roads, the density of trees, etc.

In terms of the quality of generated images, although experiments to enhance detector performance have shown that vehicle targets are quite similar to real targets, the structures of houses and roads in the images generated by the model often exhibit significant flaws. The instability in the quality of the generated images is also an issue that needs to be addressed in the future. Then in terms of the distribution of generated images, the model tends to favor generating the most common structures in the training set. For structures that are less frequent in the training set, it is difficult to control the model’s generation even with the use of prompts. This is a common issue with generative models. Exploring whether the characteristics of SAR images can be integrated to improve this problem is one of the future research directions. In this paper, we only controlled the positions of the generated vehicle targets. However, controlling the positions of targets other than vehicle targets is also an important area that needs further research. Additionally, since the training of diffusion models requires more data than GAN models, our approach may encounter challenges when applied to fields with fewer images than the SAR vehicle dataset.

5. Conclusions

We proposed a framework for augmenting SAR vehicle target detection datasets based on Diffusion models. The highlight of this framework is the ability to control the objects within the scene through prompt words, as well as to control the position of target generation using segmentation information for both the target and background. Additionally, by employing three different data processing methods, we achieved augmentation data of higher quality. Experiments on the SIVED dataset demonstrate that this method can generate more realistic images containing SAR vehicle targets, thereby enhancing the diversity of SAR vehicle target detection images and improving the robustness of deep learning-based SAR vehicle target detectors.

Author Contributions

K.W. and Z.P. conceived and designed the experiments; Z.P. contributed materials and computing resources; K.W. performed the experiments; K.W. and Z.W. analyzed the data; K.W. wrote the original draft preparation; Z.P. checked the experimental data, examined the experimental results and revised the original draft. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Youth Innovation Promotion Association, CAS, under funding number 2022119.

Data Availability Statement

The dataset was shared on the GitHub web for researchers. The link is https://github.com/WKA2000/SVDDD-Dataset/tree/main (accessed on 14 January 2024).

Acknowledgments

The authors would like to thank Sandia National Laboratory (USA) for providing public MiniSAR and FARAD data online, the U.S. Air Force for providing MSTAR data online, and Lin et al. [7] for the SIVED dataset.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Zhou, Z.; Zhao, L.; Ji, K.; Kuang, G. A domain adaptive few-shot SAR ship detection algorithm driven by the latent similarity between optical and SAR images. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–18. [Google Scholar] [CrossRef]
He, Q.; Zhao, L.; Ji, K.; Kuang, G. SAR target recognition based on task-driven domain adaptation using simulated data. IEEE Geosci. Remote Sens. Lett. 2021, 19, 1–5. [Google Scholar] [CrossRef]
Keydel, E.R.; Lee, S.W.; Moore, J.T. MSTAR extended operating conditions: A tutorial. Algorithms Synth. Aperture Radar Imag. III 1996, 2757, 228–242. [Google Scholar]
Long, Y.; Jiang, X.; Liu, X.; Zhang, Y. SAR ATR with rotated region based on convolution neural network. In Proceedings of the IGARSS 2019–2019 IEEE International Geoscience and Remote Sensing Symposium, Yokohama, Japan, 28 July–2 August 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 1184–1187. [Google Scholar]
Zhang, X.; Chai, X.; Chen, Y.; Yang, Z.; Liu, G.; He, A.; Li, Y. A novel data augmentation method for sar image target detection and recognition. In Proceedings of the 2021 IEEE International Geoscience and Remote Sensing Symposium IGARSS, Brussels, Belgium, 11–16 July 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 3581–3584. [Google Scholar]
Sun, Y.; Wang, W.; Zhang, Q.; Ni, H.; Zhang, X. Improved YOLOv5 with transformer for large scene military vehicle detection on SAR image. In Proceedings of the 2022 7th International Conference on Image, Vision and Computing (ICIVC), Xi’an, China, 26–28 July 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 87–93. [Google Scholar]
Lin, X.; Zhang, B.; Wu, F.; Wang, C.; Yang, Y.; Chen, H. SIVED: A SAR Image Dataset for Vehicle Detection Based on Rotatable Bounding Box. Remote Sens. 2023, 15, 2825. [Google Scholar] [CrossRef]
Sandia National Laboratory. Complex SAR Data. Available online: https://www.sandia.gov/radar/pathfinder-radar-isr-and-synthetic-aperture-radar-sarsystems/complex-data/ (accessed on 12 November 2023).
Yang, W.; Hou, Y.; Liu, L.; Liu, Y.; Li, X. SARATR-X: A foundation model for synthetic aperture radar images target recognition. arXiv 2024, arXiv:2405.09365. [Google Scholar]
Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial networks. Commun. ACM 2020, 63, 139–144. [Google Scholar] [CrossRef]
Kong, J.; Zhang, F. SAR target recognition with generative adversarial network (GAN)-based data augmentation. In Proceedings of the 2021 13th International Conference on Advanced Infocomm Technology (ICAIT), Yanji, China, 15–18 October 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 215–218. [Google Scholar]
Mao, C.; Huang, L.; Xiao, Y.; He, F.; Liu, Y. Target recognition of SAR image based on CN-GAN and CNN in complex environment. IEEE Access 2021, 9, 39608–39617. [Google Scholar] [CrossRef]
Cui, Z.; Zhang, M.; Cao, Z.; Cao, C. Image data augmentation for SAR sensor via generative adversarial nets. IEEE Access 2019, 7, 42255–42268. [Google Scholar] [CrossRef]
Zhu, M.; Zang, B.; Ding, L.; Lei, T.; Feng, Z.; Fan, J. LIME-based data selection method for SAR images generation using GAN. Remote Sens. 2022, 14, 204. [Google Scholar] [CrossRef]
Du, S.; Hong, J.; Wang, Y.; Xing, K.; Qiu, T. Multi-category SAR images generation based on improved generative adversarial network. In Proceedings of the 2021 IEEE International Geoscience and Remote Sensing Symposium IGARSS, Brussels, Belgium, 11–16 July 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 4260–4263. [Google Scholar]
Kwon, H.; Jeong, S.; Kim, S.; Lee, J.; Sohn, K. Deep-learning based SAR Ship Detection with Generative Data Augmentation. J. Korea Multimed. Soc. 2022, 25, 1–9. [Google Scholar]
Huang, Y.; Mei, W.; Liu, S.; Li, T. Asymmetric training of generative adversarial network for high fidelity SAR image generation. In Proceedings of the IGARSS 2022–2022 IEEE International Geoscience and Remote Sensing Symposium, Kuala Lumpur, Malaysia, 17–22 July 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 1576–1579. [Google Scholar]
Wu, B.; Wang, H.; Zhang, C.; Chen, J. Optical-to-SAR Translation Based on CDA-GAN for High-Quality Training Sample Generation for Ship Detection in SAR Amplitude Images. Remote Sens. 2024, 16, 3001. [Google Scholar] [CrossRef]
Ho, J.; Jain, A.; Abbeel, P. Denoising diffusion probabilistic models. Adv. Neural Inf. Process. Syst. 2020, 33, 6840–6851. [Google Scholar]
Qosja, D.; Wagner, S.; O’Hagan, D. SAR Image Synthesis with Diffusion Models. In Proceedings of the 2024 IEEE Radar Conference (RadarConf24), Denver, CO, USA, 6–10 May 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 1–6. [Google Scholar]
Zhang, X.; Li, Y.; Li, F.; Jiang, H.; Wang, Y.; Zhang, L.; Zheng, L.; Ding, Z. Ship-Go: SAR ship images inpainting via instance-to-image generative diffusion models. ISPRS J. Photogramm. Remote Sens. 2024, 207, 203–217. [Google Scholar] [CrossRef]
Zhou, J.; Xiao, C.; Peng, B.; Liu, Z.; Liu, L.; Liu, Y.; Li, X. DiffDet4SAR: Diffusion-based aircraft target detection network for SAR images. IEEE Geosci. Remote Sens. Lett. 2024, 21, 1–5. [Google Scholar] [CrossRef]
Perera, M.V.; Nair, N.G.; Bandara, W.G.C.; Patel, V.M. SAR despeckling using a denoising diffusion probabilistic model. IEEE Geosci. Remote Sens. Lett. 2023, 20, 1–5. [Google Scholar] [CrossRef]
Seo, M.; Oh, Y.; Kim, D.; Kang, D.; Choi, Y. Improved flood insights: Diffusion-based sar to eo image translation. arXiv 2023, arXiv:2307.07123. [Google Scholar]
Guo, Z.; Liu, J.; Cai, Q.; Zhang, Z.; Mei, S. Learning SAR-to-Optical Image Translation via Diffusion Models with Color Memory. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 14454–14470. [Google Scholar] [CrossRef]
Zhang, L.; Rao, A.; Agrawala, M. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 3836–3847. [Google Scholar]
Koh, P.W.; Liang, P. Understanding black-box predictions via influence functions. In Proceedings of the International Conference on Machine Learning, (PMLR: 2017), Sydney, Australia, 6–11 August 2017; pp. 1885–1894. [Google Scholar]
Rombach, R.; Blattmann, A.; Lorenz, D.; Esser, P.; Ommer, B. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 10684–10695. [Google Scholar]
Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning, (PMLR: 2017), Shenzhen, China, 26 February–1 March 2021; pp. 8748–8763. [Google Scholar]
Rubner, Y.; Tomasi, C.; Guibas, L.J. The earth mover’s distance as a metric for image retrieval. Int. J. Comput. Vis. 2000, 40, 99–121. [Google Scholar] [CrossRef]
Yuan, Z. Stable Diffusion for Remote Sensing Image Generation. Available online: https://github.com/xiaoyuan1996/Stable-Diffusion-for-Remote-Sensing-Image-Generation (accessed on 27 October 2024).
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
Xu, Y.; Fu, M.; Wang, Q.; Wang, Y.; Chen, K.; Xia, G.S.; Bai, X. Gliding vertex on the horizontal bounding box for multi-oriented object detection. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 43, 1452–1459. [Google Scholar] [CrossRef]
Yang, X.; Yan, J.; Feng, Z.; He, T. R3det: Refined single-stage detector with feature refinement for rotating object. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtually, 19–21 May 2021; Volume 35, pp. 3163–3171. [Google Scholar]
Yang, X.; Yang, X.; Yang, J.; Ming, Q.; Wang, W.; Tian, Q.; Yan, J. Learning high-precision bounding box for rotated object detection via kullback-leibler divergence. Adv. Neural Inf. Process. Syst. 2021, 34, 18381–18394. [Google Scholar]
Li, W.; Chen, Y.; Hu, K.; Zhu, J. Oriented reppoints for aerial object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 1829–1838. [Google Scholar]
Zhou, Y.; Yang, X.; Zhang, G.; Wang, J.; Liu, Y.; Hou, L.; Jiang, X.; Liu, X.; Yan, J.; Lyu, C.; et al. Mmrotate: A rotated object detection benchmark using pytorch. In Proceedings of the 30th ACM International Conference on Multimedia, Lisboa, Portugal, 10–14 October 2022; pp. 7331–7334. [Google Scholar]
Zhu, J.Y.; Zhang, R.; Pathak, D.; Darrell, T.; Efros, A.A.; Wang, O.; Shechtman, E. Toward multimodal image-to-image translation. Adv. Neural Inf. Process. Syst. 2017, 30, 465–476. [Google Scholar]
Isola, P.; Zhu, J.Y.; Zhou, T.; Efros, A.A. Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1125–1134. [Google Scholar]
Saharia, C.; Chan, W.; Chang, H.; Lee, C.; Ho, J.; Salimans, T.; Fleet, D.; Norouzi, M. Palette: Image-to-image diffusion models. In Proceedings of the ACM SIGGRAPH 2022 Conference Proceedings, Vancouver, BC, Canada, 7–11 August 2022; pp. 1–10. [Google Scholar]

Figure 1. This is an overview of our work, which consists of a fine-tuned Stable Diffusion model, a ControlNet model trained on top of it, a sample filter based on clarity, histogram specification guided by average grayscale matching (HSGAGM), and a sample filter based on an influence function.

Figure 2. The horizontal axis of the image represents the difference in average grayscales between the two images, while the vertical axis represents the Earth Mover’s Distance of the histograms of the two images. We randomly selected 1000 pairs of images for the plot.

Figure 3. Annotion examples. The labels corresponding to the images above are (a) building; (b) hill building; (c) playground; (d) river tree; (e) road vehicle building tree; and (f) road vehicle tree.

Figure 4. Above SAR images generated by the fine-tuned Stable Diffusion model correspond to the following labels: (a) tree; (b) tree building; (c) road vehicle tree; (d) road vehicle building tree; (e) vehicle tree; and (f) road vehicle building tree. It can be observed that the generated images do not adhere well to the prompt “vehicle”; the vehicles are distorted, and sometimes overlapping.

Figure 5. Image samples at different stages of the entire process. Column (a) contains real samples from the SIVED training set; column (b) shows the segmentation images of targets and backgrounds generated based on the rotatable bounding box annotations of the samples in column (a); column (c) presents images generated by the Stable Diffusion + ControlNet model using the images in column (b); and column (d) consists of samples obtained through histogram specification processing based on average grayscale matching. From a visual perspective, compared to the real samples, the samples in column (c) exhibit higher brightness.

Figure 6. Comparison of images generated by the model using the same segmentation information and prompts. The first row of each column shows the image of the segmentation information, followed by samples generated based on this image under different random seeds. The labels used to generate these images are as follows: (a) vehicle tree; (b) road vehicle building tree; (c) vehicle building tree; (d) vehicle building tree; (e) road vehicle building tree.

Figure 7. Comparison of detection results before and after augmentation, using the oriented reppoints detector. The results shown for comparison are from the detector with the best performance across various experimental configurations, with all images sourced from the SIVED test set. In this comparison, columns (a,c) represent the results of the detector before augmentation, while columns (b,d) correspond to the results of the augmented detector. In the images, red boxes represent targets detected by the object detector, and blue boxes represent the ground-truth targets.

Table 1. Source of the original data, geographical location, bands, polarization, and resolution information.

Dataset	Source	Location	Bands	Polarization	Resolution
FARAD	Sandia National Laboratory	Albuquerque, NM, USA	Ka/X	VV/HH	0.1 m × 0.1 m
MiniSAR	Sandia National Laboratory	Albuquerque, NM, USA	Ku	-	0.1 m × 0.1 m

Table 2. Data distribution.

Label	Playground	Hill	River	Road	Vehicle	Building	Tree
number	68	101	710	2411	2122	3784	5745
percentage	0.7%	1.0%	7.1%	24.1%	21.2%	37.9%	57.5%

A slice can contain multiple labels, so the sum of the percentages does not equal 1.

Table 3. Data distribution of slices with label ‘vehicle’.

Label	Playground	Road	Building	Tree
Number	2	697	1076	1429
Percentage	0.08%	30.1%	46.6%	61.8%

A slice can contain multiple labels, so the sum of the percentages does not equal 1.

Table 4. Computational resources.

	GPU	Time
Fine-tuning Stable Diffusion	1 × V100	11 days
Training ControlNet	1 × 3090	1 day
Generating an image	1 × 3090	A few seconds
Calculating $- [Σ_{z_{i} \in T} L (z_{i}, \hat{θ})] H_{\hat{θ}}^{- 1}$ for one detector	1 × 3090	0.5 days

Table 5. The performance of the five detectors before data augmentation and after data augmentation using the aforementioned method.

Metric	Method	Rotated Faster R-CNN	Gliding Vertex	R3Det	KLD	Oriented Reppoints
$R e c a l l$	No Augmentation	79.1%	76.0%	76.6%	82.3%	82.0%
	BicycleGAN [38]	79.3%	76.0%	77.2%	81.7%	82.4%
	Pix2pix [39]	79.6%	75.1%	78.0%	81.9%	83.1%
	Palette [40]	79.0%	75.4%	78.5%	82.7%	82.8%
	Our Method	81.1%	77.6%	78.2%	83.2%	83.6%
$m A P_{75}$	No Augmentation	64.7%	63.6%	64.3%	74.7%	72.4%
	BicycleGAN	64.2%	63.0%	65.4%	74.6%	72.6%
	Pix2pix	65.5%	63.8%	64.8%	75.0%	73.0%
	Palette	63.8%	63.0%	65.9%	75.7%	72.8%
	Our Method	71.3%	64.4%	65.4%	75.7%	74.5%

The bolded part is the highest value in the column.

Table 6. Ablation test.

Clarity Filter	HSGAGM	Influence Function Filter	Recall	mAP₇₅
✓	✓	✓	83.6%	74.5%
	✓	✓	83.4%	74.1%
✓		✓	83.1%	73.6%
✓	✓		83.6%	74.0%
✓			82.9%	72.6%
	✓		83.2%	73.5%
		✓	83.0%	73.3%
			82.8%	72.9%

✓ indicates that the image generated in the corresponding experiment for that row has passed through that step.

Table 7.

m A P_{75}

of the five detectors as different quantities of samples were added.

Table 7.

m A P_{75}

of the five detectors as different quantities of samples were added.

Method	Num_n/Num_o*	Rotated Faster R-CNN	Gliding Vertex	R3Det	KLD	Oriented Reppoints
No Augmentation	0	64.7%	63.6%	64.3%	74.7%	72.4%
Our Method	6.25%	65.5%	63.3%	64.2%	74.9%	73.0%
	12.5%	71.3%	64.4%	65.1%	75.7%	73.7%
	18.75%	66.8%	63.7%	65.4%	75.2%	73.3%
	25%	64.7%	63.5%	64.6%	75.0%	74.5%
	31.25%	63.5%	62.7%	64.7%	75.2%	73.1%
	37.5%	62.6%	63.0%	63.3%	74.3%	73.2%
	43.75%	61.6%	61.9%	63.3%	73.4%	72.3%
	50%	62.1%	59.4%	63.0%	73.5%	73.4%
BicycleGAN	12.5%	64.2%	63.0%	65.4%	74.6%	72.6%
BicycleGAN	25%	63.6%	62.6%	65.1%	74.3%	72.6%
Pix2Pix	12.5%	65.5%	63.8%	64.8%	75.0%	72.7%
Pix2Pix	25%	65.0%	63.6%	64.0%	74.6%	73.0%
Palette	12.5%	63.8%	63.0%	65.9%	75.7%	72.4%
Palette	25%	63.0%	63.0%	65.4%	75.2%	72.8%

* the ratio of new samples to real samples. The bolded part is the highest value in the column.

Table 8. The

m A P_{75}

after training set augmented by samples selected using the influence functions of other detectors. Each column represents the performance of the corresponding detector after the dataset being augmented by samples selected using different influence functions.

Table 8. The

m A P_{75}

after training set augmented by samples selected using the influence functions of other detectors. Each column represents the performance of the corresponding detector after the dataset being augmented by samples selected using different influence functions.

Influece Function	Rotated Faster R-CNN	Gliding Vertex	R3Det	KLD	Oriented Reppoints
Rotated Faster R-CNN	71.3%	62.7%	64.8%	73.8%	73.4%
Gliding Vertex	71.3%	64.4%	65.2%	74.4%	73.3%
R3Det	64.2%	64.2%	65.4%	75.2%	73.1%
KLD	72.1%	63.5%	65.4%	75.7%	73.6%
Oriented Reppoints	64.6%	63.7%	64.5%	74.6%	74.5%

The bolded part is the highest value in the column.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, K.; Pan, Z.; Wen, Z. SVDDD: SAR Vehicle Target Detection Dataset Augmentation Based on Diffusion Model. Remote Sens. 2025, 17, 286. https://doi.org/10.3390/rs17020286

AMA Style

Wang K, Pan Z, Wen Z. SVDDD: SAR Vehicle Target Detection Dataset Augmentation Based on Diffusion Model. Remote Sensing. 2025; 17(2):286. https://doi.org/10.3390/rs17020286

Chicago/Turabian Style

Wang, Keao, Zongxu Pan, and Zixiao Wen. 2025. "SVDDD: SAR Vehicle Target Detection Dataset Augmentation Based on Diffusion Model" Remote Sensing 17, no. 2: 286. https://doi.org/10.3390/rs17020286

APA Style

Wang, K., Pan, Z., & Wen, Z. (2025). SVDDD: SAR Vehicle Target Detection Dataset Augmentation Based on Diffusion Model. Remote Sensing, 17(2), 286. https://doi.org/10.3390/rs17020286

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

SVDDD: SAR Vehicle Target Detection Dataset Augmentation Based on Diffusion Model

Abstract

1. Introduction

1.1. Related Works

1.2. Motivation and Contributions

2. Materials and Methods

2.1. Preliminaries

2.1.1. DDPM

2.1.2. Latent Diffusion

2.1.3. ControlNet

2.1.4. Influence Function

2.2. Our Method

2.3. Dataset Construction

2.3.1. Our Dataset

Original Data

Data Annotation Methods and Examples

Data Distribution

2.3.2. SIVED

3. Experimental Results and Analysis

3.1. Train and Infer

3.2. Computational Resources

3.3. Experiment Settings

3.4. Visual Analysis

3.4.1. Visual Analysis on Diversity

3.4.2. Visual Analysis on Detection Performance

3.5. Metrics

3.6. Main Results

3.7. Ablation Test

3.8. Experiments on Num of Added Data

3.9. Experiments on Influence Function of Different Detectors

4. Discussion

Limitations

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI