1. Introduction
Terahertz (THz) waves can easily penetrate clothing; compared to X-rays, its photon energy is relatively low, so it is non-ionizing and harmless to the biological tissue. Compared to the microwave band, it can achieve a higher imaging resolution [
1,
2]. Due to the above advantages, the THz imaging systems have attracted extensive attention in the field of security inspection, non-destructive testing (NDT) and other applications [
3,
4].
In the field of security inspection, THz imaging can be divided into active imaging and passive imaging. The active imaging systems apply the transmitter to emit electromagnetic waves in the terahertz band and use the echoes of the target for image reconstruction. The passive systems apply the natural radiation emitted by the targets. Thus, active and passive imaging systems show different advantages in imaging performance and scene visualization [
5]. For the hidden object detection task, the radiation brightness temperature between the human body and the hidden object always has a significant difference, so the target boundary is more obvious in the terahertz passive image, which is beneficial to the target detection task. Furthermore, passive systems have some other advantages in terms of safety concerns and privacy issues in security inspection applications.
The dataset always plays a key role in model training and generalization performance. As we all know, THz security inspection equipment has just begun to be set in a few security inspection scenarios. At present, there is no public dataset available for training models to identify specific dangerous targets, and this dataset involves sensitive issues such as public safety. Therefore, using data augmentation technology to fully mine the existing dataset is a necessary means to improve the performance of the model. Most of the existing data augmentation techniques involve random cropping, splicing, rotation and other operations on the original image during the training process [
6,
7]. However, such technologies cannot generate some unseen images and do not fundamentally solve the problem. Especially in the case of few-shot learning, the improvement of network performance is limited.
With the continuous progress of deep-learning algorithms, image generation methods based on deep convolutional neural networks provide a solution for data augmentation. The generative adversarial network (GAN) [
8] is a typical image generation method. GAN models learn feature distribution of real data and improve the quality of generated images through game learning between the generator and the discriminator. The early GAN models performed feature extraction through the fully connected layer, which ignored the spatial correlation of the image. Thus, the quality of the generated image was poor. Then the deep convolutional network structure was introduced into the GAN model (DCGAN) [
9], which improved the quality of generated image. Subsequently, to generate images of multiple categories during one training process, a supervised learning GAN with image classification information is proposed [
10]. However, the introduction of classification labels combined with the dynamic training process has a negative impact on the generated image categories [
11].
In addition, the abovementioned GAN model uses the Jensen–Shannon (JS) divergence to measure the difference between the generated image and the real image. The JS divergence loss function will be a constant in most cases, resulting in invalid gradient backpropagation and unstable training. Then, Arjovsky et al. used the Wasserstein distance with weight clipping to replace the JS divergence, and proposed the WGAN model [
12], which improved the training stability. Then, Gulrajani et al. proposed an improved WGAN model by using gradient penalty to replace the weight clipping operation in WGAN, named WGAN-GP, which avoided the problem of extreme weight parameter distribution due to weight clipping [
13]. In the abovementioned GAN models, simple stacked deep convolutional network structures are used. Generally speaking, to enhance the feature extraction ability of the model, the most direct way is to increase the network depth, but some studies found that blindly increasing the network depth will cause the problem of network degradation.
To enhance the feature-learning ability of deep convolutional network, avoid network degradation and vanishing gradient problems in the deep network structure, a generative adversarial network model using residual structure [
14] is proposed in this paper, named Res-WGAN-GP. Moreover, the loss function of WGAN-GP is used to the proposed model to maintain the training stability. According to the special requirements of the loss function, different residual block structures are designed for the generator and discriminator. In addition, a quality evaluation method suitable for a THz passive image generation task is proposed. Finally, the proposed model can generate high-quality THz passive images to augment the dataset. The main contribution of this paper is the first attempt to apply deep-learning technology to achieve low-cost terahertz passive image data augmentation, which aims to help the further application of terahertz passive imaging systems in the field of security inspection.
This rest of paper is organized as follows. A detailed description of the Res-WGAN-GP model is given in
Section 2. Then the experimental results based on 0.2 THz band passive imaging dataset are given in
Section 3. Then, the resulting analyses of the generated images are discussed in
Section 4. Finally, a conclusion is drawn in
Section 5.
4. Results Analyses
In this section, the performance of the proposed Res-WGAN-GP model and the original WGAN-GP model on the terahertz passive image generation task is evaluated and compared. It is worth noting that due to the particularity of terahertz passive images, there is no public dataset; and to the best of our knowledge, there are currently no research reports on terahertz passive image generation methods, which means that no general evaluation method for terahertz passive image generation can be directly applied. In addition, the Inception Score [
19] and Fréchet Inception Distance [
20] evaluation metrics, commonly used in the optical imagery field, cannot be used in the terahertz passive image generation task. Thus, this section draws on relevant literature to propose an objective evaluation method suitable for the current task.
The proposed GAN model aims to provide data augmentation for object detection tasks; thus, we hope that the generated images meet the following requirements:
- (1)
Visual quality: the generated images should be high-quality;
- (2)
Category consistency: the generated image must represent the desired class;
- (3)
Diversity: the generated images must not be repetitive;
- (4)
Usability: the generated images must be different from the real images already in the training set.
The proposed Res-WGAN-GP model and the original WGAN-GP model were trained three times to obtain the generative models of the three types of images, and then eight groups of random noise inputs were used to obtain the generated images. The comparison is shown in
Figure 8. In terms of visual effects, the images generated by the proposed model are comparable in quality to real images, and basically meet the first requirement. The images generated by the original WGAN-GP model are mostly blurry, the desired object area is not clear enough or even missing, which is quite different from the real dataset.
To verify that the three separately trained models can accurately generate realistic images of the corresponding classes, the cross and hybrid tests are performed, respectively [
11]. First, the original dataset is classified according to categories and then train a classification model. Here, the GoogLeNet [
21] model is used as the pretrained classification model. Then, 100 fake images for each category are used as the test set. The classification accuracy is the result of the cross test. Secondly, a hybrid test is performed. A batch of generated images is mixed into the real dataset for retraining, and then another batch of generated images is used as a test set to verify whether the classification performance of the network has been improved, which is closer to the actual application scenario. The statistical results of the cross test and the hybrid test are shown in
Table 2. It can be seen that the images generated by Res-WGAN-GP model has higher classification accuracy than the original WGAN-GP model, which indicates that the generated images basically belong to the desired category and satisfy the second requirement.
In order to verify the diversity of generated images, that is, that the model does not generate one or several images repeatedly, the SSIM indicator [
22] is used to evaluate the similarity between the generated images. The calculation formula of SSIM can be expressed as
where
is the reference image;
denotes the image to be evaluated.
,
and
represent the mean, variance and covariance of the two images, respectively.
and
are constants, which are used to ensure the stability of the calculation process and can be expressed as
where
,
and
.
Generally, SSIM is a value between 0 and 1. The larger the SSIM value, the higher the similarity between the image to be evaluated and the reference image. If the two images are exactly the same, the SSIM is equal to 1.
Specifically, a batch of images was generated for each category, and randomly selected a pair of images to calculate the SSIM value. If the SSIM of the two is equal to 1, it means there is a repeated generation. On the contrary, if the SSIM value is relatively small, it means that the generated images are obviously different. In this paper, 400 images for each category are generated, 100 pairs of generated images were randomly extracted from the current category to calculate the SSIM values, and its mean value is defined as the interclass SSIM. If the interclass value is small, it means that the diversity of the generated images is better. Then, the average SSIM value of the real image in the same verification method is calculated as a reference; the real image is more variable than the generated image, that is, it usually corresponds to a smaller interclass SSIM value [
23]. The test results are shown in
Figure 9. The average intraclass SSIM of the three types of real data is 0.287, indicating that real data has good diversity. The average intraclass SSIM of the three types of samples generated by Res-WGAN-GP was 0.323, which means have a good diversity. In addition, the trend of the intra-class SSIM values of the three types of samples is similar to that of the real samples, indicating that the data generated by the proposed model has the same category information as the real samples. Although the test results of the images generated by WGAN-GP have very low SSIM values, even lower than the real samples, this is due to the lack of network learning ability, resulting in the overall low quality of the generated images. It is manifested that the gray distribution is uneven, and the difference between the generated images is large, which can also be seen from the visual comparison of the images in
Figure 8. In summary, the image generated by the proposed model satisfies the third requirement.
For the fourth requirement, generally speaking, the input of the GAN model is random noise sampled from a normal distribution. Although the model has a strong learning ability, the probability of completely fitting the real data distribution is very small. Therefore, the phenomenon of overfitting is not common in the GAN field, but in order to avoid this unexpected situation and generate useless data, the SSIM indicator is also used for evaluation. Specifically, 100 pictures are generated for each of the 3 categories, and then the SSIM values are calculated for all the generated pictures and all the real pictures to obtain the maximum SSIM value. If it is less than 1, it means that the generated pictures and the real dataset do not overlap. After calculation, the maximum SSIM values of each category are 0.77, 0.81 and 0.73, respectively. Therefore, it can be judged that there is no rare over-fitting phenomenon in the model, and the generated data can be used for data enhancement of target detection after simple screening and eliminating abnormal conditions.
5. Conclusions
In this paper, a Res-WGAN-GP generation model and a quality evaluation method are proposed for THz passive image data augmentation application. Based on the framework of deep convolutional generative adversarial network, the generator and discriminator models based on residual structure are designed. The WGAN-GP loss function is used to ensure the stability of the training process. The generated images are evaluated in terms of visual quality, category consistency, diversity, and usability. The results show that the proposed model can improve the quality of generated images and meet the requirements of data augmentation applications. In the classification tests, the classification accuracy can be improved by applying the augmented dataset, so it is expected to apply the augmented data in the object detection task with more target categories.
It is worth noting that the main contribution of this paper is the first attempt to use deep-learning methods for data augmentation of terahertz passive image datasets. This paper aims to provide a low-cost solution for poor detection performance caused by insufficient data volume.
In the future, the network structure, training strategy and loss function need to be optimized, and the proposed method also needs to be verified with a dataset containing more categories. In addition, some effective clarity evaluation methods suitable for GAN models need to be investigated.