Efficient Defect Classification Using Few-Shot Image Generation and Self-Attention Fused Convolution Features

Zhang, Yingjie; Yang, Zhenwei; Xu, Yue; Ai, Yibo; Zhang, Weidong

doi:10.3390/app14125278

Open AccessArticle

Efficient Defect Classification Using Few-Shot Image Generation and Self-Attention Fused Convolution Features

by

Yingjie Zhang

¹,

Zhenwei Yang

¹,

Yue Xu

¹,

Yibo Ai

^1,2,* and

Weidong Zhang

^1,2,*

¹

National Center for Materials Service Safety, University of Science and Technology Beijing, Beijing 100083, China

²

Southern Marine Science and Engineering Guangdong Laboratory, Zhuhai 519080, China

^*

Authors to whom correspondence should be addressed.

Appl. Sci. 2024, 14(12), 5278; https://doi.org/10.3390/app14125278

Submission received: 29 April 2024 / Revised: 6 June 2024 / Accepted: 12 June 2024 / Published: 18 June 2024

Download

Browse Figures

Versions Notes

Abstract

:

Although deep learning has been proven to significantly outperform most traditional methods in the classification of large-scale balanced image datasets, collecting enough samples for defect classification is extremely time-consuming and costly. In this paper, we propose a lightweight defect classification method based on few-shot image generation and self-attention fused convolution features. We constructed a 4-class dataset using welding seam images collected from a solar cell module packaging production line. To address the issue of limited defect samples, especially for classes with less than 10 images, we implemented two strategies. Geometric enhancement techniques were first used to extend the defective images. Secondly, multi-scale feature fusion Generative Adversarial Networks (GANs) were utilized to further enhance the dataset. We then performed the feature-level fusion of convolution neural networks and self-attention networks, achieving a classification accuracy of 98.19%. Our experimental results demonstrate that the proposed model performs well in small sample defect classification tasks. And, it can be effectively applied to product quality inspection tasks in industrial production lines.

Keywords:

defect classification; few-shot generation; attention mechanism; feature fusion

1. Introduction

Component encapsulation is an important process in the production of solar panels, and the detection of the welding quality of bus-bars and interconnectors is one of the key steps. Vision-based detection methods, with their advantages of non-contact, repeatability, high efficiency, and strong generalization, have gradually been applied in various industries [1,2]. They can be divided into design feature-based and learning feature-based methods. Design feature-based methods include statistical methods (histograms [3], co-occurrence matrices [4], local binary patterns [5]), structural methods (morphology [6]), filtering methods [7,8] (Canny, Gabor, FFT, Wavelet, Sobel, etc.), model-based methods [9], and multi-feature combination methods [10]. However, design feature-based methods often require professionals to tailor them for specific tasks, making them unsuitable for transfer applications and with limited resistance to interference.

Learning feature-based methods can automatically extract features and useful information from images. In recent years, there has been a large amount of research applying deep neural networks to defect detection tasks. However, the good performance of deep neural networks largely depends on a large amount of high-quality data. For welding seam images, there is a general lack of defective images, especially for some rare types. Transfer learning based on pre-trained models from large-scale image datasets can reduce the amount of data required for training models. However, the scarcity of defect data remains a challenge. These limited data are still insufficient to achieve the defect detection accuracy required for production and manufacturing. Existing solutions include geometric transformations, contrast adjustments, edge feature extraction, and simulating new minority class images through certain algorithms.

Chiraz et al. [11] improved the classification accuracy of the VGGNet network by approximately 3% by replacing the B and G channels of X-ray weld seam images with binary images obtained from Canny edge detection and adaptive Gaussian thresholding. Wei et al. [12] generated different minority class samples for resistance spot welding (RSW) images using balancing augmentation GAN (BAGAN) and balancing augmentation GAN with gradient penalty (BAGAN-GP), and trained ResNet50 to classify the types of RSW defects. Cao et al. [13] divided the crack images in gusset plate welded joints of steel bridges into 192 small images

64 \times 64

in size from a larger image

1024 \times 768

in size, increasing the amount of training data while also reducing the computation related to processing large images by the network. They fine-tuned the VGG16 combined with data augmentation and achieved good detection performance. Topias et al. [14] proposed a method for synthesizing X-ray data for aerospace welds. They first modeled the component using CAD and then used ray tracing simulation software to generate X-rays to obtain simulated X-ray images. Next, they added noise masks from experimental measurement images to the simulated images and finally used Mask RCNN for defect segmentation. Hou [15] used three types of resampling methods: random over-sampling (ROS), random under-sampling (RUS), and the synthetic minority over-sampling technique (SMOTE)) for X-ray weld images. And, they extracted three features: statistical features based on the gray-level co-occurrence matrix (GLCM), the histogram of oriented gradients (HOG), and the stacked sparse autoencoder (SSAE). The experimental results showed that the model trained with the features based on SSAE and SMOTE resampling achieved the highest accuracy (97.2%).

Previous studies have achieved good results, but there is still a lack of research on defect classification when there is an extreme lack of defect samples. One solution to the problem of insufficient defect samples is to perform simple geometric transformations on images or to generate virtual defect samples using certain algorithms. Since Generative Adversarial Networks (GANs) were introduced for image generation in 2014 by Radford et al. [16], scholars have conducted extensive research. The deep convolutional GAN (DCGAN) [17] introduces convolution to the GAN. The conditional GAN (CGAN) [18] was proposed to add auxiliary information such as class labels to the generators. The auxiliary classifier GAN (ACGAN) [19] adds class information to the discriminator. The Wasserstein GAN (WGAN) [20] is trained faster. The Wasserstein GAN with gradient penalty (WAGAN-GP) [21] improves the robustness of the optimization process. The balance GAN (BAGAN) [22] and the balance GAN with gradient penalty (BAGAN-GP) [23] can be trained on imbalanced datasets and used for generating minority class samples.

In addition, researchers have conducted research specifically on training GANs with a small number of samples. Ojha [24] and Xiao [25] proposed Anchor-based strategies and the Relaxed Spatial Structural Alignment method to train generating models with extremely few samples from pre-trained models. Ding [26] proposed an attribute editing-based editing method to synthesize images of unseen categories by editing category-independent attributes on a GAN trained on known categories. However, finding a compatible pre-training dataset is not guaranteed, and fine-tuning may even lead to worse performance [27]. In addition, Yang [28] decomposed the encoded features into multiple frequency components, and then performed low-frequency skip connections to preserve contour and structural information, providing the generator with rich frequency information to generate images. Liu [29] proposed a self-supervised discriminator that learns more descriptive feature maps that cover more regions from input images, resulting in more comprehensive signal training of the generator.

Most of the learning feature-based approaches use convolutional structures for feature extraction. Thanks to its advantages of good parallelism and modeling long-distance dependencies, the Vision Transformer (ViT) [30] model based on the self-attention mechanism has shown outstanding performance in image classification tasks. A CNN can learn feature maps at different scales or more expressive points. The attention mechanism can implement a diffusion mechanism from local to global to find feature representation. However, the transformer considers the relationships between features. Since the transformer is not entirely dependent on the data themselves, its is universality better. Compared with absolute values of the data, it focuses more on the relationships between the data. Therefore, the fusion of convolutional and attention features can achieve a bidirectional fusion of local and global features.

The CNN’s ability to capture intricate details limits its capacity for holistic modeling, while ViT faces challenges in fully leveraging spatial information in visual signals. Thus, the integration of local and global features has emerged as a strategy to enhance model performance. Dutta et al. [31,32] extended the concept of particle interactions in quantum mechanics to imaging problems, considering each image patch as an individual particle and enabling interactions with adjacent particles (i.e., neighboring image patches). This approach effectively combines local and non-local information. Moreover, Liu et al. [33] introduced the Swin Transformer, which incorporates hierarchical feature maps and sliding windows to facilitate interactions among image patches. Lou et al. [34] proposed TansXNet, where a CNN is used for initial feature extraction, followed by inputting the extracted local features into transformer modules for global modeling. The model leverages both local and global attention mechanisms to enhance performance. Wu et al. [35] introduced a hierarchical structure called CvT, consisting of multiple Convolutional Transformer Blocks per layer. By replacing linear transformations with convolutional transformations and employing multi-head attention mechanisms, the model better utilizes image-related local information while also modeling global context. Building upon the classic ViT architecture, Guo et al. [36] introduced CMT, which incorporates a Conv-Stem composed of 3 × 3 convolutions and CMT modules that combine depth-wise convolutions with self-attention mechanisms, resulting in improved accuracy for visual networks. Mehta et al. [37] proposed MobileViT, a lightweight visual model designed to effectively model both local and global information in input tensors using fewer parameters.

Based on the analysis above, this article proposes an adversarial generative network with fused multi-scale features to generate virtual images of defective welds. Convolutional features and global attention features are then extracted from the weld image for defect classification.

(1): We establish a defect classification framework for small-scale weld datasets, including a GAN-based defect image generator and a multi-feature fusion-based defect classifier;
(2): We propose a defect image generation method for the case of an extreme lack of defective samples for the problem of insufficient defective weld data and class imbalance. And, the distribution of weld images generated by the proposed model is experimentally demonstrated to be better than some other models;
(3): In order to maximize the extraction of the information contained in the limited images, we use a combination of methods. The method fully utilizes the local advantage of convolutional features and the global advantage of attention features. This allows the network to fully learn the features of the available image data and thus achieve more accurate classification. In addition, this approach reduces the complexity of the algorithm and improves the generalization ability of the model.

The structure of this paper is organized as follows. Section 2 describes the proposed method, including the welding defect datasets, the GAN-based data enhancement method, and the structure and characteristics of the network combining a CNN and self-attentive features. The experimental results and discussions about the weld defect datasets are given in Section 3. Finally, conclusions are given in Section 4.

2. Methodology

We propose a weld defect classification method for small datasets, as shown in Figure 1. First, the categories and features of weld images are described. The main challenge in constructing the classifier was the small and imbalanced datasets. The FastGAN was used to augment the images containing weld defects after reasonable geometric transformations. Both the generated and original images were used to extract convolutional and attention features, which were subsequently used to build an efficient and accurate classifier for weld defect classification.

2.1. Weld Datasets

During the component encapsulation production process, 1040 RGB images of

2592 \times 1944

with different angles and lighting conditions were captured using industrial cameras. Welding seams exhibited different shapes, sizes, positions, and contrasts in the images. During preprocessing, each original image was extracted with some

224 \times 224 \sim 448 \times 448

ROIs as shown in Figure 2. “Cold solder” refers to the situation where only a small amount of solder is attached at the solder joint, resulting in poor contact that may be intermittent; “no bus-bar” refers to the situation where the bus-bar is not placed during welding; “Missed solder” refers to the situation where the solder should be applied but is not. These defects more or less cause power attenuation or the failure of the battery panel components. Figure 2 also shows the unbalanced distribution of samples in the dataset, with nearly four-fifths of the images belonging to qualified welds. It was difficult to build an effective classification model on such a dataset to accurately distinguish qualified and various types of defective welding seams.

2.2. Data Augmentation

We applied geometric enhancement to the images of the defective welds, and then used GANs to generate some new simulated images.

2.2.1. Geometric Enhancement of Defect-Containing Images

Through experimentation, we discovered that welds are directionally sensitive targets, and operations such as rotating 90° can even reduce the accuracy of the model. We implemented rotations of 180°, horizontal flipping, and vertical flipping on the images, as shown in Figure 3.

2.2.2. Defect-Containing Images Generation Based on FastGAN

Training a GAN from scratch on a small number of samples can often lead to overfitting or mode collapse [38]. Therefore, we needed to train a generator that could learn quickly and a discriminator that could provide useful information continuously to train the generator. We designed a channel excitation module that can use low-scale activations to correct channel responses on high-scale feature maps. This allowed for a more robust gradient flow throughout the model weights and sped up training. We proposed a self-supervised discriminator, trained as a feature encoder and equipped with an additional decoder. This allowed the discriminator to learn more descriptive feature maps covering more regions from the input image, thus producing more comprehensive signals to train the generator.

(1): Skip-layer excitation (SLE) module

The SLE module draws inspiration from the skip-connection of the Resnet block (Figure 4) and the feature recalibration design of the Squeeze-and-Excite module (Figure 5). Thus, SLE not only has the advantage of channel-level feature recalibration but also enhances the gradient flow throughout the entire model. The Resnet block implements the skip-connection as an element-wise addition between outputs from different convolutional layers, requiring the spatial dimensions of the activations to be the same. The SLE module multiplies the outputs between channels, eliminating the heavy computation of convolution and enabling skip connections between different resolutions.

We define the skip-layer excitation module as:

y = F (x_{l o w}, {W_{i}}) x_{h i g h}

(1)

where x and y are the input and output feature maps of the SLE module, the function

F

contains operations on

x_{l o w}

, and

W_{i}

denotes the module weights to be learned.

Figure 6 shows an actual SLE module, where

x_{l o w}

and

x_{h i g h}

are feature maps 8 × 8 and 128 × 128 in size. The operations in module F are adaptive average pooling, convolution, LeakyReLU, convolution, and sigmoid activation. The output is multiplied along the channel dimension with

x_{h i g h}

to obtain y (with the same shape as

x_{h i g h}

).

(2): Self-supervised discriminator

We trained D as an encoder with a small decoder, D extracts the image features, and the decoder reconstructs the image. The decoder and D were trained on real samples only, and the optimized loss function is:

L_{r e c o n s} = E_{f \sim D_{e n c o d e} (x), x \sim I_{r e a l}} [∥ G (f) - T (x) ∥]

(2)

where f is the intermediate feature map from D. The function

G

contains the processing of f and the decoder, and the function

T

represents the processing of the sample x from the real image

I_{r e a l}

.

Our self-supervised D is shown in Figure 7, where the decoder consists of four convolutional layers and generates a 128 × 128 feature map. We used the decoder on feature maps

16^{2}

and

8^{2}

, randomly cropping 1/8 of the height and width of

16^{2}

, and then cropping the same part from the real image to obtain

I_{p a r t}

. We resized the real image to obtain I, and the decoder generated

I_{p a r t}^{'}

from the cropped

16^{2}

and

I^{'}

from

8^{2}

. D and the decoder were trained together to minimize the loss in Equation (2). This reconstruction training ensured that D extracted more comprehensive representations from the input images, including both global features (from

8^{2}

) and fine texture details (from

16^{2}

).

Finally, we adopted the hinge version of the adversarial loss [39,40] to iteratively train our D and G.

\begin{matrix} L_{D} = & - E_{x \sim I_{r e a l}} [m i n (0, - 1 + D (X))] \\ - E_{\hat{x} \sim G (z)} [m i n (0, - 1 - D (\hat{x}))] \\ + L_{r e c o n s} \end{matrix}

(3)

L_{G} = - E_{z \sim N} [D (G (z))]

(4)

2.3. Structure of Proposed Neural Network

Due to the difficulty in obtaining defect images of welds, it was crucial to effectively utilize current image information for defect recognition. We used a visual transformer structure based on self-attention to learn remote dependencies and a CNN to extract local features. This combination maximized the recognition of different types of welds with current image data. In addition, lightweight and fast network models are also important for industrial production line applications, and the MobileNetv2 block in the network made it possible to meet these needs.

The MobileViT block used in our network structure (Figure 8c) combines the MobileNetv2 block (Figure 8a) and Vit block (Figure 8b), and replaces the local information processing of the CNN with the global information processing of the transformer, which can encode both local and global information simultaneously and has a more powerful information extraction capability.

2.3.1. Standard Visual Transformer

As shown in Figure 8b, the ViT model first divides the input

X \in R^{H \times W \times C}

into non-overlapping sub-blocks

X_{f} \in R^{N \times P C}

, and projects the block sequence to a d-dimensional space

X_{P} \in R^{N \times d}

. After position encoding,

X_{P}

goes through a series of transformer blocks to learn features, where H, W, and C represent the height, width, and number of channels of the input tensor, P is the number of pixels in the patches of

w \times h

, and N is the number of patches.

2.3.2. Mobilevit Block

As shown in Figure 8c, the implementation of the MobileVit block can be divided into three steps:

(1): The input $X \in R^{H \times W \times C}$ is passed through the local representations block, and the local information is encoded using a $n \times n$ convolutional layer, and then mapped to higher dimensions using $1 \times 1$ convolution to obtain $X_{L} \in R^{H \times W \times d}$ ;
(2): The global representations block unfolds $X_{L}$ into $X_{U} \in R^{P \times N \times d}$ , which is then fed into transformer blocks to extract global spatial information. The output $X_{G} \in R^{P \times N \times d}$ is then folded and reconstructed as $X_{F} \in R^{H \times W \times d}$ , where $P = h \times w$ , $N = H \times W / P$ . Each patch contains h×w pixels, and there are N patches in total;
(3): $X_{F} \in R^{H \times W \times d}$ is restored to low C-dimensional space using Conv $1 \times 1$ and combined with X via concatenation with the input operation. Then, a $n \times n$ convolutional layer is used for channel fusion to obtain the final output Y.

Compared to the ViT, the MobileVit block adds one more dimension to become

R^{P \times N \times d}

, where

P = h \times w

. Therefore, it can retain the position of each pixel in each patch. In the feature map output after transformer operation, each pixel point is derived from all pixel points of the output feature map. As shown in Figure 9, the original feature map was divided into 25 patches (purple squares), where red pixels perform self-attention with blue pixels in the 25 purple squares, and the blue pixels in each purple square obtain local information from the surrounding gray squares through convolution. This means that each red pixel contains information from all the pixels in the input feature map and has a global receptive field.

2.3.3. Mobilevit Architecture

The network structure of MobileViT is shown in Figure 10. The initial layer is a stride

3 \times 3

standard convolution, followed by two down-sampling operations. The subsequent layers consist of MobileNetv2 blocks (Figure 8a) and MobileViT blocks (Figure 8c). Finally, channel compression is performed using a

1 \times 1

convolutional later followed by global average pooling. Swish [40] is used as the activation function. Following the CNN model, we set n = 3 for the MobileViT block. Since the spatial dimensions of the feature maps are usually a multiple of 2 and h, w ≤ n, we set h = w = 2 for all spatial levels.

3. Experiments and Discussion

We selected some few-shot generative GAN models for the qualitative and quantitative evaluation. And, some classifiers based on convolution or transformer were evaluated to analyze the performance of FastGAN models and self-attentive models with fused convolutional features in weld image generation and classification.

3.1. The Experiments of Defective Image Generation Models

In this section, several methods for generating several images based on different principles are tested qualitatively and quantitatively. These methods include BAGAN for generating imbalanced class samples, BAGAN-GP with a penalty term, RSSA for aligning the spatial structure of pre-trained models on large-scale datasets to generate new images, and WaveGAN for generating images using rich frequency information.

Qualitative evaluation is generally based on the subjective judgment of the quality of the generated images and their similarity to the original images. The three types of defective welds generated by different networks are shown in Figure 11, where the images generated by BAGAN and BAGAN-GP have insufficient clarity, and the images generated by RSSA differ significantly from the original images. The images generated by WaveGAN and the FastGAN proposed in this paper have a better intuitive feel.

The quantitative evaluation used Fréchet Inception Distance (FID) [41] as an evaluation metric, which measures the feature-level distance between the real and generated data distributions. FID is defined as follows:

F I D = {∥ μ_{r} - μ_{g} ∥}^{2} + T_{r} (Σ_{r} + Σ_{g} - 2 {(Σ_{r} Σ_{g})}^{\frac{1}{2}})

(5)

where

(μ_{r}, Σ_{r})

and

(μ_{g}, Σ_{g})

are the mean and covariance of the real and the generated features.

The FID score was calculated between the real samples of the test set and the samples generated by GANs. A lower FID indicates a higher quality of generated images. We used the FID of real samples in the training set as a baseline, and the experimental results are shown in Table 1, which indicates that FastGAN performed better in augmenting the welding defect dataset.

To further analyze the generated results, t-SNE was used to visualize the original and augmented datasets. It can embed high-dimensional data into a two-dimensional space for the easy observation of the data distribution [42]. From Figure 12, it can be seen that the generated image manifolds are uniformly distributed around the real image manifolds. This indicates that, for each class, FastGAN learned the real data distribution comprehensively and there was no problem with model collapse. Therefore, the generated samples can be considered as valid enhanced images.

3.2. The Experiments of Weld Seams Classification Models

We selected convolutional or self-attention-based networks to classify the enhanced weld seams, including Mobilenetv3 [43], ResNext [44], Swin Transformer, and CNN Meet Transformer (CMT), to demonstrate that networks combining convolution and self-attention perform better for seams classification.

We evaluated the models using three performance metrics: the accuracy (the proportion of correctly predicted samples), precision (the proportion of true positive predictions among all positive predictions), and F1 score (closer to 1 indicates better prediction performance of the model):

A c c u r a c y = \frac{T P + T N}{T P + F N + F P + T N}

(6)

P r e c i s i o o n = \frac{T P}{T P + F P}

(7)

F 1 s c o r e = \frac{2 \times P r e c i s i o n \times R e c a l}{P r e c i s i o n + R e c a l l}

(8)

where TP denotes true positive, TN denotes true negative, FP denotes false positive, and FN denotes false negative.

We trained our model on two sets of weld seam images: one set only augmented with geometric transformations and the other set augmented with both geometric transformations and FastGAN. In each experiment, we saved the model with the minimum loss value and the model with the highest validation accuracy and then tested it on the test set. The results of the classification network only trained on the weld seam images augmented with geometric transformations are shown in Table 2. The results of the classification network trained on the weld seam images augmented with both geometric transformations and FastGAN are shown in Table 3, and the recognition accuracy of the network for each type of weld seam images is shown in Table 4. The confusion matrices of different networks for different weld class classifications are shown in Figure 13. Table 5 shows the comparison results for different model parameter counts, FLOPs, and runtimes, where the runtimes are those from on a 1 NVIDIA V100 Tensor Core GPU.

As shown in Table 2 and Table 3, the accuracy of the network models trained on weld seam images enhanced by FastGAN improved to a certain extent, with MobileViT achieving a 0.79% improvement. It can also be seen that the CMT and MobileViT models, which combine convolutional and self-attention features, had a higher accuracy in identifying defective weld seams compared to network models that used only convolutional or self-attention features. It can also be seen from Figure 13 that MobileViT was able to learn more feature representations for different weld images. Moreover, the size of MobileViT is only one-third of that of CMT, which makes it more suitable for industrial-grade applications. The network structure of MobileViT used in our study is shown in Table 6.

Our dataset contains images of weld seams taken under different illumination and angles, as shown in Figure 2. And, it can be seen from Table 4 that our proposed model achieved a recognition rate of more than 99% for all three defects. This indicates that the proposed model has strong generalization and resistance to environmental interference.

3.3. Ablations

In order to demonstrate the importance of the final convolutional fusion layer in the MobileViT block on the model classification performance, ablation experiments were carried out in this study. The results are shown in Table 7. From the comparison results, it can be seen that the convolutional fusion layer was one of the keys to the improvement in the model classification performance. Table 8 demonstrates the comparison of the number of parameters, FLOPs, and running time of the models.

4. Conclusions

This paper proposes a framework for the classification of weld defects in bus-bars and interconnect bars. Firstly, a dataset of weld images taken under different illumination and angles is constructed. Then, to generate weld images with defects when the number of defect samples is extremely limited, this paper proposes FastGAN to augment defect data, the qualitative and quantitative superiority of which was demonstrated over BAGAN, BAGAN-GP, RSSA, and WaveGAN. In addition, we combine MobileNetV2 and a Visual Transformer to extract convolutional and self-attention features of input images for classifying weld types. Our experiments demonstrate that our proposed model outperforms networks that only extract convolutional or sub-attention features for the classification of weld seams. And, the model size is small. By adding geometric augmentation and using FastGAN-generated data for training, the accuracy of the network model was further improved.

The minority sample generation methods and defect classification algorithms proposed in this paper can also be applied to the analysis of other similar images in industrial production, especially in cases where defect samples are scarce. However, the key limitation of this study is the scarcity of real samples available for generating defect images in certain categories, resulting in limited diversity in the generated images and potential challenges in model generalization. Future research could explore the selective addition of noise to enhance diversity within the sample data. As the number of defect samples increases, incorporating semi-supervised learning methods for defect classification could also be considered.

Author Contributions

Z.Y. conceived and designed the study. Y.A. and W.Z. provide a lot of guidance and advice. Y.A. gathered the data. Z.Y., Y.Z. and Y.X. wrote the paper. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China (52394164) and the Innovation Group Project of the Southern Marine Science and Engineering Guangdong Laboratory (Zhuhai), grant number 311021013.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data that support the findings of this study are available from the corresponding author upon reasonable request.

Conflicts of Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References

Shinde, S.; Patidar, H. Hyperspectral image classification using principle component analysis and deep convolutional neural network. J. Ambient. Intell. Humaniz. Comput. 2023, 14, 16491–16497. [Google Scholar] [CrossRef]
Sun, C.; Ai, Y.; Wang, S.; Zhang, W. Mask-guided SSD for small-object detection. Appl. Intell. 2021, 51, 3311–3322. [Google Scholar] [CrossRef]
Manish, R.; Venkatesh, A.; Ashok, S.D. Machine vision based image processing techniques for surface finish and defect inspection in a grinding process. Mater. Today Proc. 2018, 5, 12792–12802. [Google Scholar] [CrossRef]
Kumar, J.; Srivastava, S.; Anand, R.S.; Arvind, P.; Bhardwaj, S.; Thakur, A. GLCM and ANN based approach for classification of radiographics weld images. In Proceedings of the 2018 IEEE 13th International Conference on Industrial and Information Systems (ICIIS), Rupnagar, India, 1–2 December 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 168–172. [Google Scholar] [CrossRef]
Li, S.; Li, D.; Yuan, W. Wood defect classification based on two-dimensional histogram constituted by LBP and local binary differential excitation pattern. IEEE Access 2019, 7, 145829–145842. [Google Scholar] [CrossRef]
Ma, R.; Deng, S.; Sun, H.; Qi, Y. An algorithm for fabric defect detection based on adaptive canny operator. In Proceedings of the 2019 International Conference on Intelligent Computing, Automation and Systems (ICICAS), Chongqing, China, 6–8 December 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 475–481. [Google Scholar] [CrossRef]
Xuan, L.; Hong, Z. An improved canny edge detection algorithm. In Proceedings of the 2017 8th IEEE international Conference on Software Engineering and Service Science (ICSESS), Beijing, China, 24–26 November 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 275–278. [Google Scholar] [CrossRef]
Wang, F.; Chen, J.; Xie, Z.; Ai, Y.; Zhang, W. Local sharpness failure detection of camera module lens based on image blur assessment. Appl. Intell. 2023, 53, 11241–11250. [Google Scholar] [CrossRef]
Jin, X.; Wang, Y.; Zhang, H.; Zhong, H.; Liu, L.; Wu, Q.J.; Yang, Y. DM-RIS: Deep multimodel rail inspection system with improved MRF-GMM and CNN. IEEE Trans. Instrum. Meas. 2019, 69, 1051–1065. [Google Scholar] [CrossRef]
Pastor-López, I.; Sanz, B.; de la Puerta, J.G.; Bringas, P.G. Surface defect modelling using co-occurrence matrix and fast fourier transformation. In Proceedings of the Hybrid Artificial Intelligent Systems: 14th International Conference, HAIS 2019, Proceedings 14, León, Spain, 4–6 September 2019; Springer: Berlin/Heidelberg, Germany, 2019; pp. 745–757. [Google Scholar] [CrossRef]
Ajmi, C.; Zapata, J.; Martínez-Álvarez, J.J.; Doménech, G.; Ruiz, R. Using deep learning for defect classification on a small weld X-ray image dataset. J. Nondestruct. Eval. 2020, 39, 1–13. [Google Scholar] [CrossRef]
Dai, W.; Li, D.; Tang, D.; Wang, H.; Peng, Y. Deep learning approach for defective spot welds classification using small and class-imbalanced datasets. Neurocomputing 2022, 477, 46–60. [Google Scholar] [CrossRef]
Dung, C.V.; Sekiya, H.; Hirano, S.; Okatani, T.; Miki, C. A vision-based method for crack detection in gusset plate welded joints of steel bridges using deep convolutional neural networks. Autom. Constr. 2019, 102, 217–229. [Google Scholar] [CrossRef]
Tyystjärvi, T.; Virkkunen, I.; Fridolf, P.; Rosell, A.; Barsoum, Z. Automated defect detection in digital radiography of aerospace welds using deep learning. Weld. World 2022, 66, 643–671. [Google Scholar] [CrossRef]
Hou, W.; Wei, Y.; Jin, Y.; Zhu, C. Deep features based on a DCNN model for classifying imbalanced weld flaw types. Measurement 2019, 131, 482–489. [Google Scholar] [CrossRef]
Goodfellow, I.J.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative Adversarial Nets. Stat 2014, 1050, 10. [Google Scholar] [CrossRef]
Radford, A.; Metz, L.; Chintala, S. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv 2015, arXiv:1511.06434. [Google Scholar] [CrossRef]
Mirza, M.; Osindero, S. Conditional generative adversarial nets. arXiv 2014, arXiv:1411.1784. [Google Scholar] [CrossRef]
Odena, A.; Olah, C.; Shlens, J. Conditional image synthesis with auxiliary classifier gans. In Proceedings of the International Conference on Machine Learning, PMLR, Sydney, Australia, 6–11 August 2017; pp. 2642–2651. [Google Scholar] [CrossRef]
Arjovsky, M.; Chintala, S.; Bottou, L. Wasserstein generative adversarial networks. In Proceedings of the International Conference on Machine Learning, PMLR, Sydney, Australia, 6–11 August 2017; pp. 214–223. [Google Scholar] [CrossRef]
Gulrajani, I.; Ahmed, F.; Arjovsky, M.; Dumoulin, V.; Courville, A.C. Improved training of wasserstein gans. Adv. Neural Inf. Process. Syst. 2017, 30. [Google Scholar] [CrossRef]
Mariani, G.; Scheidegger, F.; Istrate, R.; Bekas, C.; Malossi, C. Bagan: Data augmentation with balancing gan. arXiv 2018, arXiv:1803.09655. [Google Scholar] [CrossRef]
Huang, G.; Jafari, A.H. Enhanced balancing GAN: Minority-class image generation. Neural Comput. Appl. 2023, 35, 5145–5154. [Google Scholar] [CrossRef]
Ojha, U.; Li, Y.; Lu, J.; Efros, A.A.; Lee, Y.J.; Shechtman, E.; Zhang, R. Few-shot image generation via cross-domain correspondence. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 10743–10752. [Google Scholar] [CrossRef]
Xiao, J.; Li, L.; Wang, C.; Zha, Z.J.; Huang, Q. Few shot generative model adaption via relaxed spatial structural alignment. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 11204–11213. [Google Scholar] [CrossRef]
Ding, G.; Han, X.; Wang, S.; Wu, S.; Jin, X.; Tu, D.; Huang, Q. Attribute Group Editing for Reliable Few-shot Image Generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 11194–11203. [Google Scholar] [CrossRef]
Zhao, S.; Liu, Z.; Lin, J.; Zhu, J.Y.; Han, S. Differentiable augmentation for data-efficient gan training. Adv. Neural Inf. Process. Syst. 2020, 33, 7559–7570. [Google Scholar] [CrossRef]
Yang, M.; Wang, Z.; Chi, Z.; Feng, W. WaveGAN: Frequency-Aware GAN for High-Fidelity Few-Shot Image Generation. In Proceedings of the Computer Vision—ECCV 2022: 17th European Conference, Proceedings, Part XV, Tel Aviv, Israel, 23–27 October 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 1–17. [Google Scholar] [CrossRef]
Liu, B.; Zhu, Y.; Song, K.; Elgammal, A. Towards faster and stabilized gan training for high-fidelity few-shot image synthesis. In Proceedings of the International Conference on Learning Representations, Vienna, Austria, 4 May 2021. [Google Scholar] [CrossRef]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar] [CrossRef]
Dutta, S.; Basarab, A.; Georgeot, B.; Kouamé, D. Deep Unfolding of Image Denoising by Quantum Interactive Patches. In Proceedings of the 2022 IEEE International Conference on Image Processing (ICIP), Bordeaux, France, 16–19 October 2022; pp. 491–495. [Google Scholar] [CrossRef]
Dutta, S.; Basarab, A.; Georgeot, B.; Kouamé, D. DIVA: Deep Unfolded Network from Quantum Interactive Patches for Image Restoration. arXiv 2022, arXiv:2301.00247. [Google Scholar] [CrossRef]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. arXiv 2021, arXiv:2103.14030. [Google Scholar]
Lou, M.; Zhou, H.Y.; Yang, S.; Yu, Y. TransXNet: Learning Both Global and Local Dynamics with a Dual Dynamic Token Mixer for Visual Recognition. arXiv 2023, arXiv:2310.19380. [Google Scholar]
Wu, H.; Xiao, B.; Codella, N.; Liu, M.; Dai, X.; Yuan, L.; Zhang, L. CvT: Introducing Convolutions to Vision Transformers. arXiv 2021, arXiv:2103.15808. [Google Scholar]
Guo, J.; Han, K.; Wu, H.; Tang, Y.; Chen, X.; Wang, Y.; Xu, C. CMT: Convolutional Neural Networks Meet Vision Transformers. arXiv 2022, arXiv:2107.06263. [Google Scholar]
Mehta, S.; Rastegari, M. MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer. arXiv 2022, arXiv:2110.02178. [Google Scholar]
Lim, J.H.; Ye, J.C. Geometric gan. arXiv 2017, arXiv:1705.02894. [Google Scholar]
Tran, D.; Ranganath, R.; Blei, D.M. Deep and hierarchical implicit models. arXiv 2017, arXiv:1702.08896. [Google Scholar] [CrossRef]
Elfwing, S.; Uchibe, E.; Doya, K. Sigmoid-weighted linear units for neural network function approximation in reinforcement learning. Neural Netw. 2018, 107, 3–11. [Google Scholar] [CrossRef]
Heusel, M.; Ramsauer, H.; Unterthiner, T.; Nessler, B.; Hochreiter, S. GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium. In Proceedings of the NIPS’17, Red Hook, NY, USA, 4–9 December 2017; pp. 6629–6640. [Google Scholar]
Van der Maaten, L.; Hinton, G. Visualizing data using t-SNE. J. Mach. Learn. Res. 2008, 9, 2579–2605. [Google Scholar]
Howard, A.; Sandler, M.; Chen, B.; Wang, W.; Chen, L.C.; Tan, M.; Chu, G.; Vasudevan, V.; Zhu, Y.; Pang, R.; et al. Searching for MobileNetV3. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 1314–1324. [Google Scholar] [CrossRef]
Xie, S.; Girshick, R.; Dollár, P.; Tu, Z.; He, K. Aggregated Residual Transformations for Deep Neural Networks. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 5987–5995. [Google Scholar] [CrossRef]

Figure 1. Overall process for classification of weld defects.

Figure 2. Introduction of welding datasets.

Figure 3. Example of geometric enhancements.

Figure 4. Resnet block.

Figure 5. Squeeze-and-Excite.

F_{t r}

-Standard convolution operation;

F_{s q}

-Compressing the feature map

\in R^{w \times h \times c_{2}}

along the spatial dimensions (

w \times h

);

F_{e x}

-Generating weights for each feature channel through parameter

ω

;

F_{s c a l e}

-Recalibrating the feature map.

Figure 5. Squeeze-and-Excite.

F_{t r}

-Standard convolution operation;

F_{s q}

-Compressing the feature map

\in R^{w \times h \times c_{2}}

along the spatial dimensions (

w \times h

);

F_{e x}

-Generating weights for each feature channel through parameter

ω

;

F_{s c a l e}

-Recalibrating the feature map.

Figure 6. The structure of the skip-layer excitation module.

Figure 7. The structure and the forward flow of the discriminator.

Figure 8. MobileNetV2 block, Vision Transformer, and MobileViT block.

Figure 9. Every pixel sees every other pixel in the MobileViT block.

Figure 10. MobileViT network structure.

Figure 11. Some examples of images generated by GANs.

Figure 12. t-SNE visualization of datasets after geometric enhancement and FastGAN expansion.

Figure 13. The confusion matrices of models in different classes.

Table 1. FIDs calculation results for different GAN models.

	Cold Solder	No Bus-Bar	Missed Solder
BAGAN	189.498	352.071	333.201
BAGAN-GP	175.897	349.192	317.130
RSSA	203.441	225.177	166.738
WaveGAN	119.174	202.659	160.376
FastGAN	58.199	133.280	114.625
Real samples (train set)	33.131	114.001	104.728

Table 2. Results of network with geometric augmentation.

	Accurary	Precision	Recall	F1 Score	Model Size
MobileNetv3	96.84%	97.63%	94.38%	95.90%	17.8 M
ResNext	95.45%	95.03%	77.03%	80.26%	338.1 M
Swin Transformer	95.73%	97.68%	95.75%	96.62%	330.8 M
CMT	97.41 %	98.88%	97.18%	97.97%	117.2 M
MobileViT	97.40%	98.63%	91.21%	94.44%	39.9 M

Table 3. Results of the network with geometric and FastGAN augmentation.

	Accurary	Precision	Recall	F1 Score	Model Size
MobileNetv3	96.90%	98.58%	96.68%	97.55%	17.8 M
ResNext	96.38%	98.27%	96.18%	97.13%	338.1 M
Swin Transformer	97.15%	98.67%	96.99%	97.76%	330.8 M
CMT	98.06%	98.15%	97.94%	98.00%	117.2 M
MobileViT	98.19%	99.27%	97.99%	98.59%	39.9 M

Table 4. Precision for different classes.

	Cold Solder	No Bus-Bar	Missed Solder	Qualified Weld
MobileNetv3	97.96%	100%	100%	96.36%
ResNext	97.24%	100%	100%	95.85%
Swin Transformer	97.99%	100%	100%	96.70%
CMT	98.70%	100%	96.00%	97.89%
MobileViT	99.35%	100%	100%	97.72%

Table 5. Computational complexity and runtime of models.

	Parameters/M	FLOPs/G	Run Time
MobileNetv3	2.54	0.6	0:45:22
ResNext	42.14	8.03	3:06:29
Swin Transformer	27.53	4.36	2:29:19
CMT	15.37	2.15	2:14:05
MobileViT	4.94	1.44	1:48:01

Table 6. MobileVit network architecture.

Layer	Output Size	Output Stride	Repeat	Output Channels
Image	256 × 256	1
Conv-3 × 3, ↓ 2			1	16
MV2	128 × 128	2	1	32
MV2, ↓ 2			1	64
MV2	64 × 64	4	2	64
MV2, ↓ 2			1	96
MobileViT block (=2)	32 × 32	8	1	96 (d = 144)
MV2, ↓ 2			1	128
MobileViT block (L = 4)	16 × 16	16	1	128 (d = 192)
MV2, ↓ 2			1	160
MobileViT block (L = 3)			1	160 (d = 240)
Conv-1 × 1	8 × 8	32	1	640
Global pool
Linear	1 × 1	256	1	4

↓ denotes downsampling.

Table 7. Results of ablation.

	Accurary	Precision	Recall	F1 Score	Model Size
No Fusion	96.64%	98.15%	96.70%	97.37%	30.9 M
Fusion	98.19%	99.27%	97.99%	98.59%	39.9 M

Table 8. Computational complexity and runtime of models.

	Parameters/M	FLOPs/G	Run Time
No Fusion	4.02	1.23	1:20:35
Fusion	4.94	1.44	1:48:01

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, Y.; Yang, Z.; Xu, Y.; Ai, Y.; Zhang, W. Efficient Defect Classification Using Few-Shot Image Generation and Self-Attention Fused Convolution Features. Appl. Sci. 2024, 14, 5278. https://doi.org/10.3390/app14125278

AMA Style

Zhang Y, Yang Z, Xu Y, Ai Y, Zhang W. Efficient Defect Classification Using Few-Shot Image Generation and Self-Attention Fused Convolution Features. Applied Sciences. 2024; 14(12):5278. https://doi.org/10.3390/app14125278

Chicago/Turabian Style

Zhang, Yingjie, Zhenwei Yang, Yue Xu, Yibo Ai, and Weidong Zhang. 2024. "Efficient Defect Classification Using Few-Shot Image Generation and Self-Attention Fused Convolution Features" Applied Sciences 14, no. 12: 5278. https://doi.org/10.3390/app14125278

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Efficient Defect Classification Using Few-Shot Image Generation and Self-Attention Fused Convolution Features

Abstract

1. Introduction

2. Methodology

2.1. Weld Datasets

2.2. Data Augmentation

2.2.1. Geometric Enhancement of Defect-Containing Images

2.2.2. Defect-Containing Images Generation Based on FastGAN

2.3. Structure of Proposed Neural Network

2.3.1. Standard Visual Transformer

2.3.2. Mobilevit Block

2.3.3. Mobilevit Architecture

3. Experiments and Discussion

3.1. The Experiments of Defective Image Generation Models

3.2. The Experiments of Weld Seams Classification Models

3.3. Ablations

4. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI