1. Introduction
Component encapsulation is an important process in the production of solar panels, and the detection of the welding quality of bus-bars and interconnectors is one of the key steps. Vision-based detection methods, with their advantages of non-contact, repeatability, high efficiency, and strong generalization, have gradually been applied in various industries [
1,
2]. They can be divided into design feature-based and learning feature-based methods. Design feature-based methods include statistical methods (histograms [
3], co-occurrence matrices [
4], local binary patterns [
5]), structural methods (morphology [
6]), filtering methods [
7,
8] (Canny, Gabor, FFT, Wavelet, Sobel, etc.), model-based methods [
9], and multi-feature combination methods [
10]. However, design feature-based methods often require professionals to tailor them for specific tasks, making them unsuitable for transfer applications and with limited resistance to interference.
Learning feature-based methods can automatically extract features and useful information from images. In recent years, there has been a large amount of research applying deep neural networks to defect detection tasks. However, the good performance of deep neural networks largely depends on a large amount of high-quality data. For welding seam images, there is a general lack of defective images, especially for some rare types. Transfer learning based on pre-trained models from large-scale image datasets can reduce the amount of data required for training models. However, the scarcity of defect data remains a challenge. These limited data are still insufficient to achieve the defect detection accuracy required for production and manufacturing. Existing solutions include geometric transformations, contrast adjustments, edge feature extraction, and simulating new minority class images through certain algorithms.
Chiraz et al. [
11] improved the classification accuracy of the VGGNet network by approximately 3% by replacing the B and G channels of X-ray weld seam images with binary images obtained from Canny edge detection and adaptive Gaussian thresholding. Wei et al. [
12] generated different minority class samples for resistance spot welding (RSW) images using balancing augmentation GAN (BAGAN) and balancing augmentation GAN with gradient penalty (BAGAN-GP), and trained ResNet50 to classify the types of RSW defects. Cao et al. [
13] divided the crack images in gusset plate welded joints of steel bridges into 192 small images
in size from a larger image
in size, increasing the amount of training data while also reducing the computation related to processing large images by the network. They fine-tuned the VGG16 combined with data augmentation and achieved good detection performance. Topias et al. [
14] proposed a method for synthesizing X-ray data for aerospace welds. They first modeled the component using CAD and then used ray tracing simulation software to generate X-rays to obtain simulated X-ray images. Next, they added noise masks from experimental measurement images to the simulated images and finally used Mask RCNN for defect segmentation. Hou [
15] used three types of resampling methods: random over-sampling (ROS), random under-sampling (RUS), and the synthetic minority over-sampling technique (SMOTE)) for X-ray weld images. And, they extracted three features: statistical features based on the gray-level co-occurrence matrix (GLCM), the histogram of oriented gradients (HOG), and the stacked sparse autoencoder (SSAE). The experimental results showed that the model trained with the features based on SSAE and SMOTE resampling achieved the highest accuracy (97.2%).
Previous studies have achieved good results, but there is still a lack of research on defect classification when there is an extreme lack of defect samples. One solution to the problem of insufficient defect samples is to perform simple geometric transformations on images or to generate virtual defect samples using certain algorithms. Since Generative Adversarial Networks (GANs) were introduced for image generation in 2014 by Radford et al. [
16], scholars have conducted extensive research. The deep convolutional GAN (DCGAN) [
17] introduces convolution to the GAN. The conditional GAN (CGAN) [
18] was proposed to add auxiliary information such as class labels to the generators. The auxiliary classifier GAN (ACGAN) [
19] adds class information to the discriminator. The Wasserstein GAN (WGAN) [
20] is trained faster. The Wasserstein GAN with gradient penalty (WAGAN-GP) [
21] improves the robustness of the optimization process. The balance GAN (BAGAN) [
22] and the balance GAN with gradient penalty (BAGAN-GP) [
23] can be trained on imbalanced datasets and used for generating minority class samples.
In addition, researchers have conducted research specifically on training GANs with a small number of samples. Ojha [
24] and Xiao [
25] proposed Anchor-based strategies and the Relaxed Spatial Structural Alignment method to train generating models with extremely few samples from pre-trained models. Ding [
26] proposed an attribute editing-based editing method to synthesize images of unseen categories by editing category-independent attributes on a GAN trained on known categories. However, finding a compatible pre-training dataset is not guaranteed, and fine-tuning may even lead to worse performance [
27]. In addition, Yang [
28] decomposed the encoded features into multiple frequency components, and then performed low-frequency skip connections to preserve contour and structural information, providing the generator with rich frequency information to generate images. Liu [
29] proposed a self-supervised discriminator that learns more descriptive feature maps that cover more regions from input images, resulting in more comprehensive signal training of the generator.
Most of the learning feature-based approaches use convolutional structures for feature extraction. Thanks to its advantages of good parallelism and modeling long-distance dependencies, the Vision Transformer (ViT) [
30] model based on the self-attention mechanism has shown outstanding performance in image classification tasks. A CNN can learn feature maps at different scales or more expressive points. The attention mechanism can implement a diffusion mechanism from local to global to find feature representation. However, the transformer considers the relationships between features. Since the transformer is not entirely dependent on the data themselves, its is universality better. Compared with absolute values of the data, it focuses more on the relationships between the data. Therefore, the fusion of convolutional and attention features can achieve a bidirectional fusion of local and global features.
The CNN’s ability to capture intricate details limits its capacity for holistic modeling, while ViT faces challenges in fully leveraging spatial information in visual signals. Thus, the integration of local and global features has emerged as a strategy to enhance model performance. Dutta et al. [
31,
32] extended the concept of particle interactions in quantum mechanics to imaging problems, considering each image patch as an individual particle and enabling interactions with adjacent particles (i.e., neighboring image patches). This approach effectively combines local and non-local information. Moreover, Liu et al. [
33] introduced the Swin Transformer, which incorporates hierarchical feature maps and sliding windows to facilitate interactions among image patches. Lou et al. [
34] proposed TansXNet, where a CNN is used for initial feature extraction, followed by inputting the extracted local features into transformer modules for global modeling. The model leverages both local and global attention mechanisms to enhance performance. Wu et al. [
35] introduced a hierarchical structure called CvT, consisting of multiple Convolutional Transformer Blocks per layer. By replacing linear transformations with convolutional transformations and employing multi-head attention mechanisms, the model better utilizes image-related local information while also modeling global context. Building upon the classic ViT architecture, Guo et al. [
36] introduced CMT, which incorporates a Conv-Stem composed of 3 × 3 convolutions and CMT modules that combine depth-wise convolutions with self-attention mechanisms, resulting in improved accuracy for visual networks. Mehta et al. [
37] proposed MobileViT, a lightweight visual model designed to effectively model both local and global information in input tensors using fewer parameters.
Based on the analysis above, this article proposes an adversarial generative network with fused multi-scale features to generate virtual images of defective welds. Convolutional features and global attention features are then extracted from the weld image for defect classification.
- (1)
We establish a defect classification framework for small-scale weld datasets, including a GAN-based defect image generator and a multi-feature fusion-based defect classifier;
- (2)
We propose a defect image generation method for the case of an extreme lack of defective samples for the problem of insufficient defective weld data and class imbalance. And, the distribution of weld images generated by the proposed model is experimentally demonstrated to be better than some other models;
- (3)
In order to maximize the extraction of the information contained in the limited images, we use a combination of methods. The method fully utilizes the local advantage of convolutional features and the global advantage of attention features. This allows the network to fully learn the features of the available image data and thus achieve more accurate classification. In addition, this approach reduces the complexity of the algorithm and improves the generalization ability of the model.
The structure of this paper is organized as follows.
Section 2 describes the proposed method, including the welding defect datasets, the GAN-based data enhancement method, and the structure and characteristics of the network combining a CNN and self-attentive features. The experimental results and discussions about the weld defect datasets are given in
Section 3. Finally, conclusions are given in
Section 4.