DEU-Net: A Multi-Scale Fusion Staged Network for Magnetic Tile Defect Detection

Huang, Yifan; Huang, Zhiwen; Jin, Tao

doi:10.3390/app14114724

Open AccessArticle

DEU-Net: A Multi-Scale Fusion Staged Network for Magnetic Tile Defect Detection

by

Yifan Huang

¹,

Zhiwen Huang

² and

Tao Jin

^1,*

¹

School of Optical-Electrical and Computer Engineering, University of Shanghai for Science and Technology, Shanghai 200093, China

²

School of Mechanical Engineering, University of Shanghai for Science and Technology, Shanghai 200093, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(11), 4724; https://doi.org/10.3390/app14114724

Submission received: 17 March 2024 / Revised: 27 May 2024 / Accepted: 28 May 2024 / Published: 30 May 2024

Download

Browse Figures

Versions Notes

Abstract

:

Surface defect detection is a critical task in the manufacturing industry to ensure product quality and machining efficiency. Image-based precise defect detection faces significant challenges due to defects lacking fixed shapes and the detection being heavily influenced by lighting conditions. Addressing the efficiency demands of defect detection algorithms, often deployed on embedded devices, and the highly imbalanced pixel ratio between foreground and background images, this paper introduces a multi-scale fusion staged U-shaped convolutional neural network (DEU-Net). The network provides segmentation results for defect anomalies while indicating the probability of defect presence. It enables the model to train with fewer parameters, a crucial requirement for practical applications. The proposed model achieves an MIoU of 66.94 and an F1 score of 74.89 with lower Params (36.675) and Flops (19.714). Comparative analysis with FCN, U-Net, Deeplab v3+, U-Net++, Attention U-Net, and Trans U-Net demonstrates the superiority of the proposed approach in surface defect detection.

Keywords:

surface defect detection; convolutional neural network; computer vision; segmentation network; stage-wise network

1. Introduction

Surface defect detection plays a crucial role in assessing product quality in the manufacturing industry. With the development of computer vision, automated computer vision detection has become a significant form of improving industrial automation. However, most state-of-the-art defect detection algorithms based on computer vision rely on knowledge-based approaches [1,2], involving feature extraction and classification. Traditional image processing algorithms for detection are inaccurate and inflexible. Therefore, a new approach is needed for surface defect detection. In the context of magnetic tiles, defect detection is particularly important due to the critical role these components play in the performance and safety of electric motors. Defects such as blowholes, breaks, cracks, frays, and uneven surfaces can severely compromise the lifespan of the motor and pose significant safety risks. Consequently, it is essential to thoroughly inspect magnetic tiles for defects before they are used in motors.

In recent years, with the advancement of deep learning, deep learning models represented by convolutional neural networks (CNNs) [3] have demonstrated success across a range of computer vision domains, including face recognition and scene text detection. Surface defect detection based on deep learning has been able to fulfill many industrial requirements through classification, object detection, and semantic segmentation. Classification algorithms such as Visual Geometry Group (VGG) [4] can differentiate whether defects are present but cannot locate the defect’s position. Object detection algorithms like Region-CNN (RCNN) [5] can locate defects and acquire their positions but cannot provide geometric information about the defects. Semantic segmentation algorithms like fully convolutional networks (FCNs) [6] can provide detailed segmentation information. Over the years, numerous deep learning algorithms, including DeepLab v3+, U-Net, U-Net++, Attention U-Net, Trans U-Net, and others, have been proposed to address semantic segmentation challenges [7,8,9,10,11,12,13].

In practical scenarios, pixel-level defect detection differs significantly from the semantic segmentation of natural images, mainly due to three reasons: (1) defects do not have a fixed shape and often exhibit extreme aspect ratios compared to common semantic segmentation images. (2) The pixel ratio between foreground and background images is extremely imbalanced. (3) Currently available datasets are relatively small, limiting the number of trainable annotated images. Therefore, a semantic segmentation model that performs well in natural image applications may not necessarily excel in defect detection scenarios.

The challenge of obtaining industrial data adds complexity to the application of semantic segmentation in surface defect detection. While U-Net requires relatively few samples for network training to achieve high performance, its inefficient encoder cannot meet real-time requirements in industrial settings. To balance efficiency and accuracy, the proposed approach adopts the MBConv [14] block as the backbone for the encoder and decoder of the semantic segmentation model. In the decoder, a multi-layer output fusion technique is applied to better learn convolutional feature information from different levels.

Pixel-level defect detection involves classifying each individual pixel in an image as defective or non-defective, allowing for the precise identification of anomalies at the most granular level. This method is crucial for applications requiring detailed inspection, such as identifying micro-cracks in materials. Semantic segmentation, on the other hand, assigns each pixel in an image to a specific category based on the object it belongs to. This technique is vital for tasks where understanding the context of each pixel contributes to overall image analysis, such as distinguishing between different types of defects. In surface defect control, the accurate classification of defects in each image is often more important than precise defect localization. To overcome the challenge of limited sample quantities in deep learning, the proposed method is designed in two stages: the first stage implements a segmentation network for pixel-wise defect localization, and the second stage involves a supplementary network built on top of the segmentation network and utilizes segmentation outputs and features for binary image classification. The main contributions of this work include:

(1): Introducing an efficient encoder-decoder architecture based on the MBConv block, which is a module that combines depthwise separable convolution and an inverted residual structure. This architecture significantly reduces the number of parameters and amount of computational resources while optimizing the network’s learning capability and feature extraction efficiency through precise adjustments of the convolution kernels and expansion ratios;
(2): Employing a multi-layer fusion-improved loss function in the decoder, which enhances the segmentation accuracy for objects at different scales by better learning convolutional feature information from various levels;
(3): Proposing a two-stage method for industrial product surface defect detection. In the first stage, a designed lightweight U-Net network generates a mask for the defect area, and in the second stage, a compact decision network construction method is used, which is combined with the refined mask area from the first stage to detect and assess defects.

2. Related Works

Based on differences in features, traditional machine vision-based defect detection methods can be mainly classified into three categories: texture-based, color-based, and shape-based methods [15,16]. Song et al. [17] successfully identified wood-grain defects using histogram features, achieving a recognition rate of 99.8%. Additionally, Putri et al. [18] detected defects on ceramic surfaces using the gray-level co-occurrence matrix technique, with an accuracy of 92.31%.

Today, image segmentation techniques have been overshadowed by the widespread use of deep learning models, which offer improved segmentation results [19]. Badrinarayanan et al. introduced SegNet [20], an architecture based on fully convolutional networks (FCN), utilizing a novel technique called reverse pooling for non-linear upsampling. Yang et al. enhanced U-Net by incorporating residual blocks [21], facilitating the extraction of additional features at each layer. Due to its superior performance and efficiency in feature extraction, Res-UNet serves as the fundamental model for numerous architectures in deep learning [22]. Kermi et al. utilized a slightly modified U-Net architecture, integrating weighted cross-entropy and generalized dice loss into a composite loss function, resulting in a substantial enhancement in segmentation accuracy [23]. However, these methods transmit all extracted features to the decoder stage through skip connections.

Several detection methods incorporating multi-scale and multi-level feature extraction have been proposed for powerful feature representation [21,24]. These methods use feature pyramid modules as multi-scale feature extractors, capturing rich contextual information under various resolution conditions. Additionally, Mei et al. [25] employed a densely connected convolutional neural network for crack detection, introducing a connectivity loss function between pixels to overcome crack dispersion in the output of deconvolution layers. Fei et al. proposed CrackNet-V, an efficient deep neural network for 3D asphalt pavement crack detection, building upon the previous work of CrackNet [26,27]. CrackNet-V uses multiple small convolutional kernels (3 × 3) to increase the network depth, improving accuracy and computational efficiency without adding extra parameters. Huang proposed an adaptive deep fusion capsule network (ADFCNet) [28,29], which involves the deep multimodal fusion of high-level features and employs a capsule classifier for roughness recognition. This approach utilizes surface images of artifacts to detect roughness levels.

Traditional image classification models often struggle to precisely locate defects, while object detection models require a large number of prior boxes to cope with defects that vary in shape and size, which can lead to inaccurate predictions in cases of small sample sizes. Moreover, although semantic segmentation models can visualize defects, they are unable to automatically decide on over-detection or under-detection. Therefore, this research proposes a new approach that combines a segmentation network with a decision network to detect defects. Specifically, this paper uses the segmentation network to accurately locate defects and the decision network to make decisions about over-detection or under-detection in the detection results.

The deep learning techniques discussed above are applied to defect segmentation and localization in surface defect detection. However, for the accurate segmentation and localization of defect areas, it is crucial to obtain a qualitative result indicating the presence of defects. Therefore, this paper proposes a novel DEU-Net that integrates segmentation and classification. Improvements are made to the U-Net in the segmentation part, and the model is applied to surface defect detection.

3. Proposed Methodology

The proposed architecture is illustrated in Figure 1. The model consists of two-stage networks: the segmentation network and the decision network. Utilizing the enhanced U-Net as the segmentation network for this design, the last decoding layer of the improved U-Net is connected to the segmentation results as the input for the decision layer. The decision layer outputs the probability of the presence of defects.

3.1. Segmentation Network

This section begins by introducing the segmentation component of the DEU-Net, which combines MBConv and a multi-scale fusion. It then provides a detailed description of the structure of the MBConv module and the multi-scale fusion mechanism.

The segmentation structure used in this paper is illustrated in Figure 2, incorporating MBConv and a multi-scale fusion mechanism to amplify segmentation accuracy while maintaining efficiency. The pivotal component module of the segmentation network in this study is the MBConv block, where the 5 × 5 convolution kernel is substituted with a 3 × 3 convolution kernel to curtail computational costs and parameters. In the proposed methodology, MbConv1, 3 × 3 signifies an MBConv block layer with an expansion factor of 1 and a convolution kernel size of 3 × 3 (as delineated in Section 3.1.1). The expansion factor denotes the number of output channels compared to the input channels in the expansion convolution. Likewise, for MBConv6, k3 × 3 designates an MBConv block layer with an expansion factor of 6 and a convolution kernel size of 3 × 3. For comprehensive network efficiency, MBConv1 is employed in the encoding and decoding layers of the initial two layers. It is essential to maintain consistency in the MBConv block types at corresponding positions in the encoder and decoder within the MBConv encoder-decoder architecture. This ensures improved compatibility between the encoder and decoder.

The U-Net structure’s skip connection method is also adopted, introducing a direct connection between max pooling and deconvolution. This method allows certain features to be directly propagated from the encoder to the corresponding decoder, retaining con-textual information from the encoding part and introducing multiple paths for gradient backpropagation, alleviating the issue of gradient disappearance.

The outputs En1, En2, En3, and En4 of all layers in the decoding part are fused to obtain the final feature. The deeper features are relatively coarse, providing strong responses for larger targets and the partial edges of targets. Meanwhile, the shallow features can complement the deep features with sufficient detailed information. Additionally, each unit’s receptive field is different, contributing well-integrated features to the final feature map.

3.1.1. MBConv Block

The MBConv structure, as shown in Figure 3, consists primarily of a 1 × 1 regular convolution (for dimensionality expansion, including BN and Swish activation), a k × k depthwise convolution (including BN and Swish activation, where the specific value of k can be 3 × 3 or 5 × 5, as seen in the EfficientNet-B0 architecture), an SE (squeeze-and-excitation) module, and a 1 × 1 regular convolution (for dimensionality reduction, including BN), followed by a dropout layer.

The first layer of dimension expansion employs a 1 × 1 convolutional layer, with the number of convolutional kernels being n times the dimension of the input feature matrix (the 1 and 6 in MBConv1 and MBConv6 represent the multiplicative factor n). The depthwise convolution is a special case of standard convolution. The standard convolution operation takes an input feature map

X

with the dimensions

D X \times D X \times M

, applies a convolution kernel

Y

with the dimensions

D K \times D K \times M \times N

, and produces an output feature map

Y

with the dimensions

D Y \times D Y \times N

. Here,

D X

signifies the spatial width and height of the square input feature map, M represents the number of input channels (input depth),

D Y

denotes the spatial width and height of the square output feature map,

N

stands for the number of output channels (output depth), and

D K

represents the assumed spatial dimension of the square convolution kernel. The standard convolution formula is depicted in Equation (1):

F_{k, l, n} = \sum_{i, j, m} K_{i, j, m, n} \cdot X_{k + i - 1, l + j - 1, m}

(1)

The formula for depthwise convolution is shown in Equation (2), where

\tilde{K}

is a depthwise convolution kernel with the dimensions

D K \times D K \times M

. In this convolution, each filter in

\tilde{K}

with the size of

D K \times D K \times M

is applied to the corresponding channel in the input feature map

F

with m channels. This generates the output feature map

\tilde{G}

with m channels. The formula is expressed as follows:

{\tilde{F}}_{k, l, n} = \sum_{i, j, m} {\tilde{K}}_{i, j, m} \cdot X_{k + i - 1, l + j - 1, m}

(2)

SE (squeeze-and-excitation) [30] is a lightweight attention mechanism. It enables the model to align feature information, and a connection operation is performed between the output and input to prevent the issues of gradient explosion and vanishing that may arise with an increase in the number of network layers.

3.1.2. Multi-Scale Fusion Mechanism

Currently, most fully convolutional semantic segmentation networks, such as SegNet and DeepLab v3+, typically only utilize the last layer of the convolutional network as the output, which may result in the loss of shallow-level detailed information in deep features. Additionally, for defective images, there is a severe imbalance between foreground and background pixels, which can lead to gradient vanishing or exploding, making network training slow or even preventing convergence.

Inspired by the HED network architecture [31] and U2-Net [32], we calculate the loss function in each unit (block) of the decoder and weigh them to form the final loss function (as delineated in Section 3.1.3). The advantage of this approach is the better utilization of convolutional feature information from different levels. Additionally, we fuse the feature information from all layers in the decoder to obtain the final feature representation. Deep features typically exhibit strong responses for larger targets and partial edges of targets, while shallow features provide more detailed information. Furthermore, since the receptive field of each unit is different, they are effectively integrated into the final feature map, as shown in Figure 4.

After the output of each block in the decoder is obtained, as illustrated in Figure 4, we use a 1 × 1 convolutional kernel to reduce the number of feature channels. Subsequently, the probability map is restored to the original size through deconvolution. Finally, these probability maps are stacked, and the final prediction result is generated through a 1 × 1 convolution. The advantage of this approach lies in its ability to better capture feature information at different levels in the image while addressing the imbalance between foreground and background pixels.

3.1.3. Loss Function

The segmentation network is trained as a binary segmentation problem, and the binary cross-entropy loss function can be directly used for training the network. The training dataset is defined as

S = {X, Y}

, where

X = {x_{i}^{(n)}, i = 1, . . ., M}

is the input image and

Y = {y_{i}^{(n)}, i = 1, . . ., M}

is the defect annotation map.

M

M

represents the number of pixels in each image. The loss function is defined as follows (Equation (3)):

L_{s e g} = - \frac{1}{N} \sum_{i = 1}^{M} \begin{matrix} (y_{i}^{(n)} \cdot l o g (F (x_{i}^{(n)}; w)) + \\ (1 - y_{i}^{(n)}) \cdot l o g (1 - F (x_{i}^{(n)}; w))) \end{matrix}

(3)

where

W

presents the trainable parameters of the neural network model, and

F (x_{n}; w)

is the predicted probability that the sample is a crack.

The combined loss is used to separately train each decoder layer by generating predictions at each decoder layer and comparing them with ground truth. The loss function for each layer is referred to as the guidance loss for that layer (

L_{s e g}

). The total combined loss function is computed as follows (Equation (4)):

C L_{s e g} = α \cdot L_{s e g} 1 + β \cdot L_{s e g} 2 + γ \cdot L_{s e g} 3 + δ \cdot L_{s e g} 4 + θ \cdot L_{s e g} 5

(4)

where

α

,

β

,

γ

,

δ,

and

θ

are the weights for the guidance losses of the first, second, third, and fourth layers and the combined output of the four encoder layers, respectively. Higher weights are assigned to the fourth layer and the final output, as they constitute the primary outputs of the prediction model. Different values were experimented with, and

α

=

β

=

γ

= 0.125 and

δ

=

θ

= 0.25 were chosen.

3.2. Decision Network

The design of the decision network is inspired by the classification design proposed in Tabernik D for CNNs, following two core principles to optimize the classification process of image segmentation results [33]. The primary principle involves introducing multi-layer convolution operations and downsampling mechanisms to ensure that the network has sufficient capacity to effectively handle large, complex-shaped targets. This enables our network to capture not only local shape features of targets but also global shape information across the entire image region, enhancing the model’s perceptual range and recognition capability.

As shown in Figure 5, our decision network fully leverages two key input sources. Firstly, we utilize the output features of the last convolution operation in the segmentation network, which employs a 1 × 1 kernel. Secondly, we integrate the final segmentation image output of the segmentation network, also utilizing a 1 × 1 kernel. This design cleverly introduces an optional shortcut in the network, allowing the network to skip the use of many unnecessary feature maps when needed, reducing the model’s complexity and parameter count, and effectively preventing overfitting issues.

These shortcuts occur at two different levels. Firstly, at the beginning of the decision network, we transmit the segmentation output image to multiple convolution layers in the decision network to guide the classification process. Secondly, at the end of the decision network, we combine the global average and maximum values of the segmentation output image with the input of the final fully connected layer. The use of these two shortcuts enhances the performance of the decision network, making it more flexible and efficient in making classification decisions.

Loss Function

The decision network is trained using the cross-entropy loss function, as shown in Equation (5):

L_{d e c} = - \frac{1}{N} \sum_{i = 1}^{M} \begin{matrix} (y^{(n)} \cdot l o g ({\tilde{y}}^{(n)}) + \\ (1 - y^{(n)}) \cdot l o g (1 - {\tilde{y}}^{(n)})) \end{matrix}

(5)

where

y

is the class label, and

\tilde{y}

is the predicted output of the defect network. The learning is performed separately for the segmentation and decision networks. First, the segmentation network is independently trained, and then the weights of the segmentation network are frozen while training only the decision network layer. Training exclusively the decision layer mitigates overfitting concerns arising from the numerous weights in the segmentation network.

4. Experimental Setup

4.1. Dataset Description

We evaluated our proposed method on the industrial magnetic tile defect dataset [34], as shown in Figure 6. This study focuses on five defect types in the dataset: Blowhole, Fray, Crack, Break, Uneven, and Free (defect-free) conditions. The dataset comprises 1344 samples, addressing the practical challenge of difficulty in acquiring samples for industrial defect detection. The samples exhibit strong diversity, are influenced by varying lighting conditions, and pose a relatively high detection difficulty. To enhance the dataset and prevent overfitting, we applied various image augmentation techniques, including image rotation, flipping, cropping, and transposition. The Table 1 below shows the number of various types of defects in the images.

The dataset is partitioned into three subsets: 60% of the dataset is allocated to the training set, 20% to the validation set, and the remaining 20% to the test set. We adopted the Monte Carlo cross-validation method and conducted three random splits of the dataset. Instances featuring defects are denoted as positive examples, while instances without defects are labeled as negative examples.

4.2. Implementation Details

Training Setup: the proposed network is built using the PyTorch framework 2.3. It was trained using the SGD optimizer with a momentum term of 0.9 and weight decay of 1 × 10⁻⁴ to train the model. Additionally, we set the initial learning rate to 0.01 and adopted a “Poly” decay strategy. All experiments were conducted on an NVIDIA GeForce RTX 3090 Ti 24 GB GPU (Produced by NVIDIA in Santa Clara, CA, USA, and purchase in China). The batch size was set to 8, and the maximum number of epochs was 160.
Evaluation Metrics: we assess the model performance using the Intersection over Union (IoU) and F1-Score. IoU assesses how well the predicted defect regions overlap with the true defect regions. This is crucial in defect detection where the precise localization of each defect can significantly impact the subsequent decision-making processes in quality control systems. These metrics are based on a confusion matrix, which includes four elements: True Positive (TP), False Positive (FP), True Negative (TN), and False Negative (FN). For each class, the IoU is defined as the intersection over the union of the predicted and true values, calculated as follows:

I o U = \frac{T P}{T P + F P + F N}

(6)

The F1 scores for each category are calculated as follows:

F 1 = 2 \cdot \frac{p r e c i s o n \times r e c a l l}{p r e c i s o n + r e c a l l}

(7)

where precision = TP/(TP + FP), and recall = TP/(TP + FN). In addition, MIoU represents the average IoU for all categories, and the F1Avg score is an average of the F1 score for all categories. Flops (floating-point operations) denote the number of floating-point operations. It reflects the operational efficiency of a model. In the experiments in this paper, Flops are compared among different models using an input image of size 320 × 320. Smaller Flop values indicate better computational efficiency and lower time complexity for the model. Params represent the number of parameters included in the algorithm, reflecting the memory footprint of the algorithm. Smaller parameter values make it easier to deploy the algorithm on some mobile or embedded devices.

5. Results and Discussion

5.1. Comparisons with State-of-the-Art Methods

In this study, the proposed architecture is compared with state-of-the-art models used for semantic segmentation purposes. The predictions of the proposed segmentation structure are randomly selected from the test set and compared with advanced models such as FCN, U-Net, Deeplab v3+, U-Net++, AttentionU-Net, and Trans U-Net, as shown in Figure 7. From the figure, it is evident that the proposed model can most closely segment these defect areas compared to other segmentation methods. The most challenging areas to segment are Fray and its connection with Break. Therefore, most segmentation models perform poorly in segmenting Break and Fray, while our proposed segmentation network satisfactorily segments these regions.

From Table 2, it is evident that the proposed segmentation network in this paper outperforms other semantic segmentation algorithms, having the highest mIoU, as well as the lowest Flops and lower Params. U-Net++, which combines DenseNet-like structures with dense skip connections to improve the gradient flow, slightly enhances the mIoU but increases the number of Flops and Params significantly, making it less suitable for real-time applications and low memory requirements. Attention U-Net introduces an attention mechanism into U-Net, improving the mIoU, but with increased Flops and Params. Trans U-Net, incorporating transformers, achieves the best segmentation accuracy among the other models but at the cost of higher Flops and Params, similar to U-Net++.

Our model achieved an efficient inference time of 6.25 milliseconds and high accuracy metrics, surpassing competitors like DeepLab v3+, U-Net++, and Trans U-Net in terms of performance. Although FCN is the fastest with an inference time of only 2.56 milliseconds, its accuracy is lower compared to DEU-Net. U-Net and its advanced versions, despite their moderate inference speeds, also do not match the combined speed and accuracy of DEU-Net.

From Figure 8, it is visually clear that, although U-Net++, Trans U-Net, and Deeplab v3+ achieve good mIoU and F1 scores, the proposed segmentation network demonstrates advantages by achieving comparable mIoU and F1 scores at lower Flops and Params. Meanwhile, FCN, U-Net, and Attention U-Net have lower Flops and Params but poorer segmentation accuracy.

To validate our segmentation model, we also used the feature-similar KSDD dataset for verification. The KSDD dataset contains 50 images of electronic commutator surfaces. These images capture minor damage or cracks on the plastic encapsulated surfaces of electronic commutators, collected under controlled conditions such as uniform lighting, and provide pixel-level annotations of defects. The dataset comprises a total of 399 images, of which 52 images contain defects, while 347 images are defect-free.

Since the images were resized to the same dimensions as those in the magnetic tile dataset before being input into the network, the number of parameters and Flops are the same as those in Table 3. According to the results shown in Table 2, while AttentionU-Net performs the best in terms of the mIoU metric, the model proposed in this paper closely follows, demonstrating competitive segmentation accuracy. Furthermore, the method introduced in this paper performs equally well on the F1 metric compared to Attention U-Net and exhibits superior predictive accuracy when compared with other high-performance models.

5.2. Ablation Results and Analysis

From Table 4, it is evident that incorporating multi-scale fusion significantly improves the mIoU and F1 scores of U-Net, but it also increases the already large number of Flops and Params of U-Net. Using MBConv as a backbone reduces Flops and Params while providing a certain improvement in mIoU and F1 scores. The architecture proposed in this paper combines both approaches into U-Net, achieving substantial improvements in mIoU and F1 scores while reducing Flops and Params by half compared to U-Net.

Using only MBConv as a backbone reduces Flops and Params but leads to mediocre segmentation performance. In contrast, the proposed architecture, which incorporates both MBConv and multi-scale fusion, manages to balance efficiency and accuracy effectively.

From Table 5, the combined loss function yields a mIoU of 66.94%, compared to 64.71% with the cross-entropy loss. This suggests that the combined loss function may better handle the imbalance in class distribution within the dataset, effectively improving the accuracy of pixel-wise classification.

5.3. Visualization Results and Analysis

Based on Pos (only defective image samples) and PosNeg (including both non-defective and defective image samples), we evaluated the network’s performance. Figure 9 shows some visual results of correct and incorrect predictions. In the experiments, most of the samples were correctly segmented and classified. However, there are some false positives and false negatives due to insufficient lighting in certain corner areas. We also assessed the network’s performance based on the absolute number of misclassified test examples. Table 6 describes the performance of our network on industrial magnetic tile defects. Due to the significantly larger number of non-defective examples compared to that of defective examples and the high recognition rate for non-defective examples, which can inflate the overall accuracy of PosNeg, we randomly selected 60 non-defective samples to add to PosNeg. It can be observed that our network performs well on Fray and Crack, with no misclassified examples, while maintaining a high accuracy in negative test examples (i.e., non-defective examples).

6. Conclusions

This study explores a deep learning surface defect detection method based on a segmentation network. The proposed method consisted of two stages. First stage: the architecture includes a lightweight encoder and a multi-scale fusion mechanism. The multi-scale fusion mechanism incorporates the feature maps generated by each layer of the encoder into the loss function to improve the learning process. On the magnetic tile dataset, with Params and Flops being 36.675 and 19.714, respectively, the MIoU and F1 scores are 66.94 and 74.8. This study demonstrates that the model outperforms its baseline model. Second stage: involves an additional decision-making network built on top of the segmentation network to predict whether the entire image contains anomalies. It performs well on Fray and Crack, with no misclassified examples, while maintaining a high accuracy in determining negative test examples. Other defect types also exhibit high accuracy.

Existing defect detection algorithms based on deep learning primarily leverage the powerful fitting capability of neural networks. However, these methods do not consider the geometric constraints of defects. Therefore, combining the geometric features of cracks with the learning capability of convolutional neural networks to achieve more precise defect detection will be a future research direction. In addition, we will further study the detection robustness of the deep learning algorithm.

Author Contributions

Conceptualization, T.J. and Y.H.; methodology, T.J.; software, Y.H.; data curation, Z.H.; writing—original draft preparation, Y.H.; writing—review and editing, Z.H. and T.J.; visualization, Y.H. and Z.H.; supervision, T.J. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by National Key R&D Program of China, grant number: 2023YFF1103405.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Conflicts of Interest

The authors declare no conflict of interest.

References

Chen, Y.; Ding, Y.; Zhao, F.; Zhang, E.; Wu, Z.; Shao, L. Surface defect detection methods for industrial products: A review. Appl. Sci. 2021, 11, 7657. [Google Scholar] [CrossRef]
Qi, S.; Yang, J.; Zhong, Z. A review on industrial surface defect detection based on deep learning technology. In Proceedings of the International Conference on Machine Learning and Machine Intelligence (MLMI), Online, 26–28 August 2020; ACM Press: New York, NY, USA, 2020; pp. 24–30. [Google Scholar]
Boser, B.; Denker, J.S.; Henderson, D.; Howard, R.E.; Hubbard, W.; Jackel, L.D.; Laboratories, H.Y.L.B.; Zhu, Z.; Cheng, J.; Zhao, Y.; et al. Backpropagation applied to handwritten zip code recognition. Neural Comput. 2019, 1, 541–551. [Google Scholar]
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. Adv. Neural Inf. Process. Syst. 2015, 28, 1–9. [Google Scholar] [CrossRef]
Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar]
Chen, L.C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 801–818. [Google Scholar]
Ronneberger, O.; Fischer PBrox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, 5–9 October 2015; Proceedings, Part III 18, 2015. pp. 234–241. [Google Scholar]
Zhou, Z.; Siddiquee MM, R.; Tajbakhsh, N.; Liang, J. Unet++: Redesigning skip connections to exploit multiscale features in image segmentation. IEEE Trans. Med. Imaging 2019, 39, 1856–1867. [Google Scholar] [CrossRef] [PubMed]
Oktay, O.; Schlemper, J.; Folgoc, L.L.; Lee, M.J.; Heinrich, M.P.; Misawa, K.; Mori, K.; McDonagh, S.G.; Hammerla, N.Y.; Kainz, B.; et al. Attention U-Net: Learning Where to Look for the Pancreas. arXiv 2018, arXiv:1804.03999. [Google Scholar]
Chen, J.; Lu, Y.; Yu, Q.; Luo, X.; Adeli, E.; Wang, Y.; Lu, L.; Yuille, A.L.; Zhou, Y. Transunet: Transformers make strong encoders for medical image segmentation. arXiv 2021, arXiv:2102.04306. [Google Scholar]
Ren, X.; Ahmad, S.; Zhang, L.; Xiang, L.; Nie, D.; Yang, F.; Wang, Q.; Shen, D. Task decomposition and synchronization for semantic biomedical image segmentation. IEEE Trans. Image Process. 2020, 29, 7497–7510. [Google Scholar] [CrossRef]
Yang, W.; Zhang, J.; Chen, Z.; Xu, Z. An efficient semantic segmentation method based on transfer learning from object detection. IET Image Process. 2021, 15, 57–64. [Google Scholar] [CrossRef]
Tan, M.; Le, Q. Efficientnet: Rethinking model scaling for convolutional neural networks. In Proceedings of the International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019; pp. 6105–6114. [Google Scholar]
Liu, J.; Xie, G.; Wang, J.; Li, S.; Wang, C.; Zheng, F.; Jin, Y. Deep industrial image anomaly detection: A survey. Mach. Intell. Res. 2024, 21, 104–135. [Google Scholar] [CrossRef]
Chen, Z.; Deng, J.; Zhu, Q.; Wang, H.; Chen, Y. A systematic review of machine-vision-based leather surface defect inspection. Electronics 2022, 11, 2383. [Google Scholar] [CrossRef]
Song, X.Y.; Bai, F.Z.; Wu, J.X.; Chen, X.; Zhang, T. Wood knot defects recognition with gray-scale histogram features. Laser Optoelectron. Prog. 2015, 52, 205–210. [Google Scholar]
Putri, A.P.; Rachmat, H.; Atmaja DS, E. Design of automation system for ceramic surface quality control using fuzzy logic method at Balai Besar Keramik. MATEC Web Conf. 2017, 135, 00053. [Google Scholar] [CrossRef]
Nisha, C.M.; Thangarasu, N. Deep Learning Algorithms and Their Relevance: A Review. Int. J. Data Inform. Intell. Comput. 2023, 2, 1–10. [Google Scholar] [CrossRef]
Badrinarayanan, V.; Kendall, A.; SegNet, R.C. A deep convolutional encoder-decoder architecture for image segmentation. arXiv 2015, arXiv:1511.00561. [Google Scholar] [CrossRef]
Yang, F.; Zhang, L.; Yu, S.; Prokhorov, D.; Mei, X.; Ling, H. Feature pyramid and hierarchical boosting network for pavement crack detection. IEEE Trans. Intell. Transp. Syst. 2019, 21, 1525–1535. [Google Scholar] [CrossRef]
Xiao, X.; Lian, S.; Luo, Z.; Li, S. Weighted res-unet for high-quality retina vessel segmentation. In Proceedings of the 2018 9th International Conference on Information Technology in Medicine and Education (ITME), Hangzhou, China, 19–21 October 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 327–331. [Google Scholar]
Kermi, A.; Mahmoudi, I.; Khadir, M.T. Deep Convolutional Neural Networks Using U-Net for Automatic Brain Tumor Segmentation in Multimodal MRI Volumes. In Brainlesion: Glioma, Multiple Sclerosis, Stroke and Traumatic Brain Injuries. BrainLes 2018; Lecture Notes in Computer Science; Crimi, A., Bakas, S., Kuijf, H., Keyvan, F., Reyes, M., van Walsum, T., Eds.; Springer: Cham, Switzerland, 2019; Volume 11384. [Google Scholar] [CrossRef]
Srivastava, A.; Kumar, V.V.; Mahesh, T.R. Detection of Covid-19 from X-ray Images using Deep Learning Techniques. Int. J. Data Inform. Intell. Comput. 2022, 1, 1–7. [Google Scholar] [CrossRef]
Mei, Q.; Gül, M.; Azim, M.R. Densely connected deep neural network considering connectivity of pixels for automatic crack detection. Autom. Constr. 2020, 110, 103018. [Google Scholar] [CrossRef]
Fei, Y.; Wang, K.C.P.; Zhang, A.; Chen, C.; Li, J.Q.; Liu, Y.; Yang, G.; Li, B. Pixel-level cracking detection on 3D asphalt pavement images through deep-learning-based CrackNet-V. IEEE Trans. Intell. Transp. Syst. 2019, 21, 273–284. [Google Scholar] [CrossRef]
Zhang, A.; Wang, K.C.; Li, B.; Yang, E.; Dai, X.; Peng, Y.; Fei, Y.; Liu, Y.; Li, J.Q.; Chen, C. Automated pixel-level pavement crack detection on 3D asphalt surfaces using a deep-learning network. Comput. Aided Civ. Infrastruct. Eng. 2017, 32, 805–819. [Google Scholar] [CrossRef]
Yang, C.; Guo, X.; Wang, T.; Yang, Y.; Ji, N.; Li, D.; Lv, H.; Ma, T. Automatic brain tumor segmentation method based on modified convolutional neural network. In Proceedings of the 2019 41st Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), Berlin, Germany, 23–27 July 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 998–1001. [Google Scholar]
Huang, Z.; Zhang, Q.; Shao, J.; Li, W.; Zhu, J.; Fang, D. Machining surface roughness detection by adaptive deep fusion capsule network with low illumination and noise robustness. Meas. Sci. Technol. 2023, 35, 015037. [Google Scholar] [CrossRef]
Jin, X.; Xie, Y.; Wei, X.S.; Zhao, B.R.; Chen, Z.M.; Tan, X. Delving deep into spatial pooling for squeeze-and-excitation networks. Pattern Recognit. 2022, 121, 108159. [Google Scholar] [CrossRef]
Xie, S.; Tu, Z. Holistically-nested edge detection. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1395–1403. [Google Scholar]
Qin, X.; Zhang, Z.; Huang, C.; Dehghan, M.; Zaiane, O.R.; Jagersand, M. U2-Net: Going deeper with nested U-structure for salient object detection. Pattern Recognit. 2020, 106, 107404. [Google Scholar] [CrossRef]
Tabernik, D.; Šela, S.; Skvarč, J.; Skočaj, D. Segmentation-based deep-learning approach for surface-defect detection. J. Intell. Manuf. 2020, 31, 759–776. [Google Scholar] [CrossRef]
Huang, Y.; Qiu, C.; Yuan, K. Surface defect saliency of magnetic tile. Vis. Comput. 2020, 36, 85–96. [Google Scholar] [CrossRef]

Figure 1. The overall structure of the proposed defect detection.

Figure 2. The architecture of the segmentation network (DEU-Net).

Figure 3. Illustration of the MBConv block.

Figure 4. Network architecture diagram of the multi-scale fusion mechanism.

Figure 5. Architecture diagram of the decision network.

Figure 6. Examples of magnetic tile surface defects, labeled with pixel-level ground truths (GTs).

Figure 7. Detection results comparison on the magnetic tile dataset.

Figure 8. Performance and accuracy comparison on the magnetic tile dataset.

Figure 9. Examples of true positive and true negative (top two rows) as well as false positive and false negative (bottom two rows) detection on magnetic tiles using the proposed method. The classification scores for each example are displayed in the top-left corner.

Table 1. The number of various types of defects in the used dataset.

BlowHole	Break	Crack	Fray	Uneven	Free
115	85	57	32	103	952

Table 2. Segmentation results comparison on the industrial magnetic tile defect dataset.

Algorithm	FLOPs (GMac)	Inference Time (ms)	Params (M)	mIoU (%)	F1 (%)
FCN	39.842	2.56	18.644	61.37	70.16
U-Net	62.791	4.35	17.263	63.81	71.34
Deeplab v3+	64.109	13.46	39.63	63.31	72.52
U-Net++	312.615	16.03	47.189	64.38	72.05
AttentionU-Net	50.382	7.31	93.231	62.46	70.94
Trans U-Net	104.112	17.14	34.897	66.44	74.24
DEU-Net (Segmentation)	36.675	6.25	19.710	66.94	74.89

Table 3. Comparison of segmentation results on the KSDD dataset.

Algorithm	mIoU (%)	F1 (%)
FCN	87.23	90.23
U-Net	89.23	92.35
Deeplab v3+	91.73	94.56
U-Net++	90.97	93.78
AttentionU-Net	91.73	94.34
Trans U-Net	91.42	93.98
DEU-Net (Segmentation)	91.59	94.76

Table 4. Comparison of our proposed model and the baseline models on the industrial magnetic tile defect dataset.

Algorithm	FLOPs (GMac)	mIoU (%)	F1 (%)
U-Net	62.791	63.81	71.34
U-Net+Mul	63.034	65.14	73.69
U-Net+MBConv	31.584	64.98	73.18
Ours	36.675	66.94	74.89

Table 5. Comparison of the proposed model on different loss functions.

Loss Variants	mIoU (%)	F1 (%)
Cross-entropy loss	64.71	72.94
Combined loss	66.94	74.89

Table 6. Classification results on the magnetic tile dataset (number of misclassified examples in parentheses).

Defect Type	Pos	PosNeg
Blackhole	88.8(4)	95.8(5)
Fray	100.0(0)	94.2(3)
Crack	100.0(0)	96.6(3)
Break	81.8(3)	91.4(6)
Uneven	100.0 (0)	94.3(3)

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Huang, Y.; Huang, Z.; Jin, T. DEU-Net: A Multi-Scale Fusion Staged Network for Magnetic Tile Defect Detection. Appl. Sci. 2024, 14, 4724. https://doi.org/10.3390/app14114724

AMA Style

Huang Y, Huang Z, Jin T. DEU-Net: A Multi-Scale Fusion Staged Network for Magnetic Tile Defect Detection. Applied Sciences. 2024; 14(11):4724. https://doi.org/10.3390/app14114724

Chicago/Turabian Style

Huang, Yifan, Zhiwen Huang, and Tao Jin. 2024. "DEU-Net: A Multi-Scale Fusion Staged Network for Magnetic Tile Defect Detection" Applied Sciences 14, no. 11: 4724. https://doi.org/10.3390/app14114724

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

DEU-Net: A Multi-Scale Fusion Staged Network for Magnetic Tile Defect Detection

Abstract

1. Introduction

2. Related Works

3. Proposed Methodology

3.1. Segmentation Network

3.1.1. MBConv Block

3.1.2. Multi-Scale Fusion Mechanism

3.1.3. Loss Function

3.2. Decision Network

Loss Function

4. Experimental Setup

4.1. Dataset Description

4.2. Implementation Details

5. Results and Discussion

5.1. Comparisons with State-of-the-Art Methods

5.2. Ablation Results and Analysis

5.3. Visualization Results and Analysis

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI