MST: Multiscale Flow-Based Student–Teacher Network for Unsupervised Anomaly Detection

Yang, Yi; Yang, Yi; Zhou, Shubo; Gao, Yongbin; Zhu, Yadong; Wan, Xuefen; Hu, Weiyu; Jiang, Xueqin

doi:10.3390/electronics13163224

Open AccessArticle

MST: Multiscale Flow-Based Student–Teacher Network for Unsupervised Anomaly Detection

by

Yi Yang

¹,

Yi Yang

¹

,

Shubo Zhou

^1,*

,

Yongbin Gao

²,

Yadong Zhu

³,

Xuefen Wan

⁴,

Weiyu Hu

¹ and

Xueqin Jiang

¹

College of Information Science and Technology, Donghua University, Shanghai 201620, China

²

School of Electronic and Electrical Engineering, Shanghai University of Engineering Science, Shanghai 201620, China

³

Operation Support Technology Research Institute, Commercial Flying Service Company, Shanghai 200241, China

⁴

North China Institute of Science and Technology, Langfang 065201, China

^*

Author to whom correspondence should be addressed.

Electronics 2024, 13(16), 3224; https://doi.org/10.3390/electronics13163224

Submission received: 18 July 2024 / Revised: 12 August 2024 / Accepted: 13 August 2024 / Published: 14 August 2024

Download

Browse Figures

Versions Notes

Abstract

:

Student–teacher networks have shown promise in unsupervised anomaly detection; however, issues such as semantic confusion and abnormal deformations still restrict the detection accuracy. To address these issues, we propose a novel student–teacher network named MST by integrating the multistage pixel-reserving bridge (MPRB) and the spatial compression autoencoder (SCA) to the MMR network. The MPRB enhances inter-level information interaction and local feature extraction, improving the anomaly localization and reducing the false detection area. The SCA bolsters global feature extraction, making the detection boundaries of larger defects clearer. By testing our network across various datasets, our method achieves state-of-the-art (SOTA) performance on AeBAD-S, AeBAD-V, and MPDD datasets, with image-level AUROC scores of 87.5%, 78.5%, and 96.5%, respectively. Furthermore, our method also exhibits competitive performance on the widely utilized MVTec AD dataset.

Keywords:

unsupervised learning; anomaly detection; student–teacher network; MPRB; SCA

1. Introduction

Anomaly detection (AD) is essential in fields such as industry [1,2] and medical care [3]. In the industry field, visual inspection has traditionally been used to assess the quality of parts. However, this method is considered time-consuming and error-prone due to human limitations such as eye fatigue. Therefore, the automation of AD using image recognition is expected to alleviate these problems. Recently, image-based AD networks have achieved outstanding results in industrial part anomaly detection, leading to active research in this area. However, challenges remain due to the imbalance between normal and abnormal samples, as well as the highly variable scales of detection objects and defects. Based on the challenges mentioned above, we propose a novel student–teacher network model to address these issues of scale variability in objects and defects.

Image-based AD can be categorized into supervised methods and unsupervised learning methods. Currently, the application of supervised methods [4,5,6,7,8] is constrained by the high costs associated with data annotation and the paucity of genuine anomalous data. Therefore, most current research is directed toward unsupervised learning methods [9]. Unsupervised learning methods cluster datasets based on the similarity between data, thereby learning the distribution of the data or the relationships between data. This approach has been applied in various fields, including anomaly detection tasks [10,11] and graph clustering tasks [12,13]. Among these, unsupervised anomaly detection is widely used in the industrial sector.

Unsupervised anomaly detection involves identifying and localizing anomalous patterns with no labeled samples. This method has found extensive applications across a diverse range of domains, including industrial anomaly detection [14,15,16,17,18], medical diagnosis [19], and video surveillance [20,21,22].

Current unsupervised AD methods use only normal samples during training. In the inference phase, anomalies are detected and localized by calculating the distance between the features extracted from the test samples and the features learned from the training samples. Early approaches to unsupervised AD approaches to generative algorithms such as generative adversarial networks (GANs) [23,24] and variational autoencoders (VAEs) [25,26]. These methods detect anomalies by utilizing pixel-wise reconstruction errors or assessing the density from the model’s probability distribution. However, they often yield considerable residuals due to minor localization inaccuracies around edges and may fail to detect abnormal regions with relatively consistent intensity values. Another unsupervised AD approach involves using discriminative embeddings from pretrained networks, which constructs a feature space through transfer learning, and employs weighted networks to score the similarity between synthetic images and real images. However, these methods necessitate substantial training data subsampling because of their inadequate capacity to model complex data distributions with numerous training examples. To address these challenges, the student–teacher method [27] implicitly learns the distribution of training features. In this framework, a pretrained descriptive feature extractor, acting as the teacher, transfers knowledge of normal patterns to student networks on anomaly-free training data. Then, the student network is trained to mimic the teacher’s output, enabling it adequately learning the feature from normal samples. Therefore, during inference, the features from the students on unseen data are different from the ones outputted from the teacher, facilitating anomaly detection through comparative feature analysis [27].

The complexity of industrial environments results in a wide variety of anomalies with significant scale variations. Although the student–teacher method has achieved a significant performance improvement in anomaly detection, it remains a challenge when handling objects with significant scale variations. To overcome this limitation, we introduce an innovative student–teacher network model. This model pairs a reconstruction-based network with two auxiliary modules: the multistage pixel-reserving bridge (MPRB) and the spatial compression autoencoder (SCA). The MPRB is proposed to bolster the interaction between the output multiscale features, refining the capture of local details through the fusion of adjacent feature layers. This multiscale feature fusion strategy strengthens inter-level information exchange and local feature extraction, improving anomaly localization and reducing false detection areas. On the other hand, the SCA enhances the model’s ability to acquire global information via feature compression and recovery, leading to clearer detection boundaries for larger defects. By hierarchically fusing the multi-scale features from the MPRB and SCA, our methods can improve the model’s ability to acquire global and local information, and enhance the interaction between multiscale features to reduce the false alarm caused by insufficient feature extraction. Extensive experiments prove the effectiveness of our model. Therefore, our contributions can be summarized as follows:

1.: We propose a multistage pixel-reserving bridge (MPRB) block, designed to capture local semantic information effectively. Within this block, a pixel shuffling strategy is employed to upsample the feature map, and the convolutional block attention module (CBAM) is utilized to highlight discriminative features along both the spatial and channel dimensions. This integration bolsters the model’s ability to acquire local details, thereby improving the interaction between multiscale features.
2.: We propose a spatial compression autoencoder (SCA) to capture global semantic information. This block features an encoder and a decoder, each consisting of multiple sampling layers. This autoencoder structure empowers the model to effectively capture global features and to achieve hierarchical fusion with the multiscale features extracted by the MPRB. This integration further enhances the model’s proficiency in detecting anomalies across various scales within images.
3.: We integrated the MPRB and SCA into the MMR, substituting the original simple feature pyramid network (simple FPN) with our novel student network architecture. Extensive experiments and ablation studies were conducted to validate the effectiveness of our network. Moreover, our model was tested on diverse datasets, showcasing competitive results on the MVTec AD dataset and attaining state-of-the-art (SOTA) performance across several challenging and industrially relevant datasets, such as AeBAD-S, AeBAD-V, and MPDD.

The rest of this paper is organized as follows. In Section 2, the related work is introduced. Section 3 describes the proposed model in detail. Section 4 reports all the experimental results. This paper is concluded in Section 5.

2. Related Works

Anomaly detection and localization methods can be primarily classified into three categories: synthesis, embedding, and reconstruction-based approaches.

2.1. Synthesis-Based Approaches

Synthesis-based approaches typically integrate fragments of training or random images into regular images and generate a synthetic mask for supervised segmentation training. Various techniques can be used for this purpose, including autoencoders [28] and GANs [29]. Liu et al. [30] introduced a fabric defect detection framework based on a GAN by generating defects on defect-free fabric images to facilitate semantic segmentation training. Rippel et al. [31] established a fundamental architecture for transferring defects from one fabric to another by utilizing CycleGAN [32] with ResNet/U-Net as the generator. Wei et al. [33] proposed a model named defect style transfer (DST) to simulate defective samples by introducing a masked histogram matching module and maintaining the color consistency between the generated areas and real defects, making the generated images more realistic. Wei et al. [34] proposed defect segmentation with simulation (DSS), which uses traditional GAN to reconstruct the defect structure in specified regions of defect-free samples. Zhang et al. [35] proposed using layering defects and normal backgrounds and employed defects as the foreground. Jain et al. [36] explored the use of DCGANs [37], ACGANs [38], and InfoGANs [39] to generate defect images by adding noise, improving the classification accuracy. Wang et al. [40] proposed the DTGAN based on StarGANv2 [41], which introduces a foreground–background decoupling strategy for style control and evaluates image generation quality using FID [42].

With respect to autoencoder-based methods, Zavrtanik et al. [43] proposed the discriminatively trained reconstruction anomaly embedding model (DRAEM), which was designed to identify synthetically generated out-of-distribution patterns end-to-end. This method learns a joint representation of anomaly images and their anomaly-free reconstructions, enabling direct anomaly localization without complex postprocessing. Schlüter et al. [44] introduced a model named natural synthetic anomalies (NSA), which integrates the Poisson image editing functionality to blend patches of various sizes from different images seamlessly. This method can produce synthetic anomalies that are more similar to irregularities in natural subimages. Although DRAEM makes it possible to substantially optimize the detection performance, generating a comprehensive anomaly set containing all anomalies remains challenging due to defect diversity and unpredictability. Additionally, synthetic anomalies often have difficulty matching real anomalies.

2.2. Feature Embedding-Based Approaches

Feature embedding-based approaches learn the distribution of features that can distinguish between normal and abnormal samples. Typically, the feature extractors of these methods [10,11,45,46] utilize a network pretrained on large datasets such as ImageNet to extract shallow features from images and achieve anomaly detection through feature comparison. Cohen et al. [47] utilized a k-NN approach using feature maps from various network layers for fine-grained anomaly detection and localization. Rippel et al. [48] encoded features as multivariate Gaussian distributions and computed anomaly scores using the Mahalanobis distance [49]. Defard et al. [10] proposed patch distribution modeling (PaDiM), which utilizes a pretrained convolutional neural network (CNN) for patch embedding. It obtains a probabilistic representation of the normal class using a multivariate Gaussian distribution and leverages the correlation between different semantic layers of the CNN for better anomaly localization. Li et al. [50] proposed an unsupervised anomaly detection approach based on a self-organizing map (SOM). This method maintains normal features using topological memory based on multiscale features and computes image-level anomaly scores by calculating the maximum score between anomalous pixels using the Mahalanobis distance. Roth et al. [11] proposed PatchCore, which aggregates features extracted using average pooling to compute features for each patch. During training, features of normal data are stored in a memory bank. During inference, patch-level anomaly scores are obtained via the k-NN method, and the maximum value of these patch-level anomaly scores is considered the image-level anomaly score.

For student–teacher methods, Bergmann et al. [27] first introduced the pioneering framework in which the teacher network is a pretrained frozen CNN, and the student network is trained to mimic the outputs of both the teacher and student networks on training images. Since neither networks are exposed to anomaly images during training, the differences in the outputs of the anomaly regions are significant. Wang et al. [51] utilized multiscale features extracted from different network layers for knowledge distillation to improve the similarity between the features of the teacher and student networks in normal samples while promoting dissimilarity in anomaly samples. Salehi et al. [52] observed that a lightweight student network structure outperforms an equivalent student network and improved knowledge distillation by designing different student networks. Deng et al. [45] introduced multiscale feature fusion (MFF) blocks and a one-class bottleneck (OCB), embedding teacher features into MFF and OCB to eliminate redundant features across multiple scales. Rudolph et al. [53] suggested that student–teacher models with the same structure tend to extract features too similar to those extracted from anomaly images. To address this issue, they proposed an asymmetric student–teacher framework. In this framework, a normalization flow for density estimation is trained as a teacher. Additionally, a traditional feedforward network is trained as a student, triggering anomalies at a significant distance. They also introduced a normalization flow to prevent estimation bias arising from the structural differences between the two network frameworks. Feature embedding-based approaches learn target feature embeddings to separate normal and anomalous samples in the feature space. However, directly using pretrained features in these methods may lead to mismatch problems. Moreover, these approaches rely on the consistency of object scales, which makes it difficult to optimize them when the object position varies significantly.

2.3. Reconstruction-Based Approaches

Compared to feature embedding-based approaches, reconstruction-based approaches have better robustness in dealing with objects with variable scales [54,55]. Images were reconstructed by training encoders and decoders, reducing the reliance on pretrained models and decreasing the sensitivity to scale variations. Earlier studies employed self-encoders [56,57] for image reconstruction. Shi et al. [56] proposed an effective unsupervised anomaly segmentation method capable of detecting and segmenting anomalies within narrow image regions. They generated multiple spatial context-aware representations for each subregion of an image using a pretrained deep convolutional network. These region representations describe local features of corresponding regions and encode multiple spatial contextual information to make them distinguishable for anomaly detection. Liu et al. [57] proposed a reconstruction network capable of reconstructing original RGB images from grayscale edge values. This network utilizes a UNet-type denoising autoencoder with skip connections to effectively preserves high-frequency information from the original images using input edges and skip connections. Akcay et al. [58] trained on normal samples to learn the normal distribution of the domain, detecting anomalies based on deviations from the model. They employed an adversarial training scheme in the chosen architecture to perform the reconstruction task in image space and low-dimensional embedded vector space coding. Liang et al. [59] proposed the frequency decoupling (FD) method, which decouples input images into different frequency components and models the reconstruction process as a combination of parallel multifrequency image restoration. Considering the correlation between multiple frequencies, they proposed the channel selection (CS) module to realize frequency interaction between different encoders by adaptively selecting optimal channels.

However, the above methods suffer from overgeneralization, which may cause normal reconstruction results in anomalous regions. To address this problem, some researchers have proposed an image inpainting-based approach [17,54,60,61], which utilizes a mask to remove a portion of the original image, preventing anomalous region reconstruction. Zavrtanik et al. [60] proposed reconstruction-by-inpainting-based anomaly detection (RIAD). This method randomly removes some image regions and reconstructs the image by partial inpainting, addressing the drawback of automatic coding methods where anomalous regions generalize well. Jonathan et al. [17] proposed the inpainting transformer (InTra), which is trained to overlay patches from many image patch sequences to integrate information from various regions in the input image. Jiang et al. [61] first applied anomaly simulation and masking strategies on anomaly-free samples to generate simulated anomalies, addressing insufficient anomaly samples in the training stage. Then, they utilized the powerful global learning capability of the Swin transformer [62] to paint the masked regions. Finally, end-to-end anomaly detection was performed using a convolution-based U-Net. Zhang et al. [54] bolstered the ability of the model to infer causal relationships among patches in normal samples with a masking reconstruction task to address limitations when the domains of normal samples in the test set are shifted. However, for images with irregular scale variations, the loss of original information due to inpainting may limit the ability of the model to fully capture semantic information and the global context, thus restricting reconstruction capability and reducing accuracy.

3. Methodology

This section provides a detailed introduction to our proposed model and the submodules within the student network. In Section 3.1, we will describe the overall architecture. Section 3.2 will introduce the MPRB, and Section 3.3 will introduce the SCA.

3.1. Overall Architecture

In this section, we introduce the architectural framework of our network, as depicted in Figure 1.The network comprises three modules: a vision transformer (ViT) for the feature reconstruction of randomly masked images, a frozen pretrained teacher network T, and a student network S involving an MPRB and an SCA.

During the training phase, we denote the training set as

X_{t r a i n}

. For the student network, input normal images I are randomly masked and subsequently processed by a vanilla ViT [63] to reconstruct unmasked features

I_{F}^{'}

∈

R^{H \times W \times C}

, where H, W, and C are feature height, width, and channel depth, respectively. Then, the reconstructed features

I_{F}^{'}

are fed into the student network to extract multiscale student features

o_{f}^{k}

of normalcy. For the teacher, input normal images are directly fed into a frozen pretrained WideResNet50 [64] to extract multiscale teacher features

o_{t}^{k}

. Finally, feature alignment and comparison strategies are utilized to assess the student and teacher features, with the cosine similarity metric employed to compute the loss for network optimization. The process of a training epoch is detailed in Algorithm 1.

Algorithm 1 One-epoch training of our model

For I in $X_{t r a i n}$
The input normal images I are randomly masked and then fed into a $V i T$ to reconstruct entire unmasked features $I_{F}^{'}$
$I_{F}^{'} = V i T (I)$
The reconstructed features $I_{F}^{'}$ are fed into the student network S, which involves an $M P R B$ and an $S C A$ , to extract multiscale features
$o_{s m}^{1}, o_{s m}^{2}, o_{s m}^{3} = M P R B (I_{F}^{'})$
$o_{s s}^{1}, o_{s s}^{2}, o_{s s}^{3} = S C A (I_{F}^{'})$
Concatenate the features from $M P R B$ and $S C A$ , and process with an attention module $C B A M$ to obtain the student features
$o_{f}^{1} = C B A M (C a t (o_{s m}^{1}, o_{s s}^{1}))$
$o_{f}^{2} = C B A M (C a t (o_{s m}^{2}, o_{s s}^{2}))$
$o_{f}^{3} = C B A M (C a t (o_{s m}^{3}, o_{s s}^{3}))$
The input normal images I are fed into teacher network T to extract teacher features
$o_{t}^{1}, o_{t}^{2}, o_{t}^{3} = T (I)$
Calculate total loss and send back an update
$L = L o s s [\sum_{k = 1}^{3} (o_{f}^{k}, o_{t}^{k})]$
L.backward()

During the inference phase, we denote the test set as

X_{t e s t}

. The architecture of the inference network mirrors that of the training network. Anomalies are identified by computing and fusing the heatmaps based on the cosine similarity distance between the student features

o_{f}^{k}

and the teacher features

o_{t}^{k}

.

3.2. Multistage Pixel-Reserving Bridge

In anomaly detection, the variability in the scale of objects and defects poses a significant challenge for the single-scale model to capture adequate semantic information. To address this issue, MMR [54] has employed a simple FPN to extract multiscale features for obtaining semantic information. However, this approach overlooks the feature interaction across different scales, resulting in weak feature representation. Therefore, we propose the MPRB, which facilitates effective feature interaction across different layers through feature concatenation and attention mechanisms. The MPRB framework is illustrated in Figure 2.

MPRB consists of l layers. The first layer is an original resolution layer, where the input feature

I_{F}^{'}

is processed with a convolution (Conv)-batch normalization (BN)-rectified linear unit (ReLU) layer to obtain the original resolution feature, denoted as

o_{m}^{1}

. Subsequent layers include an upsampled feature extraction layer and an adjacent feature interaction layer. For the k^th layer, where k ranges from 2 to l, the input feature is upsampled using a pixel shuffling [65] operation with a scale factor of

2^{k - 1}

, followed by a Conv-BN-ReLU layer to obtain the upscaled feature, denoted as

z_{h, w}^{k}

. Then,

z_{h, w}^{k}

is concatenated with the output feature of the

(k - 1)

^th layer, which has also been upsampled by a factor of 2 using pixel shuffling. Finally, the concatenated feature is processed with a CBAM [66] to emphasize discriminative features across the channel and spatial dimension, and a 1 × 1 convolutional layer is employed to fuse the channel to obtain the interactive feature

o_{m}^{k}

. The overall calculation of our MPRB can be expressed as follows:

o_{m}^{k} = \{\begin{cases} f_{c b r} (I_{F}^{'}) & , k = 1 \\ f_{c b a m} (f_{c a t} (z_{h, w}^{k}, f_{u} (z_{h, w}^{k - 1}))) & , k = 2, \dots, l \end{cases}

(1)

where

f_{c b r} (\cdot)

represents the Conv-BN-ReLU layer and

f_{u} (\cdot)

signifies the upsampling operation.

f_{c b a m} (\cdot)

is the CBAM, and

f_{c a t} (\cdot)

denotes the channelwise concatenation process.

As shown in Figure 3, in (a), the bilinear upsampling leads to disrupted and inconsistent fine structures in the ground truth. In (b), deconvolution tends to cause checkerboard artifacts [67], where certain areas of the image appear darker than others. With increased sampling, these methods alter more pixel values at each stage, leading to inconsistent ground truth across different stages. In contrast, in (c), the pixel shuffle rearranges the multi-channel image

I^{\frac{H}{r} \times \frac{W}{r} \times r^{2}}

into a ground truth

I^{r H \times r W \times 1}

without discarding or altering pixel values, thereby preserving structural properties. Consequently, in this work, we choose the pixel shuffling strategy for image upsampling.

Our MPRB shares a similar structure with the FPN [68] in its design as a multiscale feature extractor. However, our module incorporates a scale-by-scale feature fusion strategy, facilitating dynamic interaction across various scales. This approach not only enriches feature representation but also mitigates the issues of blurring and artifacts commonly associated with the high upsampling interpolations in standard FPN architectures.

3.3. Spatial Compression Autoencoder

In the process of anomaly detection, the effective extraction of global information is crucial for detecting anomalies, such as mismatch or position shifts caused by the changes in illumination and view. Inspired by [69], we propose a spatial compression autoencoder (SCA) based on feature compression and reconstruction. The SCA framework is illustrated in Figure 4.

The SCA network includes two submodules: an image feature compression and reconstruction module based on an asymmetric U-Net structure, and an FPN module. Feature compression is performed using two downsampling layers to compress the input features to a single pixel, while feature reconstruction uses progressive feature upsampling to reconstruct to a resolution of H × W. This calculation process can be expressed as follows:

s_{h, w} = f_{u p} (f_{d o w n} (I_{F}^{'}))

(2)

where

s_{h, w}

is the output feature.

f_{d o w n} (\cdot)

is the downsampling layer consisting of two cascaded convolutional layers, with the first layer setting a 4 × 4 convolution with a stride of 2, and the second layer setting a H/2 × W/2 convolution with a stride of 1.

f_{u p} (\cdot)

is the upsampling layer consisting of five cascaded convolutional layers, each with a 4 × 4 convolution, and a bilinear layer interpolated with an upsampling rate of 2.

To adapt the MPRB, we employed the FPN for multiscale feature extraction, with the output feature of the SCA denoted as

o_{s}^{k}

. This calculation process is as follows:

o_{s}^{k} = \{\begin{cases} f_{c b r} (s_{h, w}), & k = 1 \\ f_{u} (s_{h, w}, 2^{k - 1} (H, W)), & k = 2, \dots, l \end{cases}

(3)

Our SCA module achieves global receptive field acquisition by compressing the features to a single pixel, resulting in final features that are rich in global semantic information. This complements the local features obtained by the MPRB effectively. By fusing the

o_{m}^{k}

and

o_{s}^{k}

features, our student network ultimately acquires features that effectively integrate both global and local information, denoted as

o_{f}^{k}

. By concatenating global and local features along the channel dimension and processing through the CBAM, the resulting feature map

o_{f}^{k}

not only retains both global and local information but also enhances the model’s focus on key features by integrating channel and spatial attention. This approach improves the model’s representational capacity and recognition performance.

3.4. Loss Function

We follow [54] to use the cosine similarity distance as the loss function for our model, which can be expressed as follows:

L o s s = \sum_{k = 1}^{l} \frac{1}{2^{k - 1} H \times 2^{k - 1} W} \sum 1 - \frac{o_{f}^{k} {(o_{t}^{k})}^{T}}{∥o_{f}^{k}∥ ∥o_{t}^{k}∥}

(4)

where H and W denote the height and width of the input feature maps, l denotes the feature scale number, and

o_{t}^{k}

and

o_{f}^{k}

denote the feature outputs from the teacher and student networks, respectively.

We choose a cosine similarity loss function because it focuses on the direction of feature vectors, which is more relevant in anomaly detection than their magnitude. Unlike

L_{1}

or

L_{2}

norms that are sensitive to scale, the cosine similarity is better at identifying subtle changes in patterns or structures, making it more robust to intensity or scale variations. Additionally, the cosine similarity normalizes the input vectors, reducing the influence of feature scale variations in high-dimensional spaces. For these reasons, it is the preferred loss function for our task.

4. Experiments

In this section, we first describe three anomaly detection datasets and present implementation details. Then, we compare our model with other state-of-the-art approaches. Finally, we perform an ablation study of the proposed module to validate its effectiveness.

4.1. Datasets and Implementing Details

Datasets: we use Aero-engine Blade Anomaly Detection (AeBAD) [54], Metal Parts Defect Detection (MPDD) [1], and MVTec Anomaly Detection (MVTec AD) [14] to validate our methods.

The AeBAD comprises two subdatasets: the single-blade dataset (AeBAD-S) and the video anomaly detection of blades (AeBAD-V). The dataset comprises a total of 1228 training samples and 4342 testing samples.

The MPDD is a dataset for detecting defects in the manufacturing process of painted metal parts. The training set of the dataset comprises 888 normal samples; the test set comprises 176 normal samples and 282 abnormal samples.

The MVTec AD consists of 15 categories, with an average of 5 samples per category, presenting a total of 73 distinct defect types. The dataset consists of 3629 images used for training and validation and 1725 images used for testing.

Implementing Details: We use PyTorch [70] to construct our model and perform experiments on an RTX 3090. For the training process, the input image is augmented with random cropping, resized to 256 × 256, and then center cropped to 224 × 224. Following the use of the masked autoencoder (MAE) [71], we divide an image into regular non-overlapping patches with a resolution of (16, 16) pixels. Then, we sample random patches with a masking ratio of 0.4 and feed only the unmasked patches into ViT. The backbone of ViT is the vanilla ViT-B [63], which has been pretrained using MAE. In MPRB, the feature scale number l is set to 3. For the teacher network, the features from layers 1, 2, and 3 of the pretrained WideResNet50 are adopted, resulting in a multiscale feature map with a size of 14 × 14 pixels, 28 × 28 pixels, and 56 × 56 pixels. Finally, the optimization of the network is carried out using the AdamW optimizer [72] with stepwise learning rate decay.

For the inference process, the image processing method of the inference phase mirrors that of the training phase. We use the image-level area under the receiver operating characteristic curve (AUROC) as the evaluation metric for anomaly detection. The pixel-level area under the receiver operating characteristic curve (AUROC) and per-region overlap (PRO) is used as the evaluation metric for anomaly localization.

4.2. Comparison with State-of-the-Art Methods

In this section, we quantitatively and qualitatively compare our model with some state-of-the-art methods, including MMR [54], PatchCore [11], RD4AD [45], DRAEM [43], NSA [44], RIAD [60], InTra [17], SPADE [47], DAGAN [73], PaDiM [10], CFLOW-AD [18], and NtNR [55]. Considering the similarity of the algorithm structure, we adopt MMR as the baseline algorithm.

(1) Results on AeBAD-S: We first evaluate our method on AeBAD-S. The experimental results are presented in Table 1. Our model achieved the highest scores in the image-level AUROC and pixel-level PRO, with 87.5% and 89.9%, respectively, surpassing the baseline MMR [54] by 2.8% and 0.8%, respectively. In the “background” and “view” categories, our model achieved significant improvements, reaching state-of-the-art (SOTA) levels of 91.6 and 82.6, respectively. Additionally, our model also demonstrated further improvements in other categories. Regarding error (image-level AUROC), we achieved 12.5%, reducing the previous SOTA error by 18.3%.

(2) Results on AeBAD-V: Table 2 lists the results on AeBAD-V. Since the AeBAD-V dataset does not provide pixel-level ground truth annotations and the detection is performed for image-level AUROC, we follow the same metric for comparison. Our method surpasses the current SOTA method MMR [54] and achieves a new SOTA performance of 78.5% in image-level AUROC.

(3) Results on MPDD: Table 3 lists the results on MPDD. Our model achieved the highest scores in both image-level AUROC and pixel-level AUROC, with 96.5% and 98.8%, respectively, surpassing the current SOTA method NtNR [55] by 1.6% and 1.0%, respectively. In the “Bracket Brown” category, our model achieves a significant improvement, reaching an image-level AUROC of 99.5%. Additionally, our model obtained the highest scores in five out of six categories. In the aspect of error (image-level AUROC), we achieve 3.5%, reducing the previous SOTA error by 31.4%.

(4) Results on MVTec AD: Table 4 lists the results on MVTec AD. Our model achieves competitive performance with image-level AUROC and pixel-level AUROC, surpassing the baseline MMR [54] by 0.6% and 0.1%, respectively. In the “screw” category, our model showed significant improvement, surpassing the baseline by 4.7%.

4.3. Ablation Study

We conducted ablation studies on our method using the AeBAD-S and MPDD datasets to demonstrate the improvement of our proposed modules, as shown Table 5.

(1) Effect of MPRB: When replacing the original simple feature pyramid network in MMR [54] with our designed MPRB, there were significant improvements in AeBAD-S and MPDD, increasing by 2.6% and 3.1%, respectively. This demonstrates the effectiveness of our designed MPRB.

(2) Effect of SCA: When incorporating the SCA on top of the MPRB, we observed performance improvements of 0.2% and 0.3%, respectively. This confirms that integrating multiscale features with global information can further enhance the model’s ability to detect anomalies.

(3) The anomaly localization ability of our model: Figure 5 illustrates visual examples of our model’s results on AeBAD-S and MPDD. Compared to the baseline MMR method, for bright background and anomaly-free objects, our model shows significantly reduced false detection area. For small defects, the anomaly localization of our model is more explicit. For larger defects, our model shows a clearer detection boundary. This demonstrates that our model can capture both local and global detail information and learn decision boundaries that better distinguish normal and abnormal samples.

Overall, our designed network model outperformed MMR [54], with image-level AUROC improvements of 2.8% and 3.6% on AeBAD-S and MPDD, respectively. This demonstrates the high-performance detection capabilities of our algorithm and its network model on such datasets.

5. Conclusions

In this paper, we propose a novel and effective student–teacher network for unsupervised anomaly detection. In this network, the teacher model is pretrained on a diverse set of natural images. The student network incorporates two proposed modules: the multistage pixel-reserving bridge (MPRB) and the spatial compression autoencoder (SCA). Our network is easy to train and apply in industrial scenarios. By reconstructing features through an input image and passing them through the designed MPRB and SCA, the network achieves SOTA performance in terms of image-level AUROC on the AeBAD-S, AeBAD-V, and MPDD datasets, reaching 87.5%, 78.5%, and 96.5%, respectively. Moreover, as a reconstruction-based network, it achieves competitive performance on MVTec AD. Our network provides a new approach to bridge the gap between academic research and industrial application in anomaly detection.

Author Contributions

Conceptualization, Y.Y. (Yi Yang 1), Y.Y. (Yi Yang 2) and S.Z.; methodology, Y.Y. (Yi Yang 1), Y.Y. (Yi Yang 2) and W.H.; software, Y.Y. (Yi Yang 2) and W.H.; investigation, Y.Y. (Yi Yang 1), Y.Y. (Yi Yang 2) and Y.G.; resources, X.J. and Y.Z.; data curation, X.W. and W.H.; writing—original draft preparation, Y.Y. (Yi Yang 1), Y.Y. (Yi Yang 2); writing—review and editing, S.Z., Y.Z. and X.W.; visualization, Y.Y. (Yi Yang 2), Y.G. and X.W.; supervision, X.J.; project administration, S.Z. and X.J.; funding acquisition, S.Z., Y.G. and X.W. All authors have read and agreed to the published version of the manuscript.

Funding

This work is supported in part by the National Natural Science Foundation of China under Grant 61803372, the Shanghai Local Capacity Enhancement project under Grant 21010501500, and the Langfang Science and Technology Research and Development Program under Grant 2021011035.

Data Availability Statement

The experiments in this article used publicly available datasets AeBAD, MPDD, and MVTec AD.

Conflicts of Interest

Author Yadong Zhu was employed by the company Operation Support Technology Research Institute, Commercial Flying Service Company. The remaining authors declare that this research was conducted in the absence of any commercial or financial relationships that could be construed as potential conflicts of interest.

References

Jezek, S.; Jonak, M.; Burget, R.; Dvorak, P.; Skotak, M. Deep learning-based defect detection of metal parts: Evaluating current methods in complex conditions. In Proceedings of the 2021 13th International Congress on Ultra Modern Telecommunications and Control Systems and Workshops (ICUMT), Brno, Czech Republic, 25–27 October 2021; pp. 66–71. [Google Scholar]
Zhao, Y. Just noticeable learning for unsupervised anomaly localization and detection. In Proceedings of the 2022 IEEE International Conference on Multimedia and Expo (ICME), Taipei, Taiwan, 18–22 July 2022; pp. 01–06. [Google Scholar]
Chalapathy, R.; Borzeshi, E.Z.; Piccardi, M. An investigation of recurrent neural architectures for drug name recognition. arXiv 2016, arXiv:1609.07585. [Google Scholar]
Chalapathy, R.; Borzeshi, E.Z.; Piccardi, M. Bidirectional LSTM-CRF for clinical concept extraction. arXiv 2016, arXiv:1611.08373. [Google Scholar]
Song, X.; Cao, S.; Zhang, J.; Hou, Z. Steel Surface Defect Detection Algorithm Based on YOLOv8. Electronics 2024, 13, 988. [Google Scholar] [CrossRef]
Cai, Z.; Wang, T.; Han, W.; Ding, A. PGE-YOLO: A Multi-Fault-Detection Method for Transmission Lines Based on Cross-Scale Feature Fusion. Electronics 2024, 13, 2738. [Google Scholar] [CrossRef]
Wu, Y.; Liao, T.; Chen, F.; Zeng, H.; Ouyang, S.; Guan, J. Overhead Power Line Damage Detection: An Innovative Approach Using Enhanced YOLOv8. Electronics 2024, 13, 739. [Google Scholar] [CrossRef]
Li, J.; Pan, H.; Li, J. ESD-YOLOv5: A Full-Surface Defect Detection Network for Bearing Collars. Electronics 2023, 12, 3446. [Google Scholar] [CrossRef]
Gao, R.; Cao, J.; Cao, X.; Du, J.; Xue, H.; Liang, D. Wind Turbine Gearbox Gear Surface Defect Detection Based on Multiscale Feature Reconstruction. Electronics 2023, 12, 3039. [Google Scholar] [CrossRef]
Defard, T.; Setkov, A.; Loesch, A.; Audigier, R. Padim: A patch distribution modeling framework for anomaly detection and localization. In Proceedings of the International Conference on Pattern Recognition, Virtual, 10–15 January 2021; pp. 475–489. [Google Scholar]
Roth, K.; Pemula, L.; Zepeda, J.; Schölkopf, B.; Brox, T.; Gehler, P. Towards total recall in industrial anomaly detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 14318–14328. [Google Scholar]
Tu, W.; Guan, R.; Zhou, S.; Ma, C.; Peng, X.; Cai, Z.; Liu, Z.; Cheng, J.; Liu, X. Attribute-missing graph clustering network. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 22–25 February 2024; Volume 38, pp. 15392–15401. [Google Scholar]
Xu, K.; Chen, L.; Wang, S. Data-driven kernel subspace clustering with local manifold preservation. In Proceedings of the 2022 IEEE International Conference on Data Mining Workshops (ICDMW), Orlando, FL, USA, 28 November–1 December 2022; pp. 876–884. [Google Scholar]
Bergmann, P.; Fauser, M.; Sattlegger, D.; Steger, C. MVTec AD—A comprehensive real-world dataset for unsupervised anomaly detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 9592–9600. [Google Scholar]
Lee, S.; Lee, S.; Song, B.C. Cfa: Coupled-hypersphere-based feature adaptation for target-oriented anomaly localization. IEEE Access 2022, 10, 78446–78454. [Google Scholar] [CrossRef]
Yu, J.; Zheng, Y.; Wang, X.; Li, W.; Wu, Y.; Zhao, R.; Wu, L. Fastflow: Unsupervised anomaly detection and localization via 2d normalizing flows. arXiv 2021, arXiv:2111.07677. [Google Scholar]
Pirnay, J.; Chai, K. Inpainting transformer for anomaly detection. In Proceedings of the International Conference on Image Analysis and Processing, Lecce, Italy, 23–27 May 2022; pp. 394–406. [Google Scholar]
Gudovskiy, D.; Ishizaka, S.; Kozuka, K. Cflow-ad: Real-time unsupervised anomaly detection with localization via conditional normalizing flows. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, Waikoloa, HI, USA, 3–8 January 2022; pp. 98–107. [Google Scholar]
Menze, B.H.; Jakab, A.; Bauer, S.; Kalpathy-Cramer, J.; Farahani, K.; Kirby, J.; Burren, Y.; Porz, N.; Slotboom, J.; Wiest, R.; et al. The multimodal brain tumor image segmentation benchmark (BRATS). IEEE Trans. Med. Imaging 2014, 34, 1993–2024. [Google Scholar] [CrossRef] [PubMed]
Wan, B.; Fang, Y.; Xia, X.; Mei, J. Weakly supervised video anomaly detection via center-guided discriminative learning. In Proceedings of the 2020 IEEE International Conference on Multimedia and Expo (ICME), London, UK, 6–10 July 2020; pp. 1–6. [Google Scholar]
Gong, Y.; Wang, C.; Dai, X.; Yu, S.; Xiang, L.; Wu, J. Multi-Scale Continuity-Aware Refinement Network for Weakly Supervised Video Anomaly Detection. In Proceedings of the 2022 IEEE International Conference on Multimedia and Expo (ICME), Taipei, Taiwan, 18–22 July 2022; pp. 1–6. [Google Scholar]
Liu, W.; Chang, H.; Ma, B.; Shan, S.; Chen, X. Diversity-measurable anomaly detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 12147–12156. [Google Scholar]
Schlegl, T.; Seeböck, P.; Waldstein, S.M.; Langs, G.; Schmidt-Erfurth, U. f-AnoGAN: Fast unsupervised anomaly detection with generative adversarial networks. Med. Image Anal. 2019, 54, 30–44. [Google Scholar] [CrossRef]
Schlegl, T.; Seeböck, P.; Waldstein, S.M.; Schmidt-Erfurth, U.; Langs, G. Unsupervised anomaly detection with generative adversarial networks to guide marker discovery. In Proceedings of the International Conference on Information Processing in Medical Imaging, Boone, NC, USA, 25–30 June 2017; pp. 146–157. [Google Scholar]
Baur, C.; Wiestler, B.; Albarqouni, S.; Navab, N. Deep autoencoding models for unsupervised anomaly segmentation in brain MR images. In Proceedings of the Brainlesion: Glioma, Multiple Sclerosis, Stroke and Traumatic Brain Injuries: 4th International Workshop, BrainLes 2018, Held in Conjunction with MICCAI 2018, Granada, Spain, 16 September 2018; Revised Selected Papers, Part I 4; Springer: Cham, Switzerland, 2019; pp. 161–169. [Google Scholar]
Vasilev, A.; Golkov, V.; Meissner, M.; Lipp, I.; Sgarlata, E.; Tomassini, V.; Jones, D.K.; Cremers, D. q-Space novelty detection with variational autoencoders. In Computational Diffusion MRI; Bonet-Carne, E., Hutter, J., Palombo, M., Pizzolato, M., Sepehrband, F., Zhang, F., Eds.; Springer: Cham, Switzerland, 2020; pp. 113–124. [Google Scholar]
Bergmann, P.; Fauser, M.; Sattlegger, D.; Steger, C. Uninformed students: Student-teacher anomaly detection with discriminative latent embeddings. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 4183–4192. [Google Scholar]
Hinton, G.E.; Salakhutdinov, R.R. Reducing the dimensionality of data with neural networks. Science 2006, 313, 504–507. [Google Scholar] [CrossRef]
Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial nets. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 8–13 December 2014; Volume 27. [Google Scholar]
Liu, J.; Wang, C.; Su, H.; Du, B.; Tao, D. Multistage GAN for fabric defect detection. IEEE Trans. Image Process. 2019, 29, 3388–3400. [Google Scholar] [CrossRef] [PubMed]
Rippel, O.; Müller, M.; Merhof, D. GAN-based defect synthesis for anomaly detection in fabrics. In Proceedings of the 2020 25th IEEE International Conference on Emerging Technologies and Factory Automation (ETFA), Vienna, Austria, 8–11 September 2020; Volume 1, pp. 534–540. [Google Scholar]
Zhu, J.Y.; Park, T.; Isola, P.; Efros, A.A. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2223–2232. [Google Scholar]
Wei, T.; Cao, D.; Jiang, X.; Zheng, C.; Liu, L. Defective samples simulation through neural style transfer for automatic surface defect segment. In Proceedings of the 2019 International Conference on Optical Instruments and Technology: Optoelectronic Measurement Technology and Systems, Beijing, China, 26–28 October 2019; Volume 11439, pp. 15–26. [Google Scholar]
Wei, T.; Cao, D.; Zheng, C.; Yang, Q. A simulation-based few samples learning method for surface defect segmentation. Neurocomputing 2020, 412, 461–476. [Google Scholar] [CrossRef]
Zhang, G.; Cui, K.; Hung, T.Y.; Lu, S. Defect-GAN: High-fidelity defect synthesis for automated defect inspection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2021; pp. 2524–2534. [Google Scholar]
Jain, S.; Seth, G.; Paruthi, A.; Soni, U.; Kumar, G. Synthetic data augmentation for surface defect detection and classification using deep learning. J. Intell. Manuf. 2022, 33, 1007–1020. [Google Scholar] [CrossRef]
Radford, A.; Metz, L.; Chintala, S. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv 2015, arXiv:1511.06434. [Google Scholar]
Odena, A.; Olah, C.; Shlens, J. Conditional image synthesis with auxiliary classifier gans. In Proceedings of the International conference on machine learning, Sydney, NSW, Australia, 6–11 August 2017; pp. 2642–2651. [Google Scholar]
Chen, X.; Duan, Y.; Houthooft, R.; Schulman, J.; Sutskever, I.; Abbeel, P. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. In Proceedings of the Advances in Neural Information Processing Systems, Barcelona, Spain, 5–10 December 2016; Volume 29. [Google Scholar]
Wang, R.; Hoppe, S.; Monari, E.; Huber, M.F. Defect transfer gan: Diverse defect synthesis for data augmentation. arXiv 2023, arXiv:2302.08366. [Google Scholar]
Choi, Y.; Choi, M.; Kim, M.; Ha, J.W.; Kim, S.; Choo, J. Stargan: Unified generative adversarial networks for multi-domain image-to-image translation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8789–8797. [Google Scholar]
Heusel, M.; Ramsauer, H.; Unterthiner, T.; Nessler, B.; Hochreiter, S. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Volume 30. [Google Scholar]
Zavrtanik, V.; Kristan, M.; Skočaj, D. Draem-a discriminatively trained reconstruction embedding for surface anomaly detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 8330–8339. [Google Scholar]
Schlüter, H.M.; Tan, J.; Hou, B.; Kainz, B. Natural synthetic anomalies for self-supervised anomaly detection and localization. In Proceedings of the European Conference on Computer Vision, Tel-Aviv, Israel, 23–27 October 2022; pp. 474–489. [Google Scholar]
Deng, H.; Li, X. Anomaly detection via reverse distillation from one-class embedding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 9737–9746. [Google Scholar]
Rudolph, M.; Wehrbein, T.; Rosenhahn, B.; Wandt, B. Fully convolutional cross-scale-flows for image-based defect detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2022; pp. 1088–1097. [Google Scholar]
Cohen, N.; Hoshen, Y. Sub-image anomaly detection with deep pyramid correspondences. arXiv 2020, arXiv:2005.02357. [Google Scholar]
Rippel, O.; Mertens, P.; Merhof, D. Modeling the distribution of normal data in pre-trained deep features for anomaly detection. In Proceedings of the 2020 25th International Conference on Pattern Recognition (ICPR), Milan, Italy, 10–15 January 2021; pp. 6726–6733. [Google Scholar]
Mahalanobis, P.C. On the generalized distance in statistics. Sankhyā Indian J. Stat. Ser. A 2018, 80, S1–S7. [Google Scholar]
Li, N.; Jiang, K.; Ma, Z.; Wei, X.; Hong, X.; Gong, Y. Anomaly detection via self-organizing map. In Proceedings of the 2021 IEEE International Conference on Image Processing (ICIP), Anchorage, AK, USA, 19–22 September 2021; pp. 974–978. [Google Scholar]
Wang, G.; Han, S.; Ding, E.; Huang, D. Student-teacher feature pyramid matching for anomaly detection. arXiv 2021, arXiv:2103.04257. [Google Scholar]
Salehi, M.; Sadjadi, N.; Baselizadeh, S.; Rohban, M.H.; Rabiee, H.R. Multiresolution knowledge distillation for anomaly detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 14902–14912. [Google Scholar]
Rudolph, M.; Wehrbein, T.; Rosenhahn, B.; Wandt, B. Asymmetric student-teacher networks for industrial anomaly detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 2–7 January 2023; pp. 2592–2602. [Google Scholar]
Zhang, Z.; Zhao, Z.; Zhang, X.; Sun, C.; Chen, X. Industrial anomaly detection with domain shift: A real-world dataset and masked multi-scale reconstruction. Comput. Ind. 2023, 151, 103990. [Google Scholar] [CrossRef]
Deng, S.; Sun, Z.; Zhuang, R.; Gong, J. Noise-to-Norm Reconstruction for Industrial Anomaly Detection and Localization. Appl. Sci 2023, 13, 12436. [Google Scholar] [CrossRef]
Shi, Y.; Yang, J.; Qi, Z. Unsupervised anomaly segmentation via deep feature reconstruction. Neurocomputing 2021, 424, 9–22. [Google Scholar] [CrossRef]
Liu, T.; Li, B.; Zhao, Z.; Du, X.; Jiang, B.; Geng, L. Reconstruction from edge image combined with color and gradient difference for industrial surface anomaly detection. arXiv 2022, arXiv:2210.14485. [Google Scholar]
Akçay, S.; Atapour-Abarghouei, A.; Breckon, T.P. Skip-ganomaly: Skip connected and adversarially trained encoder-decoder anomaly detection. In Proceedings of the 2019 International Joint Conference on Neural Networks (IJCNN), Budapest, Hungary, 14–19 July 2019; pp. 1–8. [Google Scholar]
Liang, Y.; Zhang, J.; Zhao, S.; Wu, R.; Liu, Y.; Pan, S. Omni-frequency channel-selection representations for unsupervised anomaly detection. IEEE Trans. Image Process. 2023. [Google Scholar] [CrossRef]
Zavrtanik, V.; Kristan, M.; Skočaj, D. Reconstruction by inpainting for visual anomaly detection. Pattern. Recogn. 2021, 112, 107706. [Google Scholar] [CrossRef]
Jiang, J.; Zhu, J.; Bilal, M.; Cui, Y.; Kumar, N.; Dou, R.; Su, F.; Xu, X. Masked swin transformer unet for industrial anomaly detection. IEEE Trans. Ind. Inform. 2022, 19, 2200–2209. [Google Scholar] [CrossRef]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 10012–10022. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16 × 16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Zagoruyko, S.; Komodakis, N. Wide residual networks. arXiv 2016, arXiv:1605.07146. [Google Scholar]
Shi, W.; Caballero, J.; Huszár, F.; Totz, J.; Aitken, A.P.; Bishop, R.; Rueckert, D.; Wang, Z. Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 1874–1883. [Google Scholar]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Odena, A.; Dumoulin, V.; Olah, C. Deconvolution and checkerboard artifacts. Distill 2016, 1, e3. [Google Scholar] [CrossRef]
Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, 5–9 October 2015; proceedings, part III 18. pp. 234–241. [Google Scholar]
Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. Pytorch: An imperative style, high-performance deep learning library. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; Volume 32. [Google Scholar]
He, K.; Chen, X.; Xie, S.; Li, Y.; Dollár, P.; Girshick, R. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 16000–16009. [Google Scholar]
Loshchilov, I.; Hutter, F. Decoupled weight decay regularization. arXiv 2017, arXiv:1711.05101. [Google Scholar]
Tang, T.W.; Kuo, W.H.; Lan, J.H.; Ding, C.F.; Hsu, H.; Young, H.T. Anomaly detection neural network with dual auto-encoders GAN and its industrial inspection applications. Sensors 2020, 20, 3336. [Google Scholar] [CrossRef] [PubMed]
Li, Y.; Mao, H.; Girshick, R.; He, K. Exploring plain vision transformer backbones for object detection. In Proceedings of the European Conference on Computer Vision, Tel-Aviv, Israel, 23–27 October 2022; pp. 280–296. [Google Scholar]

Figure 1. The architecture of the proposed network integrates a vision transformer (ViT) with a student–teacher framework. The student network incorporates two innovative modules: a multistage pixel-reserving bridge (MPRB) and a spatial compression autoencoder (SCA). The teacher network is a frozen pretrained ResNet 50 model.

Figure 2. The network architecture of MPRB consists of multiple convolutional layers, batch normalization layers, and ReLU activation functions. The upsampling is performed using the pixel shuffling strategy. Feature fusion is optimized using CBAM after concatenation, and the channels are adjusted through a convolutional layer.

Figure 3. Visual comparisons of various upsampling methods. (a) Bilinear interpolation, (b) deconvolution, (c) pixel shuffling strategy, (d–f) enlarged best views after sampling by bilinear interpolation, deconvolution, and pixel shuffling strategy, respectively.

Figure 4. The network architecture of the SCA consists of two downsampled convolutional layers and multiple upsampling blocks. An upsampling block consists of a bilinear interpolation layer, a convolutional layer, a batch normalization layer, and a ReLU activation function.

Figure 5. Visualization of test results for AeBAD-S and MPDD. Input image, ground truth, MMR heatmap, and our heatmap are shown for each dataset.

Table 1. The comparisons of image-level AUROC (%) and pixel-level PRO (%) on AeBAD-S. The best results are highlighted in bold, and the second-best results are underlined.

	Same	Background	Illumination	View	Mean
DRAEM [43]	64.0/71.4	62.1/44.3	61.6/67.6	62.3/71.1	62.5/63.6
PatchCore [11]	75.2/89.5	74.1/89.4	74.6/88.2	60.1/84.0	71.0/87.8
RD4AD [45]	82.4/86.4	84.3/86.4	85.5/86.7	71.9/82.9	81.0/85.6
NSA [44]	66.5/43.0	48.8/29.7	55.5/59.9	55.9/51.1	56.7/45.9
RIAD [60]	38.6/71.9	41.6/33.4	46.8/65.3	33.0/62.2	40.0/58.2
InTra [17]	39.8/76.8	46.1/74.8	44.7/73.7	46.3/73.4	44.2/74.7
MMR [54]	85.6/89.6	84.4/90.1	88.8/90.2	79.9/86.3	84.7/89.1
Ours	85.9/90.5	91.6/91.2	89.9/90.9	82.6/86.8	87.5/89.9

Table 2. The comparisons of image-level AUROC (%) on AeBAD-V. The best results are highlighted in bold, and the second-best results are underlined.

	Video 1	Video 2	Video 3	Mean
DRAEM [43]	79.5	71.2	53.6	68.1
PatchCore [11]	71.1	86.0	55.1	70.7
RD4AD [45]	66.0	84.8	62.1	71.0
NSA [44]	59.4	72.7	61.9	64.6
RIAD [60]	78.0	47.1	43.2	56.1
InTra [17]	62.7	55.8	43.7	54.1
MMR [54]	75.7	88.3	70.7	78.2
Ours	75.9	87.9	71.8	78.5

Table 3. The comparisons of image-level AUROC (%) and pixel-level AUROC (%) on MPDD. The best results are highlighted in bold, and the second-best results are underlined.

	Bracket Black	Bracket Brown	Bracket White	Connector	Metal Plate	Tubes	Mean
PatchCore [11]	81.9/98.4	78.4/91.5	76.0/97.4	96.7/95.0	100.0/96.6	59.7/95.1	82.1/95.7
SPADE [47]	44.7/94.3	91.0/97.2	79.9/96.8	95.2/98.4	95.6/93.0	56.1/95.9	77.1/95.9
DAGAN [73]	68.6/89.7	77.1/81.5	72.1/70.6	99.8/85.7	85.4/90.0	32.0/82.3	72.5/83.3
PaDiM [10]	75.6/94.2	85.4/92.4	82.2/98.1	91.7/97.9	56.3/92.9	57.5/93.9	74.8/96.7
CFLOW-AD [18]	72.7/96.9	88.8/97.8	87.8/98.6	94.8/98.4	99.5/98.2	73.1/96.4	86.1/97.7
MMR [54]	79.8/97.2	93.5/97.5	88.0/99.3	100.0/99.2	100.0/99.0	96.0/99.3	92.9/98.6
NtNR [55]	93.4/99.0	93.1/93.1	89.3/97.8	100.0/99.0	99.6/98.8	94.2/99.2	94.9/97.8
Ours	84.8/98.2	99.5/97.6	96.0/99.5	100.0/99.2	100.0/98.8	98.6/99.4	96.5/98.8

Table 4. The comparisons of image-level AUROC (%) and pixel-level AUROC (%) on MVTec AD. The best results are highlighted in bold, and the second-best results are underlined.

	PatchCore [11]	RD4AD [45]	DRAEM [43]	NSA [44]	RIAD [60]	InTra [17]	MMR [54]	Ours
Bottle	100.0/98.6	100.0/98.7	99.2/99.1	97.7/98.3	99.9/98.4	100.0/97.1	100.0/98.3	100.0/98.1
Cable	99.5/98.5	95.0/97.4	91.8/94.7	94.5/96.0	81.9/84.2	84.2/93.2	97.8/95.4	98.7/95.8
Capsule	98.1/98.9	96.3/98.7	98.5/94.3	95.2/97.6	88.4/92.8	86.5/97.7	96.9/98.0	97.1/98.0
Carpet	98.7/99.1	98.9/98.9	97.0/95.5	95.6/95.5	84.2/96.3	98.8/99.2	99.6/98.8	99.9/99.1
Grid	98.2/98.7	100.0/99.3	99.9/99.7	99.9/99.2	99.6/98.8	100.0/99.4	100.0/99.0	100.0/99.0
Hazelnut	100.0/98.7	99.9/98.9	100.0/99.7	94.7/97.6	83.3/96.1	95.7/98.3	100.0/98.5	100.0/98.6
Leather	100.0/99.3	100.0/99.4	100.0/98.6	99.9/99.5	100.0/99.4	100.0/99.5	100.0/99.2	100.0/99.3
MetalNut	100.0/98.4	100.0/97.3	98.7/99.5	98.7/98.4	88.5/92.5	96.9/93.3	99.9/95.9	99.9/96.1
Pill	96.6/97.6	96.6/98.2	98.9/97.6	99.2/98.5	83.8/95.7	90.2/98.3	98.2/98.4	98.7/98.4
Screw	98.1/99.4	97.0/99.6	93.9/97.6	90.2/96.5	84.5/89.1	95.7/99.5	92.5/99.5	97.2/99.6
Tile	98.7/95.9	99.3/95.6	99.6/99.2	100.0/99.3	98.7/89.1	98.2/94.4	98.7/95.6	99.4/94.4
Toothbrush	100.0/95.9	99.5/99.1	100.0/98.1	100.0/94.9	100.0/98.9	99.7/99.0	100.0/98.4	100.0/98.4
Transistor	100.0/96.4	96.7/92.5	93.1/90.9	95.1/88.0	90.9/87.7	95.8/96.1	95.1/90.2	95.7/91.0
Wood	99.2/95.1	99.2/95.3	99.1/96.4	97.5/90.7	93.0/85.8	98.0/90.5	99.1/94.8	99.5/95.5
Zipper	99.4/98.9	98.5/98.2	100.0/98.8	99.8/94.2	98.1/97.8	99.4/99.2	97.6/98.0	98.2/98.0
Mean	99.1/98.9	98.5/97.8	98.0/97.3	97.2/96.3	91.7/94.2	95.9/97.0	98.4/97.2	99.0/97.3

Table 5. Ablation studies on the anomaly detection performance (image-level AUROC %) based upon the AeBAD-S and MPDD datasets to investigate the MPRB and SCA. The compared baseline (MMR [54]) is composed of ViT and a simple FPN [74].

Dataset	Method	Simple FPN	MPRB	SCA	Image AUROC
AeBAD-S	Baseline [54]	✓	✕	✕	84.7
	ViT+MPRB	✕	✓	✕	87.3 (+2.6)
	ViT+MPRB+SCA	✕	✓	✓	87.5 (+2.6+0.2)
MPDD	Baseline [54]	✓	✕	✕	92.9
	ViT+MPRB	✕	✓	✕	96.0 (+3.1)
	ViT+MPRB+SCA	✕	✓	✓	96.5 (+3.1+0.5)

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yang, Y.; Yang, Y.; Zhou, S.; Gao, Y.; Zhu, Y.; Wan, X.; Hu, W.; Jiang, X. MST: Multiscale Flow-Based Student–Teacher Network for Unsupervised Anomaly Detection. Electronics 2024, 13, 3224. https://doi.org/10.3390/electronics13163224

AMA Style

Yang Y, Yang Y, Zhou S, Gao Y, Zhu Y, Wan X, Hu W, Jiang X. MST: Multiscale Flow-Based Student–Teacher Network for Unsupervised Anomaly Detection. Electronics. 2024; 13(16):3224. https://doi.org/10.3390/electronics13163224

Chicago/Turabian Style

Yang, Yi, Yi Yang, Shubo Zhou, Yongbin Gao, Yadong Zhu, Xuefen Wan, Weiyu Hu, and Xueqin Jiang. 2024. "MST: Multiscale Flow-Based Student–Teacher Network for Unsupervised Anomaly Detection" Electronics 13, no. 16: 3224. https://doi.org/10.3390/electronics13163224

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

MST: Multiscale Flow-Based Student–Teacher Network for Unsupervised Anomaly Detection

Abstract

1. Introduction

2. Related Works

2.1. Synthesis-Based Approaches

2.2. Feature Embedding-Based Approaches

2.3. Reconstruction-Based Approaches

3. Methodology

3.1. Overall Architecture

3.2. Multistage Pixel-Reserving Bridge

3.3. Spatial Compression Autoencoder

3.4. Loss Function

4. Experiments

4.1. Datasets and Implementing Details

4.2. Comparison with State-of-the-Art Methods

4.3. Ablation Study

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI