Multi-Layer Feature Restoration and Projection Model for Unsupervised Anomaly Detection

Cai, Fuzhen; Xia, Siyu

doi:10.3390/math12162480

Open AccessArticle

Multi-Layer Feature Restoration and Projection Model for Unsupervised Anomaly Detection

by

Fuzhen Cai

and

Siyu Xia

^*

Key Laboratory of Measurement and Control of CSE, Ministry of Education, Southeast University, Nanjing 210096, China

^*

Author to whom correspondence should be addressed.

Mathematics 2024, 12(16), 2480; https://doi.org/10.3390/math12162480

Submission received: 2 July 2024 / Revised: 4 August 2024 / Accepted: 5 August 2024 / Published: 11 August 2024

(This article belongs to the Special Issue Object Detection: Algorithms, Computations and Practices)

Download

Browse Figures

Versions Notes

Abstract

:

The anomaly detection of products is a classical problem in the field of computer vision. Image reconstruction-based methods have shown promising results in the field of abnormality detection. Most of the existing methods use convolutional neural networks to build encoding–decoding structures to do image restoration. However, the limited receptive field of convolutional neural networks makes the information considered in the image restoration process limited, and the downsampling in the encoder causes information loss, which is not conducive to performing fine-grained restoration of images. To solve this problem, we propose a multi-layer feature restoration and projection model (MLFRP), which enables the restoration process to be carried out on multi-scale feature maps through a block-level feature restoration module that fully considers the detail information and semantic information required for the restoration process. We conducted in-depth experiments on the MvtecAD anomaly detection benchmark dataset, which showed that our model outperforms current state-of-the-art anomaly detection methods.

Keywords:

anomaly detection; image restoration; CNNs; MvtecAD

MSC:

94A08

1. Introduction

The detection of anomalies in products is an indispensable part of industrial manufacturing and practical applications and is an important means of preventing defective products from entering the market [1]. The traditional method of anomaly detection that relies on humans is not only inefficient but also susceptible to resolution by subjective factors of humans, and contact with the products may also cause some damage to them.

With the development of machine vision, automatic anomaly detection solutions are gradually replacing manual inspection. Traditional methods usually require manually designed features and then use some machine learning algorithms to process the extracted feature vectors for the desired results. However, manual feature design is not very versatile and needs to be performed differently depending on specific situations, seriously hampering the generalization performance of the algorithm [2]. With the advancement of the Internet and the improvement of parallel computing power, a large amount of data is constantly accumulated, and training larger and deeper neural networks becomes possible. Deep learning, a model that requires a large amount of data and complex computing power, is gradually emerging. Deep learning-based anomaly detection can automatically learn anomaly features that are difficult to design manually, and have many application scenarios, such as metal surface defect detection [3], medical disease diagnosis [4], time series anomaly detection [5], etc., which have high precision and generalization.

Among the deep learning-based anomaly detection methods, the image restoration based method shows excellent detection ability. The method detects anomalies by comparing the difference between the input image and the image after performing the restoration. This neural network model usually consists of a U-like encoder and decoder network based on convolutional neural networks (CNNs). The encoder compresses the input image into a potential space by continuously downsampling, and then a new image is generated by the decoder by performing upsampling from the feature embedding in the latent space. However, the receptive field of the CNN is limited, resulting in a limited region that the restoration model can see, which will greatly limit the ability of image restoration [6]. There are also methods that use SwinTransformer instead of CNN [7], which can obtain a larger receptive field, but the receptive field of window-attention-based SwinTransformer is still confined.

In order to solve the above problems, we propose a novel image restoration network multi-layer feature restoration and projection model (MLFRP), combining Transformer and CNN for solving the problems caused by the insufficient receptive field of CNN. The network utilizes the powerful ability of Transformer to encode global information to enhance the performance of restoration and overcome the shortcomings of CNNs that require constant downsampling to encode long-range information. Our method destroys the image by adding a random mask to a normal sample and then extracts the multi-layer information of both the normal image and the destroyed image by a pre-trained convolutional neural network [8]. The feature restoration process is carried out independently at each layer of feature maps, and the feature restoration network uses the Transformer model-based self-attention. Although performing feature restoration directly on shallow, high-resolution feature maps can effectively reduce the information loss caused by the downsampling process, the computational effort required by self-attention on high-resolution images is large. In order to reduce the consumption of computation, MLFRP uses block-level features to perform image restoration, taking the blocks as the base features rather than pixels as the base features. This method not only reduces the amount of computation, but also makes the restoration network easier to train. In order to make each layer of feature maps have rich information representation, the MLFRP method also performs feature fusion between different layers of feature maps, and the feature fusion not only makes each layer of feature maps have diversified information but also increases the sensory field of shallow feature maps and enhances the performance of restoration.

2. Related Works

In this section, we will review the common image anomaly detection methods and point out the similarities and differences between our approach and these methods. The main methods include image reconstruction and feature embedding, which will be described separately below.

2.1. Feature-Based Anomaly Detection

Feature-based approaches can be divided into two main categories, single-stream models and dual-stream models. Single-stream network-based methods usually need to build a feature pool to store normal patterns, and the distance between the features of the input image and the features in the feature pool is used as an anomaly metric in the inference stage [9,10,11,12,13]. It is very time-consuming because these methods need to perform a retrieval in the feature pool. In contrast, in dual-stream networks, anomalies can be directly measured by the difference between the activation layers of two networks, which saves a lot of retrieval time. Knowledge distillation in dual-stream networks is also a promising approach in the field of anomaly detection and has been extensively explored in many works. US [14] captures anomalies at multiple scales by training multiple randomly initialized student networks on normal samples at different scales, while MKD [15] deals with this problem by introducing an intermediate layer of knowledge distillation and using a student network with smaller capacity to prevent noise interference. Reverse Distillation [16], on the other hand, solves the non-distinguishing filter problem by changing the direction of the data flow in student network with the help of the UNet [17] structure. RD++ [18] proposes a self-supervised optimal transport training method to obtain compact features. Additionally, with the help of adding simplex noise to the image and then reconstructing it back to the original image, it forces the network to learn how to remove the noise from the input data. Remembering Normality [19] introduces a memory model in the knowledge distillation framework, which stores the knowledge of the teacher network about normal images in the memory module, and the student network only needs to obtain the features of normal samples from the memory module in the inference phase. The method largely increases the difference between the student network and the teacher network in the abnormal part and greatly improves the anomaly detection performance.

2.2. Reconstruction-Based Anomaly Detection

Image reconstruction is a very well-known class of anomaly detection methods, including GAN [20], VAE [21], and their variants. These methods assume that the reconstruction model trained on normal samples can only reconstruct normal regions, but not anomalous regions, and perform anomaly detection based on reconstruction residuals. However, deep neural networks have certain generalization abilities and subtle anomalous regions may have representations similar to normal regions, so anomalous patterns never seen in the training phase may be partially or completely reconstructed. In order to alleviate this problem, MemAE [22] proposes inserting a memory module in the middle of the encoder and decoder, which selectively saves the feature embeddings of normal samples, and then weights the embeddings of the test image according to their similarity with the feature vectors stored in memory to replace the original features in the decoder in the inference phase, so as to suppress the generalizability of the model. RIAD [23] refers to the self-supervised learning task, which transforms the task of image reconstruction into one of image restoration by masking out some of the information in the image, forcing the network to learn a more robust representation of normal patterns by filling in the masked regions with semantically reasonable content. DRAEM [24], on the other hand, starts with evaluation metrics by pseudo-anomaly, proposing to use an additional network to learn evaluation metrics for image similarity. However, this type of method using CNNs for image reconstruction will face the problem of insufficient receptive field. To solve this problem, MSTUnet [25] replaces traditional CNNs using swin to increase the receptive field and improve the reconstruction performance. In addition to image-level reconstruction, MLDFR [6] proposes a method for feature-based reconstruction. The method considers both global and local information, which greatly improves the performance of anomaly detection. Our proposed MLFRP is also based on feature reconstruction, but differs from previous work in the reconstruction methods.

3. Method

We propose an image restoration network combining a CNN and Transformer for unsupervised anomaly detection. The network includes a random mask generation module, a feature fusion module for fusing multi-scale feature maps, and a feature restoration module based on Transformer.

3.1. Random Mask Generation

Assuming that there are anomaly-free samples

D^{t} = {I_{1}^{t}, I_{2}^{t}, \dots, I_{m}^{t}}

for training, and a dataset that contains both normal and anomaly samples

D^{q} = {I_{1}^{q}, I_{2}^{q}, \dots, I_{n}^{q}}

are used for testing. We first scale all the images in

D^{t}

to a fixed size of

H_{t} \times W_{t}

, and a single-channel image with pixel values of size 1 is generated from this size. We choose the size of the masked block k and the number of masked blocks n, and randomly select n non-overlapping regions of size

k \times k

in the generated single-channel image and set them to 0, then we obtain the 2-D mask

M_{g}

as shown in Figure 1. For a normal image

I_{n} \in D^{t}

, multiply it with the 2-D mask at the pixel level to obtain the artificially generated broken image

I_{d} = I_{n} ⊙ M_{g}

, where ⊙ indicates multiplication at the pixel level.

Then, the normal image and the broken image are simultaneously fed into the pre-trained convolutional neural network to extract the multi-level features. The upper half of the dark color in Figure 1 extracts the multi-level features of the broken image, and the lower half of the light color extracts the multi-level features of the normal image. Assuming that the pre-trained feature extraction network is

Φ

,

F_{d}^{I}

and

F_{n}^{I}

denote the feature maps of the lth layer of the abnormal and normal images, respectively, then the process of extracting multi-level features can be expressed as follows:

\begin{matrix} F_{d}^{I} = Φ^{I} (I_{d}) \\ F_{n}^{I} = Φ^{I} (I_{n}) \end{matrix}

(1)

3.2. Multi-Scale Feature Fusion

In order to enable different layers of feature maps of the broken image to contain rich information representation so that the feature restoration model can restore the broken region more conveniently and improve the accuracy and restoration effect, we propose a multi-scale feature fusion module for fusing the information between the different layers to ensure that the feature maps of each layer have the capability of rich information representation.

As an example of feature fusion using three-layer feature maps, Figure 2 shows a specific schematic of the feature fusion process. The three subimages on the left (a), middle (b), and right (c) of the figure represent the fusion of high-resolution feature maps, the fusion of medium-resolution feature maps, and the fusion of low-resolution feature maps, respectively, and the fusion of feature maps is carried out in an additive way. In order to allow feature maps of different scales to be fused in an additive manner, the corresponding feature maps need to be downsampled and upsampled to ensure that they have the same dimensions. The downsampling operation is performed using a convolutional layer (Conv) and cascading a batch normalization layer with a ReLU activation function layer, while the upsampling is performed using a transposed convolutional layer (ConvTrans), which also needs to be cascaded with a batch normalization layer and a ReLU activation function layer.

3.3. Block-Wise Feature Restoration

Our method detects anomalies by comparing the difference between the input image and the repaired image on a multi-layer feature map, so the restoration of the broken region will be performed on the feature map. However, traditional CNN-based feature restoration networks may face the problem of insufficient receptive fields when performing feature restoration, which will lead to the limited ability of the network to repair broken regions.

To solve this problem, we use self-attention to perform feature restoration on broken regions. Self-attention has a global receptive field, which can capture long-range dependencies between different locations in the image and obtain more comprehensive contextual information. In this way, the restoration network can more fully understand the contextual information of the broken region and thus can generate more accurate results. The feature restoration process is performed individually on each layer of the feature map after feature fusion. Specifically, for a feature map

F_{n}^{l}

of size

C_{l} \times H_{l} \times W_{l}

, where

C_{l}

,

H_{l}

, and

W_{l}

denote the number of channels, height, and width of the

l^{t h}

layer of the feature map, respectively, we first convert it into a sequence of length

H_{l} \times W_{l}

and dimension

C_{l}

. The sequence is then fed into a feature restoration module based on the self-attention mechanism to obtain the repaired sequence. Finally, the sequence is reverted to obtain the feature map of size

C_{l} \times H_{l} \times W_{l}

after performing the restoration of the broken region. The design of the feature restoration module refers to [26], but removes the class token because it does nothing to repair the damaged area.

Although performing feature repair directly on shallow, high-resolution feature maps can effectively reduce the information loss caused by the downsampling process, the computational effort required to perform the self-attention mechanism on high-resolution images is large. To solve this problem, we use block-level features to do image restoration, using the image blocks in the features as the baseline features instead of the pixel points as the baseline features. Specifically, for a feature map

F_{n}^{l}

of size

C_{l} \times H_{l} \times W_{l}

, assuming that the size of each sub-feature map is

p_{l}

, then

F_{n}^{l}

will be divided into

(H_{l} / p_{l}) \times (W_{l} / p_{l})

sub-feature maps of size

C_{l} \times p_{l} \times p_{l}

, which will be transformed into a sequence, and the length of the sequence will be changed from

H_{l} \times W_{l}

to

(H_{l} / p_{l}) \times (W_{l} / p_{l})

, effectively reducing the computation. At the same time, to prevent the loss of information, the dimension of the sequence is changed from

C_{l}

to

C_{l} \times p_{l} \times p_{l}

, and the number of elements in the sequence is

(H_{l} / p_{l}) \times (W_{l} / p_{l}) \times C_{l} \times p_{l} \times p_{l} = C_{l} \times H_{l} \times W_{l}

, which is equivalent to the number of elements in the feature map that generates the sequence, thus effectively avoiding the loss of information.

3.4. Loss Function and Anomaly Location

After feeding a normal image into the MLFRP network, we will obtain the feature map

F_{n}^{l}

of the normal image and

F_{r}^{l}

with repaired broken areas, where

F_{n}^{l} F_{r}^{l} \in R^{C_{l} \times H_{l} \times W_{l}}

. In the training phase, since the purpose is to repair the feature maps of the damaged images, the goal of the model is to minimize the difference between the feature maps of the normal images and the corresponding layers of the damaged areas repaired. The loss function takes into account both the mean square error loss and the cosine similarity loss between the two feature maps.

For cosine similarity loss, the feature vectors

C_{(i, j)} (F_{n}^{l})

and

C_{(i, j)} (F_{r}^{l})

of each channel in the feature map

F_{n}^{l}

and

F_{r}^{l}

are first extracted, where

i \in H_{l}

,

j \in W_{l}

,

C_{(i, j)}

denotes the feature vector corresponding to the feature map at the position

(i, j)

. Then, we calculate the cosine similarity between these two feature vectors in the corresponding channel of these two feature maps to obtain a pair of feature maps of size

H_{l} \times W_{l}

, where each position in the feature map denotes the cosine similarity between the feature vectors of the channels corresponding to

F_{n}^{l}

and

F_{r}^{l}

. The loss of cosine similarity between

F_{n}^{l}

and

F_{r}^{l}

is then obtained by subtracting each value from this feature map by 1 and averaging the final result, which can be expressed as follows:

\begin{matrix} L_{c o s}^{l} = \frac{1}{H_{l} \times W_{l}} \sum_{i}^{H_{l}} \sum_{j}^{W_{l}} (1 - \frac{{(C_{(i, j)} (F_{n}^{l}))}^{T} \cdot (C_{(i, j)} (F_{r}^{l}))}{∥C_{(i, j)} (F_{n}^{l})∥ ∥C_{(i, j)} (F_{r}^{l})∥}) \end{matrix}

(2)

For the mean square error loss, the corresponding elements in the feature map

F_{n}^{l}

and

F_{r}^{l}

are first subtracted to obtain the difference feature map

F_{s}^{l} \in R^{C_{l} \times H_{l} \times W_{l}}

, the square is taken for each value in this difference feature map, and then all the values in the feature map are averaged to obtain a scalar that characterizes the difference at each position between the two feature maps. Obviously, it is very difficult to make the absolute difference between the normal feature map and the repaired feature map the same at each position, so the mean square error term will be set to a very small value in the loss function, and the term will only be used to assist in training the model. The mean square error formula is shown below,

\begin{matrix} L_{m s e}^{l} = \frac{1}{H_{l} \times W_{l} \times C_{l}} \sum_{i}^{H_{l}} \sum_{j}^{W_{l}} \sum_{k}^{C_{l}} {(F_{n}^{l} (i, j, k) - F_{s}^{l} (i, j, k))}^{2} \end{matrix}

(3)

where i, j, and k are the height, width, and depth corresponding positions of the feature map, respectively. The loss function corresponding to the lth layer is obtained by summing the two loss functions according to certain weights,

\begin{matrix} L^{l} = L_{cos}^{l} + λ L_{m s e}^{l} \end{matrix}

(4)

where

λ

denotes the mean square error loss as a proportion of the overall loss function of the model.

The overall loss function of the model is

\begin{matrix} L_{t o t a l} = \sum_{l = 1}^{L} L^{l} \end{matrix}

(5)

In the testing phase, for an input image

I \in D^{q}

, the model extracts its original multi-layer features

F_{n}^{l}

as well as

F_{r}^{l}

generated by the image restoration network. The feature difference map between two feature maps is computed using cosine similarity, which is similar to cosine similarity loss, except that the averaging operation is missing. The way to obtain the feature difference map can be expressed as below,

\begin{matrix} A^{l} (i, j) = 1 - \frac{{(C_{(i, j)} (F_{n}^{l}))}^{T} \cdot (C_{(i, j)} (F_{r}^{l}))}{∥C_{(i, j)} (F_{n}^{l})∥ ∥C_{(i, j)} (F_{r}^{l})∥} \end{matrix}

(6)

where

A^{l} (i, j)

denotes the specific difference value of the two feature maps at position

(i, j)

for the lth layer, and

A^{l}

denotes the feature difference map for the lth layer. The feature difference maps of each layer are upsampled to the original image size and then summed at the corresponding positions to obtain the overall difference map A of the model, which also becomes the anomaly score map.

\begin{matrix} A (i, j) = \sum_{l = 1}^{L} U p s a m p l e (A^{l} (i, j)) \end{matrix}

(7)

where

A (i, j)

denotes the specific value of the anomaly score map at position

(i, j)

, the larger the value of

A (i, j)

, the greater the likelihood that the picture I to be predicted as anomalous at position

(i, j)

.

4. Experimental Results

In this section, we will conduct extensive experiments on the anomaly detection benchmark datasets MVTec and BTAD to demonstrate the effectiveness of our method. We also compare our results with the state-of-the-art methods, and show that our method achieves the best detection results. Finally, ablation experiments on the MVTec dataset will be performed to investigate the effect of different module choices.

4.1. Setup

4.1.1. Datasets

We use MVTecAD [27] and BTAD [28] datasets to test our proposed method. The MVTecAD dataset has more than 5000 high-resolution images, including 5 texture classes and 10 object classes, with a total of 15 sub-datasets of different classes, each of which has its own characteristics, and each of which contains a certain number of training and test sets. The difference in resolution of the sample images in the dataset as well as the large variation in anomaly size make the anomaly detection task challenging. The area under the per-region overlap (AUPRO) is taken as the evaluation metrics for anomaly localization.

4.1.2. Implementation

The feature extraction network uses a pre-trained ResNet34. We used the first three layers of feature maps for feature restoration, i.e.,

L = 3

. The size of the sub-feature maps divided for each layer of feature maps is 2, 1, and 1 from shallow to deep layers, i.e.,

p_{1} = 2

,

p_{2} = 1

, and

p_{3} = 1

, respectively. The size of the mask block and the number of mask blocks in the broken image generation module are set to 16 and 80, respectively.

λ

in the loss function was set to 0.001. In order to speed up training and testing, all images are scaled to the size of

256 \times 256

. Adam optimizer [29] was used to train our mixed distillation model, where

β = (0.5, 0.999)

. The learning rate is fixed at 0.001 and the batch size is set to 8. In total, 200 epochs are trained.

4.2. Quantitative and Qualitative Results

To evaluate the effectiveness of the proposed method, we compare the MLFPR method with some classical and state-of-the-art methods, namely SSIM-AD [30], DRAEM [24], FAIR [31], THFR [32], and MLDFR [32]. SSIM-AD assists the UNet network to better learn the structural feature distribution of the normal image by introducing the loss of structural similarity of SSIM. DRAEM uses synthetic anomalies to allow the model to learn how to distinguish between normal and anomalous pattern feature embeddings. FAIR uses the difference in frequency between normal and reconstructed anomalous images for anomaly detection to solve the problem of difficulty in balancing the fidelity of reconstruction in normal regions and the distinguishability of reconstruction in anomalous regions. THFR uses both global bottleneck and local bottleneck to detect anomalies. They use bottleneck and local bottleneck for filtering abnormal features to facilitate the reconstruction of normal features by the next decoder, which additionally introduces a template pool for preserving normal images, which is used to compensate for the loss of normal features caused by the feature filtering process. MLDFR extracts the feature maps using the pre-trained CNN and pre-trained VIT simultaneously, so that the extracted feature maps contain the local features extracted by the CNN network at the same time, which is a good solution to the problem. This is expected to enhance the ability of the feature repair network to repair the broken regions.

Table 1 gives the results of the pixel-level AUPRO evaluation metrics for the above methods on the MVTecAD dataset, MLFRP, the multi-layer feature repair mapping network proposed in this chapter, and the optimal results for each category have been bolded. An experimental result of - means that the specific values for these subcategories are not given in the paper corresponding to the method. It can be seen from the results that our proposed method achieves SOTA results for anomaly detection of object class in the MVTecAD dataset and achieves the best average location accuracy of 95.9% among its 15 subclasses. This demonstrates the genericity of the method in the task of anomaly detection.

We show the anomaly detection effect of MLFRP on a partial subset of the MVTecAD dataset in Figure 3. It is obvious from the figure that the anomalous regions predicted by our method are in high agreement with the real anomalous regions. For some normal regions, the model also does not incorrectly predict them as such. This intuitively demonstrates the effectiveness of our method.

To verify the generalizability of the proposed anomaly detection method, we additionally investigate the pixel-level AUPRO metrics of some classical unsupervised anomaly detection methods on the BTAD dataset. Since the latest unsupervised anomaly detection methods do not give metrics on the BTAD dataset, only some of the latest open-source methods are compared, including MSFlow [33], FAIR [31], and MMR [34]. The results of these methods on the BTAD dataset are presented in Table 2. It can be seen that the proposed MLFRP method achieves the optimal pixel-level AUPRO evaluation metrics on the BTAD dataset, which proves that the method has some ability to generalize.

4.3. Ablation Analysis

To further evaluate the proposed MLFRP algorithm, we will investigate the impact of some parameters in the algorithm and the design of the network structure on the anomaly detection results. The ablation experiments will use the average AUPRO results of the 15 categories in the MVTecAD dataset as the evaluation metrics to investigate the effects of the parameters and the structural design of the model on the model performance. The parameters to investigate include the number of mask blocks n and the size of mask blocks k as well as the number of Transformer Encoder N, and the size of sub-feature maps

p_{l}

.

Table 3 shows how different values of some of the model’s parameters affect the final performance of the model. In this paper, the first column shows the parameters to be studied by the model, the second column shows the different value settings of these parameters studied, and the third column shows the average AUPRO scores of the model on the MVTecAD dataset corresponding to the different values of the parameters. In this section, when investigating the effect of the value of k on the model performance, we need to fix

= 80

and

N = 12

first. Here, n is taken as 80 referring to the setting in the RIAD [23] model and N is taken as 12 referring to the setting in the Cait model [26]. From the data in the first part of the table, it can be seen that the model achieves the best pixel-level AUPRO evaluation metric when k is taken 16. In the following, the effect of the value of n on the performance of the model is investigated by fixing

k = 16

. From the second part of the table, it can be seen that the model achieves the optimal performance metrics when n is taken as 80, obtained by fixing

N = 12

. Similarly, fixing

k = 16

and

n = 80

, it can be found that the model achieves the optimal value for the pixel-level AUPRO metric when

N = 12

.

Different values of

p_{l}

for the size of the sub-feature maps generated by feature map chunking also have some impact on the performance of the model, and the fourth part of Table 3 demonstrates the pixel-level AUPRO metrics at different values for its different layers. The

(x_{1}, x_{2}, x_{3})

in the column of parameter values indicates the size of the sub-feature maps generated by dividing the large-scale feature maps, medium-scale feature maps, and small-scale feature maps, respectively. From the data in the table, it can be seen that the model achieves the optimal pixel-level AUPRO metrics when the size of the sub-feature maps divided into three different scales of feature maps is set to 2, 1, and 1, respectively.

The feature fusion module fuses sufficient information to the feature maps of each layer, and this information can make it easier for the next repair module to repair the broken regions. In order to prove the effectiveness of the feature fusion module, we conducted ablation experiments on it, and the experimental results are shown in Table 4. In the table, with FFN indicates that the proposed feature fusion module is used; without FFN indicates that the feature fusion module is not used. From the experimental results in the table, it can be seen that the feature fusion module has a significant improvement in the performance of anomaly localization.

In order to verify that the proposed block-level feature restoration module not only reduces the computation of self-attention but also significantly improves the anomaly localization performance of the model, we additionally investigated the impact of some methods to reduce the computation of self-attention on the anomaly detection results of the algorithm, and the investigated methods include Swin Transformer (ST) [35], External Attention (EA) [36], and Deformable Attention (DA) [37]. Swin Transformer replaces the practice of performing attention on the global feature map with performing attention in a local window and uses a sliding-window method to allow the attention to focus on more than just within a certain window. External Attention uses an external memory cell to implicitly model the relationship between different vectors and limits the length of the memory cell to reduce the amount of computation. Deformable Attention, on the other hand, increases the representational power of the attention graph by moving k and v to key positions through a set of offsets learned from the input image. The results of the ablation experiments are shown in Table 5, where ST, EA, and DA denote the replacement of the self-attention mechanism in VIT with the Window Attention, External Attention, and Deformable Self-Attention mechanisms, respectively. The pixel-level AUROC experimental metrics in the table show that the proposed block-level feature repair method has a significant improvement in the performance of anomaly localization.

5. Conclusions

We propose a novel MLFRP model for unsupervised anomaly detection. Compared to previous reconstruction-based methods, our proposed method performs image restoration on the multi-scale feature maps extracted by the pre-trained neural network, which effectively reduces the loss of information in the U-shape structure-based image restoration network, and at the same time uses self-attention, which can focus on the global information at one time to perform the restoration of the anomalous regions, which effectively solves the problem caused by the insufficient receptive field in the CNN-based image restoration network. The experimental results show that our method exceeds the current SOTA anomaly detection methods. However, although using the Transformer architecture to directly repair the feature map of the broken regions can fully take into account the global information, its computational volume and memory occupation are problems that have to be considered, and although the proposed model can achieve the optimal anomaly detection result, it sacrifices the computational efficiency of the model to a certain extent. Existing methods to reduce the computational consumption of self-attention usually require a large amount of data for training, which is not applicable in the field of unsupervised anomaly detection, therefore, the design of a Transformer structure that does not rely on a large amount of data and is computationally efficient needs to be considered.

Author Contributions

Writing—original draft, F.C.; Writing—review and editing, S.X. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by ZTE Industry-University-Institute Cooperation Funds under Grant No. HC-CN-20221107001.

Data Availability Statement

The data presented in this study are openly available in https://www.mvtec.com/company/research/datasets/mvtec-ad.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Li, W.; Zhang, H.; Wang, G.; Xiong, G.; Zhao, M.; Li, G.; Li, R. Deep learning based online metallic surface defect detection method for wire and arc additive manufacturing. Robot. Comput.-Integr. Manuf. 2023, 80, 102470. [Google Scholar] [CrossRef]
Liu, J.; Xie, G.; Wang, J.; Li, S.; Wang, C.; Zheng, F.; Jin, Y. Deep industrial image anomaly detection: A survey. Mach. Intell. Res. 2024, 21, 104–135. [Google Scholar] [CrossRef]
Shao, L.; Zhang, E.; Duan, J.; Ma, Q. Enriched multi-scale cascade pyramid features and guided context attention network for industrial surface defect detection. Eng. Appl. Artif. Intell. 2023, 123, 106369. [Google Scholar] [CrossRef]
Bao, J.; Sun, H.; Deng, H.; He, Y.; Zhang, Z.; Li, X. Bmad: Benchmarks for medical anomaly detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–21 June 2024; pp. 4042–4053. [Google Scholar]
Song, J.; Kim, K.; Oh, J.; Cho, S. Memto: Memory-guided transformer for multivariate time series anomaly detection. Adv. Neural Inf. Process. Syst. 2024, 36. [Google Scholar]
Guo, Y.; Jiang, M.; Huang, Q.; Cheng, Y.; Gong, J. Mldfr: A multilevel features restoration method based on damaged images for anomaly detection and localization. IEEE Trans. Ind. Inform. 2023, 20, 2477–2486. [Google Scholar] [CrossRef]
Ma, M.; Han, L.; Zhou, C. Research and application of transformer based anomaly detection model: A literature review. arXiv 2024, arXiv:2402.08975. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, USA, 26–30 June 2016; pp. 770–778. [Google Scholar]
Cohen, N.; Hoshen, Y. Sub-image anomaly detection with deep pyramid correspondences. arXiv 2020, arXiv:2005.02357. [Google Scholar]
Defard, T.; Setkov, A.; Loesch, A.; Audigier, R. PaDiM: A patch distribution modeling framework for anomaly detection and localization. In Proceedings of the Pattern Recognition. ICPR International Workshops and Challenges, Virtual Event, 10–15 January 2021; Proceedings, Part IV. Springer: Berlin/Heidelberg, Germany, 2021; pp. 475–489. [Google Scholar]
Li, C.-L.; Sohn, K.; Yoon, J.; Pfister, T. Cutpaste: Self-supervised learning for anomaly detection and localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 19–25 June 2021; pp. 9664–9674. [Google Scholar]
Roth, K.; Pemula, L.; Zepeda, J.; Schölkopf, B.; Brox, T.; Gehler, P. Towards total recall in industrial anomaly detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–24 June 2022; pp. 14318–14328. [Google Scholar]
Zhang, X.; Xu, M.; Zhou, X. Realnet: A feature selection network with realistic synthetic anomaly for anomaly detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–21 June 2024; pp. 16699–16708. [Google Scholar]
Bergmann, P.; Fauser, M.; Sattlegger, D.; Steger, C. Uninformed students: Student-teacher anomaly detection with discriminative latent embeddings. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 4183–4192. [Google Scholar]
Salehi, M.; Sadjadi, N.; Baselizadeh, S.; Rohban, M.H.; Rabiee, H.R. Multiresolution knowledge distillation for anomaly detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 19–25 June 2021; pp. 14902–14912. [Google Scholar]
Deng, H.; Li, X. Anomaly detection via reverse distillation from one-class embedding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–24 June 2022; pp. 9737–9746. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, 5–9 October 2015; Proceedings, Part III 18; Springer: Berlin/Heidelberg, Germany, 2015; pp. 234–241. [Google Scholar]
Tien, T.D.; Nguyen, A.T.; Tran, N.H.; Huy, T.D.; Duong, S.; Nguyen, C.D.T.; Truong, S.Q. Revisiting reverse distillation for anomaly detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 24511–24520. [Google Scholar]
Gu, Z.; Liu, L.; Chen, X.; Yi, R.; Zhang, J.; Wang, Y.; Wang, C.; Shu, A.; Jiang, G.; Ma, L. Remembering normality: Memory-guided knowledge distillation for unsupervised anomaly detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 16401–16409. [Google Scholar]
Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial networks. Commun. ACM 2020, 63, 139–144. [Google Scholar] [CrossRef]
Kingma, D.P.; Welling, M. Auto-encoding variational bayes. arXiv 2013, arXiv:1312.6114. [Google Scholar]
Gong, D.; Liu, L.; Le, V.; Saha, B.; Mansour, M.R.; Venkatesh, S.; Hengel, A.v.d. Memorizing normality to detect anomaly: Memory-augmented deep autoencoder for unsupervised anomaly detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 1705–1714. [Google Scholar]
Zavrtanik, V.; Kristan, M.; Skočaj, D. Reconstruction by inpainting for visual anomaly detection. Pattern Recognit. 2021, 112, 107706. [Google Scholar] [CrossRef]
Zavrtanik, V.; Kristan, M.; Skočaj, D. Draem-a discriminatively trained reconstruction embedding for surface anomaly detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 11–17 October 2021; pp. 8330–8339. [Google Scholar]
Jiang, J.; Zhu, J.; Bilal, M.; Cui, Y.; Kumar, N.; Dou, R.; Su, F.; Xu, X. Masked swin transformer unet for industrial anomaly detection. IEEE Trans. Ind. Inform. 2022, 19, 2200–2209. [Google Scholar] [CrossRef]
Touvron, H.; Cord, M.; Sablayrolles, A.; Synnaeve, G.; Jégou, H. Going deeper with image transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 11–17 October 2021; pp. 32–42. [Google Scholar]
Bergmann, P.; Fauser, M.; Sattlegger, D.; Steger, C. Mvtec ad–A comprehensive real-world dataset for unsupervised anomaly detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 9592–9600. [Google Scholar]
Mishra, P.; Verk, R.; Fornasier, D.; Piciarelli, C.; Foresti, G.L. Vt-adl: A vision transformer network for image anomaly detection and localization. In Proceedings of the 2021 IEEE 30th International Symposium on Industrial Electronics (ISIE), Kyoto, Japan, 20–23 June 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 1–6. [Google Scholar]
Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
Bergmann, P.; Löwe, S.; Fauser, M.; Sattlegger, D.; Steger, C. Improving unsupervised defect segmentation by applying structural similarity to autoencoders. arXiv 2018, arXiv:1807.02011. [Google Scholar]
Liu, T.; Li, B.; Du, X.; Jiang, B.; Geng, L.; Wang, F.; Zhao, Z. Fair: Frequency-aware image restoration for industrial visual anomaly detection. arXiv 2023, arXiv:2309.07068. [Google Scholar]
Guo, H.; Ren, L.; Fu, J.; Wang, Y.; Zhang, Z.; Lan, C.; Wang, H.; Hou, X. Template-guided hierarchical feature restoration for anomaly detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 6447–6458. [Google Scholar]
Zhou, Y.; Xu, X.; Song, J.; Shen, F.; Shen, H.T. Msflow: Multiscale flow-based framework for unsupervised anomaly detection. IEEE Trans. Neural Netw. Learn. Syst. 2024. [CrossRef]
Zhang, Z.; Zhao, Z.; Zhang, X.; Sun, C.; Chen, X. Industrial anomaly detection with domain shift: A real-world dataset and masked multi-scale reconstruction. Comput. Ind. 2023, 151, 103990. [Google Scholar] [CrossRef]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 11–17 October 2021; pp. 10012–10022. [Google Scholar]
Guo, M.-H.; Liu, Z.-N.; Mu, T.-J.; Hu, S.-M. Beyond self-attention: External attention using two linear layers for visual tasks. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 5436–5447. [Google Scholar] [CrossRef]
Xia, Z.; Pan, X.; Song, S.; Li, L.E.; Huang, G. Vision transformer with deformable attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–24 June 2022; pp. 4794–4803. [Google Scholar]

Figure 1. Overview of our proposed MLFRP method. Normal and damaged images are both fed into pre-trained network to obtain a multi-layer feature map. After the feature fusion and feature restoration, We obtained the anomaly score map by integrating the feature diffence maps of different layers. For the training phase, we need to compute the cosine similarity and mean square error between the multi-layer feature maps corresponding to normal and broken images and use their weighted values as a loss function for training the network.

Figure 2. Overview of our proposed multi-scale feature fusion model. Taking the fusion of three-layer feature maps as an example, (a–c) in the above figure denote the fusion method of high-resolution feature maps, the fusion method of medium feature maps, and the fusion method of low-resolution feature maps, respectively.

Figure 3. Visualization on multi-scale anomalies. From top to bottom: grid, tile, bottle, capsule, pill. In this figure, (a) denotes the original image, (b) is the anomaly label, (c) denotes the heat map of the anomaly location predicted by the model, and (d) denotes the heat map presented on the original image.

Table 1. AUPRO results on MVTecAD.

Category/Method		SSIM-AD	DRAEM	FAIR	THFR	MLDFR	Ours
Textures	Carpet	92.2	93.5	98.4	-	98.5	97.4
	Grid	91.4	96.5	97.7	-	97.9	96.2
	Leather	87.1	94.6	98.9	-	99.4	94.4
	Tile	75.8	94.2	95.4	-	96.1	97.9
	Wood	79.7	90.8	94.2	-	94.6	94.2
	Average	85.2	93.9	96.9	-	97.3	96.0
Objects	Bottle	92.4	93.5	94.1	-	96.8	97.5
	Cable	81.2	88.8	92.3	-	93.5	99.3
	Capsule	78.0	92.3	83.6	-	96.3	94.8
	Hazelnut	82.7	93.8	96.0	-	96.1	97.8
	Metal Nut	85.1	89.5	88.7	-	92.5	97.1
	Pill	78.9	90.7	95.0	-	96.9	90.3
	Screw	83.1	87.3	93.2	-	98.8	95.0
	Toothbrush	91.2	91.2	94.6	-	92.3	94.9
	Transistor	84.6	88.6	90.0	-	87.6	95.9
	Zipper	78.8	85.2	98.1	-	97.5	95.1
	Average	83.2	90.1	92.6	-	94.8	95.8
Total Average		84.1	91.4	94.0	95.0	95.7	95.9

Table 2. AUPRO results on BTAD.

Category	MSFlow	FAIR	MMR	Ours
01	83.8	81.6	76.2	87.1
02	59.2	58.2	57.5	62.9
03	98.9	98.8	97.1	98.8
average	80.6	79.5	76.9	83

Table 3. Ablation on Paramenter.

Parameter	Value	AUPRO Result
k	8	92.6
	16	95.9
	32	95.3
	64	93.2
n	60	95.3
	80	95.9
	100	95.7
N	8	94.9
	10	95.7
	12	95.9
	14	95.4
$p_{l}$	(2,1,1)	95.9
	(4,1,1)	95.1
	(2,2,1)	94.7
	(2,2,2)	91.9

Table 4. Ablation on feature fusion model.

	AUPRO Result
with FFN	95.9
without FFN	94.7

Table 5. Ablation on feature restoration model.

	AUPRO Result
ST	94.2
EA	95.2
DA	94.7
Ours	95.9

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Cai, F.; Xia, S. Multi-Layer Feature Restoration and Projection Model for Unsupervised Anomaly Detection. Mathematics 2024, 12, 2480. https://doi.org/10.3390/math12162480

AMA Style

Cai F, Xia S. Multi-Layer Feature Restoration and Projection Model for Unsupervised Anomaly Detection. Mathematics. 2024; 12(16):2480. https://doi.org/10.3390/math12162480

Chicago/Turabian Style

Cai, Fuzhen, and Siyu Xia. 2024. "Multi-Layer Feature Restoration and Projection Model for Unsupervised Anomaly Detection" Mathematics 12, no. 16: 2480. https://doi.org/10.3390/math12162480

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Multi-Layer Feature Restoration and Projection Model for Unsupervised Anomaly Detection

Abstract

1. Introduction

2. Related Works

2.1. Feature-Based Anomaly Detection

2.2. Reconstruction-Based Anomaly Detection

3. Method

3.1. Random Mask Generation

3.2. Multi-Scale Feature Fusion

3.3. Block-Wise Feature Restoration

3.4. Loss Function and Anomaly Location

4. Experimental Results

4.1. Setup

4.1.1. Datasets

4.1.2. Implementation

4.2. Quantitative and Qualitative Results

4.3. Ablation Analysis

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI