Industrial Image Anomaly Detection via Self-Supervised Learning with Feature Enhancement Assistance

Wu, Bin; Wang, Xiaoqi

doi:10.3390/app14167301

Open AccessArticle

Industrial Image Anomaly Detection via Self-Supervised Learning with Feature Enhancement Assistance

by

Bin Wu

^* and

Xiaoqi Wang

Hebei Key Laboratory of Marine Perception Network and Data Processing, School of Computer and Communication Engineering, Northeastern University at Qinhuangdao, Qinhuangdao 066004, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(16), 7301; https://doi.org/10.3390/app14167301

Submission received: 23 July 2024 / Revised: 13 August 2024 / Accepted: 19 August 2024 / Published: 19 August 2024

(This article belongs to the Special Issue State-of-the-Art of Computer Vision and Pattern Recognition)

Download

Browse Figures

Versions Notes

Abstract

:

Industrial anomaly detection is constrained by the scarcity of anomaly samples, limiting the applicability of supervised learning methods. Many studies have focused on anomaly detection by generating anomaly images and adopting self-supervised learning approaches. Leveraging pre-trained networks on ImageNet has been explored to assist in this training process. However, achieving accurate anomaly detection remains time-consuming due to the network’s depth and parameter count not being reduced. In this paper, we propose a self-supervised learning method based on Feature Enhancement Patch Distribution Modeling (FEPDM), which generates simulated anomalies. Unlike direct training on the original feature extraction network, our approach utilizes a pre-trained network to extract multi-scale features. By aggregating these multi-scale features, we are able to train at the feature level, thereby adapting more efficiently to various network structures and reducing domain bias with respect to natural image classification. Additionally, it significantly reduces the number of parameters in the training process. Introducing this approach not only enhances the model’s generalization ability but also significantly improves the efficiency of anomaly detection. The method was evaluated on MVTec AD and BTAD datasets, and (image-level, pixel-level) AUROC scores of (95.7%, 96.2%), (93.4%, 97.6%) were obtained, respectively. The experimental results have convincingly demonstrated the efficacy of our method in tackling the scarcity of abnormal samples in industrial scenarios, while simultaneously highlighting its broad generalizability.

Keywords:

self-supervised learning; anomaly detection; feature enhancement

1. Introduction

Anomaly detection [1] aims to accurately identify instances of anomalous or defective patterns that deviate significantly from those prevalent in regular instances, thereby ensuring data quality and analysis accuracy. Due to its wide range of application scenarios, including medical image analysis [2], industrial defect detection [3], video surveillance [4], etc., anomaly detection has been extensively studied. One of the important branches is industrial defect detection. Industrial image processing encounters unique complexities, among which the acquisition of anomaly images is particularly challenging. This difficulty arises not only due to their rarity and the complexity of labeling but also because of the uncertainty and diversity of the anomalies themselves. Defects can appear as extremely subtle surface imperfections like scratches, or serious structural issues such as missing parts. These factors collectively contribute to the challenge of accurately detecting and analyzing anomaly images in industrial environments. To address the above issues, researchers generally prefer to model this problem in the framework of unsupervised learning. This approach aims to efficiently and accurately identify and analyze anomalous images despite the lack of prior knowledge, i.e., using only normal samples for training.

Recently, some embedded methods [5,6,7] have demonstrated excellent performance in anomaly detection and location. These methods utilize a network pre-trained on ImageNet to extract features from normal samples for learning the normal paradigm of the data. During inference, they calculate anomaly scores based on the distance between test samples and normal samples. However, unlike the diverse, abstract, and complex semantic information typically handled by target detection in natural scene images, anomalies in industrial images tend to be highly irregular. This irregularity poses a significant domain bias between industrial images and natural scenes, presenting a major challenge that impedes the further development of embedding-based methods in the field of industrial anomaly detection.

On the other hand, in order to reduce the dependence on data labeling and to address the challenge of missing anomaly images, some methods [8,9,10] employ a self-supervised learning framework. These methods enhance the distributional properties of data by means of data enhancement (e.g., rotation, contrast, brightness, etc.). CutPast [10], for instance, artificially generates pseudo-anomaly images by means of cut-and-paste to enable the model to learn the ability to distinguish between normal and anomalous images. In this way, the anomaly detection problem is cleverly transformed into a binary classification task as the training target of the model thus effectively overcoming the difficulty of the scarcity of anomaly samples. However, the anomalies generated by CutPast have limited coverage and cannot comprehensively describe real-world defects.

To overcome the aforementioned two problems, we propose a self-supervised learning method of Feature Enhancement Patch Distribution Modeling (FEPDM), a framework for anomaly detection and location based on feature enhancement and self-supervised learning which introduces a novel data enhancement technique by generating pseudo-anomaly images using Perlin noise [11] combined with anomalous source images. It also incorporates a self-supervised proxy task, which aims to train the model to discriminatively distinguish between normal images and anomalous images. Perlin noise is capable of generating anomalous blocks with varying sizes, shapes, and natural textures, significantly enriching the diversity of anomalies. In addition, in our scheme of model training, we utilize a pre-trained network to extract multi-scale features and train at the feature level to eliminate inter-domain bias and reduce the number of model training parameters instead of training the entire network. Furthermore, to obtain multi-scale features, we have designed a Multi-Scale Feature Aggregation (MSFA) module to aggregate features extracted from different layers of the networks, thereby minimizing feature redundancy and allowing for tighter feature space aggregation. After obtaining the multi-scale features, we learn the latent space representation of the features through a simple encoder and finally obtain the classification results through a classifier. Through this carefully crafted solution, we have not only effectively addressed the challenge of scarce samples in industrial production environments but also successfully mitigated the domain bias issues commonly encountered when directly applying pre-trained models for anomaly detection and image classification, thereby achieving more precise and reliable model performance. Our main contributions are outlined as follows:

We have designed a new unsupervised anomaly detection framework, FEPDM, which reduces bias between the source and target domains while significantly decreasing the number of model training parameters by conducting training at the feature level.
We have introduced a novel anomaly simulation strategy for self-supervised learning of the model. This strategy generates anomaly blocks with diverse textures and structures, closely resembling real-world anomalies.
We have elaborated a multi-scale feature aggregation module capable of integrating multi-resolution, multi-semantic features from different network hierarchies. This module effectively reduces the dimensionality of these features to the target dimensions through an efficient dimensionality reduction strategy.
Extensive experiments on today’s most challenging real-world datasets, such as MVTec and BTAD, demonstrate that our approach exhibits excellent performance and generalization.

The subsequent sections of this article are structured as follows. In Section 2, the existing literature on unsupervised anomaly detection and self-supervised learning is comprehensively reviewed. Section 3 presents a detailed introduction of the method proposed in this paper. Section 4 comprehensively outlines the experimental setup, presents the results, conducts thorough analyses, and includes ablation studies to validate the proposed method. Lastly, Section 5 summarizes the content of this article and discusses potential future research directions.

2. Related Work

Unsupervised learning methods for anomaly detection using neural networks have been extensively analyzed. Current industrial anomaly detection can be classified into the following three types according to different perspectives: reconstruction-based methods, embedding-based methods, and self-supervised learning methods constructed by simulating anomalies.

2.1. Reconstruction-Based Approach

One of the unsupervised anomaly detection methods is reconstruction-based method, where the basic assumption of the reconstruction model is that the model performs better for reconstructing normal images, whereas for anomalous images, there will be a large reconstruction error, which serves as an indicator for anomaly detection. Most reconstruction-based methods train the network for reconstruction using Auto-Encoder (AE) [12], Variable Auto-Encoder (VAE) [13], or Generative Adversarial Networks (GAN) [14]. However, due to the strong generalization of the performance of such neural networks, the anomalous regions are also reconstructed well, which seriously affects the anomaly detection performance.

To solve this problem, RIAD [15] transforms anomaly detection into an image repair problem by introducing a random mask and attempting to use the mask to cover anomalous regions, making anomalies invisible to the autoencoder. InTra [16] further formulates anomaly detection as a patch sequence inpainting problem and presents a solution through a deep Transformer network architecture solely comprising stacked multi-headed self-attention blocks, where convolutional operations are entirely excluded. The literature [17] utilized a Vision Transformer (ViT) [18] as the backbone network to comprehend the global context among image patches, while concurrently introducing a memory module to alleviate the issue of reduced detection accuracy caused by abnormal generalization. SCGAN [19] uses a CopyPaste image enhancement module to simulate anomalies and assist the network in semantic reconstruction. FEGAN [20] employs a feature extraction network to capture the essential characteristics of images, while an Improved Generative Adversarial Network (IGAN) structured on an asymmetric encoder–decoder framework performs the task of the normal reconstruction of abnormal regions under the condition of insensitivity to distinguishing positive and negative class images. DSR [21] generates feature-level anomalies by sampling the learned quantized feature space and uses a dual-decoder architecture to improve reconstruction accuracy. DRAEM [22], by generating anomalies, employs a reconstruction sub-network and a discriminative sub-network to convert the anomaly detection problem into a discriminative problem. SSPCAB [23] employs the concept of masked autoencoders to reconstruct feature maps, thereby learning the distribution of normal images.

2.2. Embedding-Based Approach

Embedding-based methods capture the general paradigm of normal samples by constructing an embedded feature space and subsequently evaluating the features using different statistical models or distance metrics. The main ones include Gaussian density estimation, normalizing flows [24,25], non-parametric kernel density estimation (KDE) [26], and K-nearest neighbors (KNN) [27], etc. SPADE [5] uses a memory bank composed of normal features extracted from a pre-trained backbone network and employs KNN for anomaly detection at both the image and pixel level. PatchCore [7] builds on this by extracting multi-level features to obtain a better feature representation and aggregates local neighborhood feature blocks to increase the receptive field size and robustness to small spatial deviations. Additionally, PatchCore reduces the memory bank size through core-set downsampling. Gaussian-AD [28] uses multivariate Gaussian distribution for modelling image-level features and employs Mahalanobis distance to compute anomaly scores. PaDiM [6] follows this paradigm and further extracts multi-level patch-level features instead of image-level features, employing multivariate Gaussian modelling at patch level for enhanced anomaly detection performance. Furthermore, one-class classification is inherently integrated within embedding-based methods as follows: OC-SVM [29] aims to achieve anomaly detection by using only positive samples to learn a hyperplane that describes the features of the positive samples and keeps negative samples as far away as possible from this hyperplane. The basic idea of SVDD [30] is to construct a minimal hypersphere to contain as much of the data as possible, with variants such as [31,32] also derived from this concept.

2.3. Self-Supervised Learning Based Approach

The scarcity of anomaly samples limits the application of supervised learning in anomaly detection. Thus, some researchers have tried to use self-supervision [8,9,10]. Self-supervised learning is often used as a proxy task that learns features from unlabeled data, thereby generalizing to different downstream tasks. In anomaly detection tasks, a self-supervised task is often used to fine-tune a pre-trained CNN feature extractor to reduce the domain bias between natural image classification scenarios and industrial anomaly detection scenarios. Commonly used self-supervised enhancement strategies include rotation [33], deletion [34], color transformation [35], and mixing [8]. MixUp [8] takes two images and their labels, superposing them according to their weights to generate a new dataset and corresponding labels. Based on this, techniques such as [36,37] were developed. MixMatch [9] performs self-supervised learning by augmenting labeled and unlabeled data multiple times. ReConPatch [38] leverages the similarity between patch-level features, employing pairwise similarity and contextual similarity as pseudo-labels for self-supervised learning. The literature [39] achieved self-supervised learning by employing a compact loss to normalize features into a unified distribution prior to reconstructing the feature map.

There are also methods that use the generation of simulated anomalies for self-supervised training. CutPaste [10] performs self-supervised training by cutting a rectangular block from an image and pasting it into the original image. SimpleNet [40] performs discriminative training by adding random Gaussian noise to the feature space. P2RW [41] proposes a method to create pseudo-anomaly samples using pixel-point random walk, which realizes more diversified random data generation through random pixel walking.

3. Materials and Methods

This section specifically describes the methodology proposed and the datasets used in this paper, which includes generating pseudo-anomaly images and labels, a multi-scale feature aggregation module, a self-supervised training module, and calculating the anomaly score map. The proposed method is illustrated in Figure 1.

3.1. Pseudo-Anomaly Image and Label Generation

Within the framework of self-supervised learning, in which reliance on manually labeled data is unnecessary, models can be trained by automatically generating pseudo-labels based on the structure and characteristics of the data itself. Therefore, an effective strategy involves training the model to discern between normal and abnormal patterns by generating pseudo-abnormal images. In this way, the model is able to learn the intrinsic features of normal samples and identify samples that deviate significantly from these features as abnormal, thereby facilitating effective anomaly detection.

Inspired by Draem [22] and MemSeg [42], we introduce an effective anomaly simulation strategy to assist self-supervised learning training. We utilize Perlin noise and the Describable Textures Dataset (DTD) [43] for anomaly simulation. Perlin noise is an algorithm used to generate natural, continuous, and controlled random textures and shapes, widely applied in computer graphics, game development, and art creation. It produces textures with natural variations and details, enhancing the realism of generated images or shapes, which is essential for simulating anomalies. The DTD is a texture image dataset, which consists of images depicting various texture classes. We generate the simulated exception as shown in Figure 2 with the following steps:

(a): A random two-dimensional Perlin noise P is first generated and randomly rotated to increase diversity. Subsequently, the noise is binarized using a set threshold to obtain the original anomaly mask $M_{P}$ . For the normal image I, we obtain the foreground mask $M_{I F}$ of the input image through threshold binarization. This ensures that our simulated anomalies are generated on the detected target rather than in the background, avoiding unnecessary noise. The simulated anomaly mask M is obtained by multiplying the foreground mask with the Perlin mask, as shown in Figure 2a.

$M = M_{I F} ⊙ M_{P} .$

(1)

(b): The foreground of the anomaly block $I_{A F}$ is obtained by multiplying the anomaly source image $I_{S}$ and the simulated anomaly mask M on an element-by-element multiplication:

$I_{A F} = λ (M \otimes I_{S}) + (1 - λ) (M \otimes I) .$

(2)

A transparency factor

λ \in [0.10, 0.95]

in Equation (2) is introduced to mix the anomaly source information with the original image information, enhancing the realism of the synthetic anomaly. Like

⊙

,

\otimes

is multiplied element by element and used separately to distinguish between the mask and the image. There are two types of anomaly source images—texture and structure. Texture-based anomaly source images are sourced from the DTD dataset. Structural anomaly sources are derived directly from the input images themselves, where we randomly select three of the original input images from a series of enhancement functions. These functions include gamma contrast adjustment, luminance multiplication, sharpness enhancement, hue and saturation adjustment, and affine transformation (including rotation), aimed at diversifying of the data. Subsequently, the enhanced image is divided into 8 × 8 grids and randomly disrupted and reordered to create the structural class anomaly source image. Each grid size is

[W_{I} / 8, H_{I} / 8]

, where

[W_{I}, H_{I}]

denotes the resolution of the input image.

(c): Multiply the input image by the simulated anomaly inverse mask $M_{I n v}$ to get the image without anomalous blocks (Figure 2b), and finally add it to the anomalous block mask to obtain the final pseudo-anomalous image $I_{A}$ .

$I_{A} = I_{A F} + (I \otimes M_{I n v}) .$

(3)

In order to distinguish between normal images, texture class anomalies, and structure class anomalies, we assign pseudo-labels to the exceptions as they are generated, using 0 for normal images, 1 for texture class anomalies, and 2 for structure class anomalies:

l a b e l = \{\begin{cases} 0, normal \\ 1, texture \\ 2, struture \end{cases} .

(4)

3.2. Multi-Scale Feature Aggregation

The core of anomaly detection lies in distinguishing between feature representations of normal and anomalous data. To tackle the complexity and diversity of the data, researchers often employ pre-trained ResNet-like networks [44] as feature extractors. The initial layers of a ResNet network typically capture low-level features such as edges and textures of an image, whereas deeper layers are able to learn more advanced semantic concepts such as shapes, object parts, and even representations of the whole object. In anomaly detection, these multi-layered features can provide comprehensive information about different aspects of a data point thus helping to distinguish between normal and abnormal patterns.

However, the direct use of all the extracted intermediate layer features results in a large number of redundant features and computations, although sufficient semantic and resolution information are obtained. Therefore, we process the features using an attention mechanism to filter out critical features.

Our overall feature extraction process is illustrated in Figure 3. Firstly, the intermediate layer features

f_{e}^{j} (x_{i})

are obtained for each image through the feature extractor ResNet18, where

j \in {1, 2, 3}

represents the features from the first three layers, and

x_{i}

denotes both the normal images and their generated pseudo-abnormal images. Subsequently, the multi-semantic low-resolution features from layers 2 and 3 are upsampled to match the resolution of layer 1, and then merged into a preliminary aggregated feature

f_{p a}

through channel concatenation as follows:

f_{p a} = C o n c a t (f_{e}^{1}, U P_{s 1} (f_{e}^{2}), U P_{s 2} (f_{e}^{3}) |(W, H)),

(5)

where

U P_{s 1}

and

U P_{s 2}

in Equation (5) denote the upsampling of

f_{e}^{2}

and

f_{e}^{3}

to the same resolution as

f_{e}^{1}

, respectively.

[W, H]

denotes the resolution of the feature map

f_{e}^{1}

. The resolutions of

f_{e}^{2}

and

f_{e}^{3}

are

[W / 2, H / 2]

and

[W / 4, H / 4]

, respectively.

Subsequently, a 1 × 1 point-wise convolution is applied to the concatenated preliminary aggregated feature

f_{p a}

to initially fuse the channel information. The resulting fused feature, enriched with semantic information, is further enhanced by utilizing Coordinate Attention [45] to boost the representational power of the feature map and highlight crucial spatial location information, resulting in the intermediate aggregated feature

f_{i a}

as follows:

f_{i a} = C A (P W C (f_{p a})),

(6)

where

C A

denotes Coordinate Attention module,

P W C

denotes 1 × 1 point-wise convolution.

The preliminary aggregated feature

f_{p a}

in Equation (5) encapsulates some fundamental yet pivotal information, while the intermediate aggregated feature

f_{i a}

in Equation (6), after convolution and channel attention processing, places greater emphasis on the important parts and details in the image. Combining these two features can integrate the rich information from the original features with high-level representations from the processed features, enabling the model to utilize more comprehensive and effective feature information in subsequent processing. Finally, two 3 × 3 convolutions are applied to further refine the features and reduce the channel dimension to the target dimension, resulting in the final multi-scale aggregated feature

f_{a g g}

as follows:

f_{a g g} = C O N V_{2} (C O N V_{1} (f_{p a} \oplus f_{i a})) .

(7)

3.3. Self-Supervised Training

In the previous section, we artificially generated pseudo-anomaly images and pseudo-labels. This section focuses on learning self-supervised representations by constructing a one-class classifier using normal images and the generated anomaly images. Our goal is to discriminatively distinguish between the feature distributions of the normal and abnormal images, rather than carefully classifying the features. In other words, our aim is for the model to learn to recognize what is normal and what is abnormal, rather than accurately identifying specific types of anomalies, due to the diversity and uncertainty of abnormalities in reality.

Unlike some other self-supervised models, our approach, inspired by embedded methods, does not involve training the entire network from scratch but rather focuses on the feature level. Firstly, we extract features from the middle layer through the pre-trained network. Subsequently, using multi-scale feature fusion techniques, we obtain fused features with varying semantic information and resolutions. Next, these fused features serve as inputs to our model where we construct a concise encoder to project them into a latent space for discrimination and classification. This design enables our model to leverage the intrinsic structure of the data more effectively, thereby enhancing classification performance.

By generating simulated anomalies, we obtain normal image, texture-class anomalous image, and structure class anomalous image. These are processed through a pre-trained feature extractor to get multi-scale fusion features, which are then input into an encoder to generate potential vectors. Finally, these vectors are fed into a classifier for classification, using the cross-entropy loss function for training. The training objectives are as follows:

L = E \{ℂ E (g (Φ (f_{a g g} (x))), 0) + ℂ E (g (Φ (f_{a g g} (x_{T e x}))), 1) + ℂ E (g (Φ (f_{a g g} (x_{S t r}))), 2)\}

(8)

In Equation (8),

x \in X_{N}

denotes the input normal image, and

x_{T e x}, x_{S t r} \in X_{A}

denote the generated texture-class and structure-class anomalies, respectively. Furthermore, 0, 1, and 2 denote the corresponding pseudo-labels, respectively.

f_{a g g} (\cdot)

denotes multi-scale aggregated features in Equation (7),

Φ

and

g

denote the encoder and classifier, respectively, and

ℂ E (\cdot, \cdot)

denotes the cross-entropy loss function. The structures of the encoder and classifier are shown in Table 1, where the encoder consists of three convolutional blocks and corresponding pooling layers, each containing a convolutional layer, BatchNorm layer, and ReLU layer. The classifier consists of a multi-layer perceptron (MLP) and fully connected (FC) layers.

3.4. Calculate Anomaly Score Map

We used multivariate Gaussian distribution

N (μ, Σ)

to model the samples and utilized the Mahalanobis distance between the normal and abnormal samples at position

(i, j)

as the anomaly score. The Mahalanobis distance is a commonly used measure of the difference between two samples, which takes into account the characteristics of the sample distribution, correlation between features, and scale invariance. It effectively measures differences between samples, thereby identifying anomalous data more efficiently. The details are as follows:

Firstly, the multi-scale feature map

f_{a g g} (x^{k})

of the normal samples in the training set is extracted by the model obtained from self-supervised learning, where

k \in [1, 2, \dots, N]

and

N

represents the number of normal samples. Each feature vector of the feature map is treated as a patch, and the patch-level multivariate Gaussian distribution model

N (μ_{(i, j)}, Σ_{(i, j)})

is obtained by calculating its mean and covariance matrices at each patch

(i, j)

. The mean and covariance matrices of the samples are calculated as follows:

μ_{(i, j)} = \frac{1}{N} \sum_{k = 1}^{N} f_{a g g} {(x^{k})}_{(i, j)},

(9)

Σ_{(i, j)} = \frac{1}{N - 1} \sum_{k = 1}^{N} (f_{a g g} {(x^{k})}_{(i, j)} - μ_{(i, j)}) {(f_{a g g} {(x^{k})}_{(i, j)} - μ_{(i, j)})}^{T} + α E .

(10)

The

μ_{(i, j)}

and

Σ_{(i, j)}

in Equations (9) and (10) represent the mean and covariance at each patch, respectively, where

i \in [1, W], j \in [1, H]

,

W \times H

denote the resolution of the multi-scale feature map.

α

is a very small number, typically taken as 0.01, and

E

is the unit matrix.

α E

is used as a regularization constraint term to ensure that the covariance matrix is full rank and invertible, allowing it to be used to compute the Mahalanobis distance.

During the inference process, the same model is used to extract the multi-scale feature maps

f_{a g g} (x_{t})

of the test samples in the test set. An anomaly score

M (f_{a g g} {(x_{t})}_{(i, j)})

is assigned to each patch by calculating the Mahalanobis distance of the feature maps from the set of normal samples at each patch

(i, j)

. The Mahalanobis distances are calculated as follows:

M (f_{a g g} {(x_{t})}_{(i, j)}) = \sqrt{{(f_{a g g} {(x_{t})}_{(i, j)} - μ_{(i, j)})}^{T} Σ_{(i, j)}^{- 1} (f_{a g g} {(x_{t})}_{(i, j)} - μ_{(i, j)})} .

(11)

In Equation (11),

Σ_{(i, j)}^{- 1}

denotes the inverse of the covariance matrix, and we end up with a Mahalanobis distance matrix serving as the anomaly score map. It contains the anomaly scores of all feature vectors in the feature map. A low score means that the region is closer to the normal sample, hence considered normal. On the contrary, a high score indicates that the region is further from the normal samples, indicating abnormality.

3.5. Datasets

In order to validate the effectiveness and generalization of the algorithm proposed in this paper, we evaluate the proposed anomaly detection algorithm on two challenging publicly available unsupervised anomaly detection datasets—the MVTec [3] and the BTAD dataset [46].

MVTec AD is an industrial anomaly detection dataset specifically proposed for unsupervised anomaly detection, which contains 15 subsets totaling 5354 high-resolution images across 10 target categories and 5 texture categories. The training set includes 3629 normal images, while the test set includes 1725 images of both normal and abnormal instances. Anomalies in the dataset include scratches, dents, missing defects, amounting to a total of 73 different defect categories to simulate real-world industrial anomaly detection scenarios. It is noteworthy that MVTec provides precise pixel-level annotations for each anomaly image in the test set, representing the anomalous areas. The BeanTech Anomaly Detection Dataset (BTAD) is a real-world industrial image dataset consisting of normal and defective products, with a total of 2542 images across three types of industrial products. Similar to MVTec, the training set comprises only normal images, whereas the test set contains both normal and abnormal images.

4. Results

4.1. Experimental Details

Our experimental environment is the PyTorch deep learning framework running on Python 3.8, Torch 2.0.1, and Cuda 11.8. The experiments were conducted on an Intel Core i7-13700F (Intel, Santa Clara, CA, USA) with an NVIDIA RTX 3050 OEM (8 G) (NVIDIA, Santa Clara, CA, USA). For image preprocessing, we first resized the image to 256 × 256 and performed pseudo-anomaly generation at that resolution. Afterwards, it was changed to 224 × 224 by center cropping to fit the input of our model. For the feature extractor, we chose ResNet18 pre-trained on ImageNet (v.3.0). During the training process, we set the learning rate to 0.03 and used the SGD optimizer with a momentum of 0.9 and a weight decay of 0.00003. To optimize the model parameters more efficiently and speed up the training process, we used the CosineAnnealingWarmRestarts [47] scheduler. This scheduler combines the strategies of cosine annealing and periodic learning rate restarts and is able to adaptively adjust the learning rate at different stages of training. The training epoch is set to 256, and the batch size is 64. It is worth noting that since our training goal is to enable the model to correctly distinguish normal and anomalous inputs, we only need to continuously input different normal and anomalous images for the model to learn. Therefore, an epoch in this paper refers to training a batch, where a batch contains both normal, texture-abnormal, and structure-abnormal images.

4.2. Experimental Results

4.2.1. Evaluation Metrics

The ROC Curve (Receiver Operating Characteristic Curve) is plotted with the False Positive Rate (FPR) on the horizontal axis and the True Positive Rate (TPR) on the vertical axis. It demonstrates the performance of the model at various classification thresholds. The area under the subject operating characteristic curve (ROC-AUC) is the area under the ROC curve, which has a value range between 0 and 1. The larger the AUC value, the better the performance of the model in distinguishing between the positive and negative samples. Therefore, in order to validate the anomaly detection and defect location performance, we evaluate the performance of the model using both image-level and pixel-level ROC-AUC metrics.

4.2.2. Image-Level Anomaly Detection

Table 2 presents the anomaly detection results of our method compared to the current state-of-the-art methods on the MVTec dataset, including SPADE [5], RIAD [15], P-SVDD [31], P-SVDD-C [32], PaDiM [6], Cutpaste [10], and SCGAN [19]. Our method achieved the highest ROC-AUC score of 95.7%, which is 0.2% and 0.5% higher than the current state-of-the-art PaDiM and CutPaste, respectively. While our method’s performance in individual categories may not be outstanding, its overall highest score suggests generalizability across different categories.

4.2.3. Pixel-Level Anomaly Location

Table 3 shows the anomaly location results of our method compared to the current state-of-the-art methods on the MVTec dataset. Our method achieves the second-best result, only 0.3% lower than PaDiM, 0.2% higher than CutPaste, and 2% higher than RIAD. The visualization of the anomaly location is shown in Figure 4. To demonstrate the generality of our method, we also compared it with some current state-of-the-art methods (PatchCore [7] and VT-ADL [46]) on BTAD, the results are shown in Table 4 and the visualization is depicted in Figure 5. We achieved the best anomaly detection and location performance on BTAD. Although the single-class performance was not particularly impressive, the highest average AUC score was achieved, further illustrating the generalization performance of our model.

4.3. Ablation Experiment

4.3.1. Features Layer Selection

We investigated the effect of using features extracted from different network layers on anomaly detection results. ResNet18 consists of four main layers, of which we primarily utilized the first three. Table 5 displays the AUC scores obtained using the different feature layers. It can be seen that the highest image-level and pixel-level AUC scores are obtained when using all three layers of multi-scale features. It is worth noting that using only the third layer of features yields the same image-level AUC score of 95.7% as the multi-scale features, but a lower pixel-level AUC score of 94.2%. This discrepancy may be attributed to the fact that deeper features contain richer semantic and global information, which is advantageous for the classification of anomalous images. However, their lower resolution may result in missing some local information, leading to lower pixel-level scores.

Figure 6 shows the visualization results using different feature layers, comparing them with PaDiM. We specifically focus on multi-scale features, selecting combinations such as layers 1, 3; layers 2, 3; and layers 1, 2, 3. As can be seen from the results, our multi-scale feature aggregation achieves more accurate localization than directly using spliced multi-layer features (PaDiM). Meanwhile, the most accurate localization results are attained by using features that combine multi-layer semantic information with resolution (layers 1, 2, 3). Additionally, compared to PaDiM, our method exhibits a lower false detection rate. This is due to the fact that our self-supervised proxy task largely eliminates the domain bias with respect to natural image classification thus enhancing adaptability to anomaly detection tasks.

4.3.2. Attention Mechanisms

We also investigated the effects of different attention mechanisms on model performance, successively comparing no attention, SE attention [48], CBAM attention [49], and CA attention. The SE attention mechanism is a channel attention mechanism, which improves the model’s attention to important features by weighting each channel. CBAM integrates both channel and spatial attention modules, providing comprehensive consideration of channel and spatial information. The CA attention mechanism captures the correlation between different regions in the image and model the global information of the image thus improving the representation of the image.

The results are presented in Table 6, where it can be seen that CA attention gets the highest AUC scores for both texture and target classes. On average, it outperforms the other three groups in image-level anomaly detection by 0.5%, 1.1%, and 1.3%, respectively, and in pixel-level anomaly location by 0.5% for all of them. This improvement can be attributed to its introduction of coordinate information, which enables the model to better understand the relationship between the different locations and adaptively learn the weights based on different input data, so that the model can dynamically adjust the degree of attention to different parts according to the specific situation. SE, on the other hand, is simply weighted by channels, and the weighted features may lose part of the local information. Although CBAM takes both channel and spatial information into account, its high computational complexity leads to performance degradation in the algorithm.

4.3.3. Dimensional Ablation

We investigated the effect of multi-scale features extracted by different attention mechanisms on the AUC-ROC scores across various target dimensions ranging from 100 to 256. Experiments were conducted mainly in four dimensions, 100, 160, 200, and 256, as detailed in Table 7. For the SE and CBAM attention mechanisms, the highest image-level and pixel-level AUC scores were achieved at 256 dimensions. In contrast, with the CA attention mechanism, we obtained the highest image-level AUC scores of 95.7% in the 100-dimensional case, and the highest pixel-level AUC scores of 96.6% in the 200-dimensional case, which are 0.6% and 0.2% higher than those in the highest dimension, 256 dimensions, respectively. Typically, while it is widely recognized that increasing the dimensionality may enhance the performance of the model, the addition of the CA attention mechanism did not exactly follow this trend in our study. This may be due to the fact that the CA attention mechanism has a unique ability to model global information and represent features, which makes it more effective in extracting and retaining features that are critical to model performance during the dimensionality reduction process. Consequently, despite dimensionality reduction, the CA attention mechanism ensures that the model still captures sufficient useful information to maintain or potentially enhance overall performance.

4.3.4. Different Anomaly Simulation Strategies

Finally, we investigated the impact of different anomaly simulation strategies on anomaly detection performance. As shown in Figure 7, there are a total of four different synthetic anomalies as follows: the first two is from CutPaste, involving randomly cutting a piece from the image and pasting it into another location, with rectangular blocks and small patches (‘scar’), respectively. The last two are the anomaly simulation strategies in this paper, with no foreground mask and with foreground mask, respectively. Table 8 elucidates the anomaly detection performance of self-supervised training using these different anomaly simulation strategies. The fourth column (‘rect&scar’) in the table indicates that the rectangular blocks and small patches were trained as a separate class for triple classification. The fifth column (‘2-way’) indicates a binary classification task using the anomaly simulation in this paper, without distinguishing between texture and structure class anomalies, focusing solely on normal vs. anomalous instances. Experimental results show that our anomaly simulation strategy outperforms rectangular blocks and small patches, obtaining the highest image-level AUC score of 95.7%, while the pixel-level is slightly lower than the strategy without a foreground mask by 0.2%. In the texture class dataset, the highest image-level and pixel-level AUC scores (98.7%, 96.7%) were obtained without the foreground mask. This is because, in the texture class dataset, the samples occupy the entire image and no longer distinguish between the background and the foreground, so the whole image is the foreground. Thresholding binarization of the image in this case could damage the original information of the image and affect the performance of anomaly detection. As for the object class dataset, the best scores (93.9%, 96.5%) were achieved using the foreground mask, and this is because we restricted the region of simulated anomalies to the target object instead of the background, thereby preventing the introduction of unnecessary noise into the background during anomaly generation. The random cut-and-paste method has a high probability of generating anomalies into the background, especially in the category of screw, where the target only occupies a small portion of the entire image, which affects the performance of anomaly detection. Additionally, we can see that anomalies simulated using Perlin noise tend to appear closer to real-world anomalies.

4.4. Discussions

4.4.1. Computational Complexity and Memory Complexity

Table 9 presents a further comparison with two state-of-the-art algorithms, PaDiM and CutPaste, to illustrate the advantages of our algorithm. R18 and WR50 stand for ResNet18 and Wide-ResNet50 [50], respectively; RD100 and RD550 denote random downscaling to 100 and 550 dimensions on the channel. Using the same backbone network (ResNet18) and dimension (100), our algorithm has the same memory bank size as PaDiM. However, our image-level AUC score surpasses PaDiM by 5.2%, while the pixel-level AUC score is marginally lower by 0.3%. Compared to the PaDiM-WR50-RD550 with Wide-ResNet50, our image-level AUC remains 0.2% higher, and our memory bank size is nearly 30 times smaller (0.12 vs. 3.5). Despite a 1.1% lower pixel-level AUC, the visualized anomaly scoring maps demonstrate superior results over PaDiM, as shown in Figure 6, possibly owing to our training process. In comparison with CutPaste, our image-level and pixel-level AUCs outperform it by 0.5% and 0.2%, respectively. Meanwhile, CutPaste trains the entire ResNet18 with a total of 11.8M training parameters, while we train only at the feature level with the MSFA module and encoder part, reducing the parameter count to 3.8 M, which is approximately four times smaller than CutPaste.

4.4.2. Limitations

Although the method we proposed has achieved outstanding results on the overall dataset, there are still some limitations. In the phase of anomaly detection, we used the multivariate Gaussian distribution to model the sample data, but in actual industrial production, there may be some samples that do not conform to such a data distribution, leading to difficulties in detection. Meanwhile, calculating the Mahalanobis distance at each location as an anomaly score places high demands on image alignment. We assumed that image acquisition was performed under the same conditions and satisfied the multivariate Gaussian distribution. If this assumption is deviated from, it may lead to poor detection results.

For instance, our method performs poorly for the Pill, Capsule, and Screw categories in Table 2. Analysis reveals that for Pill, it is challenging to learn its normal Gaussian distribution due to the irregularly distributed small red dots in its normal image, as shown in Figure 8a. Additionally, for Capsule and Screw, their normal images exhibit significant variations in the normal distribution at each position due to different rotations and angles, as shown in Figure 8b,c, respectively, resulting in lower scores that require optimization.

5. Conclusions

This paper proposes an anomaly detection framework, Feature Enhancement Patch Distribution Modeling (FEPDM), based on anomaly generation and feature enhancement. Perlin noise and anomaly source images are utilized to generate simulated anomalies closely resembling real-world scenarios, solving the problem of scarce anomaly samples and allowing the model to be trained using a self-supervised approach. The introduction of a Multi-Scale Feature Aggregation module (MSFA) allows the model to train directly at the feature level rather than the entire feature extraction network, which significantly reduces the hardware dependency and makes it lightweight. Furthermore, extensive experiments conducted on real-world industrial datasets, MVTec AD and BTAD, demonstrate that our method exhibits outstanding performance, achieving impressive AUC scores for both image-level anomaly detection and pixel-level anomaly localization as follows: (95.7%, 96.2%) on MVtec AD and (93.4%, 97.6%) on BTAD. In summary, by simulating anomalies, we overcome the scarcity of anomaly samples, providing a novel and effective solution for industrial anomaly detection. By training at the feature level and leveraging multi-scale features, FEPDM significantly enhances the model’s generalization ability while effectively reducing computational costs, making anomaly detection more efficient.

However, there are some limitations in modeling with multivariate Gaussian distributions, as discussed in Section 4.4.2. The sizes of the mean and covariance matrices for the normal samples stored in memory banks, although independent of the number of samples, are positively correlated with the feature size. As the feature size increases, the size of the memory bank also grows substantially. Future work will be devoted to identifying more suitable sample distributions for modeling, as well as optimizing the memory bank to minimize the increase in the memory bank due to the increase in feature dimensions. Moreover, with the widespread development and application of generative models, such as diffusion models, we will endeavor to explore novel anomaly simulation approaches that can generate anomalies even closer to real-world industrial scenarios in future work.

Author Contributions

Conceptualization, B.W. and X.W.; methodology, B.W.; software, X.W.; validation, B.W. and X.W.; formal analysis, B.W. and X.W.; investigation, B.W. and X.W.; resources, B.W.; data curation, X.W.; writing—original draft preparation, X.W.; writing—review and editing, B.W. and X.W.; supervision, B.W.; project administration, B.W.; funding acquisition, B.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Science Foundation of China (NSFC) under Grant 62101113, China; the Natural Science Foundation of Hebei Province under Grant F2020501035; and the Fundamental Research Funds for the Central Universities under Grants N2123024 and N2023008.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in this study are included in the article; further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Pang, G.; Shen, C.; Cao, L.; van den Hengel, A. Deep Learning for Anomaly Detection: A Review. ACM Comput. Surv. 2022, 54, 1–38. [Google Scholar] [CrossRef]
Fernando, T.; Gammulle, H.; Denman, S.; Sridharan, S.; Fookes, C. Deep Learning for Medical Anomaly Detection—A Survey. arXiv 2021, arXiv:2012.02364. [Google Scholar] [CrossRef]
Bergmann, P.; Fauser, M.; Sattlegger, D.; Steger, C. MVTec AD—A Comprehensive Real-World Dataset for Unsupervised Anomaly Detection. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, Long Beach, CA, USA, 16–20 June 2019; pp. 9584–9592. [Google Scholar]
Adam, A.; Rivlin, E.; Shimshoni, I.; Reinitz, D. Robust Real-Time Unusual Event Detection Using Multiple Fixed-Location Monitors. IEEE Trans. Pattern Anal. Mach. Intell. 2008, 30, 555–560. [Google Scholar] [CrossRef]
Cohen, N.; Hoshen, Y. Sub-Image Anomaly Detection with Deep Pyramid Correspondences. arXiv 2021, arXiv:2005.02357. [Google Scholar]
Defard, T.; Setkov, A.; Loesch, A.; Audigier, R. PaDiM: A Patch Distribution Modeling Framework for Anomaly Detection and Localization. In Pattern Recognition. ICPR International Workshops and Challenges, Virtual, 10–15 January 2021; Del Bimbo, A., Cucchiara, R., Sclaroff, S., Farinella, G.M., Mei, T., Bertini, M., Escalante, H.J., Vezzani, R., Eds.; Lecture Notes in Computer Science; Springer International Publishing: Cham, Switzerland, 2021; Volume 12664, pp. 475–489. [Google Scholar]
Roth, K.; Pemula, L.; Zepeda, J.; Scholkopf, B.; Brox, T.; Gehler, P. Towards Total Recall in Industrial Anomaly Detection. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 19–24 June 2022; pp. 14298–14308. [Google Scholar]
Zhang, H.; Cisse, M.; Dauphin, Y.N.; Lopez-Paz, D. Mixup: Beyond Empirical Risk Minimization. arXiv 2018, arXiv:1710.09412. [Google Scholar]
Berthelot, D.; Carlini, N.; Goodfellow, I.; Papernot, N.; Oliver, A.; Raffel, C.A. MixMatch: A Holistic Approach to Semi-Supervised Learning. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; Wallach, H., Larochelle, H., Beygelzimer, A., d’Alché-Buc, F., Fox, E., Garnett, R., Eds.; Curran Associates, Inc.: New York, NY, USA, 2019; Volume 32. [Google Scholar]
Li, C.-L.; Sohn, K.; Yoon, J.; Pfister, T. CutPaste: Self-Supervised Learning for Anomaly Detection and Localization. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 19–25 June 2021; pp. 9659–9669. [Google Scholar]
Perlin, K. An Image Synthesizer. ACM Siggraph Comput. Graph. 1985, 19, 287–296. [Google Scholar] [CrossRef]
Bergmann, P.; Löwe, S.; Fauser, M.; Sattlegger, D.; Steger, C. Improving Unsupervised Defect Segmentation by Applying Structural Similarity to Autoencoders. In Proceedings of the 14th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications, Prague, Czech Republic, 25–27 February 2019; pp. 372–380. [Google Scholar]
Liu, W.; Li, R.; Zheng, M.; Karanam, S.; Wu, Z.; Bhanu, B.; Radke, R.J.; Camps, O. Towards Visually Explaining Variational Autoencoders. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, Seattle, WA, USA, 13–19 June 2020; pp. 8639–8648. [Google Scholar]
Sabokrou, M.; Khalooei, M.; Fathy, M.; Adeli, E. Adversarially Learned One-Class Classifier for Novelty Detection. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 3379–3388. [Google Scholar]
Zavrtanik, V.; Kristan, M.; Skočaj, D. Reconstruction by Inpainting for Visual Anomaly Detection. Pattern Recognit. 2021, 112, 107706. [Google Scholar] [CrossRef]
Pirnay, J.; Chai, K. Inpainting Transformer for Anomaly Detection. In Image Analysis and Processing–ICIAP 2022, Proceedings of the 21st International Conference, Lecce, Italy, 23–27 May 2022; Sclaroff, S., Distante, C., Leo, M., Farinella, G.M., Tombari, F., Eds.; Lecture Notes in Computer Science; Springer International Publishing: Cham, Switzerland, 2022; Volume 13232, pp. 394–406. [Google Scholar]
Yang, Q.; Guo, R. An Unsupervised Method for Industrial Image Anomaly Detection with Vision Transformer-Based Autoencoder. Sensors 2024, 24, 2440. [Google Scholar] [CrossRef]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image Is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv 2021, arXiv:2010.11929. [Google Scholar]
Dai, Y.; Zhang, L.; Fan, F.-Y.; Wu, Y.-J.; Zhao, Z.-K. SCGAN: Extract Features From Normal Semantics for Unsupervised Anomaly Detection. IEEE Access 2023, 11, 137957–137968. [Google Scholar] [CrossRef]
Fan, F.-Y.; Zhang, L.; Dai, Y. FEGAN: A Feature Extraction Based Approach for GAN Anomaly Detection and Localization. IEEE Access 2024, 12, 76154–76168. [Google Scholar] [CrossRef]
Zavrtanik, V.; Kristan, M.; Skočaj, D. DSR–A Dual Subspace Re-Projection Network for Surface Anomaly Detection. In Computer Vision–ECCV 2022, Proceedings of the 17th European Conference, Tel Aviv, Israel, 23–27 October 2022; Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T., Eds.; Lecture Notes in Computer Science; Springer Nature: Cham, Switzerland, 2022; Volume 13691, pp. 539–554. [Google Scholar]
Zavrtanik, V.; Kristan, M.; Skocaj, D. DRÆM–A Discriminatively Trained Reconstruction Embedding for Surface Anomaly Detection. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), IEEE, Montreal, QC, Canada, 11–17 October 2021; pp. 8310–8319. [Google Scholar]
Ristea, N.-C.; Madan, N.; Ionescu, R.T.; Nasrollahi, K.; Khan, F.S.; Moeslund, T.B.; Shah, M. Self-Supervised Predictive Convolutional Attentive Block for Anomaly Detection. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 19–24 June 2022; pp. 13566–13576. [Google Scholar]
Yu, J.; Zheng, Y.; Wang, X.; Li, W.; Wu, Y.; Zhao, R.; Wu, L. FastFlow: Unsupervised Anomaly Detection and Localization via 2D Normalizing Flows. arXiv 2021, arXiv:2111.07677. [Google Scholar]
Gudovskiy, D.; Ishizaka, S.; Kozuka, K. CFLOW-AD: Real-Time Unsupervised Anomaly Detection with Localization via Conditional Normalizing Flows. In Proceedings of the 2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 4–8 January 2022; pp. 1819–1828. [Google Scholar]
Sohn, K.; Li, C.-L.; Yoon, J.; Jin, M.; Pfister, T. Learning and Evaluating Representations for Deep One-Class Classification. arXiv 2021, arXiv:2011.02578. [Google Scholar]
Bergman, L.; Cohen, N.; Hoshen, Y. Deep Nearest Neighbor Anomaly Detection. arXiv 2020, arXiv:2002.10445. [Google Scholar]
Rippel, O.; Mertens, P.; Konig, E.; Merhof, D. Gaussian Anomaly Detection by Modeling the Distribution of Normal Data in Pretrained Deep Features. IEEE Trans. Instrum. Meas. 2021, 70, 1–13. [Google Scholar] [CrossRef]
Schölkopf, B.; Platt, J.C.; Shawe-Taylor, J.; Smola, A.J.; Williamson, R.C. Estimating the Support of a High-Dimensional Distribution. Neural Comput. 2001, 13, 1443–1471. [Google Scholar] [CrossRef]
Tax, D.M.J.; Duin, R.P.W. Support Vector Data Description. Mach. Learn. 2004, 54, 45–66. [Google Scholar] [CrossRef]
Yi, J.; Yoon, S. Patch SVDD: Patch-Level SVDD for Anomaly Detection and Segmentation. In Proceedings of the 15th Asian Conference on Computer Vision–ACCV 2020, Virtual, 30 November–4 December 2020; Ishikawa, H., Liu, C.-L., Pajdla, T., Shi, J., Eds.; Lecture Notes in Computer Science. Springer International Publishing: Cham, Switzerland, 2021; Volume 12627, pp. 375–390. [Google Scholar]
Ahn, J.-Y.; Kim, G. Application of Optimal Clustering and Metric Learning to Patch-Based Anomaly Detection. Pattern Recognit. Lett. 2022, 154, 110–115. [Google Scholar] [CrossRef]
Gidaris, S.; Singh, P.; Komodakis, N. Unsupervised Representation Learning by Predicting Image Rotations. arXiv 2018, arXiv:1803.07728. [Google Scholar]
DeVries, T.; Taylor, G.W. Improved Regularization of Convolutional Neural Networks with Cutout. arXiv 2017, arXiv:1708.04552. [Google Scholar]
Zhang, R.; Isola, P.; Efros, A.A. Colorful Image Colorization. In Computer Vision–ECCV 2016, Proceedings of the 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Leibe, B., Matas, J., Sebe, N., Welling, M., Eds.; Lecture Notes in Computer Science; Springer International Publishing: Cham, Switzerland, 2016; Volume 9907, pp. 649–666. [Google Scholar]
Yun, S.; Han, D.; Chun, S.; Oh, S.J.; Yoo, Y.; Choe, J. CutMix: Regularization Strategy to Train Strong Classifiers With Localizable Features. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6022–6031. [Google Scholar]
Olsson, V.; Tranheden, W.; Pinto, J.; Svensson, L. ClassMix: Segmentation-Based Data Augmentation for Semi-Supervised Learning. In Proceedings of the 2021 IEEE Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 5–9 January 2021; pp. 1368–1377. [Google Scholar]
Hyun, J.; Kim, S.; Jeon, G.; Kim, S.H.; Bae, K.; Kang, B.J. ReConPatch: Contrastive Patch Representation Learning for Industrial Anomaly Detection. In Proceedings of the 2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 4–8 January 2024; pp. 2041–2050. [Google Scholar]
Zuo, Z.; Wu, Z.; Chen, B.; Zhong, X. A Reconstruction-Based Feature Adaptation for Anomaly Detection with Self-Supervised Multi-Scale Aggregation. In Proceedings of the ICASSP 2024—2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Republic of Korea, 14 April 2024; pp. 5840–5844. [Google Scholar]
Liu, Z.; Zhou, Y.; Xu, Y.; Wang, Z. SimpleNet: A Simple Network for Image Anomaly Detection and Localization. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 18–22 June 2023; pp. 20402–20411. [Google Scholar]
Hua, L.; Qi, Q.; Long, J. P2 Random Walk: Self-Supervised Anomaly Detection with Pixel-Point Random Walk. Complex Intell. Syst. 2024, 10, 2541–2555. [Google Scholar] [CrossRef]
Yang, M.; Wu, P.; Feng, H. MemSeg: A Semi-Supervised Method for Image Surface Defect Detection Using Differences and Commonalities. Eng. Appl. Artif. Intell. 2023, 119, 105835. [Google Scholar] [CrossRef]
Cimpoi, M.; Maji, S.; Kokkinos, I.; Mohamed, S.; Vedaldi, A. Describing Textures in the Wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 24–27 June 2014; pp. 3606–3613. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 26 June–1 July 2016; pp. 770–778. [Google Scholar]
Hou, Q.; Zhou, D.; Feng, J. Coordinate Attention for Efficient Mobile Network Design. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 19–25 June 2021; pp. 13708–13717. [Google Scholar]
Mishra, P.; Verk, R.; Fornasier, D.; Piciarelli, C.; Foresti, G.L. VT-ADL: A Vision Transformer Network for Image Anomaly Detection and Localization. In Proceedings of the 2021 IEEE 30th International Symposium on Industrial Electronics (ISIE), Kyoto, Japan, 20 June 2021; pp. 1–6. [Google Scholar]
Loshchilov, I.; Hutter, F. SGDR: Stochastic Gradient Descent with Warm Restarts. arXiv 2017, arXiv:1608.03983. [Google Scholar]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-Excitation Networks. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 7132–7141. [Google Scholar]
Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.S. CBAM: Convolutional Block Attention Module. In Proceedings of the European Conference on Computer Vision–ECCV 2018, Munich, Germany, 8–14 September 2018; Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y., Eds.; Lecture Notes in Computer Science. Springer International Publishing: Cham, Switzerland, 2018; Volume 11211, pp. 3–19. [Google Scholar]
Zagoruyko, S.; Komodakis, N. Wide Residual Networks. arXiv 2016, arXiv:1605.07146. [Google Scholar]

Figure 1. The model framework diagram of the proposed method consists of two main parts as follows: the upper half is for the proxy task of self-supervised training, and the lower half is for anomaly detection. The self-supervised training section comprises a feature extractor, a multi-scale feature aggregation module, an encoder, and a classifier. The classifier’s outputs, denoted as 0, 1, and 2, respectively, represent normal, texture class anomalies, and structural anomalies, as defined in Equation (4). The anomaly detection section includes a feature extractor and a multi-scale feature aggregation module with the same parameters as those in the upper section, as well as a multivariate Gaussian modeling component.

Figure 2. Anomaly generation strategy. It mainly consists of two parts: (a) Generation of anomaly mask by threshold binarization (Equation (1)). (b) Generation of synthetic anomaly image by anomaly mask, anomaly source image, and input image (Equations (2) and (3)).

Figure 3. The Pipeline of the Multi-scale Feature Aggregation Module.

Figure 4. Anomaly visualization of our method on the MVTec dataset. The figure shows the anomaly detection results for a number of categories, for each of which from left to right are test images, ground truth, and predicted heat maps.

Figure 5. Anomaly visualization of our method on the BTAD dataset. BTAD has a total of three categories, 01, 02, and 03, corresponding to the first to third rows. From left to right, they represent test images, ground truth, and predicted heat maps, respectively.

Figure 6. Ablation results for different feature layers and visualization comparison with PaDiM.

Figure 7. Pseudo-anomaly images generated using different anomaly simulation strategies. From left to right are (a) normal sample (b) cut-paste with rectangular blocks (c) cut-paste with small patches (scar) (d) our method without foreground mask (e) our method using foreground mask.

Figure 8. Diversity of normal samples. From top to bottom, they are (a) Pill, (b) Capsule, and (c) Screw.

Table 1. The structure of encoder and classifier.

Structure	Input	Output	Channel
Conv_block1	56 × 56	56 × 56	128
Avgpool1	56 × 56	28 × 28	128
Conv_block2	28 × 28	28 × 28	256
Avgpool2	28 × 28	14 × 14	256
Conv_block3	14 × 14	14 × 14	512
Avgpool3	14 × 14	7 × 7	512
AdaptiveAvgPool	7 × 7	1 × 1	512
MLP	[512, 512, 128]
FC	3

Table 2. Comparison of image-level AUC% with different methods on MVTec.

Category		SPADE [5]	RIAD [15]	P-SVDD [31]	P-SVDD-C [32]	PaDiM [6]	CutPaste [10]	SCGAN [19]	Ours
	Carpet	92.8	84.2	92.9	94.4	99.9	93.1	97.0	98.9
	Grid	47.3	99.6	94.6	95.6	95.7	99.9	96.3	96.9
Texture	Leather	61.5	100.0	90.9	96.1	100.0	100.0	94.7	100.0
Texture	Tile	96.5	98.7	97.8	93.5	97.4	93.4	97.4	97.3
	Wood	95.8	93.0	96.5	98.0	98.8	98.6	100.0	97.8
	Average	85.6	95.1	94.5	95.5	98.4	97.0	97.1	98.2
	Bottle	97.2	99.9	98.6	99.5	99.8	98.3	98.3	99.9
	Cable	84.8	81.9	90.3	97.8	92.2	80.6	98.2	94.9
	Capsule	89.7	88.4	76.7	88.7	91.5	96.2	83.2	90.3
	Hazelnut	88.1	83.3	92.0	97.9	93.3	97.3	97.5	94.2
Object	Metal nut	71.0	88.5	94.0	96.5	99.2	99.3	90.1	99.1
Object	Pill	80.1	83.8	86.1	91.9	94.4	92.4	89.4	89.7
	Screw	66.7	84.5	81.3	83.3	84.4	86.3	100.0	87.5
	Toothbrush	88.9	100.0	100.0	95.6	97.2	98.3	100.0	94.7
	Transistor	90.3	90.9	91.5	92.1	97.8	95.5	91.3	95.7
	Zipper	96.6	98.1	97.9	95.9	90.9	99.4	92.4	98.2
	Average	85.3	89.9	90.8	93.9	94.1	94.3	94.0	93.9
Average		85.4	91.7	92.1	94.4	95.5	95.2	95.1	95.7

Table 3. Comparison of pixel-level AUC% with different methods on MVTec.

Category		SPADE [5]	RIAD [15]	P-SVDD [31]	P-SVDD-C [32]	PaDiM [6]	CutPaste [10]	Ours
	Carpet	97.5	96.3	92.6	92.9	98.8	98.3	98.9
	Grid	93.7	98.8	96.2	97.2	93.6	97.5	94.0
Texture	Leather	97.6	99.4	97.4	98.2	99.0	99.5	99.2
Texture	Tile	87.4	89.1	91.4	91.9	91.7	90.5	94.0
	Wood	85.5	85.8	90.8	92.1	94.0	95.5	91.6
	Average	92.3	93.9	93.7	94.5	95.3	96.3	95.5
	Bottle	98.4	98.4	98.1	98.6	98.1	97.6	97.3
	Cable	97.2	84.2	96.8	97.6	94.9	90.0	96.0
	Capsule	99.0	92.8	95.8	96.3	98.2	97.4	97.5
	Hazelnut	99.1	96.1	97.5	98.2	97.9	97.3	96.6
Object	Metal nut	98.1	92.5	98.0	98.1	96.7	93.1	94.1
Object	Pill	96.5	95.7	95.1	92.4	94.6	95.7	95.7
	Screw	98.9	98.8	95.7	95.3	97.2	96.7	96.7
	Toothbrush	97.9	98.9	98.1	96.0	98.6	98.1	97.2
	Transistor	94.1	87.7	97.0	93.5	96.8	93.0	96.4
	Zipper	96.5	97.8	95.1	96.0	97.6	99.3	97.9
	Average	97.6	94.3	96.7	96.2	97.1	95.8	96.5
Average		96.0	94.2	95.7	95.6	96.5	96.0	96.2

Table 4. Comparison of (image-level, pixel-level) AUC% on BTAD with different methods.

Category	SPADE [5]	P-SVDD [31]	PatchCore [7]	VT-ADL [46]	Ours
01	(91.4, 97.3)	(95.7, 91.6)	(90.9, 95.5)	(-, 99.0)	(97.3, 97.0)
02	(71.4, 94,4)	(72.1, 93.6)	(79.3, 94.7)	(-, 94.0)	(84.1, 96.6)
03	(99.9, 99.1)	(82.1, 91.0)	(99.8, 99.3)	(-, 77.0)	(98.7, 99.2)
Average	(87.6, 96.9)	(83.3, 92.1)	(90.0, 96.5)	(-, 90.0)	(93.4, 97.6)

Table 5. The impact of using different feature layers on anomaly detection performance on MVTec AD.

Layer 1	Layer 2	Layer 3	I-AUROC	P-AUROC
√			89.1	94.5
	√		95.4	94.9
		√	95.7	94.2
√	√		93.2	95.3
√		√	95.4	95.8
	√	√	95.6	95.6
√	√	√	95.7	96.2

Table 6. The impact of different attentional mechanisms on anomaly detection performance on MVTec AD.

Category	w/o Atten	+SE	+CBAM	+CA
Texture	(97.8, 95.1)	(97.3, 94.6)	(96.5, 94.7)	(98.2, 95.5)
Object	(93.9, 96.0)	(93.2, 96.2)	(93.4, 96.3)	(93.9, 96.5)
Average	(95.2, 95.7)	(94.6, 95.7)	(94.4, 95.7)	(95.7, 96.2)

Table 7. The impact of different feature dimensions on anomaly detection performance on MVTec AD.

Component	256d	200d	160d	100d
+SE	(95.4, 96.2)	(94.4, 95.9)	(95.2, 96.0)	(94.6, 95.7)
+CBAM	(95.2, 96.4)	(94.9, 96.3)	(94.6, 96.3)	(94.4, 95.7)
+CA	(95.1, 96.4)	(95.5, 96.6)	(94.9, 96.2)	(95.7, 96.2)

Table 8. The impact of different anomaly simulation strategies on anomaly detection performance on MVTec AD.

Category	Rect	Scar	Rect & Scar	2-Way	w/o Mask	Mask
Texture	(97.8, 95.3)	(98.4, 95.5)	(98.3, 95.6)	(97.7, 94.9)	(98.7, 96.7)	(98.2, 95.5)
Object	(91.4, 95.7)	(91.5, 95.4)	(92.7, 95.6)	(92.9, 95.9)	(92.4, 96.3)	(93.9, 96.5)
Average	(93.5, 95.6)	(93.8, 95.5)	(94.6, 95.6)	(94.5, 95.6)	(94.5, 96.4)	(95.7, 96.2)

Table 9. Comparison of our method with state-of-the-art methods (PaDiM and CutPaste) on MVTec.

Models	I-AUROC	P-AUROC	Memory Bank (GB)	Parameters (M)
PaDiM (R18-RD100)	90.5	96.5	0.12	—
PaDiM (R18-Full)	93.9	97.1	2.3	—
PaDiM (WR50-RD550)	95.5	97.3	3.5	—
CutPaste	95.2	96.0	—	11.8
Ours	95.7	96.2	0.12	3.8

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wu, B.; Wang, X. Industrial Image Anomaly Detection via Self-Supervised Learning with Feature Enhancement Assistance. Appl. Sci. 2024, 14, 7301. https://doi.org/10.3390/app14167301

AMA Style

Wu B, Wang X. Industrial Image Anomaly Detection via Self-Supervised Learning with Feature Enhancement Assistance. Applied Sciences. 2024; 14(16):7301. https://doi.org/10.3390/app14167301

Chicago/Turabian Style

Wu, Bin, and Xiaoqi Wang. 2024. "Industrial Image Anomaly Detection via Self-Supervised Learning with Feature Enhancement Assistance" Applied Sciences 14, no. 16: 7301. https://doi.org/10.3390/app14167301

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Industrial Image Anomaly Detection via Self-Supervised Learning with Feature Enhancement Assistance

Abstract

1. Introduction

2. Related Work

2.1. Reconstruction-Based Approach

2.2. Embedding-Based Approach

2.3. Self-Supervised Learning Based Approach

3. Materials and Methods

3.1. Pseudo-Anomaly Image and Label Generation

3.2. Multi-Scale Feature Aggregation

3.3. Self-Supervised Training

3.4. Calculate Anomaly Score Map

3.5. Datasets

4. Results

4.1. Experimental Details

4.2. Experimental Results

4.2.1. Evaluation Metrics

4.2.2. Image-Level Anomaly Detection

4.2.3. Pixel-Level Anomaly Location

4.3. Ablation Experiment

4.3.1. Features Layer Selection

4.3.2. Attention Mechanisms

4.3.3. Dimensional Ablation

4.3.4. Different Anomaly Simulation Strategies

4.4. Discussions

4.4.1. Computational Complexity and Memory Complexity

4.4.2. Limitations

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI