To verify the effectiveness of our proposed method, the MVTec AD and BTAD datasets are used to evaluate our performance. Additionally, the ablation study is conducted on the MVTec dataset to examine the impact of various blocks or parameters on the results.
4.2. Evaluation Protocol
There are two evaluations, which are image-level evaluation and pixel-level evaluation. They both use the operating characteristic curve (ROC) and area under the curve (AUC) as the evaluation metric, called ROC-AUC. The ROC curve is a graph that shows the performance of a classification model across all thresholds. The two parameters that control this curve are the true positive rate (TPR) and the false positive rate (FPR). The equations are defined in (2) and (3).
The AUC calculates the area under the ROC curve in two dimensions between (0,0) and (1,1). The network reports the ROCAU scores after being trained separately for each object class.
For more accurate evaluation of anomaly localization performance, the per-region-overlap (PRO) metric is used. Unlike for pixel-by-pixel measurements, the PRO score treats anomalous regions of any size equally.
4.4. Results
For comparisons with other work, we use the results mentioned in the original papers for compared methods. In image-level anomaly detection, we compare our model with the five kinds of existing models, including image restoration models, embedding-based models, abnormal sample synthesis models, flow-based models, and knowledge distillation models. The image restoration models contain AnoGAN [
44], UniAD [
45] and RIAD [
46]. The embedding-based models consist of Psvdd [
47] and PaDiM [
35]. Abnormal sample synthesis models contain Cutpaste [
6], NSA [
7] and Draem [
38]. Models for knowledge distillation involve US [
14] and ADRD [
15]. CFLOW-AD [
48] is based on flow models. Our baseline is based on the ADRD model.
Table 1 shows the results of our image-level anomaly detection. From
Table 1, it can see that our model outperformed other models in terms of the overall average value, achieving 99.43. In addition, from the perspective of texture and object class, the results of our method are also higher than other models, reaching 99.68 and 99.18, respectively. It shows that the knowledge distillation model is valuable in image anomaly detection.
In addition, it’s obvious that our model gains further improvement compared with the ADRD baseline and outperforms other models. It demonstrates that the two teacher networks, the attention mechanism, and the inconsistent teacher–student network module have a positive effect. The US is similar to our method, but it does not use the inverse teacher–student network architecture, we believe that the student network may not be good at identifying anomalous regions in the inverse structure. Some categories have better results than ours, such as Grid, Capsule, and Pill. However, our results are almost the same as theirs.
For pixel-level anomaly detection, the comparison baselines include US [
14], SPADE [
32], PaDiM [
35], Cutpaste [
6], Draem [
38], ADRD [
15], NSA [
7], CFLOW-AD [
48] and UniAD [
45]. Compared to the methods provided at the image-level anomaly detection, some models are not added to our table, and the data are not provided in the paper.
From
Table 2, it can see that our method outperforms the other methods in pixel-level anomaly detection with 97.87, which proves that our method is effective. In the texture class and target class, it can see that the result of our method is the best in the target class, reaching 98.27, but the texture class is not. The Draem method process segments the simulated anomaly samples after simulating the anomaly samples. With this approach, it may have good results for pixel-level discrimination. But our method is directly comparing the similarity between regional feature vectors.
In
Table 3, we show the PRO scores of various methods in the Mvtec dataset. As can be seen from the table, we achieved the best result of 94.66. We only compared three methods since other literature did not provide PRO scores.
We also contrast our model with a few others on the BTAD dataset, such as the autoencoder [
50] and the VT-ADL method [
49]. From
Table 4, it can see that the performance of our method outperforms other models in image-level anomaly detection. Since the other methods do not give pixel-level anomaly detection scores, it can not compare them. Individually, the two categories 1 and 2 are better than our results in the AE + MSE + SSIM method, which mainly uses the autoencoder method. The MSE is the mean square error. The SSIM is the luminance, structure, and contrast of the image.
Figure 3 shows the visualization results of our image anomaly detection. The red color indicates the region considered by the model to be a relatively high anomaly, the blue color indicates the region with a relatively low anomaly, and the green color region indicates the region with a slight anomaly. The first row represents the original image, the second row represents the mask image, and the third row is the visualization result. From
Figure 3, it can see that the defective regions can be located, including the places where there are multiple anomalous regions. However, some areas are incorrectly localized. For example, most of the background areas of the screws and carpet show slight anomalies.
In
Figure 4, failure cases are shown. From our visualization results, we have selected a few samples where errors occurred. First, the background may affect the results of our model predictions in the object class. In the first and second columns of
Figure 4, it can see that there is a little contamination in the background, marked with red colored boxes, and the visualization results show that there is a high level of outliers in these areas. Second, the area of the anomalous region is not fully localized. In the third column of
Figure 4, it can see that the display of the anomalous area is incomplete and only a part of it is shown.
In
Figure 5, we did the statistics of the abnormal scores. Normal samples are shown in blue and abnormal samples are shown in yellow. As can be seen from
Figure 5, the model is able to distinguish the normal samples from the abnormal samples well. However, some categories cannot be distinguished well and may require a better model or more training epochs.
4.5. Ablation Study
Teacher networks are important for feature extraction of images, and in
Table 5, we verified the effect of using pretraining and nonpretraining networks on the image-level and pixel-level anomaly detection results. From
Table 5, it can be seen that using pretraining can improve the results by about 20 points, indicating that using pretraining can extract the features of the image well.
In the attention mechanism module, we investigated the image feature channel division parameter r to see its effect on the anomaly detection results at the image-level and pixel-level, and the results are shown in
Table 4. As can be seen from
Table 6, the pixel-level results have a relatively small change in value as the r parameter changes. At the image-level anomaly detection results have some variation as the r parameter changes. It indicates that the division of image feature channels has some influence on the image-level anomaly detection results.
In
Table 7, the baseline represents the adoption of the original network, including one teacher and student network. New Student Network represents the replacement of the original student network with a new student network, which is designed by ourselves. Two Teachers Network means replacing a single teacher network with two teacher networks. Mscam represents the addition of two teacher network feature channels to the attention mechanism module separately. Iaff represents the feature fusion of the two-teacher network feature channels again.
Both image-level and pixel-level anomaly detection results are improved by using different architectures for the teacher and student networks. This suggests that the powerful performance student network may cause overfitting and recover the anomalies as well.
Better feature representation can be extracted using two teacher networks. In
Table 6, it can observe that the two teacher network achieves good results in image-level anomaly detection, but the pixel-level anomaly detection results are almost unchanged.
The use of an attention mechanism and feature fusion module enables better extraction of useful features. As can be seen from
Table 6, the image-level and pixel-level anomaly detection results achieve a good result. The attention mechanism can extract better feature channels according to their importance, and the feature fusion module can assign higher weights to the teacher network that provides valuable features.