The upper section represents the registration module, while the lower section represents the memory bank module. Both sections operate concurrently. Two images from samples belonging to the same category are randomly selected, and their features are extracted using the convolutional neural network and spatial transformer network (CNN + STN) for registration purposes. The feature registration process is supervised by maximizing the absolute value of the negative cosine similarity loss. After the registration process, the extracted fusion features are stored in a dedicated memory bank and continuously updated with training losses after each registration iteration.
3.1. Registration Module
Neural network training necessitates task-driven learning. Therefore, the essence of self-supervised learning lies in the thoughtful design of tasks that facilitate effective model learning. Inspired by the work of [
25], which obtained a Gaussian distribution model of normal data through feature-level registration training, we leverage the registration task as a pretext task to enhance the model’s understanding of features and emphasize spatial and positional differences. Accordingly, we construct the registration module, consisting of a feature extractor, feature encoder, and predictor, as illustrated in
Figure 2.
In feature registration, since spatial transformation can be represented as matrix operations, it is advantageous to allow the network to learn the generation of matrix parameters, thereby acquiring spatial transformation capabilities. Common network frameworks in deep learning include the CNN and the transformer. Additionally, the spatial transformer network, which plays a pivotal role in our research, can seamlessly integrate into any component of the CNN architecture. To ensure a fair comparison with state-of-the-art methods, we have selected the wide_resnet_50_2 [
26] network as the backbone for our experiments among various CNNs commonly employed in anomaly detection tasks. This network has demonstrated exceptional performance on the ImageNet dataset, achieving an accuracy of 78.51% for Top 1 and 94.09% for Top 5. For this specific task addressed in our paper, we conducted an ablation experiment (4.5.1) to compare the feature extraction capabilities of both Resnet and VIT models, where wide_resnet_50_2 exhibited superior performance.
We incorporated the
STN module into the wide_resnet_50_2 architecture. The overall structure of the
STN module is illustrated in
Figure 2 and consists of a localization network, a grid generator, and a sampler. In the first component, a feature map is given as Formula (1):
After several convolutions or fully connected layers, a regression layer follows, leading to the output of regression transformation parameter . The dimension of depends on the specific transformation type chosen by the network.
In the second component, we utilize and the specified transformation mode from the localization network output, and the grid generator performs further spatial transformations of the feature map to determine the mapping of T(θ) between the output and input features. It employs the predicted transformation parameters to create a sampling grid, which represents a set of points where the input map should be sampled to produce the transformed output. In the sampler, it utilizes the sampling grid to determine which points in the input feature map will be utilized for the transformation. It samples the input feature map against the sampling grid to obtain the final output.
In Step 1, assuming that the input RGB image has a resolution of (224,224), Formula (2) is applied.
The fourth layer of the wide_resnet_50_2 network is excluded to preserve more comprehensive spatial information, and the spatial transformer network (STN) is integrated after the initial three layers of the network. The input feature
) undergoes a transformation function
. The mapping relationship between the input and output feature mappings is defined as Formula (3).
In this case, the form of the eigenvector is shown in Formula (4):
When no key points are labeled, the STN allows the neural network to actively transform the feature map based on input features and learn spatial transformation parameters without requiring additional training supervision or modifications to the optimization process. As illustrated in
Figure 2, the STN can effectively align input images or learned features during training, thereby mitigating the impact of spatial geometric transformations such as rotation, translation, scale, and distortion on tasks like classification and localization. The STN facilitates the spatial transformation of the input data, thereby enhancing feature classification and enabling the network to achieve rotational invariance dynamically. Moreover, it intelligently selects the most salient region of the image and optimally transforms it into a suitable orientation.
Figure 2 illustrates the depiction of feeding an inverted screw image into the STN module. Through a series of transformations, the input is effectively rectified to face forward.
The employed approach in
Figure 2 involves the utilization of a Siamese network for feature encoding, wherein a negative cosine similarity loss is applied as per Formulas (5) and (6).
The negative cosine similarity loss is an appropriate metric for quantifying the similarity between two vectors. Furthermore, it possesses the capability to map similar vectors to adjacent points and dissimilar vectors to distant points. This characteristic facilitates feature clustering, as described in
Section 3.2.3.
The objective is to maximize the similarity between
and
, as well as between
and
. To prevent the input data from converging to a constant value after convolution activation, resulting in identical outputs regardless of the input image, we adopt the approach described in [
27] by halting the gradient operation on one of the branches to avoid model collapse. Finally, we define Formula (7) as the registration loss for symmetric features.
The STN is employed in this stage to perform feature rotation and inversion, facilitating the model’s determination of image similarity. Following each training iteration, a negative cosine similarity loss is obtained.
3.2. Memory Bank Acquisition
3.2.1. Feature Extraction
Let
represent the feature mapping of the second layer, where h denotes the height, w denotes the channel width, and c denotes the number of channels. Like PatchCore, the patch-level features of local features in clustered neighborhoods can be represented as Formula (8):
Here,
represents the aggregation function within the neighborhood. As mentioned in
Figure 1, we use a combination of the first three layers of wide_resnet_50_2 and
STN to extract features and build the memory library, with each layer followed by the
STN. Inspired by the PatchCore method, we likewise did not adopt the last layer of the Resnet network since it lost a lot of the features’ spatial information. As shown in
Figure 3, the features of the second and third layers of the
STN can retain global information while containing more local feature information. However, if the features of the first three or more layers are fused, the features in the input memory library will not contain enough information for accurate detection.
3.2.2. Similar to Pyramid Pooling Module (SPPM)
After feature extraction, a 3-dimensional tensor with the shape
is obtained, where C is the sum of the dimensions of the second and third layers. We then process this three-dimensional feature vector by flattening it in three dimensions except for the channels. This results in a two-dimensional tensor with a shape of
, which is randomly projected. Thus, we flatten the original feature vector into a column-wise phenotypic feature matrix. However, as mentioned earlier, the PatchCore method employs a pixel-by-pixel search approach, conducting nearest neighbor searches on each pixel and disregarding pixel relationships. We recognize that 2D average pooling can increase the receptive field, which is crucial for anomaly detection tasks. Hence, we employ the pooling approach illustrated in
Figure 4.
Assuming an input feature size of 64 × 64, we perform three pooling operations with pool core sizes of 3 × 3, 4 × 4, and 5 × 5. Subsequently, tensors of sizes 64 × 64, 32 × 32, and 16 × 16 are obtained. Next, the three pooled tensors are unsampled to the same size of 64 × 64, and their dimensions are concatenated. The earlier pooling regions are included within subsequent pooling regions using this approach to enlarge the receptive field and give more attention to edge information due to an overlap between pooling regions, thus establishing closer relationships between the pixels.
After the pooling step, the eigenvector of the domain
, as represented by Formula (9), is enacted.
The feature aggregation operation is utilized to obtain the locally aware patch feature set p of the feature mapping tensor
, enabling the successful realization of the clustering
of the feature tensor, as shown in Formula (10).
In this case, the feature repository
can be described as Formula (11):
3.2.3. Anomaly Detection
The samples used in self-supervised learning are exclusively normal. The training process aims to identify representative features of the “normal category” and utilize them as a reference for evaluating positivity and negativity, ultimately achieving anomaly detection. During testing, we index the memory bank that stores characteristic information of positive samples and calculate the Euclidean distance between sample patches to obtain an anomaly score. In n-dimensional space, if there are two points
and
, then the Euclidean distance
is defined as Formula (12).
After completing the feature aggregation and pooling module in Step 2, we proceed to the feature clustering operation. The utilization of negative cosine similarity loss during training greatly facilitates feature clustering by ensuring that similar vectors are consistently mapped to proximate locations with each iteration, while dissimilar vectors are assigned to distant points. This approach effectively reduces the clustering time and enhances the operational efficiency of the model. To optimize the memory library size and improve the testing efficiency, we implement the PatchCore method and employ greedy subsampling for reducing and optimizing the memory bank based on Formula (13).
The objective of our purpose is to streamline the testing process by solely conducting nearest neighbor searches for the test sample’s feature in , identifying its closest neighboring feature and subsequently calculating the maximum Euclidean distance from this feature to its clustering center. This approach allows us to obtain an anomaly score, facilitating effective anomaly detection.
In the memory bank, for each patch-level feature
of the training data, we select the
nearest neighbors from the patch-level features of the test data. The patch-level anomaly score of the test image
is estimated based on the distance between the patch-level feature
and
, as shown in Formulas (14) and (15).
3.2.4. Post-Processing Method
In anomaly detection, the ground truth serves as a baseline measurement obtained from a reliable method. It is used to calibrate and improve the accuracy of new measurement methods. Many current anomaly detection methods rely on the calculated AUROC curve to determine the threshold of the entire dataset. This involves setting the threshold based on the data distribution and other characteristics. The regions above the threshold are considered abnormal, while those below are considered normal. However, if the overall threshold of an abnormal image is lower than the dataset threshold, the anomaly cannot be detected. Otherwise, false detections may occur. To address this, we propose a threshold determination method based on the image itself, which identifies the critical point between normal and anomalous regions, enabling the accurate labelling and localization of anomalies.
For each test sample, as shown in
Figure 5a, we artificially construct five rectangular sampling frames in its upper left, lower left, upper right, lower right, and right center regions, as shown in
Figure 5b.
The constructed image is also fed into the network for feature extraction, and then we search for distances in the feature database, compute scores, and obtain the score matrix. Its heat map visualization is shown in
Figure 5c. Obviously, as artificially constructed anomalies, these five areas will have significantly higher anomaly scores than the rest of the picture. We set the pixel values of these five areas to 0 and the other parts to 1, thus obtaining its ground truth, as depicted in
Figure 5d. Then, we traverse the five regions, find the maximum and minimum fractional values (
) and define the
interval as outlined in Formula (16), setting
iter_times to 10 to ensure the code runs efficiently:
For each iteration, we set the threshold as outlined in Formula (17):
The iteration is carried out in the range of minimum and maximum thresholds, and each th can output an anomaly mask accordingly, where the white area is abnormal and the black area is normal, as shown in
Figure 6a–e, which are the masks corresponding to the thresholds 9.0, 8.3, 7.6, 6.4 and 6.1, respectively. As can be seen from the five pictures shown in
Figure 6, with the continuous increase in the threshold, the area of the abnormal region in the mask image corresponding to each threshold value increases first and then decreases. Finally, we calculate the intersection over union (IOU) between the prediction box and the ground truth box. The threshold is updated according to the highest IOU, and the best threshold (
best_th) is obtained after 10 iterations. The mask graphs corresponding to different responsiveness are presented in
Figure 6a–e, serving the purpose of facilitating the description of this method’s principle. It is important to note that these graphs do not represent the final segmentation results but rather serve as temporary variables within the code and are subsequently deleted after calculation to optimize memory usage.
After determining the optimal threshold value (best_th), based on industrial detection expertise, we establish the region of abnormal pixels (S = 25) and devise an algorithm. For the neighboring pixel region exceeding best_th, if it is smaller than S, the pixel value in this region will be set to 0 (normal); otherwise, it will be set to 1 (abnormal). The implementation of this setting will effectively eliminate the occurrence of false positives in small areas, resulting in a reduction in the overall false alarm rate observed in the test results.