The core innovation of this framework lies in its multi-component feature region selection method, which first divides the image into multiple regions representing the structural features of different components. In industrial visual inspection applications, these regions effectively capture the unique distribution patterns and structural variations of individual components, providing precise and fine-grained input data for subsequent processing stages.
During the training phase, the model first divides the image into multiple local regions and extracts component features for each region. Subsequently, based on the multi-component feature region selection strategy, the feature-enhancement module utilizes a self-attention mechanism to enhance the expression of key information regions in the component features. Specifically, the model first processes the input image
I through a pre-trained encoder to generate the feature map
F. A greedy sampling algorithm [
30] is then employed to extract and stack representative point features from key regions, followed by clustering to produce the multi-kernel feature vector
V. Subsequently, the tensor product operation between feature map
F and
V, combined with Conditional Random Field (CRF) interpolation, yields the component confidence map
. Threshold segmentation and contour extraction are applied to obtain the component position encoding
, which is then mapped back to the original image to crop the component image
and confidence feature
. Then, the module combines feature fusion strategies to integrate weighted features from different channels, thereby generating candidate feature images. Finally, by calculating the similarity scores between the confidence images and the candidate feature images and matching the feature map with the highest score, further feature enhancement is achieved.
In the test phase, the test image undergoes multi-component feature region selection and feature enhancement through the feature-enhancement module. Then, an anomaly-detection method based on region, color, and histogram features is used to calculate the similarity between the test image and the normal image and evaluate the anomaly degree by combining the K-Nearest Neighbors (KNN) and the Density-Based Spatial Clustering of Applications with Noise (DBSCAN) method. Through the aforementioned component feature extraction and optimization, the method can more precisely capture the anomalous features of local regions. The following sections provide a detailed introduction to the two aforementioned methods.
3.2. Multi-Component Feature Region Selection
The core objective of the multi-component feature region selection method is to divide an image into multiple regions, each representing the structural features of different components in the image. In the field of industrial visual inspection, the features of individual components often exhibit unique regional distributions and structural differences. Therefore, this method enables the effective identification of potential anomaly regions, providing more precise and fine-grained input information for subsequent feature-enhancement steps.
The multi-component feature region selection method not only comprehensively considers the overall features of each component in the image but also ensures, through fine-grained partitioning of local regions, that the detection model can focus on subtle changes within components, such as defects, damage, or the intrusion of foreign objects. This approach allows the model to pay greater attention to local regions that are easily overlooked in traditional component feature detection, thereby significantly improving the sensitivity of local anomaly detection.
Figure 3 illustrates the detailed workflow of multi-component feature region selection, including key steps such as feature encoding, foreground and background partitioning, and positional information extraction.
When initiating multi-component feature region selection, the training image is first input into an image feature encoder to transform it into a feature map. However, a critical step in processing the feature map is selecting representative points. Feature maps typically contain a large amount of information, and directly processing all points not only significantly increases computational load but is also susceptible to interference from redundant information, thereby affecting the model’s efficiency and accuracy. Therefore, it is necessary to select the most representative key points to optimize the feature map processing. Specifically, this paper employs a greedy sampling algorithm [
30] to select the
N most representative key points from the feature map. Compared to random sampling [
33], uniform sampling [
34], or the K-means clustering algorithm [
16], this greedy sampling algorithm offers higher efficiency, particularly when the feature map is large and computational resources are limited, as it maximizes the retention of key information in the feature map with a limited number of representative points.
Specifically, during the training phase, given the original training image
I, we generate the corresponding feature map
F using a pre-trained image feature encoder, where
F is a three-dimensional feature map represented as
, with
H and
W being the height and width of the feature map, respectively, and
C being the number of channels. The feature vector
at each position
can be evaluated for its expressive power by calculating its
-norm:
where
represents the value of
F at position
and channel
k. The greedy sampling algorithm [
30] aims to iteratively select representative points that optimally summarize the key patterns and important regions in the feature map. A set
S is defined to store the selected representative points. Initially,
S is empty. In each iteration, a point
is selected from the remaining candidate points and added to set
S, maximizing the expressive power gain of the component representative points. The gain is calculated as:
The position
that maximizes the gain in Equation (
2) is selected and added to the set
S. This process iterates until the number of points in set
S reaches the predetermined N representative points. The final set
S contains points that consider both the component’s expressive power and focus on local feature variations, thereby effectively capturing the key feature information of the image. Next, the representative point features of all training images are stacked to generate the feature stack matrix
K. Each image’s representative point set
S corresponds to a feature subset of that image, and stacking them yields
K, which is the collection of representative point features from all training images. Subsequently, a clustering algorithm (e.g., K-means) is applied to the feature stack
K for cluster analysis, dividing these features into multiple categories to obtain the multi-kernel feature vector
V, where the k-th component
represents the k-th component in the image.
The feature map F is then tensor-multiplied with the multi-kernel feature vector V to generate region response maps related to the feature vector components. These region response maps are interpolated to restore their resolution to the original image size. Additionally, a conditional random field (CRF) is used to further optimize the results, enhancing the boundary accuracy and consistency of local regions. Through these operations, we obtain M component confidence images , where i represents the component index and M is the total number of components.
For different components, key local regions are extracted, and corresponding region images and region confidence images are generated, where r represents the region image index. To more finely distinguish different information levels in the image and retain more details, thereby improving the recognition accuracy of anomaly regions and the granularity of subsequent analysis, we employ a piecewise thresholding operation. Unlike single-threshold segmentation methods, piecewise thresholding divides the component confidence image into multiple confidence levels, enabling more precise differentiation between high-confidence, low-confidence, and medium-confidence regions. This approach allows independent processing of regions with different confidence levels, facilitating better identification of potential anomaly regions. The specific steps are as follows:
Operation is performed. Specifically, the confidence value
at each pixel position
in the image is divided into multiple regions according to multiple thresholds
,
, and
, corresponding to different confidence levels. The formula is defined as:
In this process, the foreground region
represents pixel positions with higher confidence levels. After piecewise thresholding, the foreground region is divided into different confidence levels based on the value of
. Through this piecewise thresholding operation, we can more accurately partition the image into regions of different confidence levels, providing a clear foundation for subsequent anomaly analysis. After completing the partitioning of foreground and background, the next step is to extract the positional encoding
of the components, which is used to clarify the relationship between the region image
and its corresponding region confidence feature image
. The model extracts the structural information of the foreground region and combines it with a depth-first algorithm to perform a contour search on the foreground
, identifying the parts within the contour as component regions. These component regions are then converted into positional encoding
. The specific approach is as follows: For a position
, its four-neighborhood
is, If
and has not been visited, a depth-first search is initiated from this point, marking all connected foreground pixels as the same region
. For each connected region
, its boundary
is defined as: The region growing algorithm operates as follows: For any given position
, we first define its four-neighborhood
. When encountering an unvisited foreground pixel
, the algorithm initiates a depth-first search to aggregate all connected foreground pixels into a unified region
. Subsequently, for each connected region
, its boundary
is determined. The formal mathematical definitions are:
That is, if a position
belongs to region
and at least one of its neighboring positions does not belong to
, then this position is considered a boundary point. The boundary set of all regions is
. The minimum and maximum values of all coordinates in
in the horizontal and vertical directions are calculated as:
where
,
,
, and
are the boundary points of the region, representing its range. For each region set, the positional encoding
is determined by the four key coordinate points that define the region set:
Based on the positional encoding, the position data of the components in the confidence map are extracted. Subsequently, using this positional information as a reference, it is mapped back to the original image and the confidence map. The region corresponding to the foreground region is cropped from the original image as the region image . Simultaneously, the corresponding foreground part is extracted from the confidence map as the region confidence feature image . These two types of features serve as inputs for the region feature enhancement in the next subsection.
3.3. Feature Enhancement
After completing the multi-component feature region selection, we obtain the region images, region confidence features, and corresponding positional information for each component. To enhance the model’s ability to focus on component-level features, particularly in detecting minor defects or local anomalies, we need to extract more detailed information from local features. To this end, we introduce a feature-enhancement module, as shown in
Figure 4. The goal of the feature-enhancement module is to achieve feature enhancement by comprehensively utilizing self-attention mechanisms, feature fusion strategies, and similarity scores between features based on the output of the multi-component feature region selection module. Specifically, the component images obtained from the multi-component feature region selection module are input into a pre-trained image feature encoder to generate feature maps. Subsequently, to enable the model to better focus on key local features, the feature-enhancement module introduces a self-attention mechanism, which automatically adjusts the model’s attention to local information through feature weighting. This mechanism enhances the model’s ability to express key information regions in the image while suppressing irrelevant information in the background. Then, the feature-enhancement module divides the feature maps into multiple groups, with each group serving as a candidate feature image. Finally, by calculating the similarity scores between the region confidence images and the candidate feature images, the best-matching feature map is selected and multiplied with the positional encoding from
Section 3.1, thereby enhancing the representation of anomaly regions.
It is important to note that the feature-enhancement module, through refined similarity analysis and efficient region alignment, enables the model to automatically focus on regions exhibiting high consistency in the local feature space, thereby significantly improving sensitivity to minor defects and local anomalies. From a mathematical perspective, the above process can be described as follows: Let the feature-enhancement module be
, where
is the region confidence image,
is the region image,
R is the positional encoding,
is the structural similarity score calculation function, and
is the candidate feature image-generation function. The feature-enhancement process can then be expressed as:
The following is a detailed explanation of the specific workflow of the feature-enhancement process: The region image
is input into a pre-trained image feature encoder to generate the feature map
. Each position in the feature map
F contains certain local information. To enhance the model’s focus on key local features, inspired by the method of Tongkun Liu et al. [
19], this paper employs a self-attention mechanism in the feature encoder. This mechanism calculates dependencies between channels to determine the correlation between each channel and others, adjusting the attention of different channels through weighting operations. Specifically, the input feature map
F is first linearly transformed to obtain the query (
Q), key (
K), and value (
V) matrices:
The attention weights
are computed using trainable projection matrices
,
, and
, where the scaled dot-product operation applies a softmax normalization
that converts raw similarity scores
into probability distributions, yielding the final formulation:
Next, the feature map
is weighted according to the attention relationships between channels, capturing and enhancing the similarity and dependencies between channels. To further enhance the model’s focus on key local features, clustering is introduced to help the model concentrate on the local features represented by each channel group. Specifically, cosine similarity is used to calculate the correlation between channels in the weighted feature map
. Let
be the flattened feature of the channels. The similarity matrix
between channels in the weighted feature map can be calculated as:
where
and
are the feature vectors of the
i-th and
j-th channels in
. Using the calculated similarity matrix
M, the channels are clustered. Based on the clustering results, the channels are divided into
k groups
. Each group
contains
N channels and is treated as a feature group. In order to obtain the set of candidate feature images
, each candidate feature image
is generated by averaging all channels within its corresponding group
:
In this way, the candidate feature image effectively represents the deep structural information of the current component, providing reliable input for subsequent feature matching, particularly aiding in the detection of minor defects or local anomalies. The region confidence image is mapped to each candidate feature image to ensure alignment in the same space. Based on the mapping results, the structural similarity between the two is calculated. The structural similarity is defined as follows:
For the region confidence image
and the candidate feature image
, we define a similarity metric function
.
can be implemented using candidate metric functions such as MSE, COS, SSIM, and PSNR. In this paper, PSNR is selected as the similarity metric function, specifically defined as:
where
represents the maximum pixel value of the confidence image in our implementation.
is the size of the region confidence image
and the candidate feature image
. For the
k candidate feature images
, we obtain
k corresponding structural similarity scores
. Specifically, we calculate structural similarity score
between the region confidence image
and the candidate feature image
. The structural similarity score
is calculated as:
All candidate feature images are sorted in descending order based on their similarity scores, where:
The candidate feature image
with the highest similarity is selected as the best match for the current component,
. The best match confidence image
is then multiplied with the positional encoding
R using element-wise multiplication:
is the enhanced confidence map, which retains the information of the best match confidence image while enhancing the representation of anomaly patterns through similarity optimization.