SAID: Segment All Industrial Defects with Scene Prompts

Huang, Yican; Zhu, Junwei; Zhong, Xiaopin; Deng, Yuanlong

doi:10.3390/s25164929

Open AccessArticle

SAID: Segment All Industrial Defects with Scene Prompts

¹

College of Mechatronics and Control Engineering, Shenzhen University, Nanhai Ave., Shenzhen 518060, China

²

School of Sino-Germany Intelligent Production, Shenzhen City Polytechnic, Shenzhen 518116, China

³

College of Urban Transportation and Logistics, Shenzhen Technology University, Shenzhen 518118, China

^*

Author to whom correspondence should be addressed.

Sensors 2025, 25(16), 4929; https://doi.org/10.3390/s25164929

Submission received: 23 June 2025 / Revised: 25 July 2025 / Accepted: 7 August 2025 / Published: 9 August 2025

(This article belongs to the Section Industrial Sensors)

Download

Browse Figures

Versions Notes

Abstract

In the field of industrial inspection, image segmentation is a common method for surface inspection, capable of locating and segmenting the appearance defect areas of products. Most existing methods are trained specifically for particular products. The recent SAM (Segment Anything Model) serves as an image segmentation foundation model, capable of achieving zero-shot segmentation through diverse prompts. Nevertheless, SAM’s performance in special downstream tasks is not satisfactory. Additionally, SAM requires prior manual interactions to complete segmentation and post-processing of the segmentation results. This paper proposes SAID (Segment All Industrial Defects) to deal with these issues. The SAID model encodes single-annotated prompt–image pairs into scene embedding via Scene Encoder, achieving automatic segmentation and eliminating the reliance on manual intervention. Meanwhile, SAID’s Feature Alignment and Fusion Module effectively addresses the alignment issue between scene embedding and image embedding. Experimental results demonstrate that SAID outperforms SAM in segmentation capabilities across various industrial scenes. Under the one-shot target scene segmentation task, SAID also improves the mIoU metrics by 5.79 and 0.87 compared to the MSNet and SegGPT, respectively.

Keywords:

industrial defect segmentation; cross-scene adaptation; prompt-based foundation model

1. Introduction

Industrial surface defect detection is a critical quality control technology aimed at identifying appearance defects on surfaces, such as scratches, dirt, and damage. Image segmentation transforms the surface defect detection task into a semantic segmentation problem between defective and normal regions, enabling fine segmentation of defect areas and obtaining their location and geometric attributes. Currently, industrial image segmentation methods based on deep learning can be divided into two categories: one is professional models, which usually use supervised methods for training from scratch; the other is to use the foundation model and employ prompt learning mechanism for segmentation.

For the first, although professional segmentation methods like Mask-RCNN [1], DeepLabV3 [2], and SegFormer [3] have achieved significant success in the field of image segmentation, the segmentation task in industrial surface defect detection still faces several challenges: (1) Lack of defect samples [4]. Normal samples of industrial products are easy to obtain, but defect samples are relatively scarce. (2) Numerous scene types and complex defect patterns [5], such as surface defects of industrial products exhibiting various shapes, colors, textures, etc. (3) Ambiguous defect evaluation criteria. Industrial products are prone to undefined defect types, posing challenges to traditional supervised learning-based segmentation methods.

For the second, with the development of computing power and large-scale datasets, foundation models [6] continue to emerge. Recently, Meta AI’s foundational visual segmentation model SAM [7] has garnered attention for its ability to generate precise object masks in an interactive manner. SAM is trained on the large-scale dataset SA-1B, possessing powerful zero-shot generalization capability. However, due to significant differences between natural images and downstream task images from other fields (such as industrial images, medical images [8], and remote sensing images [9]), directly using SAM for segmentation in downstream fine-grained fields does not yield satisfactory results. Notably, SAM requires manual interaction prompts, which poses real-time issues in practical applications.

In summary, traditional deep learning methods have poor generalization across different scenes, while SAM lacks capability in downstream industrial fields, and its prompt mechanism is not suitable for industrial detection. To develop SAM’s powerful automatic segmentation capability with cross-scene transferability on industrial images, we design the network SAID for industrial cross-scene single-annotation prompt segmentation. (1) SAID utilizes scene embedding without real-time human interaction. As shown in Figure 1, the input for the scene prompt is a set of product single-annotated prompt–image pairs, which the model encodes into the scene prompt to assist with segmentation. SAID possesses generalization capability for industrial scene images, requiring only the product image and the corresponding annotation mask as the product prompt–image pair to complete the segmentation of abnormal regions. (2) We note that the introduction of the Scene Encoder module inevitably leads to feature alignment issues between image embedding and scene embedding. We design a Lightweight Feature Alignment and Fusion Module, including the Neck module and Lightweight Fusion module, to effectively address this issue. As illustrated in Figure 2, traditional industrial defect detection models typically require training dedicated models for each specific scene, which significantly limits their generalization capability. In contrast, SAID exhibits robust cross-scene generalization by leveraging product prompt pairs—each consisting of a product image and its corresponding annotation—from a given scene to form a transferable scene prompt. While SAM relies on manual interaction to generate segmentation prompts, SAID eliminates this need by automatically encoding the prompt pairs through the Scene Encoder, enabling effective segmentation in unseen scenes without additional human input. Our code will be publicly available at the following URL: https://github.com/KLIVIS/SAID-IVD (accessed on 24 July 2025).

The main contributions of our work are as follows:

1.: We propose SAID for the defect segmentation task of industrial images. SAID eliminates the reliance on human priors and does not require complex post-processing, achieving automatic defect segmentation detection.
2.: We design the Scene Encoder to encode the product with a set of user-input product annotation images into scene embedding, enhancing the model’s segmentation capabilities. To address the misalignment of features between the Scene Encoder and Image Encoder outputs, one Lightweight Feature Alignment and Fusion Module is designed.
3.: Experiments on multiple industrial scene datasets show that our SAID model exhibits excellent capabilities under both one-shot and supervised settings.

2. Related Work

2.1. Surface Defect Detection

Surface defect detection is a significant technology in the field of computer vision. Traditional vision-based defect detection methods [10,11,12] primarily relied on handcrafted features such as visual textures and colors to detect defects, which perform poorly for complex backgrounds and lighting variations. In recent years, deep learning-based segmentation methods have become the preferred task for surface defect detection. These methods can be categorized into two types: supervised learning-based methods and unsupervised learning-based methods. Supervised learning-based methods require a large amount of annotated data to train models, such as commonly used models like FCN [13] and Mask RCNN [1]. These methods perform well when there is sufficient annotated data, but obtaining large amounts of annotated data is often costly and time-consuming. However, unsupervised learning-based methods do not need annotated data and are mainly divided into reconstruction-based methods and embedding-based methods. Reconstruction-based methods [14,15] use reconstruction networks to reconstruct normal images, and abnormal images are easily detected as they cannot be well reconstructed [16,17]. Embedding-based methods [18,19,20] typically use deep neural networks to pre-train feature extractors on large datasets such as ImageNet [21], extracting meaningful vectors that describe the entire image, with anomaly scores usually represented by the distance between the embedding vector of the test image and the reference vector representing the normality of the training dataset. These unsupervised methods provide an effective solution in the absence of annotated data.

2.2. Few-Shot Image Segmentation Methods

In order to address the issue of limited annotated samples for segmentation, some researchers have proposed various few-shot image segmentation methods. These methods can be primarily categorized into meta-learning [22,23], Siamese Network-based [24,25], and prompt learning [26] approaches. Compared to unsupervised methods, few-shot learning provides the model with a clear learning objective to achieve better model performance. Meta-learning enables the model to quickly learn and adapt to new tasks from a small number of samples. By pre-learning some general knowledge or parameters, the model can better adapt to segmentation tasks of different categories. Siamese Network-based methods compare the target image with a few samples to learn a general feature representation. The Siamese Network can quickly adapt on a few samples, thereby achieving accurate segmentation. The core of prompt learning methods lies in designing clever templates that make the training approach for downstream fine-tuning tasks more similar to the pre-training tasks [27]. This design helps to reduce the semantic gap between pre-training and fine-tuning, making the training process more effective and efficient. However, prompt learning requires a foundation model with sufficiently strong emergent capabilities.

2.3. Fundamental Visual Segmentation Model

SAM [7] is a foundation model designed by Meta AI for image segmentation tasks. Its architecture is based on Transformer [28], including an Image Encoder, a Prompt Encoder, and a Mask Decoder. The Image Encoder, using ViT (Vision Transformer) [29] as its backbone, is pre-trained with MAE (Masked Autoencoder) [30] to encode the input image into image embedding. The Prompt Encoder consists of dense and sparse branches. SAM is trained on a large-scale annotated dataset, SA-1B, and demonstrates robust zero-shot generalization capability. However, despite its excellent performance on natural images, SAM’s effectiveness in downstream specialized tasks such as agriculture, remote sensing, and medical imaging is poor [8,31,32]. To enhance the performance of SAM in downstream specialized tasks, researchers have proposed various methods. Among these, PEFT (Prompt-based Evolutionary Fine-tuning) [33] is a common and effective technique. PEFT allows for fine-tuning SAM’s performance in specific domains, as demonstrated by studies like SAM-Adapter [34] and Medical Sam Adapter [35]. Additionally, SonarSAM [36] introduces SAM into the field of sonar images and enhances its performance through LoRA [37] technology. Beyond fine-tuning techniques, some works have attempted to modify the prompt network to improve SAM. RSPrompter [9] integrates human prompts into the network itself, achieving superior segmentation performance on remote sensing datasets. SEEM [38], on the other hand, is a model employing a universal encoder–decoder architecture, providing new insights into addressing the performance issues of foundational segmentation models across various downstream tasks through complex query and prompt interactions.

3. Methods

Traditional deep learning-based defect detection methods [14,39,40] exhibit poor generalization across different scenes. They often require expert models to be trained for each distinct scene. SAM, while effective in some contexts, suffers in specialized industrial domains due to its need for manual interaction and tendency to generate excessive redundant masks. Building upon SAM, we introduces a Scene Encoder and a Lightweight Feature Alignment and Fusion Module to construct a model named SAID, which can perform automatic segmentation on industrial data using single-annotation information.

3.1. Overview of the SAID Architecture

The network SAID consists of four main components: Image Encoder, Scene Encoder, Feature Alignment and Fusion Module, and Mask Decoder. Figure 3 illustrates the model architecture of SAID. The Image Encoder, adopted from SAM, is frozen during training and not subject to parameter updates. The Scene Encoder is designed to encode a pair of annotated product images into scene embedding, aiding SAID in achieving automatic segmentation without additional human intervention. The Feature Alignment and Fusion Module efficiently integrates feature maps from both the Image Encoder and the Scene Encoder before feeding them into the decoder for mask prediction. The Mask Decoder in SAID is identical to that in SAM and is fine-tuned during training.

The workflow of the model includes the following steps:

(1).: The image to be detected $I_{i n p u t} \in R^{3 \times H \times W}$ is encoded by the Image Encoder to produce the image embedding $E_{i m a g e} \in R^{256 \times 64 \times 64}$ .
(2).: Scene Encoder encodes a pair of product images $I_s a m p l e \in R^{3 \times 1024 \times 1024}$ and their corresponding mask images $M_s a m p l e \in R^{3 \times 1024 \times 1024}$ that belong to the same scene as the image to be detected into the scene embedding $E_{s c e n e} \in R^{256 \times 64 \times 64}$ .
(3).: $E_{i m a g e}$ and $E_{s c e n e}$ are fused through the designed Feature Alignment and Fusion Module, and then fed into the Mask Decoder for mask prediction, yielding the segmentation results.

The SAID model is trained using the Industrial-5i dataset under a five-fold cross-validation setting, where each fold isolates one industrial scene for testing to evaluate cross-scene generalization. All input images and corresponding masks are resized to 1024 × 1024 and normalized to the [0, 1] range. To improve robustness, data augmentation techniques such as random horizontal flipping, brightness adjustment, and affine transformations are applied. During training, the Image Encoder is frozen to preserve pre-trained representations, while the Scene Encoder, Feature Alignment and Fusion Module, and Mask Decoder are optimized jointly in an end-to-end manner. The model is trained for 100 epochs using the Adam optimizer with a learning rate of

1 \times 10^{- 3}

and the batch size is 16. A cosine annealing scheduler is employed to gradually reduce the learning rate. The total loss combines binary cross-entropy to handle the class imbalance in defect regions. All experiments are implemented in PyTorch 2.1 with mixed-precision training on an NVIDIA RTX 3090Ti GPU (Colorful, Shenzhen, China). During inference, the model performs segmentation using only the target image and a single-annotated product image pair from the same scene, without requiring any manual interaction or online fine-tuning.

3.2. Scene Encoder

The Scene Encoder serves as a replacement for the complex human–model interaction mechanism in SAM, functioning as a prompt encoder. As illustrated in Figure 4, it is capable of encoding a pair of annotated scene prompts, namely a defective product image

I_s a m p l e

and its corresponding label mask image

M_s a m p l e

, into a scene embedding that carries domain expert knowledge. This embedding is then aligned and fused with the image embedding before being fed into the Mask Decoder. The Scene Encoder transforms the product input image

I_s a m p l e \in R^{3 \times H \times W}

and its annotation

M_s a m p l e \in R^{1 \times H \times W}

into a scene embedding

E_{s c e n e} \in R^{b s \times 256 \times 64 \times 64}

. The input image to be detected

I_{i n p u t} \in R^{b s \times 3 \times H \times W}

, after being processed by the Image Encoder, yields the image embedding

E_{i m a g e} \in R^{b s \times 256 \times 64 \times 64}

. Thanks to the robust feature extraction capabilities of the Image Encoder and the prompt encoding capabilities of the Scene Encoder, SAID exhibits excellent single-annotation cross-scene defect segmentation abilities.

Scene Encoder is a module designed for encoding scene information, based on CNNs (convolutional Neural Networks), and its architecture is inspired by the encoder–decoder structure of U-Net [41]. The U-Net architecture effectively extracts both local and global features from images, enabling the extraction of prior knowledge from a single-annotation prompt case. The scene prompt encoding component consists of two parts: the pre-encoder and the fusion encoder. The inputs are a product image

I_s a m p l e (3 \times 1024 \times 1024)

and its corresponding mask image

M_s a m p l e (1 \times 1024 \times 1024)

. These two images are separately processed by two pre-encoders

f_{1} (\cdot)

and

f_{2} (\cdot)

to extract features, resulting in feature maps of size

256 \times 64 \times 64

.

\begin{matrix} F_{i m a g e} & = f_{1} (I_s a m p l e), \\ F_{m a s k} & = f_{2} (M_s a m p l e), \end{matrix}

(1)

f_{1} (\cdot)

represents the pre-encoder for the sample image

I_s a m p l e

, and

f_{2} (\cdot)

represents the pre-encoder for the corresponding mask of the sample image

M_s a m p l e

. Both

f_{1} (\cdot)

and

f_{2} (\cdot)

have similar network architectures, consisting of four convolutional blocks. Each convolutional block

c o n v^{i}

includes a

3 \times 3

convolutional layer, a BN (Batch Normalization) layer, and a LeakyReLU activation function.

c o n v^{i} = L e a k y R e L U (B N (c o n v (i n p u t))),

(2)

i denotes the i-th convolutional block, with values ranging from 1 to 4. input refers to

I_s a m p l e

for

f_{1} (\cdot)

or

M_s a m p l e

for

f_{2} (\cdot)

.

The feature maps

F_{i m a g e}

and

F_{m a s k}

are summed element-wise, and the resulting sum is processed by the fusion encoder

f_f u s i o n

. The

f_f u s i o n

consists of three convolutional blocks, each followed by a downsampling operation that reduces the feature map’s height and width to half of their original size while doubling the number of channels. Consequently, the shape of the input feature map transitions from

R^{256 \times 64 \times 64}

to

R^{512 \times 16 \times 16}

.

c o n v_{f_f u s i o n}^{i} = D o w n S a m p (L e a k y R e L U (B N (c o n v (F_{i m a g e} + F_{m a s k})))) .

(3)

The scene prompt decoding part consists of three convolutional blocks, each preceded by an upsampling layer. The upsampling operation doubles the spatial dimensions of the feature map. Following each upsampling, a convolutional operation further extracts features while halving the number of channels. This transition results in the feature map’s shape changing from

R^{1024 \times 8 \times 8}

to

R^{256 \times 64 \times 64}

.

\begin{matrix} c o n v_{d e c o d e r}^{i} (x, y) & = L e a k y R e L U (c o n v (c o n c a t (U p S a m p (x), y))), \end{matrix}

(4)

U p S a m p (x)

is implemented using nearest-neighbor interpolation, which enlarges the spatial dimensions of the low-resolution feature map through interpolation. This is followed by a

1 \times 1

convolutional layer to adjust the dimensions. In each convolutional operation of the decoder, after the upsampling and dimensionality reduction, the output is concatenated with the corresponding convolutional block output from the encoder. This concatenation aids the model in better feature extraction. Ultimately, after the encoding–decoding process, a scene embedding

E_{s c e n e} \in R^{b s \times 256 \times 64 \times 64}

is obtained, which has the same shape as the image embedding

E_{i m a g e}

. The final results are shown in Figure 5.

3.3. Feature Alignment and Fusion Module

The

E_{i m a g e}

output by the Image Encoder naturally exhibits misalignment in the feature space with the

E_{s c e n e}

output by the Scene Encoder. A direct approach to address this issue would be to fine-tune the Image Encoder. However, due to the limited amount of data for downstream tasks, retraining could easily lead to overfitting and fail to fully leverage the zero-shot emergence capabilities of SAM. To address this, we introduce a Lightweight Feature Alignment and Fusion Module, which includes Neck modules and a Lightweight Fusion module, as shown in Figure 3. Before being fused with

E_{s c e n e}

,

E_{i m a g e}

undergoes alignment through the Neck modules, followed by feature fusion through the Lightweight Fusion module.

\begin{matrix} E_{i m a g e}^{'} & = ReLU (W_{i m a g e_2} \cdot ReLU (W_{i m a g e_1} \cdot E_{i m a g e} + b_{i m a g e_1}) + b_{i m a g e_2}), \\ E_{s c e n e}^{'} & = ReLU (W_{s c e n e_2} \cdot ReLU (W_{s c e n e_1} \cdot E_{s c e n e} + b_{s c e n e_1}) + b_{s c e n e_2}), \end{matrix}

(5)

W_{i m a g e_1}

,

b_{i m a g e_1}

,

W_{i m a g e_2}

, and

b_{i m a g e_2}

represent the weight matrices and bias vectors for the first and second layers of the MLP processing

E_{i m a g e}

, respectively. Similarly,

W_{s c e n e_1}

,

b_{s c e n e_1}

,

W_{s c e n e_2}

, and

b_{s c e n e_2}

denote the weight matrices and bias vectors for the first and second layers of the MLP processing

E_{s c e n e}

. The symbol

ReLU (\cdot)

represents the nonlinear activation function ReLU.

To align and fuse the

E_{i m a g e}^{'}

output by the Image Encoder with the

E_{s c e n e}^{'}

output by the Scene Encoder in the feature space, multiple experiments were conducted. The Neck module designed in this paper is a single-layer MLP that follows both the Image Encoder and the Scene Encoder. This module does not alter the shape of the embedding but ensures that the image embedding and scene embedding are aligned in the feature dimension, thereby reducing the likelihood of suboptimal segmentation results due to feature misalignment introduced by the Scene Encoder module. After introducing the Neck modules to align the image embedding and scene embedding, the Lightweight Fusion module further integrates these embedding. Inspired by [42], the Lightweight Fusion module designed in this paper is a Lightweight feature fusion module that incorporates a pixel-level 3D attention mechanism. As shown in Figure 6c, the

E_{i m a g e}^{'}

and

E_{s c e n e}^{'}

from Equation (5) are concatenated and then processed through two simple convolutional operations followed by a pixel-level attention module for feature fusion. The two convolutional layers are connected with skip connections that concat with the input, and finally, a parameter-free attention mechanism is used to further fuse the bound features.

x = c o n v (c o n c a t (E_{i m a g e}^{'}, E_{s c e n e}^{'})),

(6)

c o n v (\cdot)

represents a two-layer convolutional operation. The bound features from Equation (6) are further fused through Equation (7).

E_{f e a t u r e} = x \times σ (\frac{{(x - μ)}^{2}}{4 (\frac{\sum_{i = 1}^{n} {(x_{i} - μ)}^{2}}{n} + ϵ)} + 0.5),

(7)

μ

denotes the mean of the vector x, which is computed using the formula

μ = \frac{1}{n} \sum_{i = 1}^{n} x_{i} .

Here,

x_{i}

represents the individual feature point within the vector x. n represents the number of effective pixel values within the vector x. This is determined by the expression

n = w \times h - 1,

where w and h are the width and height of the image, respectively. The subtraction of 1 accounts for any potential boundary effects or excluded pixels.

ϵ

is a small constant introduced to prevent division by zero, ensuring numerical stability in the calculations. In the conducted experiments, this constant is set to

1 \times 10^{- 4}

. The symbol

σ (\cdot)

signifies the Sigmoid activation function, which is defined as

σ (z) = \frac{1}{1 + e^{- z}} .

σ (\cdot)

maps the real-valued number to a value between 0 and 1, making it particularly useful for binary classification tasks and as an activation function in neural networks. Finally, as shown in Equation (8), the fused feature

E_{f e a t u r e}

is decoded by the Mask Decoder to produce the output predicted mask image

M_{O u t p u t}

.

M_{O u t p u t} = d e c_{M a s k} (E_{f e a t u r e}),

(8)

where

d e c_{M a s k} (\cdot)

is the Mask Decoder of SAID.

Figure 6. The structure of different fusion modules. (a) Concat fusion mechanism. (b) Cross-attention fusion mechanism. (c) Lightweight 3D Fusion Mechanism.

3.4. Loss Function

For the network SAID proposed, the loss function is the cross-entropy loss between the predicted mask and the GT (Ground Truth) mask, which is a commonly used loss function in segmentation networks, denoted as

L_{s e g}

. The cross-entropy loss

L_{s e g}

quantifies the difference between the predicted mask and the true label mask.

L_{s e g} = - \frac{1}{N} \sum_{i = 1}^{N} [y_{i} log (p_{i}) + (1 - y_{i}) log (1 - p_{i})],

(9)

L_{s e g}

evaluates the model’s performance by calculating the difference between the model’s predicted probability distribution and the true probability distribution at each pixel. Here,

N = H \times W

represents the total number of pixels,

y_{i}

is the true label of the i-th pixel, and

p_{i}

is the model’s predicted value for the i-th pixel.

4. Experiments

4.1. Setup

Datasets To validate the generality of our approach, we use the Industrial-5i dataset compiled in [43]. This dataset includes images from 20 different industrial scenes sourced from MVtec-AD [44], KolektorSDD [45], Magnetic Tile Defect [46], RSDDs [47], and BSData [48]. The Industrial-5i dataset was divided into four folds, and the network was trained and validated using a cross-validation approach. Specifically, one fold of the scenes was selected as the test set to evaluate the model’s generalization capability, while the remaining scene categories served as the training set to train the network model. During training, the training set data underwent data augmentation to expand to five times its original size. Detailed information about the dataset can be found in Table 1.

Evaluation Metrics Following prior research [43,49], we adopted mIoU as the primary evaluation metric for our experiments. mIoU is a widely recognized metric in the field of image segmentation, used to quantify the overlap between the predicted segmentation and the ground truth segmentation. For a single category, the intersection-over-union (IoU) is defined as the ratio of the area of intersection between the predicted segmentation and the ground truth mask to the area of their union. mIoU is the average of IoU across all categories. The formula for calculating mIoU is as follows,

mIoU = \frac{1}{2} \sum_{c = 0}^{1} \frac{T P_{c}}{F P_{c} + T P_{c} + F N_{c}} .

(10)

Here, mIoU represents the mean intersection-over-union for the two classes (0 and 1).

T P_{c}

,

F P_{c}

, and

F N_{c}

denote the True Positives, False Positives, and False Negatives for class c, respectively. The IoU for each class is computed as the ratio of the intersection area to the union area between the predicted and ground truth labels. The mIoU is the average of these two IoU values. The mIoU ranges from 0 to 1, with higher values indicating superior segmentation performance in binary semantic segmentation tasks.

Implementation details We use the loss function (10) to supervise the training of our network SAID. To address the issue of the large number of parameters in Image Encoder, we introduced a Lightweight EfficientSAM [50] Image Encoder (based on different versions of ViT) for comparative experiments. During training, the parameters of the Image Encoder remained frozen and were not updated. All experiments used the representative Adam optimizer with an initial learning rate set to

1 \times 10^{- 3}

and the batch size set to 16. We employed a warm-up learning rate strategy combined with cosine decay [51] for adjustment. The experiments were conducted on an NVIDIA RTX 3090Ti GPU using the PyTorch 2.1 framework for training.

4.2. Main Results

4.2.1. Cross-Scene One-Shot Segmentation

For the cross-scene one-shot segmentation task, we conduct four sets of cross-validation experiments on the Industrial-5i dataset mentioned in Section 4.1. The Industrial-5i dataset consists of four folds, and we sequentially select one fold as the validation set, with the remaining three folds serving as the training set. The training set underwent data augmentation, expanding its size to five times the original. To investigate the impact of different Backbone models on the segmentation performance of SAID, we study the Image Encoder of SAM in two versions: ViT-L and ViT-B. Additionally, following the work of EfficientSAM [50] on distilling a Lightweight version of SAM, we also examine the Image Encoder of EfficientSAM in two versions: ViT-T and ViT-S. Table 2 presents the experimental results on the Industrial-5i dataset, with SAID compared to advanced single-annotation prompt segmentation models. Among them, FSS-1000 [52] serves as an important benchmark dataset in the few-shot segmentation field, providing a unified platform for evaluating the generalization performance of subsequent models. Based on this task, MMNet

\tilde{c}

itewu2021learning enhances the semantic correlation between support and query images by introducing a multi-level memory mechanism, improving the model’s contextual understanding ability. MSNet [25] emphasizes the fusion and interaction of multi-scale features to adapt to the diversity of target sizes and shapes. SegGPT [26] transforms image segmentation tasks into sequence-to-sequence generation problems, achieving strong generalization ability under zero and few sample conditions through pixel-level modeling of image content. The results show that SAID performs exceptionally well in industrial defect detection, achieving higher mIoU metrics compared to other one-shot segmentation methods. Under the one-shot setting, our method improved the mIoU by 0.87 and 5.79 compared to MSNet [25] and SegGPT [26], respectively. The performance comparison between them can be clearly seen in Figure 7.

To evaluate the reliability of the models, we measured their inference time on the RTX 3090 Ti, as summarized in Table 3. It can be observed that SAID requires a relatively longer time for full inference. This is primarily due to its Image Encoder, which is based on a computationally intensive Vision Transformer (ViT). However, if the input image is pre-encoded into an image embedding, the inference time per image can be significantly reduced to only 15–20 ms, which is comparable to that of other methods.

Traditional one-shot or few-shot segmentation methods typically rely on comparison with a support image to locate the target (appearance defects) in the query image. The core task of these methods is to learn features from one or a few normal scene images for defect detection in the query image. In contrast, SAID outperforms other one-shot segmentation models because it leverages scene-consistent prompts derived from the same industrial environment, enabling better alignment between support and query images. This reduces domain gaps and improves defect localization accuracy. Figure 8 demonstrates the defect segmentation results of our method on some industrial images. Even in situations with complex and varied image textures and colors, our method can generate accurate defect segmentation maps, showing its effectiveness and robustness in practical applications.

4.2.2. Supervised Experiment

The goal of one-shot learning is to train a model that can quickly adapt to new categories. However, due to the limited amount of training data in one-shot learning, the model may be at risk of overfitting. To further validate the capabilities of SAID, we conduct a supervised experiment to verify the model’s performance on specific data (single scene category data). We perform supervised training and testing on 15 industrial scenes from MVtec-AD [44], with 80% of the images from each industrial scene category used for training and the remaining 20% for testing. We compare our model with three prompting models of SAM, and the results are shown in Table 4. Compared with SAM, SAID increases the mIoU metric to 0.725, while SAM’s mIoU strong human prior knowledge box prompting is 0.635. The final results are shown in Figure 9.

In Table 4, the everything mode used by SAM involves generating a uniform grid point prompt (

m \times m

) on the image to be detected, where the value of m is set to 32 by default; the point mode selects a positive prompt point (a prompt point in the defect area); and the box mode uses the outer rectangle box of the GT as the prompt box, which is a very precise human prior. It can be clearly seen from Table 4 that, except for a few scenes (such as capsule and transistor), our model’s segmentation capabilities in specific scenes can basically reach or even surpass the segmentation capabilities of the SAM with strong human prior knowledge in box mode.

To further evaluate the practical usability of our proposed method, we compare the interaction efforts required by SAM and SAID in generating segmentation masks. For SAM, producing an accurate segmentation result generally demands manual input for each image. In point-based prompting, users typically need to provide 3–5 point clicks, and the interaction process takes approximately 8–12 s per image, depending on the complexity of the defect and the precision of the user input. In box-based prompting, segmentation can often be achieved with 1–3 bounding boxes, requiring around 3–5 s per image. However, both modes still involve manual interaction and iterative refinement to achieve acceptable segmentation quality.

4.3. Ablation Experiment

We conduct ablation experiments based on Table 5, following the conclusions drawn from Section 4.2. To explore the impact of the Scene Encoder module and the Feature Alignment and Fusion Module on the performance of the SAID model, we construct ablation experiments. The experimental data used the four sets of data from Table 1, and the evaluation metric was the mIoU of the model’s one-shot segmentation results. Among them,

S A M_{E v e r y t h i n g}

is the everything mode of SAM, the same as in Section 4.2.2. Using this segmentation method, the model outputs a large number of masks, and we select the mask with the highest overlap with the GT as the final output for comparison. FT (no prompt) removes the prompter encoder from SAM and fine-tunes the Mask Decoder using the industrial image data from the remaining three folds to adapt it to the distribution of industrial images. At the same time, it specifies the output of a single mask, avoiding multiple outputs and the complex post-processing operations. The Scene Encoder proposed in this paper provides scene information to assist in segmentation through a pair of example images. Finally, on this basis, the Feature Alignment and Fusion Module (FA-F) is added.

From Table 5, it can be seen that although the “

S A M_{E v e r y t h i n g}

+ Post-processing” mode achieved high mIoU in some cases (such as on fold1 data), the images segmented by

S A M_{E v e r y t h i n g}

have multiple mask outputs, and it is necessary to calculate the best mask as the result by comparing multiple outputs with the GT. However, in real detection needs, there is no GT detection process, and this method is not applicable. In contrast, our model benefits from the Scene Encoder and Lightweight Fusion, and the detection effect has exceeded the

S A M_{E v e r y t h i n g}

mode, while also achieving automatic detection without the need for subsequent processing operations, making it a more ideal detection method.

The efficient fusion of

E_{i m a g e}

and

E_{s c e n e}

is crucial for SAID to enhance the model’s segmentation performance. This paper compares the impact of different fusion mechanisms on model performance, primarily contrasting: (a) Concat Fusion; (b) Attention Fusion; and (c) Lightweight Fusion. These three fusion mechanisms are illustrated in Figure 6. Concat Fusion involves concatenating

E_{i m a g e}

and

E_{s c e n e}

and then passing them through a convolutional module for fusion output. Attention Fusion employs a cross-attention mechanism, using

E_{s c e n e}

as the query Q and

E_{i m a g e}

as the key K and the value V, to fuse features by extracting useful information through cross-attention. Lightweight Fusion concatenates the features through a simple convolution and then fuses them through the 3D attention mechanism in Equation (7).

The results in Table 6 clearly indicate that the superiority of the proposed Lightweight 3D Fusion Mechanism can be attributed to its design, which explicitly aligns the scene prompt features with the query image features in both spatial and semantic dimensions using a low-cost bottleneck structure. Unlike simple concatenation or element-wise addition, the 3D fusion incorporates both channel-wise and token-wise interactions, enhancing the feature consistency between prompt and query inputs. Therefore, this paper selects this fusion mechanism to fuse

E_{i m a g e}

and

E_{s c e n e}

embedding, aiming to achieve efficient feature fusion and improve the model performance.

5. Conclusions

This study addresses the limited generalization of traditional deep learning-based defect detection models and the practical shortcomings of the general-purpose SAM model in industrial downstream tasks, including its reliance on manual interaction, lack of automatic segmentation capability, and complex post-processing. To overcome these issues, we propose SAID—a single-annotation scene prompt segmentation framework. SAID integrates a Scene Encoder to extract contextual scene-level embeddings from annotated product image pairs and introduces a Lightweight Fusion module to effectively combine image and scene embeddings. By leveraging scene-specific prior knowledge and fine-tuning on multi-domain defect data, SAID achieves automatic and accurate segmentation across diverse industrial environments.

While SAID demonstrates strong segmentation performance, certain limitations remain. First, the model relies on a representative annotated prompt image from the same scene, which may not always be readily available in real-world online inspection scenarios involving unseen or evolving product types. Second, although the model eliminates the need for real-time human interaction, its inference speed is limited by the high computational cost of the frozen Image Encoder inherited from SAM, making it less suitable for latency-sensitive or edge device deployment. Additionally, SAID may experience performance degradation under severe occlusion, extreme lighting conditions, or highly ambiguous defect textures.

Future research will address these challenges by exploring a dynamic prompt selection mechanism capable of automatically identifying the most relevant support samples from a historical defect database, reducing reliance on manually chosen prompt images. We also aim to incorporate online adaptive learning to continuously refine the model as new product types or defect styles emerge. Furthermore, we plan to compress and distill the model to enable real-time deployment on edge devices in resource-constrained environments. A more comprehensive robustness evaluation will also be conducted to systematically assess performance under varied industrial conditions, including low-contrast defects, noisy backgrounds, and camera misalignments.

Author Contributions

Conceptualization, X.Z. and Y.D.; methodology, Y.H. and X.Z.; software, Y.H.; formal analysis, X.Z. and J.Z.; investigation, Y.H. and J.Z.; data curation, Y.H.; writing-original draft preparation, Y.H.; writing-review and editing, Y.H., J.Z., and X.Z.; visualization, Y.H. supervision, X.Z. and Y.D.; project administration, X.Z.; funding acquisition, X.Z. and Y.D. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported in part by the General Project of the National Natural Science Foundation of China under Grant (62171288), Guangdong Province Innovation Team for Ordinary Universities (Natural Science) Project (2024KCXTD065) and Shenzhen Fundamental Research Fund (No. JCYJ20230808105212023).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data sharing is not applicable to this article as no new data were created or analyzed in this study.

Acknowledgments

The authors would like to thank the anonymous reviewers.

Conflicts of Interest

The authors declare no conflicts of interest.

References

He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar]
Chen, L.C.; Papandreou, G.; Schroff, F.; Adam, H. Rethinking atrous convolution for semantic image segmentation. arXiv 2017, arXiv:1706.05587. [Google Scholar]
Xie, E.; Wang, W.; Yu, Z.; Anandkumar, A.; Alvarez, J.M.; Luo, P. SegFormer: Simple and efficient design for semantic segmentation with transformers. Adv. Neural Inf. Process. Syst. 2021, 34, 12077–12090. [Google Scholar]
Tao, X.; Hou, W.; Xu, D. A survey of surface defect detection methods based on deep learning. Acta Autom. Sin. 2021, 47, 1017–1034. [Google Scholar]
Li, S.; Yang, J.; Wang, Z.; Zhu, S.; Yang, G. Review of development and application of defect detection technology. Acta Autom. Sin. 2020, 46, 2319–2336. [Google Scholar]
Bommasani, R.; Hudson, D.A.; Adeli, E.; Altman, R.; Arora, S.; von Arx, S.; Bernstein, M.S.; Bohg, J.; Bosselut, A.; Brunskill, E.; et al. On the opportunities and risks of foundation models. arXiv 2021, arXiv:2108.07258. [Google Scholar]
Kirillov, A.; Mintun, E.; Ravi, N.; Mao, H.; Rolland, C.; Gustafson, L.; Xiao, T.; Whitehead, S.; Berg, A.C.; Lo, W.Y.; et al. Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–3 October 2023; pp. 4015–4026. [Google Scholar]
Mazurowski, M.A.; Dong, H.; Gu, H.; Yang, J.; Konz, N.; Zhang, Y. Segment anything model for medical image analysis: An experimental study. Med. Image Anal. 2023, 89, 102918. [Google Scholar] [CrossRef]
Chen, K.; Liu, C.; Chen, H.; Zhang, H.; Li, W.; Zou, Z.; Shi, Z. RSPrompter: Learning to prompt for remote sensing instance segmentation based on visual foundation model. IEEE Trans. Geosci. Remote Sens. 2024, 62, 4701117. [Google Scholar] [CrossRef]
Suresh, B.R.; Fundakowski, R.A.; Levitt, T.S.; Overland, J.E. A real-time automated visual inspection system for hot steel slabs. IEEE Trans. Pattern Anal. Mach. Intell. 1983, 6, 563–572. [Google Scholar] [CrossRef]
Schael, M. Texture defect detection using invariant textural features. In Pattern Recognition: 23rd DAGM Symposium, Munich, Germany, 12–14 September 2001; Proceedings 23; Springer: Berlin/Heidelberg, Germany, 2001; pp. 17–24. [Google Scholar]
Tsai, D.M.; Lin, C.P.; Huang, K.T. Defect detection in coloured texture surfaces using Gabor filters. Imaging Sci. J. 2005, 53, 27–37. [Google Scholar] [CrossRef]
Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar]
Haselmann, M.; Gruber, D.P.; Tabatabai, P. Anomaly detection using deep learning based image completion. In Proceedings of the 2018 17th IEEE International Conference on Machine Learning and Applications (ICMLA), Orlando, FL, USA, 17–20 December 2018; pp. 1237–1242. [Google Scholar]
Yang, M.; Wu, P.; Feng, H. MemSeg: A semi-supervised method for image surface defect detection using differences and commonalities. Eng. Appl. Artif. Intell. 2023, 119, 105835. [Google Scholar] [CrossRef]
Radford, A.; Metz, L.; Chintala, S. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv 2015, arXiv:1511.06434. [Google Scholar]
Ke, M.; Lin, C.; Huang, Q. Anomaly detection of Logo images in the mobile phone using convolutional autoencoder. In Proceedings of the 2017 4th International Conference on Systems and Informatics (ICSAI), Hangzhou, China, 11–13 November 2017; pp. 1163–1168. [Google Scholar]
Lai, Y.T.K.; Hu, J.S. A texture generation approach for detection of novel surface defects. In Proceedings of the 2018 IEEE International Conference on Systems, Man, and Cybernetics (SMC), Miyazaki, Japan, 7–10 October 2018; pp. 4357–4362. [Google Scholar]
Park, T.; Liu, M.Y.; Wang, T.C.; Zhu, J.Y. Semantic image synthesis with spatially-adaptive normalization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 2337–2346. [Google Scholar]
Roth, K.; Pemula, L.; Zepeda, J.; Schölkopf, B.; Brox, T.; Gehler, P. Towards total recall in industrial anomaly detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 14318–14328. [Google Scholar]
Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 248–255. [Google Scholar]
Luo, S.; Li, Y.; Gao, P.; Wang, Y.; Serikawa, S. Meta-seg: A survey of meta-learning for image segmentation. Pattern Recognit. 2022, 126, 108586. [Google Scholar] [CrossRef]
Wu, Z.; Shi, X.; Lin, G.; Cai, J. Learning meta-class memory for few-shot semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 11–17 October 2021; pp. 517–526. [Google Scholar]
Lu, X.; Wang, W.; Ma, C.; Shen, J.; Shao, L.; Porikli, F. See more, know more: Unsupervised video object segmentation with co-attention siamese networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 3623–3632. [Google Scholar]
Shi, X.; Cui, Z.; Zhang, S.; Cheng, M.; He, L.; Tang, X. Multi-similarity based hyperrelation network for few-shot segmentation. IET Image Process. 2023, 17, 204–214. [Google Scholar] [CrossRef]
Wang, X.; Zhang, X.; Cao, Y.; Wang, W.; Shen, C.; Huang, T. Seggpt: Towards segmenting everything in context. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–3 October 2023; pp. 1130–1140. [Google Scholar]
Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning, Virtual, 18–24 July 2021; pp. 8748–8763. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Advances in Neural Information Processing Systems, Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Curran Associates Inc.: Red Hook, NY, USA, 2017; Volume 30. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16 × 16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
He, K.; Chen, X.; Xie, S.; Li, Y.; Doll’ar, P.; Girshick, R.B. Masked autoencoders are scalable vision learners. 2022 IEEE. In Proceedings of the CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 15979–15988. [Google Scholar]
Ji, W.; Li, J.; Bi, Q.; Li, W.; Cheng, L. Segment anything is not always perfect: An investigation of sam on different real-world applications. arXiv 2023, arXiv:2304.05750. [Google Scholar] [CrossRef]
He, C.; Li, K.; Zhang, Y.; Xu, G.; Tang, L.; Zhang, Y.; Guo, Z.; Li, X. Weakly-supervised concealed object segmentation with sam-based pseudo labeling and multi-scale feature grouping. In Advances in Neural Information Processing Systems, Proceedings of the 37th International Conference on Neural Information Processing Systems, New Orleans, LA, USA, 10–16 December 2023; Curran Associates Inc.: Red Hook, NY, USA, 2023; Volume 36. [Google Scholar]
Houlsby, N.; Giurgiu, A.; Jastrzebski, S.; Morrone, B.; De Laroussilhe, Q.; Gesmundo, A.; Attariyan, M.; Gelly, S. Parameter-efficient transfer learning for NLP. In Proceedings of the International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019; pp. 2790–2799. [Google Scholar]
Chen, T.; Zhu, L.; Deng, C.; Cao, R.; Wang, Y.; Zhang, S.; Li, Z.; Sun, L.; Zang, Y.; Mao, P. Sam-adapter: Adapting segment anything in underperformed scenes. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–3 October 2023; pp. 3367–3375. [Google Scholar]
Wu, J.; Fu, R.; Fang, H.; Liu, Y.; Wang, Z.; Xu, Y.; Jin, Y.; Arbel, T. Medical sam adapter: Adapting segment anything model for medical image segmentation. arXiv 2023, arXiv:2304.12620. [Google Scholar] [CrossRef]
Hedlund, W. Zero-shot Segmentation for Change Detection: Change Detection in Synthetic Aperture Sonar Imagery Using Segment Anything Model. Master’s Thesis, Linköping University, Linköping, Sweden, 2024. [Google Scholar]
Hu, E.J.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; Chen, W. Lora: Low-rank adaptation of large language models. arXiv 2021, arXiv:2106.09685. [Google Scholar]
Zou, X.; Yang, J.; Zhang, H.; Li, F.; Li, L.; Wang, J.; Wang, L.; Gao, J.; Lee, Y.J. Segment everything everywhere all at once. In Advances in Neural Information Processing Systems, Proceedings of the 37th International Conference on Neural Information Processing Systems, New Orleans, LA, USA, 10–16 December 2023; Curran Associates Inc.: Red Hook, NY, USA, 2023; Volume 36. [Google Scholar]
Liu, Z.; Zhou, Y.; Xu, Y.; Wang, Z. Simplenet: A simple network for image anomaly detection and localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 20402–20411. [Google Scholar]
Batzner, K.; Heckler, L.; König, R. Efficientad: Accurate visual anomaly detection at millisecond-level latencies. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2024; pp. 128–138. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, 5–9 October 2015; Part III 18; Springer: Cham, Switzerland, 2015; pp. 234–241. [Google Scholar]
Yang, L.; Zhang, R.Y.; Li, L.; Xie, X. Simam: A simple, parameter-free attention module for convolutional neural networks. In Proceedings of the International Conference on Machine Learning, Virtual, 18–24 July 2021; pp. 11863–11874. [Google Scholar]
Shi, X.; Zhang, S.; Cheng, M.; He, L.; Tang, X.; Cui, Z. Few-shot semantic segmentation for industrial defect recognition. Comput. Ind. 2023, 148, 103901. [Google Scholar] [CrossRef]
Bergmann, P.; Fauser, M.; Sattlegger, D.; Steger, C. MVTec AD—A comprehensive real-world dataset for unsupervised anomaly detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 9592–9600. [Google Scholar]
Tabernik, D.; Šela, S.; Skvarč, J.; Skočaj, D. Segmentation-based deep-learning approach for surface-defect detection. J. Intell. Manuf. 2020, 31, 759–776. [Google Scholar] [CrossRef]
Huang, Y.; Qiu, C.; Yuan, K. Surface defect saliency of magnetic tile. Vis. Comput. 2020, 36, 85–96. [Google Scholar] [CrossRef]
Gan, J.; Li, Q.; Wang, J.; Yu, H. A hierarchical extractor-based visual rail surface inspection system. IEEE Sens. J. 2017, 17, 7935–7944. [Google Scholar] [CrossRef]
Schlagenhauf, T.; Landwehr, M. Industrial machine tool component surface defect dataset. Data Brief 2021, 39, 107643. [Google Scholar] [CrossRef]
Tian, Z.; Zhao, H.; Shu, M.; Yang, Z.; Li, R.; Jia, J. Prior guided feature enrichment network for few-shot segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 44, 1050–1065. [Google Scholar] [CrossRef]
Xiong, Y.; Varadarajan, B.; Wu, L.; Xiang, X.; Xiao, F.; Zhu, C.; Dai, X.; Wang, D.; Sun, F.; Iandola, F.; et al. Efficientsam: Leveraged masked image pretraining for efficient segment anything. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 16111–16121. [Google Scholar]
Antoniou, A.; Edwards, H.; Storkey, A. How to train your MAML. In Proceedings of the International Conference on Learning Representations, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
Li, X.; Wei, T.; Chen, Y.P.; Tai, Y.W.; Tang, C.K. Fss-1000: A 1000-class dataset for few-shot segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 2869–2878. [Google Scholar]

Figure 1. Multiple products in the industrial field have various defects.

Figure 2. Illustration of the differences between traditional methods (a), SAM-based methods (b), and the proposed SAID framework (c).

Figure 3. The SAID model architecture consists of an Image Encoder in the upper part, which is frozen and does not participate in training. The Scene Encoder in the lower part requires training.

Figure 4. Scene Encoder. The single-annotated prompt–image pair

I_s a m p l e

and

M_s a m p l e

are encoded to achieve the scene embedding

E_{s c e n e}

.

Figure 4. Scene Encoder. The single-annotated prompt–image pair

I_s a m p l e

and

M_s a m p l e

are encoded to achieve the scene embedding

E_{s c e n e}

.

Figure 5. Visualization of single-annotated prompt–image pairs

I_s a m p l e

,

M_s a m p l e

, and scene embedding

E_{s c e n e}

.

Figure 5. Visualization of single-annotated prompt–image pairs

I_s a m p l e

,

M_s a m p l e

, and scene embedding

E_{s c e n e}

.

Figure 7. Visualization of mIoU performance for single segmentation on the Industrial-5i dataset.

Figure 8. Visual comparison of one-shot segmentation results of SAID and SegGPT on various industrial scenes.

Figure 9. Performance results of SAID on various industrial datasets.

Table 1. Classes and the original corresponding number of images in the industry-5i dataset.

Fold1	Wood	Pill	BSD	Railway	Toothbrush
	60	141	426	94	30
Fold2	Leather	Mutou	Metal-Nut	Kolektor-SSD2	Bottle
	92	1838	70	436	63
Fold3	Carpet	Hazelnut	Phone	Tile	Grid
	89	70	100	84	57
Fold4	Magnetic Tile	Capsule	Cable	Kolektor-SSD	Zipper
	392	109	92	522	119

Table 2. The mIoU performance of one-shot segmentation on Industrial-5i dataset.

Methods	Fold1	Fold2	Fold3	Fold4	Mean
FSS-1000 [52]	10.37	13.23	8.54	7.11	9.81
MMNet [23]	16.59	31.66	22.12	16.55	21.73
MSNet [25]	21.25	31.98	29.24	14.18	24.16
SegGPT [26]	31.16	22.98	28.69	33.47	29.08
SAID(EfficentSAM-T)	24.67	27.69	27.66	20.41	25.61
SAID(EfficentSAM-S)	26.01	27.42	28.72	24.37	26.13
SAID(SAM-B)	25.79	26.68	29.64	26.09	26.80
SAID(SAM-L)	27.49	28.24	29.94	34.17	29.96

Table 3. Model comparison on RTX 3090 Ti.

Model	Inference Time (ms)	Params (M)	Architecture Characteristics
FSS-1000	30–60	45	ResNet-101 backbone with prototype matching
MMNet	10–20	2.1	Lightweight multi-scale CNN with attention
MSNet	8–15	10	Multi-scale autoencoder with memory bank
SAID (SAM-B)	250–350	91	Vision Transformer-Base with mask decoder
SAID (SAM-L)	500–800	308	Vision Transformer-Large with mask decoder
SAID (pre-encoded)	15–20	308	Vision Transformer-Huge with mask decoder

Table 4. SAID performance in supervised training on the MVtec dataset, with mIoU as the evaluation metric. “Human” indicates whether manual interaction is required.

Category	${SAM}_{Everything}$		${SAM}_{Point}$		${SAM}_{Box}$		$SAID (ours)$
Category	mIoU	Human	mIoU	Human	mIoU	Human	mIoU	Human
bottle	0.298	N	0.489	Y	0.675	Y	0.831	N
cable	0.410	N	0.560	Y	0.676	Y	0.589	N
capsule	0.316	N	0.444	Y	0.562	Y	0.482	N
carpet	0.045	N	0.296	Y	0.475	Y	0.673	N
grid	0.144	N	0.265	Y	0.526	Y	0.494	N
hazelnut	0.439	N	0.589	Y	0.705	Y	0.891	N
leather	0.291	N	0.485	Y	0.631	Y	0.762	N
metal_nut	0.355	N	0.696	Y	0.671	Y	0.894	N
pill	0.374	N	0.570	Y	0.743	Y	0.703	N
screw	0.208	N	0.455	Y	0.635	Y	0.797	N
tile	0.337	N	0.730	Y	0.726	Y	0.835	N
toothbrush	0.263	N	0.446	Y	0.735	Y	0.877	N
transistor	0.324	N	0.325	Y	0.445	Y	0.339	N
wood	0.176	N	0.325	Y	0.650	Y	0.834	N
zipper	0.149	N	0.257	Y	0.588	Y	0.806	N
Mean	0.279	N	0.468	Y	0.635	Y	0.725	N

Table 5. Ablation experiment results, FT stands for fine-tuned Mask Decoder, and FA-F refers to the Light Weight Fusion module proposed.

	FT	Scene Encoder	FA-F	Fold1	Fold2	Fold3	Fold4	Mean
$S A M_{E v e r y t h i n g}$	✗	✗	✗	33.70	24.86	24.36	20.01	25.73
FT (No prompt)	✓	✗	✗	18.65	20.42	26.43	18.42	20.98
Scene Encoder	✓	✓	✗	23.74	24.12	29.58	25.16	25.65
Ours	✓	✓	✓	27.49	28.24	29.94	34.17	29.29

Table 6. The impact of various fusion modules on model performance, with the evaluation metric being mIoU.

Fusion Modules	Fold1	Fold2	Fold3	Fold4	Mean
Concat Fusion	22.65	24.43	30.54	23.13	25.19
Attention Fusion	25.62	25.61	29.11	26.69	26.76
Lightweight Fusion	27.49	28.24	29.94	34.17	29.95

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Huang, Y.; Zhu, J.; Zhong, X.; Deng, Y. SAID: Segment All Industrial Defects with Scene Prompts. Sensors 2025, 25, 4929. https://doi.org/10.3390/s25164929

AMA Style

Huang Y, Zhu J, Zhong X, Deng Y. SAID: Segment All Industrial Defects with Scene Prompts. Sensors. 2025; 25(16):4929. https://doi.org/10.3390/s25164929

Chicago/Turabian Style

Huang, Yican, Junwei Zhu, Xiaopin Zhong, and Yuanlong Deng. 2025. "SAID: Segment All Industrial Defects with Scene Prompts" Sensors 25, no. 16: 4929. https://doi.org/10.3390/s25164929

APA Style

Huang, Y., Zhu, J., Zhong, X., & Deng, Y. (2025). SAID: Segment All Industrial Defects with Scene Prompts. Sensors, 25(16), 4929. https://doi.org/10.3390/s25164929

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

SAID: Segment All Industrial Defects with Scene Prompts

Abstract

1. Introduction

2. Related Work

2.1. Surface Defect Detection

2.2. Few-Shot Image Segmentation Methods

2.3. Fundamental Visual Segmentation Model

3. Methods

3.1. Overview of the SAID Architecture

3.2. Scene Encoder

3.3. Feature Alignment and Fusion Module

3.4. Loss Function

4. Experiments

4.1. Setup

4.2. Main Results

4.2.1. Cross-Scene One-Shot Segmentation

4.2.2. Supervised Experiment

4.3. Ablation Experiment

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI