SAID: Segment All Industrial Defects with Scene Prompts
Abstract
1. Introduction
- 1.
- We propose SAID for the defect segmentation task of industrial images. SAID eliminates the reliance on human priors and does not require complex post-processing, achieving automatic defect segmentation detection.
- 2.
- We design the Scene Encoder to encode the product with a set of user-input product annotation images into scene embedding, enhancing the model’s segmentation capabilities. To address the misalignment of features between the Scene Encoder and Image Encoder outputs, one Lightweight Feature Alignment and Fusion Module is designed.
- 3.
- Experiments on multiple industrial scene datasets show that our SAID model exhibits excellent capabilities under both one-shot and supervised settings.
2. Related Work
2.1. Surface Defect Detection
2.2. Few-Shot Image Segmentation Methods
2.3. Fundamental Visual Segmentation Model
3. Methods
3.1. Overview of the SAID Architecture
- (1).
- The image to be detected is encoded by the Image Encoder to produce the image embedding .
- (2).
- Scene Encoder encodes a pair of product images and their corresponding mask images that belong to the same scene as the image to be detected into the scene embedding .
- (3).
- and are fused through the designed Feature Alignment and Fusion Module, and then fed into the Mask Decoder for mask prediction, yielding the segmentation results.
3.2. Scene Encoder
3.3. Feature Alignment and Fusion Module
3.4. Loss Function
4. Experiments
4.1. Setup
4.2. Main Results
4.2.1. Cross-Scene One-Shot Segmentation
4.2.2. Supervised Experiment
4.3. Ablation Experiment
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar]
- Chen, L.C.; Papandreou, G.; Schroff, F.; Adam, H. Rethinking atrous convolution for semantic image segmentation. arXiv 2017, arXiv:1706.05587. [Google Scholar]
- Xie, E.; Wang, W.; Yu, Z.; Anandkumar, A.; Alvarez, J.M.; Luo, P. SegFormer: Simple and efficient design for semantic segmentation with transformers. Adv. Neural Inf. Process. Syst. 2021, 34, 12077–12090. [Google Scholar]
- Tao, X.; Hou, W.; Xu, D. A survey of surface defect detection methods based on deep learning. Acta Autom. Sin. 2021, 47, 1017–1034. [Google Scholar]
- Li, S.; Yang, J.; Wang, Z.; Zhu, S.; Yang, G. Review of development and application of defect detection technology. Acta Autom. Sin. 2020, 46, 2319–2336. [Google Scholar]
- Bommasani, R.; Hudson, D.A.; Adeli, E.; Altman, R.; Arora, S.; von Arx, S.; Bernstein, M.S.; Bohg, J.; Bosselut, A.; Brunskill, E.; et al. On the opportunities and risks of foundation models. arXiv 2021, arXiv:2108.07258. [Google Scholar]
- Kirillov, A.; Mintun, E.; Ravi, N.; Mao, H.; Rolland, C.; Gustafson, L.; Xiao, T.; Whitehead, S.; Berg, A.C.; Lo, W.Y.; et al. Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–3 October 2023; pp. 4015–4026. [Google Scholar]
- Mazurowski, M.A.; Dong, H.; Gu, H.; Yang, J.; Konz, N.; Zhang, Y. Segment anything model for medical image analysis: An experimental study. Med. Image Anal. 2023, 89, 102918. [Google Scholar] [CrossRef]
- Chen, K.; Liu, C.; Chen, H.; Zhang, H.; Li, W.; Zou, Z.; Shi, Z. RSPrompter: Learning to prompt for remote sensing instance segmentation based on visual foundation model. IEEE Trans. Geosci. Remote Sens. 2024, 62, 4701117. [Google Scholar] [CrossRef]
- Suresh, B.R.; Fundakowski, R.A.; Levitt, T.S.; Overland, J.E. A real-time automated visual inspection system for hot steel slabs. IEEE Trans. Pattern Anal. Mach. Intell. 1983, 6, 563–572. [Google Scholar] [CrossRef]
- Schael, M. Texture defect detection using invariant textural features. In Pattern Recognition: 23rd DAGM Symposium, Munich, Germany, 12–14 September 2001; Proceedings 23; Springer: Berlin/Heidelberg, Germany, 2001; pp. 17–24. [Google Scholar]
- Tsai, D.M.; Lin, C.P.; Huang, K.T. Defect detection in coloured texture surfaces using Gabor filters. Imaging Sci. J. 2005, 53, 27–37. [Google Scholar] [CrossRef]
- Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar]
- Haselmann, M.; Gruber, D.P.; Tabatabai, P. Anomaly detection using deep learning based image completion. In Proceedings of the 2018 17th IEEE International Conference on Machine Learning and Applications (ICMLA), Orlando, FL, USA, 17–20 December 2018; pp. 1237–1242. [Google Scholar]
- Yang, M.; Wu, P.; Feng, H. MemSeg: A semi-supervised method for image surface defect detection using differences and commonalities. Eng. Appl. Artif. Intell. 2023, 119, 105835. [Google Scholar] [CrossRef]
- Radford, A.; Metz, L.; Chintala, S. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv 2015, arXiv:1511.06434. [Google Scholar]
- Ke, M.; Lin, C.; Huang, Q. Anomaly detection of Logo images in the mobile phone using convolutional autoencoder. In Proceedings of the 2017 4th International Conference on Systems and Informatics (ICSAI), Hangzhou, China, 11–13 November 2017; pp. 1163–1168. [Google Scholar]
- Lai, Y.T.K.; Hu, J.S. A texture generation approach for detection of novel surface defects. In Proceedings of the 2018 IEEE International Conference on Systems, Man, and Cybernetics (SMC), Miyazaki, Japan, 7–10 October 2018; pp. 4357–4362. [Google Scholar]
- Park, T.; Liu, M.Y.; Wang, T.C.; Zhu, J.Y. Semantic image synthesis with spatially-adaptive normalization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 2337–2346. [Google Scholar]
- Roth, K.; Pemula, L.; Zepeda, J.; Schölkopf, B.; Brox, T.; Gehler, P. Towards total recall in industrial anomaly detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 14318–14328. [Google Scholar]
- Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 248–255. [Google Scholar]
- Luo, S.; Li, Y.; Gao, P.; Wang, Y.; Serikawa, S. Meta-seg: A survey of meta-learning for image segmentation. Pattern Recognit. 2022, 126, 108586. [Google Scholar] [CrossRef]
- Wu, Z.; Shi, X.; Lin, G.; Cai, J. Learning meta-class memory for few-shot semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 11–17 October 2021; pp. 517–526. [Google Scholar]
- Lu, X.; Wang, W.; Ma, C.; Shen, J.; Shao, L.; Porikli, F. See more, know more: Unsupervised video object segmentation with co-attention siamese networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 3623–3632. [Google Scholar]
- Shi, X.; Cui, Z.; Zhang, S.; Cheng, M.; He, L.; Tang, X. Multi-similarity based hyperrelation network for few-shot segmentation. IET Image Process. 2023, 17, 204–214. [Google Scholar] [CrossRef]
- Wang, X.; Zhang, X.; Cao, Y.; Wang, W.; Shen, C.; Huang, T. Seggpt: Towards segmenting everything in context. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–3 October 2023; pp. 1130–1140. [Google Scholar]
- Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning, Virtual, 18–24 July 2021; pp. 8748–8763. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Advances in Neural Information Processing Systems, Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Curran Associates Inc.: Red Hook, NY, USA, 2017; Volume 30. [Google Scholar]
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16 × 16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
- He, K.; Chen, X.; Xie, S.; Li, Y.; Doll’ar, P.; Girshick, R.B. Masked autoencoders are scalable vision learners. 2022 IEEE. In Proceedings of the CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 15979–15988. [Google Scholar]
- Ji, W.; Li, J.; Bi, Q.; Li, W.; Cheng, L. Segment anything is not always perfect: An investigation of sam on different real-world applications. arXiv 2023, arXiv:2304.05750. [Google Scholar] [CrossRef]
- He, C.; Li, K.; Zhang, Y.; Xu, G.; Tang, L.; Zhang, Y.; Guo, Z.; Li, X. Weakly-supervised concealed object segmentation with sam-based pseudo labeling and multi-scale feature grouping. In Advances in Neural Information Processing Systems, Proceedings of the 37th International Conference on Neural Information Processing Systems, New Orleans, LA, USA, 10–16 December 2023; Curran Associates Inc.: Red Hook, NY, USA, 2023; Volume 36. [Google Scholar]
- Houlsby, N.; Giurgiu, A.; Jastrzebski, S.; Morrone, B.; De Laroussilhe, Q.; Gesmundo, A.; Attariyan, M.; Gelly, S. Parameter-efficient transfer learning for NLP. In Proceedings of the International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019; pp. 2790–2799. [Google Scholar]
- Chen, T.; Zhu, L.; Deng, C.; Cao, R.; Wang, Y.; Zhang, S.; Li, Z.; Sun, L.; Zang, Y.; Mao, P. Sam-adapter: Adapting segment anything in underperformed scenes. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–3 October 2023; pp. 3367–3375. [Google Scholar]
- Wu, J.; Fu, R.; Fang, H.; Liu, Y.; Wang, Z.; Xu, Y.; Jin, Y.; Arbel, T. Medical sam adapter: Adapting segment anything model for medical image segmentation. arXiv 2023, arXiv:2304.12620. [Google Scholar] [CrossRef]
- Hedlund, W. Zero-shot Segmentation for Change Detection: Change Detection in Synthetic Aperture Sonar Imagery Using Segment Anything Model. Master’s Thesis, Linköping University, Linköping, Sweden, 2024. [Google Scholar]
- Hu, E.J.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; Chen, W. Lora: Low-rank adaptation of large language models. arXiv 2021, arXiv:2106.09685. [Google Scholar]
- Zou, X.; Yang, J.; Zhang, H.; Li, F.; Li, L.; Wang, J.; Wang, L.; Gao, J.; Lee, Y.J. Segment everything everywhere all at once. In Advances in Neural Information Processing Systems, Proceedings of the 37th International Conference on Neural Information Processing Systems, New Orleans, LA, USA, 10–16 December 2023; Curran Associates Inc.: Red Hook, NY, USA, 2023; Volume 36. [Google Scholar]
- Liu, Z.; Zhou, Y.; Xu, Y.; Wang, Z. Simplenet: A simple network for image anomaly detection and localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 20402–20411. [Google Scholar]
- Batzner, K.; Heckler, L.; König, R. Efficientad: Accurate visual anomaly detection at millisecond-level latencies. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2024; pp. 128–138. [Google Scholar]
- Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, 5–9 October 2015; Part III 18; Springer: Cham, Switzerland, 2015; pp. 234–241. [Google Scholar]
- Yang, L.; Zhang, R.Y.; Li, L.; Xie, X. Simam: A simple, parameter-free attention module for convolutional neural networks. In Proceedings of the International Conference on Machine Learning, Virtual, 18–24 July 2021; pp. 11863–11874. [Google Scholar]
- Shi, X.; Zhang, S.; Cheng, M.; He, L.; Tang, X.; Cui, Z. Few-shot semantic segmentation for industrial defect recognition. Comput. Ind. 2023, 148, 103901. [Google Scholar] [CrossRef]
- Bergmann, P.; Fauser, M.; Sattlegger, D.; Steger, C. MVTec AD—A comprehensive real-world dataset for unsupervised anomaly detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 9592–9600. [Google Scholar]
- Tabernik, D.; Šela, S.; Skvarč, J.; Skočaj, D. Segmentation-based deep-learning approach for surface-defect detection. J. Intell. Manuf. 2020, 31, 759–776. [Google Scholar] [CrossRef]
- Huang, Y.; Qiu, C.; Yuan, K. Surface defect saliency of magnetic tile. Vis. Comput. 2020, 36, 85–96. [Google Scholar] [CrossRef]
- Gan, J.; Li, Q.; Wang, J.; Yu, H. A hierarchical extractor-based visual rail surface inspection system. IEEE Sens. J. 2017, 17, 7935–7944. [Google Scholar] [CrossRef]
- Schlagenhauf, T.; Landwehr, M. Industrial machine tool component surface defect dataset. Data Brief 2021, 39, 107643. [Google Scholar] [CrossRef]
- Tian, Z.; Zhao, H.; Shu, M.; Yang, Z.; Li, R.; Jia, J. Prior guided feature enrichment network for few-shot segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 44, 1050–1065. [Google Scholar] [CrossRef]
- Xiong, Y.; Varadarajan, B.; Wu, L.; Xiang, X.; Xiao, F.; Zhu, C.; Dai, X.; Wang, D.; Sun, F.; Iandola, F.; et al. Efficientsam: Leveraged masked image pretraining for efficient segment anything. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 16111–16121. [Google Scholar]
- Antoniou, A.; Edwards, H.; Storkey, A. How to train your MAML. In Proceedings of the International Conference on Learning Representations, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
- Li, X.; Wei, T.; Chen, Y.P.; Tai, Y.W.; Tang, C.K. Fss-1000: A 1000-class dataset for few-shot segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 2869–2878. [Google Scholar]
Fold1 | Wood | Pill | BSD | Railway | Toothbrush |
60 | 141 | 426 | 94 | 30 | |
Fold2 | Leather | Mutou | Metal-Nut | Kolektor-SSD2 | Bottle |
92 | 1838 | 70 | 436 | 63 | |
Fold3 | Carpet | Hazelnut | Phone | Tile | Grid |
89 | 70 | 100 | 84 | 57 | |
Fold4 | Magnetic Tile | Capsule | Cable | Kolektor-SSD | Zipper |
392 | 109 | 92 | 522 | 119 |
Methods | Fold1 | Fold2 | Fold3 | Fold4 | Mean |
---|---|---|---|---|---|
FSS-1000 [52] | 10.37 | 13.23 | 8.54 | 7.11 | 9.81 |
MMNet [23] | 16.59 | 31.66 | 22.12 | 16.55 | 21.73 |
MSNet [25] | 21.25 | 31.98 | 29.24 | 14.18 | 24.16 |
SegGPT [26] | 31.16 | 22.98 | 28.69 | 33.47 | 29.08 |
SAID(EfficentSAM-T) | 24.67 | 27.69 | 27.66 | 20.41 | 25.61 |
SAID(EfficentSAM-S) | 26.01 | 27.42 | 28.72 | 24.37 | 26.13 |
SAID(SAM-B) | 25.79 | 26.68 | 29.64 | 26.09 | 26.80 |
SAID(SAM-L) | 27.49 | 28.24 | 29.94 | 34.17 | 29.96 |
Model | Inference Time (ms) | Params (M) | Architecture Characteristics |
---|---|---|---|
FSS-1000 | 30–60 | 45 | ResNet-101 backbone with prototype matching |
MMNet | 10–20 | 2.1 | Lightweight multi-scale CNN with attention |
MSNet | 8–15 | 10 | Multi-scale autoencoder with memory bank |
SAID (SAM-B) | 250–350 | 91 | Vision Transformer-Base with mask decoder |
SAID (SAM-L) | 500–800 | 308 | Vision Transformer-Large with mask decoder |
SAID (pre-encoded) | 15–20 | 308 | Vision Transformer-Huge with mask decoder |
Category | ||||||||
---|---|---|---|---|---|---|---|---|
mIoU | Human | mIoU | Human | mIoU | Human | mIoU | Human | |
bottle | 0.298 | N | 0.489 | Y | 0.675 | Y | 0.831 | N |
cable | 0.410 | N | 0.560 | Y | 0.676 | Y | 0.589 | N |
capsule | 0.316 | N | 0.444 | Y | 0.562 | Y | 0.482 | N |
carpet | 0.045 | N | 0.296 | Y | 0.475 | Y | 0.673 | N |
grid | 0.144 | N | 0.265 | Y | 0.526 | Y | 0.494 | N |
hazelnut | 0.439 | N | 0.589 | Y | 0.705 | Y | 0.891 | N |
leather | 0.291 | N | 0.485 | Y | 0.631 | Y | 0.762 | N |
metal_nut | 0.355 | N | 0.696 | Y | 0.671 | Y | 0.894 | N |
pill | 0.374 | N | 0.570 | Y | 0.743 | Y | 0.703 | N |
screw | 0.208 | N | 0.455 | Y | 0.635 | Y | 0.797 | N |
tile | 0.337 | N | 0.730 | Y | 0.726 | Y | 0.835 | N |
toothbrush | 0.263 | N | 0.446 | Y | 0.735 | Y | 0.877 | N |
transistor | 0.324 | N | 0.325 | Y | 0.445 | Y | 0.339 | N |
wood | 0.176 | N | 0.325 | Y | 0.650 | Y | 0.834 | N |
zipper | 0.149 | N | 0.257 | Y | 0.588 | Y | 0.806 | N |
Mean | 0.279 | N | 0.468 | Y | 0.635 | Y | 0.725 | N |
FT | Scene Encoder | FA-F | Fold1 | Fold2 | Fold3 | Fold4 | Mean | |
---|---|---|---|---|---|---|---|---|
✗ | ✗ | ✗ | 33.70 | 24.86 | 24.36 | 20.01 | 25.73 | |
FT (No prompt) | ✓ | ✗ | ✗ | 18.65 | 20.42 | 26.43 | 18.42 | 20.98 |
Scene Encoder | ✓ | ✓ | ✗ | 23.74 | 24.12 | 29.58 | 25.16 | 25.65 |
Ours | ✓ | ✓ | ✓ | 27.49 | 28.24 | 29.94 | 34.17 | 29.29 |
Fusion Modules | Fold1 | Fold2 | Fold3 | Fold4 | Mean |
---|---|---|---|---|---|
Concat Fusion | 22.65 | 24.43 | 30.54 | 23.13 | 25.19 |
Attention Fusion | 25.62 | 25.61 | 29.11 | 26.69 | 26.76 |
Lightweight Fusion | 27.49 | 28.24 | 29.94 | 34.17 | 29.95 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Huang, Y.; Zhu, J.; Zhong, X.; Deng, Y. SAID: Segment All Industrial Defects with Scene Prompts. Sensors 2025, 25, 4929. https://doi.org/10.3390/s25164929
Huang Y, Zhu J, Zhong X, Deng Y. SAID: Segment All Industrial Defects with Scene Prompts. Sensors. 2025; 25(16):4929. https://doi.org/10.3390/s25164929
Chicago/Turabian StyleHuang, Yican, Junwei Zhu, Xiaopin Zhong, and Yuanlong Deng. 2025. "SAID: Segment All Industrial Defects with Scene Prompts" Sensors 25, no. 16: 4929. https://doi.org/10.3390/s25164929
APA StyleHuang, Y., Zhu, J., Zhong, X., & Deng, Y. (2025). SAID: Segment All Industrial Defects with Scene Prompts. Sensors, 25(16), 4929. https://doi.org/10.3390/s25164929