Improved Test Input Prioritization Using Verification Monitors with False Prediction Cluster Centroids
Abstract
:1. Introduction
- We propose a new scoring function for TIP based on runtime verification monitors that uses the distance between FPC centroids from training data and an unlabeled test instance.
- We analyze the proposed DeepFPC method based on the intra-class feature compactness and inter-class feature separability of a training set.
- The proposed DeepFPC approach shows significantly better performance than the existing TIP methods in image classification and active learning with several datasets.
2. Methods
2.1. Runtime Verification Monitor
2.2. DeepFPC
2.3. Intra-Class Feature Compactness and Inter-Class Feature Separability
3. Experimental Results
3.1. Experimental Setups
3.1.1. Datasets and Models
3.1.2. Compared Methods
3.1.3. Evaluation Metrics
3.2. Comparisons between Different TIP Methods with the Perspective of Identifying Test Instances with Errors
3.3. Inference Time Comparisons with Different TIP Methods
3.4. Sample Visaulization
4. Conclusions and Future Works
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Yang, W.; Hongwei, G.; Yueqiu, J.; Xin, Z. A Novel Approach to Maritime Image Dehazing Based on a Large Kernel Encoder–Decoder Network with Multihead Pyramids. Electronics 2022, 11, 3351. [Google Scholar] [CrossRef]
- Dhanya, V.; Subeesh, A.; Kushwaha, N.; Vishwakarma, D.K.; Kumar, T.N.; Ritika, G.; Singh, A. Deep learning based computer vision approaches for smart agricultural applications. Artif. Intell. Agric. 2022, 6, 211–229. [Google Scholar] [CrossRef]
- Ahmed, W.; Morerio, P.; Murino, V. Cleaning noisy labels by negative ensemble learning for source-free unsupervised domain adaptation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV 2022), Waikoloa, HI, USA, 4–8 January 2022; pp. 1616–1625. [Google Scholar]
- Vardi, G. On the implicit bias in deep-learning algorithms. Commun. ACM 2023, 66, 86–93. [Google Scholar] [CrossRef]
- Ma, L.; Felix, J.; Fuyuan, Z.; Jiyuan, S.; Minhui, X.; Bo, L.; Chunyang, C.; Ting, S.; Li, L.; Yang, L. Deepgauge: Multi-Granularity Testing Criteria for Deep Learning Systems. In Proceedings of the 33rd ACM/IEEE international conference on Automated Software Engineering (ASE 2018), Montpellier, France, 3–7 September 2018; pp. 120–131. [Google Scholar]
- Yuan, Y.; Pang, Q.; Wang, S. Revisiting Neuron Coverage for Dnn Testing: A Layer-Wise and Distribution-Aware Criterion. In Proceedings of the IEEE/ACM 45th International Conference on Software Engineering (ICSE 2023), Melbourne, Australia, 14–20 May 2023; pp. 1200–1212. [Google Scholar]
- Yan, R.; Chen, Y.; Gao, H.; Yan, J. Test Case Prioritization with Neuron Valuation Based Pattern. Sci. Comput. Program. 2022, 215, 102761. [Google Scholar] [CrossRef]
- Kim, J.; Feldt, R.; Yoo, S. Guiding Deep Learning System Testing Using Surprise Adequacy. In Proceedings of the IEEE/ACM 41st International Conference on Software Engineering (ICSE 2019), Montreal, QC, Canada, 24–31 May 2019; pp. 1039–1049. [Google Scholar]
- Kim, S.; Yoo, S. Multimodal Surprise Adequacy Analysis of Inputs for Natural Language Processing Dnn Models. In Proceedings of the IEEE/ACM International Conference on Automation of Software Test (AST 2021), Madrid, Spain, 20–21 May 2021; pp. 80–89. [Google Scholar]
- Feng, Y.; Shi, Q.; Gao, X.; Wan, J.; Fang, C.; Chen, Z. Deepgini: Prioritizing Massive Tests to Enhance the Robustness of Deep Neural Networks. In Proceedings of the 29th ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA 2020), Virtual. 18–22 July 2020; pp. 177–188. [Google Scholar]
- Ma, W.; Papadakis, M.; Tsakmalis, A.; Cordy, M.; Traon, Y.L. Test Selection for Deep Learning Systems. ACM Trans. Softw. Eng. Methodol. TOSEM 2021, 30, 1–22. [Google Scholar] [CrossRef]
- Wang, Z.; You, H.; Chen, J.; Zhang, Y.; Dong, X.; Zhang, W. Prioritizing Test Inputs for Deep Neural Networks Via Mutation Analysis. In Proceedings of the IEEE/ACM 43rd International Conference on Software Engineering (ICSE 2021), Virtual. 25–28 May 2021; pp. 397–409. [Google Scholar]
- Li, Y.; Li, M.; Lai, Q.; Liu, Y.; Xu, Q. Testrank: Bringing Order into Unlabeled Test Instances for Deep Learning Tasks. In Proceedings of the 35th Conference on Neural Information Processing Systems (NeurIPS 2021), Virtual. 6–14 December 2021; pp. 20874–20886. [Google Scholar]
- Al-Qadasi, H.; Wu, C.; Falcone, Y.; Bensalem, S. Deepabstraction: 2-Level Prioritization for Unlabeled Test Inputs in Deep Neural Networks. In Proceedings of the IEEE International Conference On Artificial Intelligence Testing (AITest 2022), Newark, CA, USA, 15–18 August 2022; pp. 64–71. [Google Scholar]
- Cheng, C.; Nührenberg, G.; Yasuoka, H. Runtime Monitoring Neuron Activation Patterns. In Proceedings of the Design, Automation & Test in Europe Conference & Exhibition (DATE 2019), Florence, Italy, 25–29 March 2019; pp. 300–303. [Google Scholar]
- Wu, C.; Falcone, Y.; Bensalem, S. Customizable Reference Runtime Monitoring of Neural Networks Using Resolution Boxes. In Proceedings of the 23rd International Conference on Runtime Verification (RV 2023), Thessaloniki, Greece, 3–6 October 2023; pp. 23–41. [Google Scholar]
- Liu, W.; Longhui, Y.; Adrian, W.; Bernhard Schölkopf. Generalizing and Decoupling Neural Collapse Via Hyperspherical Uniformity Gap. arXiv 2023, arXiv:2303.06484. [Google Scholar]
- Deng, L. The Mnist Database of Handwritten Digit Images for Machine Learning Research [Best of the Web]. IEEE Signal Process. Mag. 2012, 29, 141–142. [Google Scholar] [CrossRef]
- Xiao, H.; Kashif, R.; Roland, V. Fashion-Mnist: A Novel Image Dataset for Benchmarking Machine Learning Algorithms. arXiv 2017, arXiv:1708.07747. [Google Scholar]
- Krizhevsky, A.; Hinton, G. Learning Multiple Layers of Features from Tiny Images. Master’s Thesis, University of Tront, Toronto, ON, USA, 2009. [Google Scholar]
- He, K.M.; Zhang, X.Y.; Ren, S.Q.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 770–778. [Google Scholar]
- Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L. Mobilenetv2: Inverted Residuals and Linear Bottlenecks. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (CVPR 2018), Salt Lake City, UT, USA, 18–22 June 2018; pp. 4510–4520. [Google Scholar]
- Zagoruyko, S.; Nikos, K. Wide Residual Networks. arXiv 2016, arXiv:1605.07146. [Google Scholar]
- Tan, M.; Le, Q. Efficientnetv2: Smaller Models and Faster Training. In Proceedings of the International Conference on Machine Learning (ICML 2021), Virtual. 18–24 July 2021; pp. 10096–10106. [Google Scholar]
- Hendrycks, D.; Thomas, D. Benchmarking Neural Network Robustness to Common Corruptions and Perturbations. arXiv 2019, arXiv:1903.12261. [Google Scholar]
- Weiss, M.; Tonella, P. Simple Techniques Work Surprisingly Well for Neural Network Test Prioritization and Active Learning (Replicability Study). In Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA 2022), Virtual. 18–22 July 2022; pp. 139–150. [Google Scholar]
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16 × 16 Words: Transformers for Image Recognition at Scale. In Proceedings of the 9th International Conference on Learning Representations (ICLR 2021), Virtual, 3–7 May 2021. [Google Scholar]
- Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Fei-Fei, L. ImageNet: A large-scale hierarchical image database. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2009), Miami, FL, USA, 20–25 June 2009; pp. 248–255. [Google Scholar]
- Hendrycks, D.; Zhao, K.; Basart, S.; Steinhardt, J.; Song, D. Natural adversarial examples. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2021), Virtual. 19–25 June 2021; pp. 15262–15271. [Google Scholar]
- Li, X.; Chen, Y.; Zhu, Y.; Wang, S.; Zhang, R.; Xue, H. ImageNet-E: Benchmarking Neural Network Robustness via Attribute Editing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 20371–20381. [Google Scholar]
ID | Dataset | Model | Train Acc (%) | Test Acc (%) | ||
---|---|---|---|---|---|---|
A | CIFAR10 | MobilenetV2 | 99.51 | 93.38 | 0.99 | −0.114 |
B | CIFAR10 | ResNet50 | 98.69 | 93.81 | 0.98 | −0.105 |
C | MNIST | ResNet18 | 98.03 | 97.66 | 0.88 | −0.061 |
D | FMNIST | WideResNet50 | 94.01 | 87.7 | 0.86 | 0.001 |
E | CIFAR100 | EfficientNet V2-S | 99.52 | 88.2 | 0.84 | 0.002 |
F | CIFAR10-C | ResNet50 | - | 77.29 | - | - |
G | MNIST-C | ResNet18 | - | 70.19 | - | - |
H | FMNIST-C | WideResNet50 | - | 47.83 | - | - |
ID | Gini [10] | DSA [8] | MLSA [9] | NBC [5] | NLC [6] | FD+ [7] | DA [14] | DeepFPC |
---|---|---|---|---|---|---|---|---|
A | 54.53 | 53.85 | 52.3 | 51.835 | 34.818 | 70.37 | 78.88 | 87.26 |
B | 52.62 | 47.58 | 48.48 | 56.071 | 40.109 | 35.46 | 71.61 | 78.81 |
C | 51.8 | 63.79 | 56.94 | 7.51 | 3.13 | 46.58 | 64.36 | 69.06 |
D | 59.13 | 55.66 | 41.63 | 9.45 | 11.79 | 35.48 | 70.13 | 68.56 |
E | 61.65 | 60.11 | 61.39 | 10.62 | 12.26 | 28.81 | 60.4 | 58.18 |
F | 66.33 | 29.13 | 31.17 | 52.74 | 25.29 | 57.57 | 78.0 | 83.61 |
G | 58.68 | 55.82 | 56.96 | 71.23 | 49.32 | 66.83 | 85.41 | 86.08 |
H | 73.58 | 71.29 | 73.37 | 71.18 | 70.16 | 49.06 | 92.39 | 95.11 |
A | B | C | D | E | F | G | H | |
---|---|---|---|---|---|---|---|---|
Angular term only | 82.44 | 77.94 | 63.92 | 65.26 | 55.24 | 81.22 | 82.39 | 91.28 |
Euclidean term only | 72.96 | 72.85 | 67.04 | 66.79 | 49.25 | 77.94 | 75.89 | 77.23 |
DeepFPC | 87.26 | 78.81 | 69.06 | 68.56 | 58.18 | 83.61 | 86.08 | 95.11 |
TIP Metrics | Type of Evaluation Split | MNIST—ResNet18 | FMNIST—WideResNet50 | |||
---|---|---|---|---|---|---|
Type of Active Split | Type of Active Split | |||||
Nominal | OOD | Nominal | OOD | |||
Random Selection | Nominal | 91.18 | 69.26 | 89.46 | 49.26 | |
OOD | 93.48 | 78.72 | 88.42 | 58.52 | ||
Intrinsic Function | Gini [10] | Nominal | 93.92 | 74.14 | 90.78 | 49.54 |
OOD | 93.34 | 80.32 | 89.04 | 66.94 | ||
Surprise Adequacy | DSA [8] | Nominal | 90.74 | 69.88 | 90.78 | 51.26 |
OOD | 94.8 | 81.82 | 87.54 | 68.24 | ||
MLSA [9] | Nominal | 91.78 | 69.88 | 91.94 | 51.96 | |
OOD | 93.78 | 73.12 | 88.5 | 62.98 | ||
Neuron Coverage | NBC [5] | Nominal | 90.28 | 67.48 | 91.12 | 51.04 |
OOD | 92.98 | 79.56 | 87.64 | 62.94 | ||
NLC [6] | Nominal | 91.94 | 70.38 | 87.70 | 41.38 | |
OOD | 89.5 | 73.54 | 91.04 | 61.44 | ||
FD+ [7] | Nominal | 91.88 | 71.06 | 89.8 | 52.44 | |
OOD | 91.18 | 72.98 | 87.1 | 60.5 | ||
Monitor-Based | DA [14] | Nominal | 93.92 | 74.36 | 92.32 | 52.34 |
OOD | 93.6 | 80.6 | 88.34 | 65.18 | ||
DeepFPC | Nominal | 92.2 | 74.16 | 92.48 | 52.44 | |
OOD | 93.8 | 81.60 | 88.65 | 65.39 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Hwang, H.; Chun, I.Y.; Shin, J. Improved Test Input Prioritization Using Verification Monitors with False Prediction Cluster Centroids. Electronics 2024, 13, 21. https://doi.org/10.3390/electronics13010021
Hwang H, Chun IY, Shin J. Improved Test Input Prioritization Using Verification Monitors with False Prediction Cluster Centroids. Electronics. 2024; 13(1):21. https://doi.org/10.3390/electronics13010021
Chicago/Turabian StyleHwang, Hyekyoung, Il Yong Chun, and Jitae Shin. 2024. "Improved Test Input Prioritization Using Verification Monitors with False Prediction Cluster Centroids" Electronics 13, no. 1: 21. https://doi.org/10.3390/electronics13010021
APA StyleHwang, H., Chun, I. Y., & Shin, J. (2024). Improved Test Input Prioritization Using Verification Monitors with False Prediction Cluster Centroids. Electronics, 13(1), 21. https://doi.org/10.3390/electronics13010021