DermViT: Diagnosis-Guided Vision Transformer for Robust and Efficient Skin Lesion Classification
Abstract
:1. Introduction
- 1.
- Dermoscopic Hierarchical Attention (DHA): DHA mimics the attention-shifting mechanism employed by dermatologists, enabling efficient allocation of computational resources through a two-stage routing mechanism. In the first stage, DHA computes region-level attention weights in order to rapidly discard diagnostically irrelevant background regions. In the second stage, fine-grained token-level attention is applied exclusively within the retained regions, concentrating on dermatologically relevant local features. This hierarchical coarse-to-fine strategy reduces computational overhead while enhancing the sensitivity of detection for subtle lesions.
- 2.
- Dermoscopic Context Pyramid (DCP): Inspired by the multi-magnification collaborative mechanism in pathological diagnosis, DCP adapts to the high intra-class variability of lesions such as melanoma through cross-scale feature fusion.
- 3.
- Dermoscopic Feature Gate (DFG): Inspired by the observation–verification cycle in clinical diagnosis, DFG suppresses semantic noise through channel decoupling and dynamic gating. It decouples input features into a primary pathway for morphological analysis and an auxiliary pathway for detail verification. The primary pathway captures local context (e.g., pigment texture), while the auxiliary pathway calibrates response intensity through channel-wise multiplication. This mechanism emulates the cognitive process of dermatologists, who first observe globally before zooming in to verify local details, effectively attenuating high-frequency artifacts (e.g., hair noise) that may interfere with diagnosis.
2. Materials and Methods
2.1. Dermoscopic Hierarchical Attention
2.2. Dermoscopic Feature Gate
2.3. Dermoscopic Context Pyramid
2.4. The Entire Network Structure
2.5. Experimental Configuration
3. Results
3.1. Comparison Experiments
3.2. Ablation Experiments
3.3. Training Process Analysis
3.4. Visualization
4. Discussion
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Balch, C.M.; Gershenwald, J.E.; Soong, S.J.; Thompson, J.F.; Atkins, M.B.; Byrd, D.R.; Buzaid, A.C.; Cochran, A.J.; Coit, D.G.; Ding, S.; et al. Final Version of 2009 AJCC Melanoma Staging and Classification. J. Clin. Oncol. 2009, 27, 6199–6206. [Google Scholar] [CrossRef]
- Braun, R.P.; Rabinovitz, H.S.; Oliviero, M.; Kopf, A.W.; Saurat, J.H. Dermoscopy of pigmented skin lesions. J. Am. Acad. Dermatol. 2005, 52, 109–121. [Google Scholar] [CrossRef] [PubMed]
- Steppan, J.; Hanke, S. Analysis of skin lesion images with deep learning. arXiv 2021, arXiv:2101.03814. [Google Scholar]
- Vestergaard, M.; Macaskill, P.; Holt, P.; Menzies, S. Dermoscopy compared with naked eye examination for the diagnosis of primary melanoma: A meta-analysis of studies performed in a clinical setting. Br. J. Dermatol. 2008, 159, 669–676. [Google Scholar] [CrossRef] [PubMed]
- Gouda, W.; Sama, N.U.; Al-Waakid, G.; Humayun, M.; Jhanjhi, N.Z. Detection of Skin Cancer Based on Skin Lesion Images Using Deep Learning. Healthcare 2022, 10, 1183. [Google Scholar] [CrossRef]
- Hasan, M.K.; Ahamad, M.A.; Yap, C.H.; Yang, G. A survey, review, and future trends of skin lesion segmentation and classification. Comput. Biol. Med. 2023, 155, 106624. [Google Scholar] [CrossRef]
- Celebi, M.E.; Kingravi, H.A.; Uddin, B.; Iyatomi, H.; Aslandogan, Y.A.; Stoecker, W.V.; Moss, R.H. A methodological approach to the classification of dermoscopy images. Comput. Med Imaging Graph. 2007, 31, 362–373. [Google Scholar] [CrossRef]
- Abbas, Q.; Celebi, M.; Serrano, C.; Fondón García, I.; Ma, G. Pattern classification of dermoscopy images: A perceptually uniform model. Pattern Recognit. 2013, 46, 86–97. [Google Scholar] [CrossRef]
- Goceri, E. Classification of skin cancer using adjustable and fully convolutional capsule layers. Biomed. Signal Process. Control 2023, 85, 104949. [Google Scholar] [CrossRef]
- Esteva, A.; Kuprel, B.; Novoa, R.A.; Ko, J.; Swetter, S.M.; Blau, H.M.; Thrun, S. Dermatologist-level classification of skin cancer with deep neural networks. Nature 2017, 542, 115–118. [Google Scholar] [CrossRef]
- Harangi, B. Skin lesion classification with ensembles of deep convolutional neural networks. J. Biomed. Inform. 2018, 86, 25–32. [Google Scholar] [CrossRef] [PubMed]
- Aldhyani, T.H.H.; Verma, A.; Al-Adhaileh, M.H.; Koundal, D. Multi-Class Skin Lesion Classification Using a Lightweight Dynamic Kernel Deep-Learning-Based Convolutional Neural Network. Diagnostics 2022, 12, 2048. [Google Scholar] [CrossRef] [PubMed]
- He, X.; Tan, E.L.; Bi, H.; Zhang, X.; Zhao, S.; Lei, B. Fully Transformer Network for Skin Lesion Analysis. Med. Image Anal. 2022, 77, 102357. [Google Scholar] [CrossRef]
- Gessert, N.; Sentker, T.; Madesta, F.; Schmitz, R.; Kniep, H.; Baltruschat, I.; Werner, R.; Schlaefer, A. Skin lesion classification using CNNs with patch-based attention and diagnosis-guided loss weighting. IEEE Trans. Biomed. Eng. 2019, 67, 495–503. [Google Scholar] [CrossRef] [PubMed]
- Yu, Y.; Jia, H.; Zhang, L.; Xu, S.; Zhu, X.; Wang, J.; Wang, F.; Han, L.; Jiang, H.; Zhou, Q.; et al. Deep Multi-Modal Skin-Imaging-Based Information-Switching Network for Skin Lesion Recognition. Bioengineering 2025, 12, 282. [Google Scholar] [CrossRef]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.; Kaiser, L.; Polosukhin, I. Attention is All you Need. arXiv 2017, arXiv:1706.03762. [Google Scholar]
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
- Datta, S.K.; Shaikh, M.A.; Srihari, S.N.; Gao, M. Soft Attention Improves Skin Cancer Classification Performance. In Interpretability of Machine Intelligence in Medical Image Computing, and Topological Data Analysis and Its Applications for Medical Data, Lecture Notes in Computer Science; Springer International Publishing: Cham, Switzerland, 2021; pp. 13–23. [Google Scholar] [CrossRef]
- Li, A.; Zhang, D.; Yu, L.; Kang, X.; Tian, S.; Wu, W.; You, H.; Huo, X. Residual cosine similar attention and bidirectional convolution in dual-branch network for skin lesion image classification. Eng. Appl. Artif. Intell. 2024, 133, 108386. [Google Scholar] [CrossRef]
- Matsoukas, C.; Haslum, J.; Söderberg, M.; Smith, K. Is it Time to Replace CNNs with Transformers for Medical Images. arXiv 2021, arXiv:2108.09038. [Google Scholar]
- Wu, H.; Chen, S.; Chen, G.; Wang, W.; Lei, B.; Wen, Z. FAT-Net: Feature adaptive transformers for automated skin lesion segmentation. Med. Image Anal. 2022, 76, 102327. [Google Scholar] [CrossRef]
- Parvaiz, A.; Khalid, M.; Zafar, R.; Ameer, H.; Ali, M.; Fraz, M. Vision Transformers in Medical Computer Vision—A Contemplative Retrospection. Eng. Appl. Artif. Intell. 2023, 122, 106126. [Google Scholar] [CrossRef]
- Zhu, L.; Wang, X.; Ke, Z.; Zhang, W.; Lau, R.W. BiFormer: Vision Transformer With Bi-Level Routing Attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 10323–10333. [Google Scholar]
- Wang, S.; Li, B.; Khabsa, M.; Han, F.; Ma, H. Linformer: Self-Attention with Linear Complexity. arXiv 2020, arXiv:2006.04768. [Google Scholar]
- Madan, S.; Li, Y.; Zhang, M.; Pfister, H.; Kreiman, G. Improving generalization by mimicking the human visual diet. arXiv 2022, arXiv:2206.07802. [Google Scholar]
- Huang, L.; Yuan, Y.; Guo, J.; Zhang, C.; Chen, X.; Wang, J. Interlaced Sparse Self-Attention for Semantic Segmentation. arXiv 2019, arXiv:1907.12273. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar] [CrossRef]
- Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K.Q. Densely Connected Convolutional Networks. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar] [CrossRef]
- Rezvantalab, A.; Safigholi, H.; Karimijeshni, S. Dermatologist Level Dermoscopy Skin Cancer Classification Using Different Deep Learning Convolutional Neural Networks Algorithms. arXiv 2018, arXiv:1810.10348. [Google Scholar]
- Foahom Gouabou, A.C.; Damoiseaux, J.L.; Monnier, J.; Iguernaissi, R.; Moudafi, A.; Merad, D. Ensemble Method of Convolutional Neural Networks with Directed Acyclic Graph Using Dermoscopic Images: Melanoma Detection Application. Sensors 2021, 21, 3999. [Google Scholar] [CrossRef]
- Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021. [Google Scholar] [CrossRef]
- Fan, Q.; Huang, H.; Chen, M.; He, R. Vision Transformer with Sparse Scan Prior. arXiv 2024, arXiv:2405.13335. [Google Scholar]
- Codella, N.; Rotemberg, V.; Tschandl, P.; Çelebi, M.; Dusza, S.; Gutman, D.; Helba, B.; Kalloo, A.; Liopyris, K.; Marchetti, M.; et al. Skin Lesion Analysis Toward Melanoma Detection 2018: A Challenge Hosted by the International Skin Imaging Collaboration (ISIC). arXiv 2019, arXiv:1902.03368. [Google Scholar]
- Onakpojeruo, E.P.; Mustapha, M.T.; Ozsahin, D.U.; Ozsahin, I. A Comparative Analysis of the Novel Conditional Deep Convolutional Neural Network Model, Using Conditional Deep Convolutional Generative Adversarial Network-Generated Synthetic and Augmented Brain Tumor Datasets for Image Classification. Brain Sci. 2024, 14, 559. [Google Scholar] [CrossRef]
- Onakpojeruo, E.P.; Mustapha, M.T.; Ozsahin, D.U.; Ozsahin, I. Enhanced MRI-based brain tumour classification with a novel Pix2pix generative adversarial network augmentation framework. Brain Commun. 2024, 6, fcae372. [Google Scholar] [CrossRef]
- Mikołajczyk, A.; Grochowski, M. Data augmentation for improving deep learning in image classification problem. In Proceedings of the 2018 International Interdisciplinary PhD Workshop (IIPhDW), Swinoujscie, Poland, 9–12 May 2018; pp. 117–122. [Google Scholar] [CrossRef]
- Shen, S.; Xu, M.; Zhang, F.; Shao, P.; Liu, H.; Xu, L.; Zhang, C.; Liu, P.; Yao, P.; Xu, R. A low-cost high-performance data augmentation for deep learning-based skin lesion classification. BME Front. 2022, 2022, 9765307. [Google Scholar] [CrossRef] [PubMed]
- Litjens, G.; Kooi, T.; Bejnordi, B.E.; Setio, A.A.A.; Ciompi, F.; Ghafoorian, M.; van der Laak, J.A.; van Ginneken, B.; Sánchez, C.I. A survey on deep learning in medical image analysis. Med. Image Anal. 2017, 42, 60–88. [Google Scholar] [CrossRef] [PubMed]
- Chen, T.; Kornblith, S.; Norouzi, M.; Hinton, G. A Simple Framework for Contrastive Learning of Visual Representations. arXiv 2020, arXiv:2002.05709. [Google Scholar]
- He, K.; Fan, H.; Wu, Y.; Xie, S.; Girshick, R. Momentum Contrast for Unsupervised Visual Representation Learning. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020. [Google Scholar] [CrossRef]
- Howard, A.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. arXiv 2017, arXiv:1704.04861. [Google Scholar]
- Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.C. MobileNetV2: Inverted Residuals and Linear Bottlenecks. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018. [Google Scholar] [CrossRef]
- Ribeiro, M.T.; Singh, S.; Guestrin, C. “Why Should I Trust You?”. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016. [Google Scholar] [CrossRef]
- Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization. Int. J. Comput. Vis. 2020, 128, 336–359. [Google Scholar] [CrossRef]
- Zhang, C.; Yang, Z.; He, X.; Deng, L. Multimodal Intelligence: Representation Learning, Information Fusion, and Applications. IEEE J. Sel. Top. Signal Process. 2020, 14, 478–493. [Google Scholar] [CrossRef]
- Azam, M.A.; Khan, K.B.; Salahuddin, S.; Rehman, E.; Khan, S.A.; Khan, M.A.; Kadry, S.; Gandomi, A.H. A review on multimodal medical image fusion: Compendious analysis of medical modalities, multimodal databases, fusion techniques and quality metrics. Comput. Biol. Med. 2022, 144, 105253. [Google Scholar] [CrossRef]
Model | FLOPs (G) | Parameters (M) | ISIC2018 | ||||
---|---|---|---|---|---|---|---|
Precision (%) | Recall (%) | F1 (%) | Accuracy (%) | MAUC (%) | |||
Resnet34 | 3.68 | 21.8 | 69.02 ± 2.38 | 66.80 ± 0.90 | 67.24 ± 1.42 | 81.67 ± 0.52 | 95.56 ± 0.20 |
Resnet50 | 4.13 | 23.52 | 71.33 ± 1.04 | 69.58 ± 1.60 | 70.14 ± 0.94 | 82.35 ± 0.46 | 95.58 ± 0.17 |
Resnet101 | 7.87 | 44.55 | 68.92 ± 1.77 | 69.20 ± 1.36 | 68.67 ± 0.92 | 81.93 ± 0.41 | 95.68 ± 0.11 |
ConvNeXt-tiny | 4.45 | 27.8 | 61.50 ± 1.16 | 59.05 ± 1.76 | 59.18 ± 1.33 | 78.09 ± 0.33 | 94.27 ± 0.07 |
Swin-tiny | 8.74 | 27.5 | 62.28 ± 0.95 | 62.41 ± 1.14 | 61.88 ± 1.11 | 80.04 ± 0.25 | 95.42 ± 0.04 |
Swin-small | 17.09 | 48.8 | 52.80 ± 0.64 | 47.63 ± 0.73 | 49.70 ± 0.51 | 74.11 ± 0.27 | 92.80 ± 0.04 |
ViT-Base | 16.86 | 85.65 | 60.86 ± 2.22 | 55.03 ± 1.38 | 56.88 ± 1.50 | 77.75 ± 0.42 | 93.00 ± 0.21 |
RCSABC [19] | 22.89 | 81.19 | 80.38 | 76.66 | 78.3 | 87.39 | \ |
Biformer | 4.42 | 27.65 | 71.17 ± 1.10 | 71.44 ± 1.49 | 69.86 ± 0.97 | 83.00 ± 0.45 | 95.18 ± 0.13 |
DermViT | 2.75 | 16.82 | 80.75 ± 1.07 | 76.82 ± 0.50 | 77.90 ± 0.43 | 85.34 ± 0.23 | 96.29 ± 0.05 |
Model | FLOPs (G) | Parameters (M) | ISIC2019 | ||||
---|---|---|---|---|---|---|---|
Precision (%) | Recall (%) | F1 (%) | Accuracy (%) | MAUC (%) | |||
Resnet34 | 3.68 | 21.8 | 68.42 ± 0.93 | 66.59 ± 0.91 | 67.11 ± 0.87 | 78.92 ± 0.46 | 95.75 ± 0.06 |
Resnet50 | 4.13 | 23.52 | 70.31 ± 0.86 | 68.00 ± 0.43 | 68.83 ± 0.51 | 79.38 ± 0.21 | 96.07 ± 0.06 |
Resnet101 | 7.87 | 44.55 | 66.47 ± 0.73 | 65.49 ± 0.88 | 65.57 ± 0.61 | 77.44 ± 0.34 | 95.41 ± 0.07 |
ConvNeXt-tiny | 4.45 | 27.8 | 58.90 ± 1.21 | 51.71 ± 1.50 | 53.94 ± 0.93 | 71.55 ± 0.27 | 93.21 ± 0.05 |
swin-tiny | 8.74 | 27.5 | 56.96 ± 1.40 | 42.23 ± 1.39 | 45.17 ± 1.36 | 67.48 ± 1.19 | 89.79 ± 0.23 |
swin-small | 17.09 | 48.8 | 63.40 ± 0.38 | 58.53 ± 0.37 | 60.43 ± 0.15 | 74.80 ± 0.12 | 93.89 ± 0.02 |
Vit-base | 16.86 | 85.65 | 58.66 ± 0.52 | 52.61 ± 0.90 | 54.54 ± 0.78 | 70.40 ± 0.33 | 92.39 ± 0.07 |
RCSABC [19] | 22.89 | 81.19 | \ | \ | \ | \ | \ |
Biformer | 4.42 | 27.65 | 72.35 ± 1.04 | 69.52 ± 1.17 | 70.68 ± 0.75 | 78.77 ± 0.31 | 95.71 ± 0.12 |
Ours | 2.75 | 16.82 | 72.39 ± 0.41 | 71.24 ± 0.31 | 71.57 ± 0.30 | 80.82 ± 0.20 | 96.42 ± 0.03 |
Model | FLOPs | Parameters (M) | Precision (%) | Recall (%) | F1 (%) | Accuracy (%) | MAUC |
---|---|---|---|---|---|---|---|
Baseline | 16.86 | 85.65 | 58.89 | 53.00 | 55.79 | 78.3 | 94.01 |
Baseline + DHA | 4.42 | 27.65 | 73.34 | 70.50 | 71.89 | 83.26 | 96.02 |
Baseline + DCP | 5.12 | 32.36 | 74.76 | 71.18 | 72.93 | 82.99 | 95.21 |
Baseline + DFG | 2.02 | 11.52 | 75.40 | 71.69 | 73.37 | 82.52 | 95.40 |
Baseline + DHA + DCP + DFG | 2.75 | 16.82 | 83.24 | 77.25 | 78.88 | 86.12 | 96.54 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Zhang, X.; Liu, Y.; Ouyang, G.; Chen, W.; Xu, A.; Hara, T.; Zhou, X.; Wu, D. DermViT: Diagnosis-Guided Vision Transformer for Robust and Efficient Skin Lesion Classification. Bioengineering 2025, 12, 421. https://doi.org/10.3390/bioengineering12040421
Zhang X, Liu Y, Ouyang G, Chen W, Xu A, Hara T, Zhou X, Wu D. DermViT: Diagnosis-Guided Vision Transformer for Robust and Efficient Skin Lesion Classification. Bioengineering. 2025; 12(4):421. https://doi.org/10.3390/bioengineering12040421
Chicago/Turabian StyleZhang, Xuejun, Yehui Liu, Ganxin Ouyang, Wenkang Chen, Aobo Xu, Takeshi Hara, Xiangrong Zhou, and Dongbo Wu. 2025. "DermViT: Diagnosis-Guided Vision Transformer for Robust and Efficient Skin Lesion Classification" Bioengineering 12, no. 4: 421. https://doi.org/10.3390/bioengineering12040421
APA StyleZhang, X., Liu, Y., Ouyang, G., Chen, W., Xu, A., Hara, T., Zhou, X., & Wu, D. (2025). DermViT: Diagnosis-Guided Vision Transformer for Robust and Efficient Skin Lesion Classification. Bioengineering, 12(4), 421. https://doi.org/10.3390/bioengineering12040421