A Comparative Evaluation of CNN and Transfer Learning Models for Multi-Class Skin Disease Classification

Kharisma, Ivana Lucia; Munandar, Rifki Arief; Maulana, D Ihsan; Rizky, Mochamad Naufal; Kamdan,

doi:10.3390/engproc2025107052

Open AccessProceeding Paper

A Comparative Evaluation of CNN and Transfer Learning Models for Multi-Class Skin Disease Classification^†

by

Ivana Lucia Kharisma

^*

,

Rifki Arief Munandar

,

D Ihsan Maulana

,

Mochamad Naufal Rizky

and

Kamdan

Engineering, Computer and Design Faculty, Nusa Putra University, Sukabumi 43152, West Java, Indonesia

^*

Author to whom correspondence should be addressed.

^†

Presented at the 7th International Global Conference Series on ICT Integration in Technical Education & Smart Society, Aizuwakamatsu City, Japan, 20–26 January 2025.

Eng. Proc. 2025, 107(1), 52; https://doi.org/10.3390/engproc2025107052

Published: 29 August 2025

(This article belongs to the Proceedings of The 7th International Global Conference Series on ICT Integration in Technical Education & Smart Society)

Download

Browse Figures

Versions Notes

Abstract

The automated classification of skin diseases has the potential to significantly enhance patient outcomes through early and targeted treatment. In this study, we compared a custom convolutional neural network (CNN) with three transfer learning architectures—ResNet50, DenseNet201, and Inception—for classifying nine different dermatological conditions. The models were trained on the publicly available FYP Skin Disease Dataset, which consists of 4500 images of both prevalent and serious skin diseases including melanoma, basal cell carcinoma, and squamous cell carcinoma. We conducted extensive performance analysis using accuracy, loss curves, confusion matrices, and multi-class ROC analysis. Our findings indicate that the custom CNN achieved 100% validation accuracy, followed by DenseNet201 at 99% and Inception at 98%, with ResNet50 lagging at 44%. These results highlight the potential of both custom architectures and transfer learning models in dermatological image analysis, which have significant implications for researchers and clinicians seeking viable automated solutions for skin disease diagnosis.

Keywords:

skin disease; multi class classification; transfer learning

1. Introduction

Skin diseases, especially skin cancers like melanoma, pose a considerable and escalating health issue due to their rising incidence and potential severity [1,2,3,4] Timely and precise diagnosis is essential for enhancing patient outcomes and lowering the mortality rates linked to these conditions. Conventional clinical approaches rely significantly on the visual assessments conducted by seasoned dermatologists, a process that can be subjective and constrained by the availability of specialists [5,6]. As a result, automated skin disease detection through artificial intelligence (AI) has become an essential focus to overcome these challenges by providing reliable, efficient, and scalable diagnostic solutions. The diversity of skin lesions regarding color, texture, and morphology poses considerable obstacles in creating universally effective AI models. Various diseases frequently display visual resemblances or reveal nuanced distinctions within the same condition, creating obstacles for precise classification [1,7].

This study aimed to evaluate the effectiveness of various AI models in relation to several dermatological conditions, focusing on their accuracy, robustness, and generalizability. We also aimed to identify optimal approaches for accurate and reliable multi-class skin disease classification through a comparative evaluation of a custom convolutional neural network (CNN) and established transfer learning architectures including ResNet50, DenseNet201, and Inception. The evaluation employed the publicly accessible FYP Skin Disease Dataset, which includes images depicting a range of common and significant skin disorders.

This study enhances the field through a thorough comparative analysis of the performance of AI models in the classification of multi-class skin diseases. This study meticulously assesses these models, emphasizing the advantages and drawbacks of both tailored CNN architectures and widely used transfer learning models. These insights play a vital role in steering ongoing enhancements in AI architecture and training methodologies, thereby propelling the progress of diagnostic tools that can effectively classify a range of dermatological conditions [5,8,9].

2. Materials and Methods

2.1. Dataset

The research utilized the FYP Skin Disease Dataset, which is publicly accessible on Kaggle. This dataset originally comprised 22,982 images classified into nine distinct dermatological conditions: acne, melanoma (MEL), melanocytic nevus (NV), basal cell carcinoma (BCC), squamous cell carcinoma (SCC), actinic keratosis (AK), seborrheic keratosis (SEK), dermatofibroma (DF), and vascular lesions (VASC) [10]. The dataset was refined to 4500 images to maintain balanced representation and enhance computational efficiency, with 500 images designated for each class. Images were standardized to a resolution of 300 × 300 pixels and divided into training (80%), validation (10%), and testing (10%) subsets.

2.2. Data Preprocessing

All images went through resizing to 224 × 224 pixels to align with the input specifications of the model. Each pixel value was normalized to fall within the [0,1] range by dividing it by 255. The process of loading and preprocessing data utilized TensorFlow’s tf.data pipeline, incorporating caching, shuffling with a buffer size of 1000, and prefetching set to AUTOTUNE to enhance performance [11].

2.3. Model Architectures

2.3.1. Custom CNN

The custom convolutional neural network included convolutional layers (Conv2D) featuring a kernel size of 3 × 3, ReLU activation functions, and MaxPooling layers with a size of 2 × 2. The architecture comprised Conv2D layers with progressively increasing filter depth (32, 64, 128), followed by a flattening layer, dense layers with 128 units and ReLU activation, a dropout layer with a rate of 0.5, and ending with a final softmax layer for 9 classes.

2.3.2. Transfer Learning Models

Pre-trained ImageNet models—ResNet50 [12], DenseNet201 [13], and InceptionResNetV2 [14]—were utilized without their top classification layers. A global average pooling (GAP) layer was incorporated, followed by dense layers with 512 units utilizing ReLU activation. A dropout layer with a rate of 0.5 was included, resulting in a softmax layer aimed at classification across nine classes.

2.4. Training Setup

We developed all models using the TensorFlow 2.20.0 framework and the Keras API with 3.11.3 version on an NVIDIA GeForce RTX 3080 GPU. The training process employed the Adam optimizer with a learning rate of 0.001, utilized sparse categorical cross-entropy for loss calculation, set a batch size of 32, and ran for 50 epochs, incorporating early stopping based on validation loss with a patience of 5 epochs.

2.5. Evaluation Metrics

The evaluation of the model’s performance involved analyzing accuracy, loss curves, confusion matrices, and classification reports, which included precision, recall, and F1-score. Additionally, multi-class ROC curves were utilized to assess the class-wise AUC.

3. Results

The findings demonstrate that the custom CNN model acquired flawless accuracy (100%) alongside negligible loss (0.0017). DenseNet201 and InceptionResNetV2 achieved impressive accuracy rates of 99.33% and 98.66%, respectively, while maintaining remarkably low losses. In contrast, ResNet50 showed notably poorer performance, achieving an accuracy of merely 44.41% and displaying high loss, which suggests difficulties in effectively classifying multi-class skin lesions with this model, as shown below in Figure 1.

The model evaluation is shown in Table 1 below.

The training and validation accuracy curves for ResNet50 demonstrated a steady enhancement in accuracy, ultimately stabilizing at approximately 44%. The loss curves showed a consistent decline, but remained comparatively elevated, highlighting that ResNet50 was not sufficiently tailored to accurately identify the unique features of the multi-class skin lesion dataset utilized in this analysis, as showed in Figure 2 below.

The accuracy curves of the CNN exhibited a swift and steady approach to peak performance, achieving 100% accuracy with negligible variations. In a similar vein, the loss curves exhibited a sharp decline and maintained a proximity to zero during the training process, underscoring the Custom CNN model’s robust capacity to effectively fit and generalize the training data. The results can be figured out in Figure 3.

DenseNet201 demonstrated a notable training performance, swiftly achieving accuracy levels close to 99%. The loss values decreased significantly during the early epochs and sustained a low and stable trajectory throughout the training phase, suggesting robust generalization ability. The results showed in Figure 4 below.

The accuracy curves for InceptionResNetV2 demonstrated consistent enhancements, ultimately reaching a stabilization near 98%. The loss curves exhibited a steep decline at the beginning and maintained a consistently low level, indicating successful learning and strong generalization to the validation set.

3.1. Confusion Matrix Analysis

The confusion matrix indicates that ResNet50 struggled to differentiate among several classes, particularly misclassifying acne, NV, and MEL with other lesion types. This ambiguity highlights the model’s limited overall accuracy and suggests that ResNet50 faced challenges in acquiring distinctive features pertinent to each skin disease category within this dataset, ilustrated in Figure 5. For CNN results showed in Figure 6.

The matrix indicates that the custom CNN model successfully classified all test samples, resulting in zero misclassifications across the nine classes. This outcome supports earlier findings of the model achieving 100% test accuracy, highlighting its exceptional ability to learn and differentiate the features of each skin disease.

The matrix shows that DenseNet201, stated in Figure 7, produced highly accurate predictions, with only a handful of minor misclassifications noted (for instance, one AK image was classified as BCC). The result reinforces the previous finding of 99.33% accuracy and underscores the model’s robust generalization ability and discriminative strength across all categories.

The confusion matrix indicates that InceptionResNetV2, stated in Figure 8, attained high accuracy across all classes, with only slight misclassifications such as NV being mistaken for MEL or VASC. This result validates the model’s strong capability to generalize characteristics of various skin conditions while ensuring high predictive accuracy.

3.2. ROC Curve and AUC Analysis

Table 2 shows the area under the curve (AUC) values for each class across all four models, providing valuable insights into their discriminative capabilities.

The results further substantiate previous findings, demonstrating that the CNN, DenseNet201, and InceptionResNetV2 models attained perfect AUC scores across all classes. In contrast, ResNet50 showed notably lower AUCs, especially for NV and DF, which suggests a lack of class separability, as illustrated in Figure 9. For ROC and AUC values stated in Figure 10, while Figure 11 stated results of ROC and AUC using DenseNet201. ROC and AUC results of InceptionResNetV2 stated in Figure 12.

3.3. Classification Report

The findings indicate exceptional class discrimination (AUC = 1.00) for CNN, DenseNet201, and InceptionResNetV2, whereas ResNet50 demonstrated lower and more variable AUC values, corroborating previous observations of inadequate adaptation, shown in Table 3 below. For CNN report stated in Table 4. Table 5 stated the results of DenseNet201 while results of InceptionResNet v2 stated in Table 6.

3.4. Visualization of Sample Predictions

In the process of testing the model in detecting skin diseases, it was carried out using images data taken from the data set. The results of testing on images using ResNet 50 are shown in Figure 13. The results of testing using CNN are shown in Figure 14. Meanwhile, testing using DenseNet201 is shown in Figure 15. Figure 16 shows the results of testing using the InceptionResNet v2 model.

4. Discussion

This study presents a detailed comparison of a custom-built CNN model with three transfer learning architectures—ResNet50, DenseNet201, and InceptionResNetV2—focusing on the multi-class classification of dermatological conditions. The exceptional performance of the Custom CNN, DenseNet201, and InceptionResNetV2 models highlights the capabilities of deep learning in automating skin disease diagnosis with remarkable accuracy.

ResNet50 demonstrated a notable lack of performance when compared with alternative models, which can be linked to its constrained ability to adjust to the unique characteristics of dermatological images within this dataset. The findings were corroborated by the confusion matrix and the ROC curve, indicating that ResNet50 struggled to differentiate effectively among specific classes like NV, DF, and MEL. The lower AUC values and classification metrics indicate that the deep residual connections of ResNet50 may not have adequately captured the fine-grained features pertinent to skin lesion textures and variations without further tuning.

The custom CNN attained an impeccable classification score, indicating that a specialized architecture crafted for skin lesion images can surpass standard pre-trained models, particularly when computational resources permit comprehensive training from the ground up. Similarly, DenseNet201 and InceptionResNetV2 attained nearly flawless scores, demonstrating the efficacy of deep transfer learning models when utilized with suitable preprocessing and training methodologies.

An important finding is that all models, with the exception of ResNet50, exhibited consistently strong performance across all classes, even among visually similar categories like SEK and AK, which are typically challenging to differentiate. The classification reports and AUC tables indicate that models utilizing transfer learning, when designed with adequate depth and suitable architectural selections, demonstrate strong generalization capabilities upon fine-tuning.

The results highlight the critical role of choosing and fine-tuning models in the realm of medical imaging tasks. Models that are custom-designed might prove to be more effective in specific contexts, whereas carefully selected transfer learning models can deliver top-tier results with less data and reduced training durations.

5. Conclusions

This investigation provided a comparative assessment of four deep learning models aimed at multi-class skin disease classification, utilizing a balanced dataset comprising 4500 dermatoscopic images. The models comprised a tailored CNN, ResNet50, DenseNet201, and InceptionResNetV2. The findings indicated that the custom CNN attained an impressive 100% accuracy, with DenseNet201 not far behind at 99.33% and InceptionResNetV2 at 98.66%. In contrast, ResNet50 showed a notable deficiency, achieving only 44.41% accuracy.

The evaluation utilizing confusion matrices, AUC, and precision-recall metrics demonstrates that tailored architectures and effectively optimized transfer learning models can deliver exceptional accuracy and dependable performance for medical image classification tasks. Nonetheless, it is important to note that not every pre-trained model demonstrates effective generalization without undergoing fine-tuning.

In summary, this study underscores the practicality and effectiveness of AI-based diagnosis in dermatology, advocating for the continued advancement of refined deep learning solutions specifically designed for medical datasets. Future investigations could delve into the integration of clinical metadata, the utilization of advanced ensemble techniques, and the development of real-time diagnostic applications to enhance the significance of this study.

Author Contributions

Conceptualization, R.A.M., D.I.M. and M.N.R.; methodology, R.A.M., I.L.K. and K.; software, R.A.M. and I.L.K.; validation, I.L.K.; formal analysis, R.A.M.; resources, R.A.M. and D.I.M.; writing—original draft preparation, R.A.M., D.I.M. and M.N.R.; writing—review and editing, R.A.M. and I.L.K. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Ethical review and approval were waived for this study, because the data were obtained from publicly available datasets.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Conflicts of Interest

The authors declare no conflict of interest.

References

Wei, L.; Ding, K.; Hu, H. Automatic Skin Cancer Detection in Dermoscopy Images Based on Ensemble Lightweight Deep Learning Network. IEEE Access 2020, 8, 99633–99647. [Google Scholar] [CrossRef]
Hasan, M.K.; Dahal, L.; Samarakoon, P.N.; Tushar, F.I.; Martí, R. DSNet: Automatic dermoscopic skin lesion segmentation. Comput. Biol. Med. 2020, 120, 103738. [Google Scholar] [CrossRef] [PubMed]
Imran, A.; Nasir, A.; Bilal, M.; Sun, G.; Alzahrani, A.; Almuhaimeed, A. Skin Cancer Detection Using Combined Decision of Deep Learners. IEEE Access 2022, 10, 118198–118212. [Google Scholar] [CrossRef]
Harangi, B. Skin Lesion Classification with Ensembles of Deep Convolutional Neural Networks. J. Biomed. Inform. 2018, 86, 25–32. [Google Scholar] [CrossRef] [PubMed]
Diame, Z.E.; Al-Berry, M.N.; Salem, M.A.-M.; Roushdy, M. Autoencoder Performance Analysis of Skin Lesion Detection. Xi’nan Jiaotong Daxue Xuebao 2021, 56, 937–947. [Google Scholar] [CrossRef]
Capurro, N.; Pastore, V.P.; Touijer, L.; Odone, F.; Cozzani, E.; Gasparini, G.; Parodi, A. A Deep Learning Approach to Direct Immunofluorescence Pattern Recognition in Autoimmune Bullous Diseases. Br. J. Dermatol. 2024, 191, 261–266. [Google Scholar] [CrossRef] [PubMed]
Pacheco, A.G.C.; Krohling, R.A. The Impact of Patient Clinical Information on Automated Skin Cancer Detection. Comput. Biol. Med. 2020, 116, 103545. [Google Scholar] [CrossRef] [PubMed]
Magdy, A.; Hussein, H.; Abdel-Kader, R.F.; Abd El Salam, K. Performance Enhancement of Skin Cancer Classification Using Computer Vision. IEEE Access 2023, 11, 72120–72133. [Google Scholar] [CrossRef]
Sengupta, S.; Mittal, N.; Modi, M. Improved Skin Lesions Detection Using Color Space and Artificial Intelligence Techniques. J. Dermatol. Treat. 2020, 31, 511–518. [Google Scholar] [CrossRef] [PubMed]
Kaggle. FYP Skin Disease Dataset. Available online: https://www.kaggle.com/datasets/bilalmanzoor2/fyp-skin-disease-dataset (accessed on 20 June 2025).
TensorFlow. tf.data: Build TensorFlow Input Pipelines. TensorFlow Documentation. Available online: https://www.tensorflow.org/guide/data (accessed on 20 June 2025).
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar] [CrossRef]
Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K.Q. Densely Connected Convolutional Networks. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 2261–2269. [Google Scholar] [CrossRef]
Szegedy, C.; Ioffe, S.; Vanhoucke, V.; Alemi, A.A. Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning. arXiv 2016, arXiv:1602.07261. [Google Scholar] [CrossRef]

Figure 1. Model accuracy and model loss comparison.

Figure 2. Accuracy and loss metrics for CNN.

Figure 3. Accuracy and loss metrics for DenseNet201.

Figure 4. Accuracy and loss metrics for InceptionResNetV2.

Figure 5. ResNet50 confusion matrix.

Figure 6. CNN confusion matrix.

Figure 7. DenseNet201 confusion matrix.

Figure 8. InceptionResNetV2 confusion matrix.

Figure 9. ResNet50 ROC curve.

Figure 10. CNN ROC curve.

Figure 11. DenseNet201 ROC curve.

Figure 12. InceptionResNetV2 ROC curve.

Figure 13. ResNet50 sample prediction [10].

Figure 14. CNN sample prediction [10].

Figure 15. DenseNet201 sample prediction [10].

Figure 16. InceptionResNetV2 sample prediction [10].

Table 1. Model accuracy and model loss comparison.

Model	Accuracy (%)	Loss
Custom CNN	100.00	0.0017
ResNet50	44.41	1.4891
DenseNet201	99.33	0.0148
InceptionResNetV2	98.66	0.0484

Table 2. Prediction AUC for multi-class classification.

Class	CNN AUC	ResNet50 AUC	DenseNet201 AUC	InceptionResNetV2 AUC
Ak	1.00	0.83	1.00	1.00
Acne	1.00	0.89	1.00	1.00
Bcc	1.00	0.90	1.00	1.00
Df	1.00	0.81	1.00	1.00
Mel	1.00	0.89	1.00	1.00
Nv	1.00	0.76	1.00	1.00
Scc	1.00	0.84	1.00	1.00
Sek	1.00	0.96	1.00	1.00
Vasc	1.00	0.82	1.00	1.00

Table 3. Resnet50 report.

Class	Precision (%)	Recall (%)	F1-Score (%)
Ak	0.43	0.30	0.35
Acne	0.26	0.77	0.39
Bcc	0.51	0.41	0.46
Df	0.77	0.20	0.31
Mel	0.59	0.58	0.58
Nv	0.46	0.24	0.32
Scc	0.50	0.34	0.41
Sek	0.68	0.67	0.68
Vasc	0.35	0.44	0.39

Table 4. CNN report.

Class	Precision (%)	Recall (%)	F1-Score (%)
Ak	1.00	1.00	1.00
Acne	1.00	1.00	1.00
Bcc	1.00	1.00	1.00
Df	1.00	1.00	1.00
Mel	1.00	1.00	1.00
Nv	1.00	1.00	1.00
Scc	1.00	1.00	1.00
Sek	1.00	1.00	1.00
Vasc	1.00	1.00	1.00

Table 5. DenseNet201 report.

Class	Precision (%)	Recall (%)	F1-Score (%)
Ak	1.00	0.98	0.99
Acne	1.00	1.00	1.00
Bcc	0.97	1.00	0.99
Df	1.00	1.00	1.00
Mel	1.00	1.00	1.00
Nv	1.00	1.00	1.00
Scc	1.00	1.00	1.00
Sek	1.00	1.00	1.00
Vasc	1.00	1.00	1.00

Table 6. InceptionResNetV2 report.

Class	Precision (%)	Recall (%)	F1-Score (%)
Ak	0.92	0.98	0.95
Acne	1.00	1.00	1.00
Bcc	0.98	0.92	0.95
Df	0.96	1.00	1.98
Mel	0.98	0.98	0.98
Nv	0.96	0.92	0.94
Scc	1.00	0.98	0.99
Sek	1.00	1.00	1.00
Vasc	0.96	1.00	0.98

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Kharisma, I.L.; Munandar, R.A.; Maulana, D.I.; Rizky, M.N.; Kamdan. A Comparative Evaluation of CNN and Transfer Learning Models for Multi-Class Skin Disease Classification. Eng. Proc. 2025, 107, 52. https://doi.org/10.3390/engproc2025107052

AMA Style

Kharisma IL, Munandar RA, Maulana DI, Rizky MN, Kamdan. A Comparative Evaluation of CNN and Transfer Learning Models for Multi-Class Skin Disease Classification. Engineering Proceedings. 2025; 107(1):52. https://doi.org/10.3390/engproc2025107052

Chicago/Turabian Style

Kharisma, Ivana Lucia, Rifki Arief Munandar, D Ihsan Maulana, Mochamad Naufal Rizky, and Kamdan. 2025. "A Comparative Evaluation of CNN and Transfer Learning Models for Multi-Class Skin Disease Classification" Engineering Proceedings 107, no. 1: 52. https://doi.org/10.3390/engproc2025107052

APA Style

Kharisma, I. L., Munandar, R. A., Maulana, D. I., Rizky, M. N., & Kamdan. (2025). A Comparative Evaluation of CNN and Transfer Learning Models for Multi-Class Skin Disease Classification. Engineering Proceedings, 107(1), 52. https://doi.org/10.3390/engproc2025107052

Article Menu

A Comparative Evaluation of CNN and Transfer Learning Models for Multi-Class Skin Disease Classification^†

Abstract

1. Introduction