1. Introduction
Skin diseases, especially skin cancers like melanoma, pose a considerable and escalating health issue due to their rising incidence and potential severity [
1,
2,
3,
4] Timely and precise diagnosis is essential for enhancing patient outcomes and lowering the mortality rates linked to these conditions. Conventional clinical approaches rely significantly on the visual assessments conducted by seasoned dermatologists, a process that can be subjective and constrained by the availability of specialists [
5,
6]. As a result, automated skin disease detection through artificial intelligence (AI) has become an essential focus to overcome these challenges by providing reliable, efficient, and scalable diagnostic solutions. The diversity of skin lesions regarding color, texture, and morphology poses considerable obstacles in creating universally effective AI models. Various diseases frequently display visual resemblances or reveal nuanced distinctions within the same condition, creating obstacles for precise classification [
1,
7].
This study aimed to evaluate the effectiveness of various AI models in relation to several dermatological conditions, focusing on their accuracy, robustness, and generalizability. We also aimed to identify optimal approaches for accurate and reliable multi-class skin disease classification through a comparative evaluation of a custom convolutional neural network (CNN) and established transfer learning architectures including ResNet50, DenseNet201, and Inception. The evaluation employed the publicly accessible FYP Skin Disease Dataset, which includes images depicting a range of common and significant skin disorders.
This study enhances the field through a thorough comparative analysis of the performance of AI models in the classification of multi-class skin diseases. This study meticulously assesses these models, emphasizing the advantages and drawbacks of both tailored CNN architectures and widely used transfer learning models. These insights play a vital role in steering ongoing enhancements in AI architecture and training methodologies, thereby propelling the progress of diagnostic tools that can effectively classify a range of dermatological conditions [
5,
8,
9].
2. Materials and Methods
2.1. Dataset
The research utilized the FYP Skin Disease Dataset, which is publicly accessible on Kaggle. This dataset originally comprised 22,982 images classified into nine distinct dermatological conditions: acne, melanoma (MEL), melanocytic nevus (NV), basal cell carcinoma (BCC), squamous cell carcinoma (SCC), actinic keratosis (AK), seborrheic keratosis (SEK), dermatofibroma (DF), and vascular lesions (VASC) [
10]. The dataset was refined to 4500 images to maintain balanced representation and enhance computational efficiency, with 500 images designated for each class. Images were standardized to a resolution of 300 × 300 pixels and divided into training (80%), validation (10%), and testing (10%) subsets.
2.2. Data Preprocessing
All images went through resizing to 224 × 224 pixels to align with the input specifications of the model. Each pixel value was normalized to fall within the [0,1] range by dividing it by 255. The process of loading and preprocessing data utilized TensorFlow’s tf.data pipeline, incorporating caching, shuffling with a buffer size of 1000, and prefetching set to AUTOTUNE to enhance performance [
11].
2.3. Model Architectures
2.3.1. Custom CNN
The custom convolutional neural network included convolutional layers (Conv2D) featuring a kernel size of 3 × 3, ReLU activation functions, and MaxPooling layers with a size of 2 × 2. The architecture comprised Conv2D layers with progressively increasing filter depth (32, 64, 128), followed by a flattening layer, dense layers with 128 units and ReLU activation, a dropout layer with a rate of 0.5, and ending with a final softmax layer for 9 classes.
2.3.2. Transfer Learning Models
Pre-trained ImageNet models—ResNet50 [
12], DenseNet201 [
13], and InceptionResNetV2 [
14]—were utilized without their top classification layers. A global average pooling (GAP) layer was incorporated, followed by dense layers with 512 units utilizing ReLU activation. A dropout layer with a rate of 0.5 was included, resulting in a softmax layer aimed at classification across nine classes.
2.4. Training Setup
We developed all models using the TensorFlow 2.20.0 framework and the Keras API with 3.11.3 version on an NVIDIA GeForce RTX 3080 GPU. The training process employed the Adam optimizer with a learning rate of 0.001, utilized sparse categorical cross-entropy for loss calculation, set a batch size of 32, and ran for 50 epochs, incorporating early stopping based on validation loss with a patience of 5 epochs.
2.5. Evaluation Metrics
The evaluation of the model’s performance involved analyzing accuracy, loss curves, confusion matrices, and classification reports, which included precision, recall, and F1-score. Additionally, multi-class ROC curves were utilized to assess the class-wise AUC.
3. Results
The findings demonstrate that the custom CNN model acquired flawless accuracy (100%) alongside negligible loss (0.0017). DenseNet201 and InceptionResNetV2 achieved impressive accuracy rates of 99.33% and 98.66%, respectively, while maintaining remarkably low losses. In contrast, ResNet50 showed notably poorer performance, achieving an accuracy of merely 44.41% and displaying high loss, which suggests difficulties in effectively classifying multi-class skin lesions with this model, as shown below in
Figure 1.
The model evaluation is shown in
Table 1 below.
The training and validation accuracy curves for ResNet50 demonstrated a steady enhancement in accuracy, ultimately stabilizing at approximately 44%. The loss curves showed a consistent decline, but remained comparatively elevated, highlighting that ResNet50 was not sufficiently tailored to accurately identify the unique features of the multi-class skin lesion dataset utilized in this analysis, as showed in
Figure 2 below.
The accuracy curves of the CNN exhibited a swift and steady approach to peak performance, achieving 100% accuracy with negligible variations. In a similar vein, the loss curves exhibited a sharp decline and maintained a proximity to zero during the training process, underscoring the Custom CNN model’s robust capacity to effectively fit and generalize the training data. The results can be figured out in
Figure 3.
DenseNet201 demonstrated a notable training performance, swiftly achieving accuracy levels close to 99%. The loss values decreased significantly during the early epochs and sustained a low and stable trajectory throughout the training phase, suggesting robust generalization ability. The results showed in
Figure 4 below.
The accuracy curves for InceptionResNetV2 demonstrated consistent enhancements, ultimately reaching a stabilization near 98%. The loss curves exhibited a steep decline at the beginning and maintained a consistently low level, indicating successful learning and strong generalization to the validation set.
3.1. Confusion Matrix Analysis
The confusion matrix indicates that ResNet50 struggled to differentiate among several classes, particularly misclassifying acne, NV, and MEL with other lesion types. This ambiguity highlights the model’s limited overall accuracy and suggests that ResNet50 faced challenges in acquiring distinctive features pertinent to each skin disease category within this dataset, ilustrated in
Figure 5. For CNN results showed in
Figure 6.
The matrix indicates that the custom CNN model successfully classified all test samples, resulting in zero misclassifications across the nine classes. This outcome supports earlier findings of the model achieving 100% test accuracy, highlighting its exceptional ability to learn and differentiate the features of each skin disease.
The matrix shows that DenseNet201, stated in
Figure 7, produced highly accurate predictions, with only a handful of minor misclassifications noted (for instance, one AK image was classified as BCC). The result reinforces the previous finding of 99.33% accuracy and underscores the model’s robust generalization ability and discriminative strength across all categories.
The confusion matrix indicates that InceptionResNetV2, stated in
Figure 8, attained high accuracy across all classes, with only slight misclassifications such as NV being mistaken for MEL or VASC. This result validates the model’s strong capability to generalize characteristics of various skin conditions while ensuring high predictive accuracy.
3.2. ROC Curve and AUC Analysis
Table 2 shows the area under the curve (AUC) values for each class across all four models, providing valuable insights into their discriminative capabilities.
The results further substantiate previous findings, demonstrating that the CNN, DenseNet201, and InceptionResNetV2 models attained perfect AUC scores across all classes. In contrast, ResNet50 showed notably lower AUCs, especially for NV and DF, which suggests a lack of class separability, as illustrated in
Figure 9. For ROC and AUC values stated in
Figure 10, while
Figure 11 stated results of ROC and AUC using DenseNet201. ROC and AUC results of InceptionResNetV2 stated in
Figure 12.
3.3. Classification Report
The findings indicate exceptional class discrimination (AUC = 1.00) for CNN, DenseNet201, and InceptionResNetV2, whereas ResNet50 demonstrated lower and more variable AUC values, corroborating previous observations of inadequate adaptation, shown in
Table 3 below. For CNN report stated in
Table 4.
Table 5 stated the results of DenseNet201 while results of InceptionResNet v2 stated in
Table 6.
3.4. Visualization of Sample Predictions
In the process of testing the model in detecting skin diseases, it was carried out using images data taken from the data set. The results of testing on images using ResNet 50 are shown in
Figure 13. The results of testing using CNN are shown in
Figure 14. Meanwhile, testing using DenseNet201 is shown in
Figure 15.
Figure 16 shows the results of testing using the InceptionResNet v2 model.
4. Discussion
This study presents a detailed comparison of a custom-built CNN model with three transfer learning architectures—ResNet50, DenseNet201, and InceptionResNetV2—focusing on the multi-class classification of dermatological conditions. The exceptional performance of the Custom CNN, DenseNet201, and InceptionResNetV2 models highlights the capabilities of deep learning in automating skin disease diagnosis with remarkable accuracy.
ResNet50 demonstrated a notable lack of performance when compared with alternative models, which can be linked to its constrained ability to adjust to the unique characteristics of dermatological images within this dataset. The findings were corroborated by the confusion matrix and the ROC curve, indicating that ResNet50 struggled to differentiate effectively among specific classes like NV, DF, and MEL. The lower AUC values and classification metrics indicate that the deep residual connections of ResNet50 may not have adequately captured the fine-grained features pertinent to skin lesion textures and variations without further tuning.
The custom CNN attained an impeccable classification score, indicating that a specialized architecture crafted for skin lesion images can surpass standard pre-trained models, particularly when computational resources permit comprehensive training from the ground up. Similarly, DenseNet201 and InceptionResNetV2 attained nearly flawless scores, demonstrating the efficacy of deep transfer learning models when utilized with suitable preprocessing and training methodologies.
An important finding is that all models, with the exception of ResNet50, exhibited consistently strong performance across all classes, even among visually similar categories like SEK and AK, which are typically challenging to differentiate. The classification reports and AUC tables indicate that models utilizing transfer learning, when designed with adequate depth and suitable architectural selections, demonstrate strong generalization capabilities upon fine-tuning.
The results highlight the critical role of choosing and fine-tuning models in the realm of medical imaging tasks. Models that are custom-designed might prove to be more effective in specific contexts, whereas carefully selected transfer learning models can deliver top-tier results with less data and reduced training durations.
5. Conclusions
This investigation provided a comparative assessment of four deep learning models aimed at multi-class skin disease classification, utilizing a balanced dataset comprising 4500 dermatoscopic images. The models comprised a tailored CNN, ResNet50, DenseNet201, and InceptionResNetV2. The findings indicated that the custom CNN attained an impressive 100% accuracy, with DenseNet201 not far behind at 99.33% and InceptionResNetV2 at 98.66%. In contrast, ResNet50 showed a notable deficiency, achieving only 44.41% accuracy.
The evaluation utilizing confusion matrices, AUC, and precision-recall metrics demonstrates that tailored architectures and effectively optimized transfer learning models can deliver exceptional accuracy and dependable performance for medical image classification tasks. Nonetheless, it is important to note that not every pre-trained model demonstrates effective generalization without undergoing fine-tuning.
In summary, this study underscores the practicality and effectiveness of AI-based diagnosis in dermatology, advocating for the continued advancement of refined deep learning solutions specifically designed for medical datasets. Future investigations could delve into the integration of clinical metadata, the utilization of advanced ensemble techniques, and the development of real-time diagnostic applications to enhance the significance of this study.
Author Contributions
Conceptualization, R.A.M., D.I.M. and M.N.R.; methodology, R.A.M., I.L.K. and K.; software, R.A.M. and I.L.K.; validation, I.L.K.; formal analysis, R.A.M.; resources, R.A.M. and D.I.M.; writing—original draft preparation, R.A.M., D.I.M. and M.N.R.; writing—review and editing, R.A.M. and I.L.K. All authors have read and agreed to the published version of the manuscript.
Funding
This research received no external funding.
Institutional Review Board Statement
Ethical review and approval were waived for this study, because the data were obtained from publicly available datasets.
Informed Consent Statement
Not applicable.
Data Availability Statement
The data presented in this study are available on request from the corresponding author.
Conflicts of Interest
The authors declare no conflict of interest.
References
- Wei, L.; Ding, K.; Hu, H. Automatic Skin Cancer Detection in Dermoscopy Images Based on Ensemble Lightweight Deep Learning Network. IEEE Access 2020, 8, 99633–99647. [Google Scholar] [CrossRef]
- Hasan, M.K.; Dahal, L.; Samarakoon, P.N.; Tushar, F.I.; Martí, R. DSNet: Automatic dermoscopic skin lesion segmentation. Comput. Biol. Med. 2020, 120, 103738. [Google Scholar] [CrossRef] [PubMed]
- Imran, A.; Nasir, A.; Bilal, M.; Sun, G.; Alzahrani, A.; Almuhaimeed, A. Skin Cancer Detection Using Combined Decision of Deep Learners. IEEE Access 2022, 10, 118198–118212. [Google Scholar] [CrossRef]
- Harangi, B. Skin Lesion Classification with Ensembles of Deep Convolutional Neural Networks. J. Biomed. Inform. 2018, 86, 25–32. [Google Scholar] [CrossRef] [PubMed]
- Diame, Z.E.; Al-Berry, M.N.; Salem, M.A.-M.; Roushdy, M. Autoencoder Performance Analysis of Skin Lesion Detection. Xi’nan Jiaotong Daxue Xuebao 2021, 56, 937–947. [Google Scholar] [CrossRef]
- Capurro, N.; Pastore, V.P.; Touijer, L.; Odone, F.; Cozzani, E.; Gasparini, G.; Parodi, A. A Deep Learning Approach to Direct Immunofluorescence Pattern Recognition in Autoimmune Bullous Diseases. Br. J. Dermatol. 2024, 191, 261–266. [Google Scholar] [CrossRef] [PubMed]
- Pacheco, A.G.C.; Krohling, R.A. The Impact of Patient Clinical Information on Automated Skin Cancer Detection. Comput. Biol. Med. 2020, 116, 103545. [Google Scholar] [CrossRef] [PubMed]
- Magdy, A.; Hussein, H.; Abdel-Kader, R.F.; Abd El Salam, K. Performance Enhancement of Skin Cancer Classification Using Computer Vision. IEEE Access 2023, 11, 72120–72133. [Google Scholar] [CrossRef]
- Sengupta, S.; Mittal, N.; Modi, M. Improved Skin Lesions Detection Using Color Space and Artificial Intelligence Techniques. J. Dermatol. Treat. 2020, 31, 511–518. [Google Scholar] [CrossRef] [PubMed]
- Kaggle. FYP Skin Disease Dataset. Available online: https://www.kaggle.com/datasets/bilalmanzoor2/fyp-skin-disease-dataset (accessed on 20 June 2025).
- TensorFlow. tf.data: Build TensorFlow Input Pipelines. TensorFlow Documentation. Available online: https://www.tensorflow.org/guide/data (accessed on 20 June 2025).
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar] [CrossRef]
- Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K.Q. Densely Connected Convolutional Networks. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 2261–2269. [Google Scholar] [CrossRef]
- Szegedy, C.; Ioffe, S.; Vanhoucke, V.; Alemi, A.A. Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning. arXiv 2016, arXiv:1602.07261. [Google Scholar] [CrossRef]
| Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).