Deep Learning-Based Web Application for Automated Skin Lesion Classification and Analysis

Aksoy, Serra; Demircioglu, Pinar; Bogrekci, Ismail

doi:10.3390/dermato5020007

Open AccessArticle

Deep Learning-Based Web Application for Automated Skin Lesion Classification and Analysis

by

Serra Aksoy

^1,*

,

Pinar Demircioglu

^2,3

and

Ismail Bogrekci

³

¹

Institute of Computer Science, Ludwig Maximilian University of Munich (LMU), Oettingenstrasse 67, 80538 Munich, Germany

²

Institute of Materials Science, Technical University of Munich (TUM), Boltzmannstr. 15, 85748 Garching bei Munich, Germany

³

Department of Mechanical Engineering, Aydin Adnan Menderes University (ADU), Aytepe, Aydin 09010, Turkey

^*

Author to whom correspondence should be addressed.

Dermato 2025, 5(2), 7; https://doi.org/10.3390/dermato5020007

Submission received: 14 March 2025 / Revised: 14 April 2025 / Accepted: 18 April 2025 / Published: 24 April 2025

(This article belongs to the Collection Artificial Intelligence in Dermatology)

Download

Browse Figures

Versions Notes

Abstract

:

Background/Objectives: Skin lesions, ranging from benign to malignant diseases, are a difficult dermatological condition due to their great diversity and variable severity. Their detection at an early stage and proper classification, particularly between benign Nevus (NV), precancerous Actinic Keratosis (AK), and Squamous Cell Carcinoma (SCC), are crucial for improving the effectiveness of treatment and patient prognosis. The goal of this study was to test deep learning (DL) models to determine the best architecture to use in classifying lesions and create a web-based platform for improved diagnostic and educational availability. Methods: Various DL models, like Xception, DenseNet169, ResNet152V2, InceptionV3, MobileNetV2, EfficientNetV2 Small, and NASNetMobile, were compared for classification accuracy. The top model was incorporated into a web application, allowing users to upload images for automatic classification, thereby offering confidence scores as a measure of the reliability of predictions. The tool also has enhanced visualization capabilities, which allow users to investigate feature maps derived from convolutional layers, enhancing interpretability. Web scraping and summarization techniques were also employed to offer concise, evidence-based dermatological information from established sources. Results: Of the models evaluated, DenseNet169 achieved the best classification accuracy of 85% and was, therefore, chosen as the base architecture for the web application. The application enhances diagnostic clarity by visualizing features and promotes access to trustworthy medical information on dermatological disorders. Conclusions: The developed web application serves as both a diagnostic support system for dermatologists and an educational system for the general public. By using DL-based classification, interpretability techniques, and automatic medical information extraction, it facilitates early intervention and increases awareness regarding skin health.

Keywords:

skin lesion classification; skin cancer; DenseNet169; transfer learning

1. Introduction

Several studies have looked into multiple AI-based skin lesion classification approaches via machine learning (ML) and deep learning (DL) methods. This section provides an overview of the critical studies, approaches, parameters of interest, and findings of key studies:

Calderón et al. [1] employed a Bilinear CNN trained on dermatoscopic images to achieve precise skin lesion classification.

Mehmood et al. [2] made use of the Xception network, which is a shallow and wide network, and they conducted a fine-tuning of the network specifically to push its performance in the classification of skin lesions further. Through this fine-tuning, they achieved successful classification outcomes that verified the potential of the network in this regard.

Pathania and Behki [3] employed DenseNet 169, a sophisticated deep learning model, which was meticulously trained on the Skin Cancer dataset comprising a variety of dermoscopy images. The given model achieved a high accuracy rate of 89.7%. Additionally, it is important to mention that the sophisticated model was integrated into clinical workflows, thereby facilitating improved diagnosis in a real-world medical setup.

Singh et al. [4] suggested a comprehensive deep learning framework incorporating uncertainty estimation as one of the key components. The framework was developed to effectively combine uncertainty estimation and explainability concepts with the ultimate goal of significantly boosting diagnostic confidence for users and practitioners alike.

Rahman et al. [5] employed an Ensemble Learning Approach, in which various models were combined strategically to achieve better performance and boost classification accuracy. Through this strategy, they were able to classify skin lesions correctly into a number of different classes.

Mporas et al. [6] compared Traditional ML to preprocessing, where image preprocessing techniques like segmentation and hair removal were applied, and AdaBoost and Random Forest classifiers were used.

Hoang et al. [7] presented a novel concept named Wide-ShuffleNet, which is a variant of ShuffleNet architecture. It utilizes various techniques such as group convolution, channel shuffle operations, skip connections, and batch normalization in a bid to come up with a thin deep learning model. The architecture is claimed to enhance its training efficiency significantly without adding its structure size.

Saba et al. [8] applied transfer learning using a deep CNN, which enhanced contrast, boundary definition of lesions, and feature extraction, ultimately resulting in automatic lesion detection.

Chaturvedi et al. [9] conducted a thorough comparative study with a focus on some of the well-known CNNs, like Xception, InceptionV3, etc. In the comparative study, the researchers compared in detail the performance of pre-trained CNNs along with ensemble models in the scenario of multiclass classification task testing in particular.

Maqsood and Damaševičius [10] suggested an innovative deep learning framework that accommodates feature selection as a major component. They take earnest consideration of the importance of feature engineering, which has been planned in such a manner to enhance classification accuracy so that more precise results can be achieved.

Salma and Eltrass [11] introduced a new automated deep learning method, which is specifically tailored for skin lesion classification. Grab-cut segmentation, a recent state-of-the-art method, was utilized in combination with pre-trained Convolutional Neural Network (CNN) models such as VGG-16, Res-Net50, and more in this methodology. Through the integration of a collection of advanced deep learning techniques and effective image preprocessing methods, they were able to accomplish a significant improvement in the accuracy of the classification process.

Aksoy [12] conducted a study on the classification of melanoma based on the MobileNet-V3-Large model as a feature extractor in a multi-input deep learning model. The research was successful in incorporating both visual data and structured data (age, gender, and location of the lesion), proving that multi-modal learning greatly improves classification efficacy. The research reflected a remarkable accuracy rate of 99.56%, indicating the enormous potential of deep learning in melanoma diagnosis.

Ozdemir et al. [13] proposed a new hybrid deep learning architecture for the multiclass classification of skin cancer using the ISIC 2019 dataset, which consists of eight distinct classes of skin lesions. The proposed architecture integrates ConvNeXtV2 blocks with separable self-attention mechanisms for enhancing feature extraction and classification accuracy while maintaining computational efficiency.

Shakya et al. [14] utilized deep learning to classify skin cancer images of the ISIC 2018 dataset that are segmented by active contour. Experimentally, it compares fine-tuned pre-trained networks, various pre-trained networks as feature extractors with optimizers, and the fusion of both. The highest accuracy (92.87%) is obtained with the fusion of ResNet-18 and MobileNet using an SVM classifier.

Chatterjee et al. [15] made a thorough study of Cross-Correlation-Based Features, paying close attention to their usage in Spatial and Frequency Domains. With this in-depth analysis, they determined a considerable enhancement in the accuracy attainable using these specially extracted features.

Shetty et al. [16] integrated ML algorithms with CNN models, leveraging the benefits of both methodologies for more precise classification.

Fraiwan and Faouri [17] utilized deep transfer learning techniques, i.e., utilizing pre-trained models, in an attempt to achieve extremely high accuracy even when dealing with limited data.

Jain et al. [18] and Huang et al. [19] worked on transfer learning-based methods, proving the significance of pre-trained models in the classification of skin lesions.

Tajerian et al. [20] utilized EfficientNet-B1 with Global Average Pooling, utilizing a softmax layer for the classification of lesions, making the model more efficient.

Ali et al. [21] proposed an EfficientNet Image Pipeline, which included image preprocessing tasks like hair removal, dataset augmentation, and image resizing, leading to optimized classification on the EfficientNet models (B0–B7).

Srinivasu et al. [22] combined MobileNet V2 with LSTM, improving classification accuracy through sequence learning techniques.

Salamaa et al. [23] compared ResNet50 and VGG-16 models with the addition of preprocessing techniques, including SVM for inclusion/exclusion, to enhance classification results.

Himel et al. [24] proposed Vision Transformer Architecture, utilizing self-attention mechanisms for spatial feature extraction to enhance lesion classification accuracy.

Mehr et al. [25] used Inception-ResNet-v2 along with Patient Data, such as age, sex, and anatomical site information, to further improve the model’s performance.

Chen et al. [26] provided a deep learning-based method for pigmented skin disease image classification, referred to as the skin-global attention block (Skin-GAB). The suggested approach turned out to be remarkably effective in terms of accuracy and feasibility as well. When compared to utilizing Xception for classification architecture and the convolutional block attention module (CBAM) for attention mechanism on the HAM10000 dataset, an accuracy improvement of 2.89% was attained using the architectural framework proposed in the paper.

Hanum et al. [27] merged 39 skin lesion types from five datasets. They compared five state-of-the-art deep neural models, including MobileNetV2, Xception, InceptionV3, EfficientNetB1, and Vision Transformer. In order to optimize model robustness and accuracy, they introduced Efficient Channel Attention (ECA) and Convolutional Block Attention Module (CBAM). A comparison showed that the Vision Transformer model with CBAM outperformed others with 93.46% accuracy, 94% precision, 93% recall, 93% F1-measure, and 93.67% specificity.

Georgiadis et al. [28] illustrated that hyperdatasets with more data enhance the classification accuracy of models, whether they are trained from scratch or through transfer learning. The Vision Transformer (ViT) performed better in comparison to CNNs, achieving 91.87% accuracy for nine classes and 58% accuracy for 32 classes on larger hyperdatasets.

Skin lesions are visible changes in the skin, like abnormal spots, bumps, or color. They can result from excess sun exposure, heredity, infections, or exposure to the environment. Some, like moles and warts, are benign, but others, like melanoma, are more severe and can require treatment [29,30]. The most common cancer in the world to be diagnosed is skin cancer [1,31], and it is a significant public health problem, with millions of new cases annually. Both non-melanoma and melanoma skin cancers have been increasingly growing in number. There are approximately 2 to 3 million non-melanoma skin cancer cases and 132,000 melanoma cases each year diagnosed worldwide [32].

The increased incidence of melanoma only heightens the necessity for early detection to be able to treat the disease successfully. It is critical to be capable of differentiating between malignant melanoma and benign lesions in a bid to provide patients with the best treatment [2,33]. Skin cancer is the most prevalent cancer globally, and its incidence has been on the increase continuously, which has two main classes: non-melanoma skin cancers (NMSCs) and melanoma [34]. While less prevalent, melanoma is far more dangerous and tends to grow rapidly, so its early detection becomes crucial. More prevalent NMSCs are basal cell carcinoma (BCC) and Squamous Cell Carcinoma (SCC). BCC does not generally spread but can form severe problems when not treated, while SCC also has a high chance of spreading but is always treatable in case of early detection. The increase of NMSCs is essentially caused by increased UV radiation exposure. Early detection of melanoma is especially important as survival is more than 90% when it is diagnosed early, and it drops significantly as the illness progresses [35]. It may be difficult to determine if a skin lesion is benign or malignant, even for skilled dermatologists. Such uncertainty occasionally leads to treatment delays or unnecessary procedures [36,37]. The advent of AI, particularly DL methods, in the recent past, has caused significant enthusiasm because of their high accuracy and performance in the classification of skin lesions.

The primary technical contribution of this work is in the explicit classification of three clinically useful types of skin lesions: Actinic Keratosis, Squamous Cell Carcinoma, and Nevus. Explainability is achieved through visualization of intermediate convolutional layers at inference time. Users can, therefore, see the internal feature representations of the model in an understandable manner. Furthermore, the system features an automatic medical knowledge extraction pipeline through web scraping. It is pulling structured information from trusted clinical sources, in this case, the Cleveland Clinic. It is also augmented by utilizing Meta’s BART-CNN-Large natural language processing model. This allows the creation of brief, contextually pertinent, and evidence-based summaries within the web application. This combined architecture serves to improve diagnostic transparency and interpretability. This also leads to better user access to quality dermatological information. Consequently, the platform assists in closing typical gaps in information that exist in clinical decision-making and patient education. Finally, it raises the usefulness and trustworthiness of the system for clinicians.

2. Materials and Methods

2.1. Data Preprocessing and Augmentation Strategies for Enhanced Model Performance

The ISIC 2019 dataset (https://www.kaggle.com/datasets/salviohexia/isic-2019-skin-lesion-images-for-classification (accessed on 2 January 2025)) consists mainly of dermoscopic images of lighter populations, which can lead to bias. This may limit the model’s applicability in dark-skinned individuals based on differences in appearance and contrast of lesions, which can impact reliability if not well represented in the training dataset. This is a common issue with most skin lesion datasets, and more datasets with darker skin types should be uploaded to the internet. However, the two major skin lesion classification datasets employed by most of the studies, i.e., HAM 10,000 and ISIC collected data, were predominantly from lighter populations, and for classification reliability, we had to employ a dataset that contained sufficient and quality data.

The database consists of standardized dermoscopic images, which reduces the variation between various image acquisition setups and devices. Real-life cases, on the other hand, present various modalities and techniques, to which it cannot be assured that the dataset is fully representative, thus reducing the model’s robustness in real-life dermatology practice.

The ISIC 2019 dataset, derived from the 2017 and 2018 ISIC challenges, is an extensive and large dataset for skin lesion analysis. This extensive dataset consists of a total of 25,331 images, which belong to nine diagnostic classes. For the aim and purpose of this study, a specific subset of 1884 unique images was selected carefully. This selection targeted three classes of high clinical significance and interest: the first class is Melanocytic Nevus (NV), which indicates benign lesions that are typically harmless; the second class is Actinic Keratosis (AK), which is referred to as a precancerous condition that can lead to more serious health issues; and finally, the third class is Squamous Cell Carcinoma (SCC), which indicates a malignant condition that arises from the above-stated AK (as illustrated in Figure 1). The database was made up of a total of 628 images in each class so this provided a well-balanced and comprehensive collection. This careful ordering allowed the viewing and study of skin lesions that present a progression from benign to malignant states. Such observation underlines the urgent need for early treatment and accurate diagnosis in an attempt to manage these lesions. The data were divided systematically into distinct sets for training, validation, and testing. Specifically, 90% of the total dataset was used for training the models, and an equal portion of 5% of the remaining data was used for validation as well as testing purposes to ensure rigorous performance evaluation. This systematic strategy guaranteed class balance, hence efficient model training while limiting chances for overfitting.

During the first step of preprocessing, a significant step taken was resizing all the images to a standard size of 256 × 256 pixels. The resizing was performed so that all the images would have standard sizes throughout the dataset, which would make it convenient to perform better in subsequent analysis. Additionally, in order to optimize the learning process, their pixel values were carefully normalized within a specific range of [0, 1]. This scaling technique was helpful in achieving stability improvement during training by successfully reducing the overall magnitude of the feature values being learned. Furthermore, a number of data augmentation techniques were strategically employed to significantly increase both the diversity and strength of the training and validation sets, thereby improving the model’s generalization ability. The augmentations used in this research involved a range of random rotations that could be up to 30 degrees, as well as vertical and horizontal shifts that could cover up to 50% of the entire image size. In addition, shear transformations were also applied, along with zooming features that could go up to 50%. Moreover, the process also included flipping the images horizontally, as shown in Figure 2. These various forms of augmentations simply added a level of variability in the imaging process, allowing for the capture of various orientations, differences in lesion sizes, and different angles on the images. To effectively deal with and prevent any blank spaces that were generated due to these various transformations, a nearest-neighbor filling approach was used. This was particularly used to maintain the integrity and preservation of the images during the augmentation process.

For the validation set, a data augmentation policy that was a bit less aggressive than the other policies was employed. This was to rightly balance adding diversity to the dataset while ensuring it remains representative of unseen data. To be more specific, rotation angles were set to a maximum of 30 degrees to prevent excessive distortion. Horizontal and vertical translations were also significantly restricted, with movement accounting for just 30% of the entire image size allowed to preserve the integrity of the dataset. Shear and zooming operations were also strictly controlled to ensure optimal results; each of these operations was specifically restricted to a maximum of 30%. This limitation was introduced as a strategic effort to maintain the overall integrity of the images while still allowing for some degree of moderate variability to be duplicated successfully. Horizontal flipping was also employed as a strategic mode of flipping asymmetric patterns correctly, ensuring that this operation was performed in a manner that did not enforce any visible distortion. All this caution was employed to ensure that the integrity of the images was never lost in any way during this process. The same nearest-neighbor filling method was once more utilized to answer and fill in appropriately any subsequent gaps that had occurred as a direct consequence of these augmentations, thereby supplying a high level of overall visual coherence within the images. On the other hand, the test set itself was completely left in its original form, except for the application of pixel rescaling, which was employed to significantly improve the overall quality of the images.

Table 1 illustrates the distribution of the dataset comprised of 565 images for training, 32 for validation, and 31 for testing, totaling 628 images per class. Overall, this amounts to 1695 images for training, 96 images for validation, and 93 images for testing, which sums up to a total of 1887 images. The data is divided systematically to allow for balanced and complete evaluation, with the majority of images reserved for model training and smaller groups for validation and final testing to evaluate the performance of the model.

The original data included 1887 images with 628 images per class. To partially address data sparsity and improve generalization, the data were augmented around 10 times during training, which gave us around 16,950 train samples. The augmentations were done dynamically by on-the-fly data generators without saving the augmented samples as individual image files. For the validation set, more modest augmentation was performed, which gave us around 960 augmented validation samples generated at train time.

2.2. Proposed Deep Learning Model for Skin Lesion Classification

VC was selected as the recommended model for the purposes of this study primarily because of its remarkable ability to seamlessly integrate various components. The proposed model utilizes DenseNet169 as its backbone architecture, which impressively achieved an accuracy rate of 85%, marking it as the highest performer among all the models that were employed in the scope of this study. This particular model was specifically chosen to serve as the backbone due to its optimal balance, which strikes a perfect harmony between depth, efficiency in parameter usage, and robust feature extraction capabilities. DenseNet169 belongs to the family of Dense Convolutional Networks that are defined by their novel architecture, which connects each layer to every other layer in a feed-forward fashion. This new method of dense connectivity substantially enhances the gradient flow during training, assists in mitigating the problems linked to vanishing gradients, and dramatically increases feature reuse throughout the network. Due to this, DenseNet169 is particularly suited for applications that require careful and detailed feature extraction, like the classification of dermoscopic images. DenseNet169 consists of a series of dense blocks, which are interleaved by transition layers. The transition layers are designed to consist of both convolutional and pooling operations, which serve the important function of reducing the spatial dimensions of the input while concurrently preserving the richness of the feature maps that are generated. With this design, DenseNet169 is able to effectively capture and model intricate patterns in data with a relatively smaller number of parameters. As such, this makes it particularly suitable for applications involving the detection of subtle variations that are present in dermoscopic images, such as texture variations, color variations, and shape variations.

In this particular study, the last 10 layers of the DenseNet169 model were thoroughly fine-tuned so that they could successfully learn to accommodate the specific and challenging task of dermoscopic image classification. This was complemented by taking pre-trained weights from ImageNet, which allowed successful transfer learning to enhance the model’s performance. Additionally, the topmost layers of the backbone structure were replaced with a custom-designed classification head, specifically designed to further enhance the extracted features so that they are optimally tuned for this specific field of research.

The classification head started by using a global average pooling layer to condense feature maps into a fixed-size vector. This essential step led to fully connected layers designed to learn and extract progressively more complex features. The first of these fully connected layers contained 256 neurons that employed ReLU activation functions with L2 regularization, the penalty factor of which was 0.01. This was carried out to avoid the risk of overfitting by actually controlling the sizes of the model’s weights. Following this, two more layers were added to the architecture, the first of which contained 128 neurons and the second of which contained 64 neurons. Both layers used the ReLU activation function to detect patterns. A dropout layer with a 20% rate was added after the second layer to enhance generalization. This technique disables some of the neurons during training, which stabilizes the model and reduces overfitting. Gaussian noise (std 0.001) was introduced before the final layer to improve the robustness of the model to image variability. The output layer had three neurons for classes NV, AK, and SCC, using a softmax function for class probability. Figure 3 shows the DenseNet169-based model, starting with a Conv2D layer (kernel 7 × 7, stride 2, 64 filters) followed by a 3 × 3 max pooling layer (stride 2). There are four dense blocks and three transition layers. Dense blocks are made up of convolutional layers with dense connections, and transition layers use batch normalization, a 1 × 1 convolution, and a 2 × 2 average pooling to reduce dimensions. The DenseNet backbone has a modified classification head with GlobalAveragePooling2D, Dense (256) with L2 regularization (λ = 0.01), Dense (128) with ReLU, Dropout (rate = 0.2), Dense (64) with ReLU, GaussianNoise (stddev = 0.001), and Dense (3) with Softmax for classification into AK, NV, or SCC.

The model was optimized through a process that utilized the Adam algorithm, which has been proven to be effective in allowing neural networks to be trained. The learning rate was selected carefully and kept at 0.0001 to enable the training to be conducted in an efficient and smooth manner. To evaluate the model’s performance, categorical cross-entropy was employed as the loss function, a decision considered suitable based on the nature of the task that involved more than one class. Meanwhile, accuracy was selected as the main metric for testing the model’s overall performance. The entire process of training was carried out on more than 50 epochs, using datasets with data augmentation approaches, which play a useful role in strengthening the model to learn efficiently.

2.3. Experimental Setup: Data Preprocessing, Model Training, and Evaluation Framework

The initial phase of the experimental setup commenced with a rigorous and extensive data preprocessing and augmentation procedure. This was a crucial step taken to significantly enhance the strength of the model and its ability to generalize well on unseen data. In order to achieve this, the dataset was subjected to a normalization process, where all pixel values were rescaled to fall in the range of [0, 1]. In addition, a comprehensive list of aggressive augmentation techniques was applied systematically to the training set, whose function was to artificially enhance its variability and diversity, thereby making the dataset richer. This procedure included a series of random rotations that could be up to 30 degrees. It also included the application of horizontal and vertical shifts that could reach up to 50% of the total image sizes. The procedures also included shearing and zooming by a factor of 0.5. Horizontal flipping was also applied in the transformations. In order to maintain the images whole during these transformations, any new pixels generated during these procedures were filled by using the nearest neighbor. For the validation data, a sequence of augmentation techniques was applied with reduced intensities to contribute to the dataset without compromising it. This included random rotation of the images through an angle of up to 30 degrees, shifts of up to 30%, and shearing and zooming transformations by a factor of 0.3. There were also horizontal flips as part of the augmentation process, all to introduce uniformity in the validation process without changing the inherent properties of the data to a very large degree. Following the preprocessing process, all the models were trained and tested with the same set of conditions, which was critical in an attempt to have an unbiased and fair comparison of their performance.

The various models that were utilized in this study were Xception, DenseNet169, ResNet152V2, InceptionV3, MobileNetV2, EfficientNetV2 Small, and NASNetMobile. These specific models were suitably selected owing to their established history of effectiveness and efficiency in carrying out image classification tasks in the field of study. Apart from this, each and every model was initialized with pre-trained weights in the ImageNet dataset so that the researchers could utilize the highest benefit that can be obtained using transfer learning techniques. In order to further tune the models, the last 10 layers of their feature extraction modules were unfrozen and set to be trainable, with the remaining layers left frozen to preserve their generalized feature extraction capability. The classification heads of all the models were also swapped out for a custom architecture. The architecture used involved a global average pooling layer, which was intended to decrease the spatial dimensions of the feature maps learned during the process. This pooling layer was followed by densely connected layers of 256 neurons and then one of 128 neurons, both of these layers applying the ReLU activation function to introduce non-linearity into the model. To minimize the risk of overfitting, which can be detrimental to the performance of the neural network on new, unseen data, kernel regularization was applied, with a weight decay factor of 0.01 for the first dense layer. As well as this, to further enhance the model’s generalization performance, a dropout rate of 20% was applied after the second dense layer, such that some neurons would be randomly deactivated during training in order to prevent the model from relying too heavily on any one set of features. In order to enhance the regularization during training, a Gaussian noise layer with a standard deviation of 0.001 was introduced into the architecture. This is performed to simulate the presence of noise in the input space and, therefore, generate a more robust model. Lastly, the design of the output layer consisted of three individual neurons, with each neuron specifically committed to one of the three lesion classes. To ensure class probabilities are appropriately generated, a softmax activation function was applied at this final stage.

The models were developed with detail using the Adam optimizer, and it was chosen to have a learning rate of 1 × 10⁻⁴. The specific selection is such that there is a very good balance that has been met between learning efficiency and model stability upon training. In addition, the categorical cross-entropy loss function was used as the objective function on purpose, which is an appropriate choice considering the multiclass classification type of the problem at hand. All the models were trained for the large number of 50 epochs, with each epoch running batches of data comprising 32 samples. The batch size was selected very carefully so that the models were exposed to a diverse range of examples in each training step. While training, we closely observed validation loss and accuracy after each epoch. This was conducted to monitor model performance and detect any hint of overfitting.

In addition, the model evaluation was executed in a routine and systematic process. The test set initially underwent a processing phase that involved the same normalization techniques applied to the training data. It should be noted that augmentation was not applied during the process. The rationale underlying this was the need to have the resultant predictions based solely on the original and actual image data. Upon the completion of the training process, the test set was then fed through the models. This was carried out with the objective of calculating a number of key metrics, including accuracy, precision, recall, and F1-score. These metrics all provide a complete and overall assessment of the predictive performance exhibited by the models across all classes of the study.

Confusion matrices were carefully created to give a visual outline of the distribution of predictions by the models. The matrices successfully outlined cases of both correct classifications and misclassifications, enabling a detailed examination of the prediction results. Moreover, classification reports were presented in the analysis, which highlighted critical performance measures in terms of precision, recall, and F1-scores for every class. This was given to provide a clearer and more detailed impression of how the models performed in handling different kinds of lesions. The training process was closely analyzed by plotting the loss and accuracy curves against the training and validation sets. This was a very useful exercise in tracking progress throughout the training process, determining if any overfitting had taken place, and gaining a greater insight into how the models were effectively learning from the data being presented (Figure 4).

To further support training efficiency and model reproducibility, all experiments were developed in a controlled environment on TensorFlow and Keras backend frameworks. GPU acceleration was utilized to accelerate the training process, particularly critical for deeper architectures such as DenseNet169 and ResNet152V2, demanding considerable computational resources due to their extremely large number of trainable parameters. Weight initialization was performed through the He normal initialization scheme, which was said to preserve variance across layers and normalize gradient flow in deep networks. Model checkpointing and early stopping callbacks were also implemented within the training loop to save the best-performing models according to validation loss. This way, the overfitting risk was reduced, and unnecessary computational cost caused by extended training was prevented. Lastly, random seeds were set for all libraries and frameworks utilized (e.g., NumPy, TensorFlow) to achieve deterministic behavior and ease reproducibility of outcomes across runs.

The procedure was continuously repeated on all models to normalize and control the experimental environment throughout the experiment. Outcomes of these experiments were saved and plotted for the comparison of various models, assessed by terminal performance metrics, training efficacy, and stability across epochs.

An intuitive Gradio web application was created, which was specifically designed to allow users to upload skin lesion images easily and then get classification results. This application also consists of several interactive tools that provide a more thorough analysis of the images uploaded. The application leverages the proposed model to correctly predict the three targeted classes of skin lesions: NV, SCC, and AK. It further shows the confidence scores for each class, thus helping users know the degree of certainty for the model’s classification. With the upload of an image, the app proceeds to a crucial preprocessing step involving resizing and normalizing the image so that it matches the unique dimensions on which the model was trained prior to the prediction generation process. Moreover, in addition to returning classification results, users are given the opportunity to observe the intricate manners in which the model sees images at various convolutional layers by simply selecting a layer from a conveniently available dropdown menu. This selection results in the generation of feature maps, which are utilized to reveal the activations at that particular layer in the model. The resultant visualization of this information is performed in a comprehendible manner through the utilization of clear and informative plots. Yet another special aspect that differentiates this app is the inclusion of a state-of-the-art summarization pipeline using Meta’s Bart-CNN-Large model from Hugging Face’s extensive transformers library. The innovative feature helps users query and access relevant information on various skin diseases in a productive way. Further, the app enhances its performance by web scraping to receive important medical details directly from the Cleveland Clinic site, which has been recognized to be a dependable and authentic source. It applies BeautifulSoup to scrape completely detailed information about symptoms, treatments, prevention, and a host of other pertinent information users may find interesting. The users can select a skin disease from the provided list and pose a question to receive concise responses generated using the app’s advanced summarization model (refer to Table 2). These features enrich the app for doctors and researchers alike. It offers an easy-to-use interface and valuable information on dermatology and skin lesion classification.

3. Results

DenseNet169 was the overall best-performing model within the context of this study, as it managed to achieve an impressive classification accuracy rate of 85%. Considering the evaluation of its performance in individual classes, the model achieved some truly outstanding results that are worth noting. More specifically, it achieved a precision score of 0.82, a recall score of 0.74, and an F1-score of 0.78 for the AK class, demonstrating its effectiveness in that class. When looking at the NV class, it was apparent that it possessed impeccable reliability, with a fantastic precision rate of 0.94, paired with a perfect recall rate of 1.00, along with an F1-score of 0.97, all of which unquestionably denote its impeccable ability to accurately classify instances as benign. Apart from this, the rates for the SCC class were also highly competitive, with a precision of 0.78, a recall of 0.81, and an F1-score of 0.79, which indicates that while lower than the NV class, its performance is nevertheless one of high efficacy in the identification process.

Training and validation loss curves dropped consistently, indicating good learning. Close agreement between validation and training losses showed the ability of the model to generalize to new data sets. The accuracy training and validation curves showed quick and remarkable enhancement in the initial training, eventually reaching a state of equilibrium. The small difference between validation and training accuracy means the model is strong and does not overfit. Later fluctuations in validation accuracy are normal due to the limited validation set. Table 3 contains T&V Loss, Accuracy Curve Plots, and Confusion Matrix Plots for the best-performing model.

DenseNet169 was better than the other models in key parameters. Even though ResNet152V2 also achieved an accuracy value of 78%, it performed poorly in SCC classification, as seen from its F1-score of 0.70. ResNet152V2 performed better than NV with 0.88 precision and 0.94 recall but did worse in the AK class with lower scores—0.71 precision, 0.77 recall, and 0.74 F1-score—compared with DenseNet169. The ResNet152V2 model had difficulty recognizing the subtle patterns of AK lesions, unlike the DenseNet169 model, which was recognized more efficiently.

InceptionV3 and MobileNetV2 achieved 76% accuracy. Class-wise analysis showed salient differences; InceptionV3 had a lower recall of 0.97 for the NV class, notably compared with DenseNet169, and identified SCC poorly with an F1-score of 0.69. Mo-bileNetV2 performed well in the NV class with a precision of 0.91, recall of 0.97, and F1-score of 0.94. Mo-bileNetV2 performed poorly in the classification of SCC, which resulted in an F1-score of 0.62, indicating less reliability in the detection of SCC lesions.

The Xception model achieved 83% accuracy, slightly lower than anticipated. On the AK class, it performed impressively with a precision of 0.88 and an F1-score of 0.79, reflecting its capability to detect instances within this class. It stumbled in the classification of SCC with a precision of 0.83, a recall of 0.77, and an F1-score of 0.80. Despite the fact that the Xception model worked very well in general, it lacked the performance stability of all classes compared to the DenseNet169 model.

EfficientNetV2S, however, had the lowest total accuracy of merely 43% and proved to be the poorest performance of the models considered in this review. Furthermore, its own performance varied highly across different classes, with it being reported to have achieved a high precision score for the AK class of 0.37 while having a very low recall of 0.13 for the NV class. It is these types of inconsistencies of performance that ensured it was performing poorly and unsuitable for the task of dermoscopic image classification in this regard. Likewise, NASNetMobile also demonstrated marginal improvement with an accuracy rate of 67%. However, despite this small increment in accuracy, it was not able to achieve the necessary performance levels that were essential for the ideal evaluation. The model had an F1-score of 0.56 for the SCC class by itself, and it revealed little performance metrics when tested on other classification classes as well.

In this section, it is critical to explore and analyze the factors that significantly influence model performance, particularly in the classification of different skin lesion types: Nevus (NV), Squamous Cell Carcinoma (SCC), and Actinic Keratosis (AK). The models demonstrated strong performance on the NV class, which corresponds to benign lesions. This outcome is likely due to the relative ease of classifying benign lesions, as they typically exhibit well-defined and less ambiguous visual features. In contrast, distinguishing between AK and SCC proved to be considerably more challenging. These lesion types represent adjacent stages on the precancerous-to-cancerous spectrum and often share overlapping dermoscopic characteristics. The complexity and subtlety of these visual patterns make it difficult for the models to achieve clear separation between the two classes, even when overall classification metrics appear high. Furthermore, early-stage AK lesions can visually resemble benign nevi, which adds another layer of difficulty in achieving accurate classification. This phenotypic similarity introduces additional confusion, especially in borderline or early pathological cases. Consequently, the task of reliably differentiating between benign, precancerous, and malignant lesions remains one of the most challenging aspects of automated skin lesion classification.

The training dataset consisted of 628 instances per class, which, while adequate for each class separately, in the case of training DL models, can be regarded as being small. While class imbalance was not a problem, the relatively small size of the dataset can have some influence on the ability of the model to generalize well, particularly for the more complex and heterogeneous lesions, such as AK and SCC. It is highly probable that if the size and extent of the amount of data collected were far greater than what was actually obtained, then such a greater quantity of data would have eventually resulted in more efficient insights and more useful outcomes for the analysis conducted. This is due to the fact that larger amounts of data will have a tendency to feed into the neural networks and allow them to learn from the larger base of diversified features in the data, which causes the risk of overfitting that degrades the functionality of the model to be reduced.

Apart from the above details, the dataset included a combination of images, such as images with certain markings and images that were recorded directly using confocal microscopes. To enhance the overall performance and reliability of the data being utilized, there was a necessity to remove the images with markings from the dataset. Also, the black outer margins present in the confocal microscope images were meticulously trimmed in order to successfully remove any unwanted background that could pose interference to the analysis. However, it can be noted that this particular preprocessing step caused an unexpected decline in the performance of the model being developed.

Therefore, after due consideration, it was finally decided to use the whole dataset in its raw state without making any alterations to it. This will allow the model to learn completely from the whole range of data that is provided, which includes not just the annotated images but also the untrimmed areas. This complete training method is anticipated to drastically improve the model’s capacity to generalize well when used in actual clinical practice. This inference resulted in improved overall performance, which further facilitates and substantiates the hypothesis that in the niche field of medical image analysis, certain variations present in the data may potentially harbor informative diagnostic hints. These hints can potentially prove to be significantly crucial in obtaining precise classification outcomes. This background also underscores the inherent challenges present in obtaining good-quality and systematically consistent medical images. The variability that results due to diverse imaging technologies, coupled with the comprehensive methodologies utilized during image acquisition protocols and diverse annotation styles, tends to create enormous challenges. All these aspects most frequently make it difficult to achieve uniformity and consistency in the dataset involved in such studies.

Finally, there was also an early exploration of the potential use of a GPT-type language model to generate in-depth and comprehensive medical information in the context of skin lesions specifically. However, it quickly became apparent that these models did not give results that were either accurate or consistent. This failure of consistency can most likely be attributed to their general-purpose nature as well as their use of large internet-based datasets that are not necessarily authoritative or verified when it comes to medical content. This limitation resulted in the reflective integration of web scraping techniques to efficiently obtain accurate and reliable information directly from well-known and reputable medical websites, such as the Cleveland Clinic. Through systematic harvesting and distillation of information from these authoritative websites, the app guarantees users receive accurate, evidence-based information related to particular skin lesion ailments. This innovative approach significantly enhances the overall practical applicability of the app in medical settings, as well as addresses and reduces the limitations of using general language models for the delivery of specialized medical information.

4. Error Analysis

In order to further improve the understanding of the performance limitations of the proposed DenseNet169-based model, a thorough error analysis with emphasis on the misclassification patterns identified among the three lesion categories (NV, AK, and SCC) were performed.

The majority of misclassification errors occurred between AK and SCC, which are adjacent to one another on the precancerous-to-malignant spectrum. Confusion matrix analysis revealed that AK misclassifications as SCC represented a notable proportion of the false positives in the SCC class. SCC misclassifications as AK were also largely due to early SCC lesions, which dermoscopically show features almost identical to advanced AK, which can be easily noticed on confusion matrices.

On the other hand, NV exhibited infrequent misclassification and demonstrated perfect recall (1.00), which suggests that the model is good at distinguishing between pre/malignant lesions and benign lesions. This result aligns with the well-defined morphological characteristics of NV and its lower intraclass variability.

The causes of misclassification are interclassing feature overlaps, poor representation of intra-class variability, image artifacts and non-lesion features and low lesion-to-image area ratio lesions. AK and SCC share dermoscopic characteristics such as erythema, hyperkeratosis, atypical vascular structures, and irregular skin textures. DenseNet169, despite its strong feature propagation and reuse, failed to detect such weak boundaries. While the dataset was numerically well-balanced, it was homogeneous within the AK class. This may have encouraged the model to overfit prominent AK patterns and lose its ability to generalize. Hair occlusions, shadows, annotation markings, and peripheral vignetting interfered with feature extraction, particularly in the earlier layers. In the absence of specialized artifact removal, these features added noise. Low lesion-to-image area ratio lesions that took up small areas in the image resulted in a reduced signal-to-noise ratio, thus compromising feature representation following global pooling. The quantitative summary table of errors is given in Table 4.

To enhance the diagnostic capabilities of dermoscopic image classification models further and overcome difficulties like the confusion between AK and SCC, some methodological enhancements can be explored. One of these promising avenues is the use of spatial attention mechanisms like the Convolutional Block Attention Module (CBAM) and Efficient Channel Attention (ECA). These modules allow the model to attend to lesion-relevant regions and inhibit background noise, thereby facilitating discriminative feature extraction. Also, patient-level metadata like age, sex, and lesion location incorporated into multi-modal learning architectures can allow for denser contextual representations. This can further improve classification results through the alignment of visual features to clinically relevant parameters. The preprocessing techniques also form an important component in improving model input quality. Techniques like hair artifact removal, contrast enhancement with Contrast Limited Adaptive Histogram Equalization (CLAHE), and lesion segmentation serve to highlight important features in dermoscopic images. Also, training with Hard Example Mining allows the model to focus on learning from confusing or misclassified examples. This can greatly enhance model robustness and generalization. While the model showed very good overall performance, with superb detection of benign lesions, the recurring confusion between AK and SCC suggests the necessity for better lesion localization, artifact suppression, and more discriminative feature learning.

5. Conclusions

In conclusion, 85% accuracy was achieved by DenseNet169-based model, for skin lesion recognition including NV, AK, and SCC types. The performance justifies the applicability of DL models in the clinical field, even more so for the detection of skin cancer at an early stage. The model’s ability to differentiate between NV, AK, and SCC is a feature of important clinical usefulness since it can aid dermatologists in making accurate diagnoses of skin disease.

Taking inspiration from the above-stated model, the web-based application created with the help of Gradio represents a significant and real-world innovation on the path to effortlessly bringing artificial intelligence to many healthcare workflow tasks. Not only does the new application provide diagnostic support a lot faster and more easily accessible for medical practitioners, but it also has the added value of helping the patients as well, all through its top-notch feature visualization feature that contributes towards user experience and efficacy. By enabling users to directly engage with the analysis of how the model conducts its image analysis through different convolutional layers, the app gives users a far greater and more nuanced appreciation of its operation. Furthermore, this functionality also serves to offer an increased degree of transparency regarding the predictions made by the model. In addition to the above-mentioned skills, the information retrieval feature that comes with the application is a highly functional tool. Using this feature, users, regardless of whatever level of medical expertise they may or may not be at, or simply with the general knowledge they happen to possess at the time, can learn about various aspects that are associated with skin lesions. This encompasses a tremendous amount of valuable information concerning the very nature of the lesions themselves, along with the numerous symptoms that are associated with them, potential treatment methods that may be available, and numerous other relevant attributes. All this valuable information is made readily available within the application for the user’s convenience. The application goes to great lengths to make this tremendous amount of information not only concise and clear but also extremely user-friendly. This design component allows the information to be displayed alongside the classification output, which allows users to more easily understand the information concerning their skin lesions as well as the conditions that are associated with them. As a direct result of this specific characteristic, not only does it serve to encourage users to have well-informed dialogues with their respective healthcare providers regarding their health, but it also serves as a critical function in significantly raising awareness of the indispensable significance of keeping proper skin integrity. Moreover, it highlights the importance of knowing the possible problems that can arise so that they can be identified and intervened upon at an early stage in time.

The software also boasts a great facility for generating confidence scores, which is of huge benefit to dermatologists when they assess the accuracy of their predictions. This wonderful aspect provides them with the potential to concentrate on the most severe cases that require emergency attention and treatment without wasting a moment while also increasing the overall precision and dependability of their diagnoses. Apart from this important aspect, the utility of this technology by no means gets limited to clinical uses alone; rather, it has tremendous potential to bring far-reaching benefits to marginalized or rural populations where access to specialized dermatological care is lamentably very restricted and largely hard to obtain. Through the features of mobile imaging devices, patients in these groups would be in a position to receive initial checkups and advice on their skin issues. This would eventually allow them to bridge the existing health gaps, making way for earlier intervention among those requiring such essential care.

To ensure clinical robustness and generalizability, formal collaborations with academic hospitals and dermatology departments are planned for the external validation of the proposed model. Through these collaborations, diverse, real-world dermoscopic datasets are expected to be acquired, encompassing a range of imaging conditions, lesion types, and Fitzpatrick skin tones. Validation will be conducted in three phases: retrospective benchmarking on hospital-curated datasets, stratified performance evaluation across demographic and phenotypic subgroups, and prospective deployment within clinical workflows to assess real-time diagnostic utility. Model performance will be assessed using ROC-AUC, sensitivity, specificity, and Expected Calibration Error (ECE). The results of these evaluations will be used to guide domain adaptation and model refinement through techniques such as transfer learning and continual learning, with the goal of enhancing robustness and promoting fairness across diverse populations.

In order to include multi-ethnic and multi-tonal datasets in the future, a strategic expansion plan has been devised to improve not only model generalizability but also fairness across various skin phenotypes. Partnerships are being sought with academic hospitals and dermatology clinics to enable the collection of dermoscopic images from underrepresented groups. These datasets will be generated under standard imaging protocols and ethical frameworks, with high-fidelity inputs and detailed metadata (e.g., Fitzpatrick skin type, anatomical site, age, sex). At the same time, mobile health (mHealth) devices equipped with dermoscopy-grade cameras are envisioned to be deployed in low-resource clinical environments, with real-time image capture and geospatial and demographic tagging. Dataset growth will also be enabled via crowdsourcing pipelines, using global dermatology networks and contributions from competitive benchmarks like the ISIC archive, with expert-validated annotations. The web application will also be expanded with a HIPAA-compliant clinician-facing submission portal, with consent management and automated quality control for incoming images. To address domain shift and promote fairness in model performance, domain adaptation techniques, such as adversarial training, instance reweighting, and batch normalization alignment, will be employed during fine-tuning. Minority domain instances (e.g., darker skin tones) will be oversampled or explicitly reweighted during training to address representation imbalance. Performance stratification will be performed across Fitzpatrick types I–VI based on subgroup-specific metrics: AUC, sensitivity, specificity, F1-score, and calibration error. These initiatives will yield a stronger and demographically more diverse diagnostic model that will increase the clinical validity of automatic skin lesion classification worldwide.

Author Contributions

Conceptualization, S.A.; methodology, S.A.; formal analysis, S.A.; investigation, S.A.; resources, S.A.; data curation, S.A.; writing—original draft preparation, S.A., P.D. and I.B.; writing—review and editing, S.A., P.D. and I.B.; visualization, S.A.; supervision, P.D. and I.B.; project administration, P.D. and I.B. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data contained within the article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Calderón, C.; Sanchez, K.; Castillo, S.; Arguello, H. BILSK: A Bilinear Convolutional Neural Network Approach for Skin Lesion Classification. Comput. Methods Programs Biomed. Update 2021, 1, 100036. [Google Scholar] [CrossRef]
Mehmood, A.; Gulzar, Y.; Ilyas, Q.M.; Jabbari, A.; Ahmad, M.; Iqbal, S. SBXception: A Shallower and Broader Xception Architecture for Efficient Classification of Skin Lesions. Cancers 2023, 15, 3604. [Google Scholar] [CrossRef] [PubMed]
Pathania, R.; Behki, P. Skin Cancer Detection Using Deep Learning. In Proceedings of the IEEE 2024 Sixth International Conference on Computational Intelligence and Communication Technologies (CCICT), Sonepat, India, 19–20 April 2024; pp. 568–575. [Google Scholar]
Singh, R.K.; Gorantla, R.; Allada, S.G.R.; Narra, P. SkiNet: A Deep Learning Framework for Skin Lesion Diagnosis with Uncertainty Estimation and Explainability. PLoS ONE 2022, 17, e0276836. [Google Scholar] [CrossRef]
Rahman, Z.; Hossain, S.; Islam, R.; Hasan, M.; Hridhee, R.A. An Approach for Multiclass Skin Lesion Classification Based on Ensemble Learning. Inform. Med. Unlocked 2021, 25, 100659. [Google Scholar] [CrossRef]
Mporas, I.; Perikos, I.; Paraskevas, M. Color Models for Skin Lesion Classification from Dermatoscopic Images. In Advances in Integrations of Intelligent Methods; Hatzilygeroudis, I., Perikos, I., Grivokostopoulou, F., Eds.; Smart Innovation, Systems and Technologies; Springer: Singapore, 2020; Volume 170, pp. 85–98. ISBN 9789811519178. [Google Scholar]
Hoang, L.; Lee, S.-H.; Lee, E.-J.; Kwon, K.-R. Multiclass Skin Lesion Classification Using a Novel Lightweight Deep Learning Framework for Smart Healthcare. Appl. Sci. 2022, 12, 2677. [Google Scholar] [CrossRef]
Saba, T.; Khan, M.A.; Rehman, A.; Marie-Sainte, S.L. Region Extraction and Classification of Skin Cancer: A Heterogeneous Framework of Deep CNN Features Fusion and Reduction. J. Med. Syst. 2019, 43, 289. [Google Scholar] [CrossRef]
Chaturvedi, S.S.; Tembhurne, J.V.; Diwan, T. A Multi-Class Skin Cancer Classification Using Deep Convolutional Neural Networks. Multimed. Tools Appl. 2020, 79, 28477–28498. [Google Scholar] [CrossRef]
Maqsood, S.; Damaševičius, R. Multiclass Skin Lesion Localization and Classification Using Deep Learning Based Features Fusion and Selection Framework for Smart Healthcare. Neural Netw. 2023, 160, 238–258. [Google Scholar] [CrossRef]
Salma, W.; Eltrass, A.S. Automated Deep Learning Approach for Classification of Malignant Melanoma and Benign Skin Lesions. Multimed. Tools Appl. 2022, 81, 32643–32660. [Google Scholar] [CrossRef]
Aksoy, S. Multi-Input Melanoma Classification Using MobileNet-V3-Large Architecture. J. Autom. Mob. Robot. Intell. Syst. 2025, 19, 1–12. [Google Scholar] [CrossRef]
Ozdemir, B.; Pacal, I. A Robust Deep Learning Framework for Multiclass Skin Cancer Classification. Sci. Rep. 2025, 15, 4938. [Google Scholar] [CrossRef] [PubMed]
Shakya, M.; Patel, R.; Joshi, S. A Comprehensive Analysis of Deep Learning and Transfer Learning Techniques for Skin Cancer Classification. Sci. Rep. 2025, 15, 4633. [Google Scholar] [CrossRef] [PubMed]
Chatterjee, S.; Dey, D.; Munshi, S.; Gorai, S. Extraction of Features from Cross Correlation in Space and Frequency Domains for Classification of Skin Lesions. Biomed. Signal Process. Control 2019, 53, 101581. [Google Scholar] [CrossRef]
Shetty, B.; Fernandes, R.; Rodrigues, A.P.; Chengoden, R.; Bhattacharya, S.; Lakshmanna, K. Skin Lesion Classification of Dermoscopic Images Using Machine Learning and Convolutional Neural Network. Sci. Rep. 2022, 12, 18134. [Google Scholar] [CrossRef]
Fraiwan, M.; Faouri, E. On the Automatic Detection and Classification of Skin Cancer Using Deep Transfer Learning. Sensors 2022, 22, 4963. [Google Scholar] [CrossRef]
Jain, S.; Singhania, U.; Tripathy, B.; Nasr, E.A.; Aboudaif, M.K.; Kamrani, A.K. Deep Learning-Based Transfer Learning for Classification of Skin Cancer. Sensors 2021, 21, 8142. [Google Scholar] [CrossRef]
Huang, H.; Hsu, B.W.; Lee, C.; Tseng, V.S. Development of a Light-weight Deep Learning Model for Cloud Applications and Remote Diagnosis of Skin Cancers. J. Dermatol. 2021, 48, 310–316. [Google Scholar] [CrossRef]
Tajerian, A.; Kazemian, M.; Tajerian, M.; Akhavan Malayeri, A. Design and Validation of a New Machine-Learning-Based Diagnostic Tool for the Differentiation of Dermatoscopic Skin Cancer Images. PLoS ONE 2023, 18, e0284437. [Google Scholar] [CrossRef]
Ali, K.; Shaikh, Z.A.; Khan, A.A.; Laghari, A.A. Multiclass Skin Cancer Classification Using EfficientNets—A First Step Towards Preventing Skin Cancer. Neurosci. Inform. 2022, 2, 100034. [Google Scholar] [CrossRef]
Srinivasu, P.N.; SivaSai, J.G.; Ijaz, M.F.; Bhoi, A.K.; Kim, W.; Kang, J.J. Classification of Skin Disease Using Deep Learning Neural Networks with MobileNet V2 and LSTM. Sensors 2021, 21, 2852. [Google Scholar] [CrossRef]
Salamaa, W.M.; Aly, M.H. Deep Learning Design for Benign and Malignant Classification of Skin Lesions: A New Approach. Multimed. Tools Appl. 2021, 80, 26795–26811. [Google Scholar] [CrossRef]
Himel, G.M.S.; Islam, M.; Al-Aff, K.A.; Karim, S.I.; Sikder, K.U. Skin Cancer Segmentation and Classification Using Vision Transformer for Automatic Analysis in Dermatoscopy-Based Noninvasive Digital System. Int. J. Biomed. Imaging 2024, 2024, 3022192. [Google Scholar] [CrossRef] [PubMed]
Mehr, R.A.; Ameri, A. Skin Cancer Detection Based on Deep Learning. J. Biomed. Phys. Eng. 2022, 12. [Google Scholar] [CrossRef]
Chen, J.; Jiang, Q.; Ai, Z.; Wei, Q.; Xu, S.; Hao, B.; Lu, Y.; Huang, X.; Chen, L. Pigmented Skin Disease Classification via Deep Learning with an Attention Mechanism. Appl. Soft Comput. 2025, 170, 112571. [Google Scholar] [CrossRef]
Hanum, S.A.; Dey, A.; Kabir, M.A. An Attention-Guided Deep Learning Approach for Classifying 39 Skin Lesion Types. arXiv 2025, arXiv:2501.05991. [Google Scholar] [CrossRef]
Georgiadis, P.; Gkouvrikos, E.V.; Vrochidou, E.; Kalampokas, T.; Papakostas, G.A. Building Better Deep Learning Models Through Dataset Fusion: A Case Study in Skin Cancer Classification with Hyperdatasets. Diagnostics 2025, 15, 352. [Google Scholar] [CrossRef]
Ramamurthy, K.; Thayumanaswamy, I.; Radhakrishnan, M.; Won, D.; Lingaswamy, S. Integration of Localized, Contextual, and Hierarchical Features in Deep Learning for Improved Skin Lesion Classification. Diagnostics 2024, 14, 1338. [Google Scholar] [CrossRef] [PubMed]
Aksoy, S.; Demircioglu, P.; Bogrekci, I. Enhancing Melanoma Diagnosis with Advanced Deep Learning Models Focusing on Vision Transformer, Swin Transformer, and ConvNeXt. Dermatopathology 2024, 11, 239–252. [Google Scholar] [CrossRef]
Ghosh, M.; Maiti, A.K.; Sarkar, P.; Jana, A.; Roy, A. Skin Cancer Detection Using a Deep-Learning Based Framework. In Real-World Applications and Implementations of IoT; Acharyya, A., Dey, P., Biswas, S., Eds.; Studies in Smart Technologies; Springer Nature: Singapore, 2025; pp. 107–119. ISBN 978-981-9786-26-8. [Google Scholar]
World Health Organization. Skin Cancer: Key Facts. 2020. Available online: https://www.who.int/news-room/q-a-detail/skin-cancer (accessed on 11 January 2024).
Ma, X.; Shan, J.; Ning, F.; Li, W.; Li, H. EFFNet: A Skin Cancer Classification Model Based on Feature Fusion and Random Forests. PLoS ONE 2023, 18, e0293266. [Google Scholar] [CrossRef]
Manole, I.; Butacu, A.-I.; Bejan, R.N.; Tiplica, G.-S. Enhancing Dermatological Diagnostics with EfficientNet: A Deep Learning Approach. Bioengineering 2024, 11, 810. [Google Scholar] [CrossRef]
Skin Cancer Information. Available online: https://www.skincancer.org/skin-cancer-information/skin-cancer-facts/ (accessed on 11 January 2024).
Leiter, U.; Keim, U.; Garbe, C. Epidemiology of Skin Cancer: Update 2019. In Sunlight, Vitamin D and Skin Cancer; Reichrath, J., Ed.; Advances in Experimental Medicine and Biology; Springer International Publishing: Cham, Switzerland, 2020; Volume 1268, pp. 123–139. ISBN 978-3-030-46226-0. [Google Scholar]
Cancer Facts Statistics. Available online: https://www.cancer.org/research/cancer-facts-statistics/all-cancer-facts-figures/cancer-facts-figures-2022.htmlCancerfactsIs (accessed on 2 January 2025).

Figure 1. Overview of different skin lesion classes with visual examples.

Figure 2. Examples of augmented skin lesion images for enhanced model training.

Figure 3. Deep learning architecture for skin lesion classification.

Figure 4. Overview of data preprocessing, training, and evaluation pipeline.

Table 1. Distribution of Training, Validation, and Test Sets for Skin Lesion Classification.

Classes	Train	Validation	Test	Total
NV	565	32	31	628
SCC	565	32	31	628
AK	565	32	31	628
Total	1695	96	93	1887

Table 2. Web application for automated skin lesion classification.

Classes	Web App Predictions
AK
NV
SCC

Table 3. DenseNet169 performance evaluation results of the skin lesion classification models.

Plots/Model	DenseNet169
T&V Loss Curve Plot
T&V Accuracy Curve Plot
Confusion Matrix Plot
Plots/Model	Xception
T&V Loss Curve Plot
T&V Accuracy Curve Plot
Confusion Matrix Plot
Plots/Model	ResNet152V2
T&V Loss Curve Plot
T&V Accuracy Curve Plot
Confusion Matrix Plot
Plots/Model	InceptionV3
T&V Loss Curve Plot
T&V Accuracy Curve Plot
Confusion Matrix Plot
Plots/Model	EfficientNetV2S
T&V Loss Curve Plot
T&V Accuracy Curve Plot
Confusion Matrix Plot
Plots/Model	MobileNetV2
T&V Loss Curve Plot
T&V Accuracy Curve Plot
Confusion Matrix Plot
Plots/Model	NASNetMobile
T&V Loss Curve Plot
T&V Accuracy Curve Plot
Confusion Matrix Plot

Table 4. The quantitative summary of errors.

Misclassification Direction	Frequency (%)	Notable Observations
AK → SCC	~18%	Severe AK resembling early SCC
SCC → AK	~15%	Early SCC lacking aggressive traits
NV → AK or SCC	<3%	Pigmentation artifacts or poor contrast

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Aksoy, S.; Demircioglu, P.; Bogrekci, I. Deep Learning-Based Web Application for Automated Skin Lesion Classification and Analysis. Dermato 2025, 5, 7. https://doi.org/10.3390/dermato5020007

AMA Style

Aksoy S, Demircioglu P, Bogrekci I. Deep Learning-Based Web Application for Automated Skin Lesion Classification and Analysis. Dermato. 2025; 5(2):7. https://doi.org/10.3390/dermato5020007

Chicago/Turabian Style

Aksoy, Serra, Pinar Demircioglu, and Ismail Bogrekci. 2025. "Deep Learning-Based Web Application for Automated Skin Lesion Classification and Analysis" Dermato 5, no. 2: 7. https://doi.org/10.3390/dermato5020007

APA Style

Aksoy, S., Demircioglu, P., & Bogrekci, I. (2025). Deep Learning-Based Web Application for Automated Skin Lesion Classification and Analysis. Dermato, 5(2), 7. https://doi.org/10.3390/dermato5020007

Article Menu

Deep Learning-Based Web Application for Automated Skin Lesion Classification and Analysis

Abstract

1. Introduction

2. Materials and Methods

2.1. Data Preprocessing and Augmentation Strategies for Enhanced Model Performance

2.2. Proposed Deep Learning Model for Skin Lesion Classification

2.3. Experimental Setup: Data Preprocessing, Model Training, and Evaluation Framework

3. Results

4. Error Analysis

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI