1. Introduction
The skin, the biggest organ in the human body, performs several functions, including regulating body temperature, identifying environmental cues, and protecting interior organs. Skin illnesses include a broad spectrum of disorders that impact the skin and its supporting organs, encompassing both benign and malignant skin lesions. In particular, two skin lesions that are commonly observed are malignant melanoma (MM) and disfiguring dermatosis (DD).
DD, such as dermatitis, melasma, and acne, can lead to skin inflammation, pigmentation alterations, and texture changes, significantly impacting an individual’s physical appearance and quality of life [
1,
2,
3,
4]. MM, a highly aggressive skin cancer, is prone to metastasis and recurrence, often spreading to vital organs like the lungs, liver, and bones, causing fatal damage [
5,
6]. Early and accurate diagnosis is crucial for the effective management and treatment of these conditions. VISIA multi-modal imaging technology has emerged as a valuable tool for identifying, diagnosing, and evaluating the severity of DD. This cutting-edge system integrates various imaging modalities, including UV, visible, and near-infrared light, to capture high-quality skin images [
7]. This technology helps dermatologists analyze skin conditions, barrier function, and pigmentation. MM can be detected early using dermoscopic and clinical imaging, but integrating and interpreting multi-modal data remains challenging [
8]. DD and MM often exhibit complex skin features like UV spots, brown spots, red marks, porphyrins, uneven coloration, and irregular borders. Traditional methods may lack accuracy, and physician subjectivity can influence diagnosis. Combining advanced imaging with deep learning allows for more precise and thorough skin analysis [
9]. Moreover, this combination assists clinicians in recognizing subtle variations in skin characteristics, which are crucial for the identification and diagnosis of dermatological conditions [
10]. Deep learning enables multi-modal imaging systems to analyze extensive data, offering precise insights for personalized treatment plans and clinical follow-up for skin lesions.
Previous approaches for skin lesion image recognition relied on conventional image processing and computer vision techniques, focusing on manual feature extraction like color, texture, and shape using methods such as the Hough transform and edge detection [
11,
12,
13]. These features were then classified using various classifiers such as support vector machines and k-nearest neighbors [
14]. Some studies used geometric models based on morphological operations for lesions with distinct features [
12]. Nonetheless, these methods were limited by their dependence on hand-crafted features and classifiers, making them prone to noise and data imbalance. The advent of deep learning techniques, particularly the rise of convolutional neural networks (CNNs), has markedly enhanced the accuracy and performance of skin disease image recognition [
15,
16]. CNNs can extract features from clinical and dermoscopic images, enabling the high-accuracy classification of conditions like acne, eczema, psoriasis, and skin tumors [
16,
17]. Pre-trained models like VGG and ResNet serve as feature extractors, reducing training time and data requirements while improving accuracy [
18]. Transfer learning further enhances model performance by leveraging knowledge from other medical image tasks, addressing data scarcity, and improving generalization. Generative adversarial networks (GANs) have become a useful tool for data augmentation in the field of skin disease recognition [
19]. The attention mechanism, inspired by human visual processes, enhances skin disease recognition accuracy by enabling models to focus on relevant regions, such as lesion areas, while suppressing irrelevant background noise [
20,
21]. This attention mechanism enables the model to focus on the most relevant regions within the skin images, such as lesion areas, while suppressing irrelevant background noise [
22]. This focused attention on lesion regions improves the model’s ability to distinguish skin disorders like acne, eczema, psoriasis, and skin tumors, enhancing classification accuracy. In multi-modal recognition, deep learning and vision transformers integrate skin images with patient metadata, boosting diagnostic precision. This focused attention on lesion regions enhances diagnostic accuracy [
23,
24].
However, current approaches toward the early detection of skin lesions predominantly rely on the analysis of individual images [
16,
24]. This methodology presents constraints in the holistic examination of skin ailments. Our research circumvents this limitation by employing multi-modal imaging. Moreover, the prevailing trend in multi-modal methodologies mainly centers on the concatenation and fusion of distinct diagnostic modal features, often overlooking the intricate interplay between these modes and the critically important exchanges of multi-scale feature information [
9,
20,
24]. To address this challenge, our study introduces a deep multi-modal information-switching network tailored for the end-to-end recognition of skin lesions. It harnesses the power of deep convolutional neural networks to elicit nuanced features specific to skin lesions across various skin image modalities. The network further augments the exchange of information between features from different modalities across multiple scales, thus ameliorating both the interpretability and recognition precision in multi-modal skin imaging for skin lesions.
Figure 1 illustrates the distinction between our proposed method and traditional multi-modal classification models. Unlike conventional approaches that directly fuse the features extracted by the neural network, our method employs an information-switching module to exchange information between feature maps of different modalities at various scales. This enables the recognition model to better consider the relationships between modalities.
In summary, our study aims to leverage multi-modal images to enable a more comprehensive analysis of skin conditions and enhance the accuracy of identifying skin lesions. For this purpose, we utilize a deep learning analysis approach that integrates multi-modal influences to automatically identify and classify various skin lesions. This article proposes a deep multi-modal information-switching network, known as MDSIS-Net. MDSIS-Net introduces an innovative approach that uses transfer learning within a multi-scale fully shared convolutional neural network to extract skin lesion features across different scales and modalities. Within MDSIS-Net, a novel multi-scale information-switching structure is developed to generate learnable parameters for facilitating the exchange of deep and shallow features between different modalities. The primary contributions of this work are as follows:
1. This research proposes MDSIS-Net, an end-to-end deep multi-modal information-switching network for skin lesion recognition, utilizing deep convolutional neural networks to extract intra-modality features and enhance multi-scale inter-modality interactions, improving interpretability and accuracy for DD and MM diagnosis.
2. We propose a novel multi-scale information-switching structure within a multi-modal skin lesion recognition framework, which generates learnable parameters to facilitate cross-modal deep and shallow feature exchange with adaptive layer-wise weighting. Our MDSIS-Net improves the inter-modality association of deep and shallow features within each modality, optimizing the distribution of distinct features related to different diseases across diverse imaging modalities.
3. The proposed model is trained and validated on real-world clinical data for DD and the public Derm7pt dataset for MM, demonstrating superior performance over state-of-the-art methods. Interpretability analysis generates modality-specific heatmaps, highlighting distinctive features of DD and MM to enhance model transparency and reliability. These visualizations provide dermatologists with clear insights into lesion characteristics, facilitating more informed and effective treatment decisions.
2. Review of the Previous Research and Literature
Traditional machine learning algorithms and morphological analysis were the main techniques used in the early stages of skin lesion identification. To help dermatologists diagnose skin conditions, Moldovanu et al. [
11] offered a skin lesion classification approach that integrates the dimensionality of surface fractals and statistics data on color clustering using two classifiers, k nearest neighbors and neural networks. Chatterjee et al. [
12] suggested classifying skin lesion kinds using fractal-based feature extraction, morphological preprocessing, and recursive feature removal to facilitate the more precise diagnosis and treatment of skin lesions in the medical field. Ranjan et al. [
13] introduced a method using a machine learning model to measure and score the severity of radiation dermatitis, while also analyzing its mathematical characteristics, including erythema’s shape, size, and color. As computer vision technology continues to advance, an increasing number of deep learning methods are being utilized in the recognition of medical diseases. Pan et al. [
25] proposed an Ensemble-3DCNN model to analyze brain magnetic resonance imaging (MRI) to confirm the regularity of the pathological progression of Alzheimer’s disease. This model provided robust support for early diagnosis, disease monitoring, and the development of treatment strategies. Javadi Moghaddam et al. [
26] introduced an adapted version of the DenseNet-121 COVID-19 diagnostic framework to achieve the precise identification of COVID-19 from X-ray images. This study introduced a novel technical tool for clinical diagnosis. Noorbakhsh et al. [
27] employed CNNs for cross-category analysis to uncover conserved spatial behavior in tumor histology images, involving the processing and analysis of various types of tumor histology images using deep learning methods to reveal their similarities and differences.
Many deep learning models have been created and shown to be effective in the field of identifying diseases early on. Thieme et al. [
28] introduced a deep convolutional neural network (named MPXV-CNN) for identifying characteristic skin lesions caused by the monkeypox virus, achieving a sensitivity of 89% in the prospective cohort. Anand et al. [
16] introduced a fusion model that integrates the U-Net and CNN models to accurately identify skin lesions. The model was simulated and analyzed using the HAM10000 dataset, which includes 10,015 dermoscopic images, achieving an accuracy of 97.96%. Gomathi et al. [
29] introduced a novel dual optimization approach based on a deep learning network for skin cancer detection. The proposed network achieved an accuracy of 98.76% when evaluated with the HAM10000 dataset. Due to the inadequate availability of training datasets for disease recognition in real-world application scenarios, transfer learning is often utilized to address this issue. Mahbod et al. [
30] proposed and evaluated a multi-scale multi-CNN fusion approach for skin lesion classification based on pre-trained CNNs and transfer learning, resulting in an accuracy of 86.2% on the ISIC 2018 dataset. Karri et al. [
31] conducted a study on a two-phase cross-domain transfer learning approach, which involved both model-level and data-level transfer learning. They fine-tuned the system using two datasets, MoleMap and ImageNet, and achieved a Dice Similarity Coefficient (DSC) of 94.63% and 99.12% accuracy on the HAM10000 dataset. The results demonstrated the effectiveness of their approach in transferring knowledge across different domains. Additionally, attention was paid to image augmentation techniques to augment the diversity of the dataset, thereby enhancing recognition precision. Eduardo et al. [
32] introduced a robust progressive growth of adversarial networks based on residual learning and recommended for facilitating the training of deep networks. This architecture could generate realistic synthetic 512 × 512 skin images, even when using small dermoscopic and non-dermoscopic skin image datasets as problem domains. With the rapid evolution of vision transformer technology, there has been a notable increase in its utilization for medical image recognition in recent years. This adaptation represents a significant leap forward in the integration of advanced transformer models to enhance the analysis and understanding of medical imaging data. Xin et al. [
21] presented a skin cancer identification method based on a multi-scale vision transformer approach, which they further optimized using comparative learning. This technique achieved an impressive 94.3% accuracy on the HAM10000 dataset, demonstrating its effectiveness in the field. He et al. [
33] introduced a Fully Transformer Network (FTN) to extract long-range contextual information for skin lesion analysis and successfully validated its effectiveness and efficiency on the ISIC 2018 dataset. Zhang et al. [
24] developed a series of dual-branch hierarchical multi-modal transformer (HMT) blocks to systematically integrate data from multiple imaging modalities. This approach resulted in a diagnostic accuracy of 80.03% and an average accuracy of 77.99% on the Derm7pt dataset. In the realm of multi-modal skin disease identification, numerous studies have employed deep learning and vision transformer methods to fuse characteristics derived from skin images and patient-related metadata. Omeroglu et al. [
23] proposed a multi-modal deep learning framework for skin lesion classification that integrated features of clinical images, dermoscopic images, and patient metadata. The framework employed three branches to extract features in a hybrid manner, achieving an average accuracy of 83.04% on the even-point criteria evaluation dataset, representing a 2.14% improvement over existing methods. He et al. [
20] designed a cross-attention (CA) module that enabled collaboration between two modalities of dermoscopic images and clinical images through a cross-modal attention mechanism, enhancing the representation ability of features. On the seven-point criteria evaluation dataset, the average accuracy achieved was 76.8%, which was superior to the state-of-the-art methods.
This paper seeks to improve the recognition of multi-modal skin images by examining the effects of diverse deep learning networks. This study also delves into the use of transfer learning methods to enhance feature extraction from images with varying modalities. Furthermore, to expand the training dataset, a variety of data augmentation techniques are employed. This comprehensive approach aims to advance the comprehension and utilization of deep learning in multi-modal skin image analysis.
3. Methods
Figure 2 illustrates the holistic architecture of the MDSIS-Net (
Figure 2a) network proposed in this study. The network comprises several key modules, including the image preprocessing module (
Figure 2a), the multi-modal intra-feature extraction module (
Figure 2a), the multi-modal information-switching module (ISM) (
Figure 2b), and the feature aggregation module (
Figure 2a). VISIA imaging is used to collect five modal images of patients with DD as input images. These modal images include spots, red marks, UV spots, porphyrins, and brown spots. MM images consist of two modalities of images: clinical images and dermoscopic images. The image preprocessing module is essential for refining input data through normalization and augmentation to guarantee optimal data quality and compatibility with the network. This study focuses on extracting meaningful features from different modalities in the input data through a multi-modal intra-feature extraction module. By leveraging multi-modal information, this module effectively captures diverse and complementary data, enhancing the network’s discriminative capabilities. To achieve this, we utilize the EfficientNetV2 block as the fundamental network module for multi-modal intra-feature extraction [
34]. It consists of eight blocks, each producing feature maps at different scales, and employs shared parameters for handling multi-modal inputs. To enhance integration and information exchange across different modalities, a multi-modal ISM is integrated. This module strategically chooses and merges the most informative features from each modality, enabling effective information fusion and maximizing the synergistic advantages provided by multiple modalities. Additionally, it serves as an interactive platform for different modality features, enabling effective collaboration and exchange of information. The ISM further facilitates the exchange of information among the eight different scales of feature maps extracted by the EfficientNetV2. It ensures seamless communication and interaction between the feature maps at various scales, promoting effective information fusion and integration. Finally, the feature aggregation module consolidates the extracted features from different modalities into a unified representation. This comprehensive architecture enables the network to tackle the complexities of the given task and enhance its predictive capabilities. The model will be validated using the DD and MM datasets.
Algorithm 1 presents an illustration of the workflow for our proposed MDSIS-Net model.
Algorithm 1 The pipeline for our proposed MDSIS-Net model |
Input: Multi-modal image datasets of MM and DD |
Output: Predicted classes and Grad-CAM |
Splitting the dataset: Every dataset is separated into testing, validation, and training sets. |
1. Training phase: |
Hyperparameter: |
Image size: (384, 384, 3) |
Number of input modalities: 5 and 2 |
Size of training batch: 32 |
Initial learning rate: 0.05 |
Decay type: CosineAnnealingLR |
Optimizer selection: SGD |
Training runtime: |
for epoch in range(begin_epoch, end_epoch) |
Multi-modal image normalization |
Multi-modal image augmentation |
Multi-modal intra-feature extraction |
Multi-modal information switching |
Feature aggregation and classification |
Predict the category and obtain Grad-CAM |
Calculate the weighed cross-entropy loss |
Gradient update and backpropagation |
2. Testing |
Read the image pixel |
Data normalization |
Feed the best training model |
Obtain the predicted mask and feature map |
Computer the metric: mAP, accuracy, precision, recall, and f1-score |
3. Inference runtime |
Obtain an image |
Data normalization |
Feed the best model |
Obtain the predicted class and feature maps |
3.1. Multi-Modal Skin Lesion Dataset
Two datasets are used to validate the effectiveness of our proposed MDSIS-Net model, one derived from the publicly available dataset Derm7pt for multi-modal MM recognition, and the other derived from clinical VISIA images for DD identification. A comprehensive dataset has been collected, which includes 2005 samples totaling 10,025 VISIA images with five distinct modalities: spots, red marks, UV spots, porphyrins, and brown spots. This study was approved by the Ethics Committee of the First Affiliated Hospital of Ningbo University on 20 December 2023 (approval No. 2023R-178RS). All participants were informed about the right to withdraw from the study at any time, following the ethical standards of the Declaration of Helsinki, revised in 2013. The spot modality utilizes standard white light imaging to capture visible skin surface pigmentation. Red marks can reflect the condition of capillaries. UV spots are captured using 365 nm ultraviolet light, reflecting potential pigmentation beneath the epidermis, which correlates positively with skin photoaging. Porphyrins exhibit fluorescence, particularly in the T-zone, and are metabolic byproducts of bacteria residing in the follicular openings. Brown spots represent deeper, more latent pigmentation than UV spots. These images have been categorized into dermatitis, melasma, and acne.
Figure 3 demonstrates a representative example of the VISIA multi-modal dataset. The clinical labels for the VISIA dataset are provided by experienced dermatologists, who assign diagnostic labels based on the lesion responses observed across different modalities. The Derm7pt dataset is used for the study of the diagnosis and classification of MM and NMSC. It consists of both clinical and dermoscopic images. Clinical images capture the macroscopic lesion characteristics of MM patients, while dermoscopic images provide a macroscopic view of the lesion features in MM patients. This dataset contains 1011 samples, totaling 2022 two-modal images, with 252 cases of MM and 759 cases of NMSC. The images in this dataset are annotated according to a seven-point checklist criteria.
Figure 4 demonstrates a representative example of the Derm7pt multi-modal dataset. Both datasets have been randomly split into training, validation, and test sets, with the respective allocations being 70%, 15%, and 15%, to facilitate model development and evaluation. Subsequently, the training set will be employed for deep learning model training, the validation set for fine-tuning hyperparameters, and the test set for rigorously validating the model’s accuracy and generalization.
Table 1 lists the detailed data distribution and purpose for the DD and MM datasets.
3.2. Multi-Modal Image Preprocessing
In multi-modal image recognition, augmentation and normalization play crucial roles. Normalization is essential for standardizing the image data from various modalities to a specific range of values, ensuring consistency in the data representation. Additionally, augmentation is valuable for increasing the diversity of the training data, thereby enhancing the model’s ability to generalize to a wide range of input variations.
3.2.1. Multi-Modal Image Normalization
The dermatological dataset studied in this paper consists of multiple modalities, each with distinct pixel value ranges. Therefore, normalizing each modality is crucial to standardize the data and ensure fair comparison across modalities. This normalization process is essential to facilitate subsequent feature extraction and guarantee that the significance of each modality is appropriately considered, rather than favoring the modality with the larger pixel values. The normalization will allow for a more accurate and comprehensive analysis of the dataset, leading to better insights and potential improvements in skin disease diagnosis and treatment. There are numerous deep learning-based normalization methods utilized for image recognition, each offering distinct benefits and applications. Robust scaling, Z-score normalization, image-specific normalization, and min–max scaling are a few popular normalization methods. Z-score normalization is particularly beneficial in standardizing the scales of features, reducing the effects of varying ranges across features and improving the model’s capacity for learning and precise forecasting. It also helps in mitigating the impact of outliers and extreme values, making it a robust normalization technique suitable for a wide range of image recognition tasks. This technique calculates the Z-score (mean minus the standard deviation) of each channel and scales the values to a common range, such as [0, 1]. It serves as a robust method for normalization as it minimizes the impact of varying pixel values across different skin conditions. In this study, we utilize the following formula for the z-score normalization of each modality.
represents the raw value of each mode, which includes the channels for red, green, and blue, where
represents the index of all modes.
is a vector of
.
represents the average of the raw data and is a vector of
.
indicates the input data’s standard deviation and is a vector of
.
represents the value after normalization and is a
vector. The formula representing
is as follows.
Here is the formula for
.
3.2.2. Multi-Modal Image Augmentation
Image augmentation is crucial for deep learning training and the recognition of skin lesions in multi-modal images. This technique improves image quality, making it more suitable by enhancing visual effects. Skin lesion features may be affected by factors such as lighting, color, and contrast, necessitating image augmentation to clarify and highlight these features. By increasing contrast, reducing noise, and adjusting color balance, image augmentation improves the clarity and distinction of skin disease features, aiding deep learning algorithms in better recognition and understanding them. Moreover, image augmentation ensures image consistency, facilitating comparison and matching between images of different modalities, thereby improving the generalization ability and recognition accuracy of deep learning algorithms. It also mitigates class imbalance and enhances data diversity.
Table 2 enumerates the image augmentation techniques applied in this paper and offers a thorough explanation of the associated parameters.
3.3. Multi-Modal Intra-Feature Extraction
This study utilizes EfficientNetV2, an improved multi-modal feature extraction network that is based on EfficientNet. By adjusting parameters such as depth, width, and input image resolution, EfficientNetV2 enhances network performance. It integrates neural architecture search and a composite model expansion method to develop a classification recognition network. Through the selection of optimal composite coefficients, the network’s depth, width, and input image resolution are proportionally expanded in three dimensions. This search for optimal coefficients aims to maximize recognition feature accuracy. By dynamically balancing these three dimensions, the number of coefficients and complexity during model training is effectively reduced, leading to a significant improvement in model performance. This approach yields better results compared to single-dimension scaling and also contributes to faster training speeds. The network primarily consists of 3 × 3 convolution, stacked Fused-MBConv, MBConv, and 1 × 1 convolution modules. The network structure details are provided in
Figure S19 in the Supplementary Material, which consists of eight stages that generate feature maps at distinct scales. These eight feature maps at various scales serve for the subsequent ISM to extract intricate relationships between different modalities. In this research, transfer learning is employed for initialization on the ImageNet1K dataset to obtain preliminary parameters, and shared parameters are used to handle multi-modal inputs, resulting in a coherent and concise representation.
Table 3 presents the operation type, kernel size, stride size, expand ratio, output channel size, and layer size for each block. The expand ratio represents the amplification factor for the middle channel in each module.
3.4. Multi-Modal Information-Switching Module
When conducting deep learning-based skin lesion recognition, the incorporation of different modality images provides a more comprehensive and diverse set of information, thereby aiding in the precise identification of various types of skin diseases. The five VISIA modality images consist of spots, red marks, UV spots, porphyrins, and brown spots. Spot images depict small spots on the skin, red mark images illustrate areas with red patches, UV spot images capture spots induced by UV rays, porphyrin images highlight the presence of porphyrins in the skin, and brown spot images showcase brown patchy areas on the skin. In the case of MM, dermoscopic images allow for the visualization of its microscopic lesions, such as those in the vascular region, and clinical images allow for the observation of the features of MM at the macroscopic level, such as contours and overall color. The two modalities complement each other. By harnessing the power of multi-modal information exchange, we effectively communicate and integrate diverse features to accurately classify and differentiate among different skin diseases.
Figure S21 in the Supplementary Material demonstrates the specific framework of ISM; each modality goes through a feature extraction network to obtain eight stages of feature maps, and each stage of feature maps goes through the ISM module for information exchange, where
denotes the index of different modalities,
indicates the number of all modalities, and
represents the stage of feature extraction.
Given an input image
, consisting of
n modalities
, we begin by feeding the image of each modal a basic network block
to generate multi-modal features
, where
represents the stage index of feature extraction. The formula for multi-modal feature extraction is expressed as follows.
Next, the feature maps
are fed into the ISM for feature exchange. ISM employs Multi-head Self-Attention (MSA) across features of different modalities. For the features at each spatial location, we consider the features from different modalities as a sequence that requires information exchange. Specifically, for the feature maps
, they are stacked together and transformed into
, meeting the input format required by MSA using Permute and Reshape operations. The ISM initially employs three separate fully connected (FC) layers to map it to the queries (Q), keys (K), and values (V) for the
j-th attention head as follows.
Subsequently, we compute the attention for each head of every modality as follows.
represents the dimension of each head. Through the concatenation of the attention outputs from m heads and the utilization of the FC layer, the output of the MSA is obtained, which has the same dimension as the input sequence. This method then utilizes Permute and Reshape operations, as well as Spit operations, to obtain the information exchange features
under different modalities. The formula is as follows.
Finally, the input multi-modal feature maps and the information-exchanged multi-modal feature maps are linearly summed together using a linear summation approach, and the summation result serves as the output
of the ISM module for subsequent feature aggregation and classification. The formula is as follows.
The skin disease feature images of different modalities undergo a comprehensive process of information exchange, encompassing eight stages. This progressive information exchange facilitates the extraction of intricate feature relationships between the diverse modalities to a greater extent. The outcome of this process is a final feature map, which serves as a vital resource for the subsequent steps of feature fusion and classification.
3.5. Feature Aggregation and Classification
In this study, a merging technique is employed to fuse the feature maps obtained from the information exchange of different modalities. This fused feature map is then utilized for the subsequent classification of skin lesions.
Figure S22 in the Supplementary Material illustrates the specific process of feature fusion and classification. A 1 × 1 convolutional layer is employed to reduce the dimensionality of the feature maps obtained from the information exchange. Simultaneously, global average pooling (GAP) is used to compute the average of the two-dimensional feature maps along each channel. Consequently, one-dimensional vectors are derived. The feature vectors extracted from multiple modalities are fused using concatenation. Subsequently, the fused vector is processed by an FC layer and softmax activation to obtain the predicted probabilities for skin lesion classification.
As shown in Equations (13) and (14), each modality feature map
obtained from the information exchange is individually processed through GAP and a 1 × 1 convolutional layer. The resulting feature maps are then concatenated across all modalities to obtain a one-dimensional feature vector
. This feature vector is then fed into an FC layer and undergoes softmax activation to generate the predicted probability
for skin disease classification.
During the training phase of this research, Weighted Cross Entropy Loss (WCEL) is employed as the loss function for the multi-modal skin disease classification model. WCEL is a technique used to address class imbalance or the varying importance of different classes. The mathematical formula for WCEL is as follows.
The weight assigned to class is denoted by , while represents the true label for class (0 or 1) and signifies the predicted probability of class . The weights can be determined based on the class distribution in the dataset or manually defined to emphasize or de-emphasize specific classes. Loss is calculated for each class and then aggregated to obtain the overall loss value, which, in turn, is utilized to optimize the model during the training process.
3.6. Experimental Settings
In this research, a deep learning platform for multi-modal skin lesion identification is constructed using PyTorch V2.0.1. The experimental setup consists of two GPUs with a memory capacity of 12 GB each and an Intel(R) i7 CPU. The network is trained using images resized to 384 × 384 pixels, and the Stochastic Gradient Descent (SGD) optimizer is employed. The initial learning rate is set to 0.05, and the training is performed with a batch size of 32. This hardware and software configuration enables the efficient processing and optimization of the model for accurate skin lesion recognition. The detailed model configuration is shown in
Table 4.
Evaluation Metrics and Results
Five performance criteria are used in this study to assess our multi-modal skin disease recognition model: f1-score, mean Average Precision (mAP), recall, accuracy, and precision. These measurements offer a thorough grasp of the model’s performance in correctly categorizing skin conditions. Precision indicates the proportion of correctly classified positive samples out of all samples predicted as positive. It is calculated as the ratio of true positives (TP) to the sum of true positives and false positives (FP). The formula for precision is as follows.
Recall gauges the model’s capacity to accurately distinguish positive samples from all of the real positive samples. It is sometimes referred to as sensitivity or true positive rate. The ratio of true positives (TP) to the total of false negatives (FN) is used to compute it. The following is the recall formula.
By computing the ratio of correctly classified samples to all samples, accuracy evaluates how accurate the model’s predictions are overall. TN refers to the true negatives, which are the number of samples that are correctly predicted as negatives. It is calculated using the following formula.
F1-score is a balanced evaluation metric that takes both precision and memory into account. It is calculated as the harmonic mean of precision and recall. It is computed with the following formula.
The mAP evaluates the model’s performance in multi-class classification problems. It computes the Average Precision for each class and then takes the mean across all classes. It serves as a robust measurement of the model’s classification accuracy across multiple classes. The formula for mAP is as follows.
represents the Average Precision for the class . is the total number of classes. These evaluation indicators of precision, recall, accuracy, f1-score, and mAP collectively provide insights into the model’s performance in terms of prediction correctness, sensitivity, overall accuracy, and classification accuracy across multiple classes.
5. Discussion
The current state of research in multi-modal skin lesion recognition has shown promising results, particularly with the integration of deep learning techniques [
20,
24,
35]. Recent studies have highlighted the potential of combining various imaging modalities, such as dermoscopy, clinical images, and reflectance confocal microscopy, to improve the diagnostic accuracy of skin lesions [
41]. However, challenges remain in effectively integrating these diverse modalities due to differences in data representation, resolution, and feature extraction requirements [
42].
The proposed MDSIS-Net model addresses the challenges of integrating diverse imaging modalities for accurate multi-modal skin lesion recognition by employing a deep multi-modal information-switching network that facilitates the intricate interplay between these modes and enables the critical exchange of multi-scale feature information, thereby advancing the examination of skin ailments through end-to-end recognition. A novel multi-scale information-switching structure is developed within MDSIS-Net to facilitate the exchange of deep and shallow features between different modalities. This structure automatically adjusts the weight of information exchange at different feature layers, while also enhancing the inter-modality association of deep and shallow features within individual modalities. This optimizes the distribution of the distinctive features of different diseases across diverse imaging modalities. MDSIS-Net leverages deep convolutional neural networks to extract fine-grained features of skin lesions from various skin image modalities. Moreover, it enhances the information exchange between different modality features at various scales, thereby improving the interpretability and recognition accuracy of skin lesions in multi-modal skin imaging. The experiment is based on actual clinical data and public dataset Derm7pt. For DD, VISIA equipment is used to perform the multi-modal imaging of patients. Each patient’s images include five modalities: spots, red marks, UV spots, porphyrins, and brown spots. The melanoma dataset includes both clinical images and dermoscopic images. Our proposed model achieves an mAP of 0.967, accuracy of 0.960, precision of 0.935, recall of 0.960, and f1-score of 0.947 on the DD dataset, all surpassing the performance of the current best models. Furthermore, on the MM dataset, our proposed model achieves an mAP of 0.877, accuracy of 0.907, precision of 0.911, recall of 0.815, and f1-score of 0.851, all of which exceed the performance of the current best models. Furthermore, on the MM dataset, our proposed model achieves an mAP of 0.877, accuracy of 0.907, precision of 0.911, recall of 0.815, and f1-score of 0.851, all of which exceed the performance of the current best models. We conduct Grad-CAM for model interpretability analysis, and different modalities show hotspots that reflect the corresponding features of skin lesions in different modalities, which are similar to the focus areas of clinical diagnosis. Additionally, from the t-SNE and confusion matrix, we can see that our model has excellent discriminability and feature distinction abilities across multiple classes.
This study has several limitations. First, as the number of modalities increases, the multi-modal network’s computational complexity rises as well, requiring significant time and resources, which may limit its applicability in environments with limited resources. Initial tests using Automatic Mixed Precision (AMP) [
43] training, which switches from float32 to float16 precision, show promise for optimization as they cut memory consumption in half with only a 1% accuracy loss. Second, Grad-CAM-based heatmap visualization, which performs unsupervised lesion region segmentation [
44], may misidentify regions due to similarities in color, texture, or patterns between healthy and affected areas. Third, since image-based clinical diagnosis places high demands on image quality, the model’s performance is sensitive to factors such as blurring, noise, and size variations, which can affect results. While it supports DD and MM tasks, its generalizability to other skin diseases requires further validation.