1. Introduction
After alternating between periods of great passion and setback [
1], AI has found its place as a critical component of growth in a variety of applications [
2]. These applications range from diagnostic decision assistants in healthcare, safety-critical systems in autonomous vehicles, and long-term financial investment planning, and benefit from these breakthroughs [
3].
AI is capable of analyzing complex data and exploiting non-intuitive approaches to derive meaningful relationships [
4]. Healthcare applications based on AI are utilized in early detection, diagnosis, treatment, as well as outcome prediction and prognosis evaluation [
5]. The barrier that stands in the way of AI applications is sourced from the lack of transparency and black-box nature that cannot be explained directly [
6]. The black-box nature of AI systems could be explained as follows. When an AI model learns and gives an output, it processes the data and deciphers the processed information immediately instead of storing the learned data as a clear digital memory [
7]. This is why an explainable and understandable glass-box approach should be taken to enable transparent, trustable, and re-traceable AI applications [
8]. Chronic wound management, which is one of the important fields in healthcare, also requires explainable AI models. In this study, AI techniques are applied to the classification of chronic wounds, i.e., diabetic ulcers, lymphovascular, surgical, and pressure injury.
The Explainable Artificial Intelligence (XAI) term is coined to provide transparency and guided inference in understanding the decision-making processes of the AI system [
9]. The study in [
10] provides a comprehensive review of XAI in terms of concepts, taxonomies, opportunities, and challenges, as well as a discussion on adopting XAI techniques to image processing. The study in [
11] summarizes the recent developments in XAI and its connection with artificial general intelligence, as well as identified trust-related problems of AI applications. The study in [
12] examines the state of AI-based FDA-approved medical devices and algorithms. Although millions of dollars funded medical AI research in 2019, only ten (10) medical devices have been approved by the FDA. The authors in [
13] present a comparative analysis of approved AI and ML medical devices. The approved devices are being used mainly in radiology, and a few are qualified as high-risk devices. The acceptance of AI is still low amongst medical practitioners with various matters related to trustworthiness and reliability [
14]. Authors in [
15] identified nuances, challenges, and requirements for the design of interpretable and explainable machine learning models and systems in healthcare and described how to choose the right interpretable machine learning algorithm. Conventional black-box AI systems are turned into glass-box systems with the help of XAI techniques which provide data about the intermediate steps of the inference process [
16,
17]. An example of this would be a computer-aided diagnosis system that not only outputs a prediction but also shows where it looked during the decision-making process by overlaying a heat map on top of an X-ray image. The study in [
18] presents the Grad-CAM technique by utilizing the gradients that are taken from the convolution layer to generate a highlighted localization map. Grad-CAM benefits the convolutions, whereas our proposed method calculates the most effective features by tweaking the input and perceiving its effect on classification. Authors in [
19] presented classification tasks using LIME (Local Interpretable Model-Agnostic Explanations) to explain predictions of Deep Learning (DL) models, to be able to make these complex models partly understandable.
In [
20], the authors proposed a classification technique where they combined the Genetic Algorithm (GA) and Adaptive Neural Fuzzy Inference System (ANFIS) to predict heart attack through XAI at satisfactory rates. Authors in [
21] developed an assisted and incremental medical diagnosis system using XAI, which allows the interaction between the physician (i.e., human agent) and the AI agent. Authors in [
22] investigated the problem of explainability in AI in the medical domain where wrong system decisions can be very harmful and proposed two approaches to explain predictions of deep learning models, (i) computes sensitivity of the prediction with respect to changes in input, and (ii) decomposes decision in terms of the input variables. Authors in [
23] investigated how to increase the trust in computer vision through XAI and how to implement XAI to better understand AI in a critical area such as disease detection.
This paper presents a highly transparent and explainable artificial intelligence tool for the classification of chronic wounds, i.e., diabetic ulcer, lymphovascular, surgical, and pressure injury. Objectives of the study are:
Build a wound type classification model using deep learning and transfer learning methods.
Showcase an approach to make common AI models more transparent and explainable to understand the results and gain trust into the AI model.
Utilize readily available AI neural networks to show that more transparency or explainability can be introduced to a variety of commonly available models, such as transfer learning.
Apply XAI methods to convert complex black-box AI systems to more understandable glass box AI systems that aim to provide a look into the internal decision-making mechanics to give the user the ability to follow the reasoning behind the AI models’ prediction.
Provide insights into the complex decision-making processes of an AI system in the field of healthcare applications, especially chronic wound type classification.
3. Data Collection, Pre-processing, Environment, and Validation
This section discusses data collection, data pre-processing, and the test environment. Details about the dataset are given in the data collection section. Forming a ground truth for classification and the environment that the model runs on is explained in data pre-processing and environment sections, respectively.
3.1. Data Collection
The chronic wound data repository, which includes diabetic, lymphovascular, pressure injury, and surgical wound types, are collected from the eKare Inc. data repository and was anonymized for patient privacy [
41]. eKare Inc. specializes in wound management, with its services used by many hospitals and wound clinics for patient/wound management. A total of 8690 wound images were chosen by an MD specialized in wound care to represent the aforementioned wound types. The dataset comprises 1811 diabetic, 2934 lymphovascular, 2299 pressure injuries, and 1646 surgical wound images.
The proposed model uses wound images to predict wound etiology utilizing transfer learning, data augmentation, and deep neural networks (DNN).
3.2. Data Pre-Processing
The dataset was reviewed by a trained MD to ensure the correct classification of underlying chronic wound etiology. This validated classification serves as the clinical ground-truth. Wound images are then hand-labeled for wound type classification.
The distribution of the dataset is not even, as the dataset is fine-tuned for a correct representation of chronic wound classes. Data augmentation techniques such as mirroring, rotation, and horizontal flip are used to increase dataset size and maintain class balance. The dataset, 8690 images in total, was split into training and test sets comprising 6520 and 2170 images, respectively. The collected data was pre-processed to increase data quality. This includes formatting, rescaling, and normalization of the images. Images were scaled to 224 × 224 pixels and normalized for a faster learning process.
3.3. Environment
The proposed model was implemented using the Keras deep learning framework with Python version 3.6. We used a workstation to run our model, which has an Intel® Core ™ i7 -8700X CPU @3.20 GHz with 32 GB memory, NVIDIA GeForce GTX 1080 GPU with 8 GB dedicated and 16 GB shared memory. We trained the model for 1000 epochs where the model has warmed up 250 epochs with only training fully connected (FC) layers, then an additional 750 epoch with the training of FC layers, and the final set of the convolutional layers. The total training of the model took around 8 h. We used a constant learning rate of 0.001 for the “RMSprop” optimizer for the training.
3.4. Validation
Validation was done using the confusion matrix shown in
Table 1. Precision gives the ratio of correctly classified wound types over total positive wound type predictions. Recall is a measure of how many of the positive wounds are correctly classified. This metric checks predictions in the eye of true labels. A high recall value relates to the identification of more true positive, and therefore, fewer incorrectly classified samples. Interestingly, both of these metrics could be high, yet the model could still underperforms. This is why a third metric is utilized to characterize the model performance. F1-score is a hybrid measurement that brings together both precision and recalls for a better evaluation.
Performance measures are given in Equations (1)–(3) below.
(ROC) curve and area under the curve (AUC) are also used as performance measures and shown in
Figure 3. Higher AUC values indicate the classification capability of the proposed model. The X-axis of the ROC curve is recall, and Y-axis is the false positive rate (FPR) which is given in Equation (4) below.
4. Implementation of Transfer Learning and XAI Approaches on Wound Classification
The objective of this paper is to explore and apply XAI methods on chronic wound classification to expand knowledge about the opaque “black-box” structure of the machine learning models. The test dataset comprised 25% of the data, while the remaining 75% was used as training data. Data augmentation techniques, such as mirroring, rotation, and horizontal flip, are used to avoid overfitting and to increase the dataset for better training performance. Test data is indexed for generalization of the model and proper comparison. Transfer learning is realized in two steps, first, a warm-up phase, and second, a fine-tuning phase. This study, using transfer learning, provided satisfying results according to performance metrics, which are F1 score, re-call, and precision (features extracted from the confusion matrix). Precision, recall, and F1-scores of each wound type, and their averages, are compared in
Table 2.
Higher precision values of lymphovascular, surgical, and pressure injury wound types indicate the model performed very well with these wound types, whereas pressure injuries were harder to diagnose (low recall score for pressure injury wounds). This means that some pressure injury wounds are not learned or are similar to another wound type and misclassified by the model. Lymphovascular wounds have one of the highest recall scores among all wound types, which reveals that the proposed method is capable of diagnosing lymphovascular wounds. The F1 score on the performance of lymphovascular wounds is high, and pressure injury is low. Surgical wounds have fair precision and F1 scores, but have low recall scores. Hence our model is likely to classify a surgical wound as diabetic. The recall of diabetic wound types is pretty high, and it has one of the lowest F1 scores, which is a result of low precision. The ROC curve and AUC results are depicted in
Figure 3. Lymphovascular and surgical wounds have the highest AUC values, whereas diabetic and pressure injury suffers from low precision (diabetic) and recall (pressure injury).
As AI-based products provide efficiency and automation, AI becomes very popular in low-risk fields, such as agriculture, customer services, and manufacturing. However, applications of AI remain limited in high-risk domains such as health care, as trust is critical in medical practice [
14]. Reliability issues concerning patients and medical practitioners, as well as regulations, hinder the adoption of AI-based systems [
12]. Understanding the rationale behind model predictions would certainly help users decide when to trust or not to trust their predictions.
A deep neural network using the transfer learning technique was trained using chronic wound images to predict the wound type. Accurate wound type designation helps a clinician to classify the wound type, which serves to better steer the treatment approach. Prediction of the image classification is then explained by an “
explainer” that points to visual features of the image that are the most important to the model. With this information related to model rationale, the clinician can decide to trust the model or not. Model outputs include an understandable qualitative link between inputs and predictions, which is an essential part of the explainability aspect [
42]. The rich model feature-set is too numerous and difficult to interpret directly, yet by facilitating a guided qualitative approach, human reasoning can be augmented with additional model data [
43]. Another significant property that a reliable explainer should have is local faithfulness. Local faithfulness is achieved by characterizing the response of a local function with a range of adjacent inputs [
44].
In this study, the DNN model with transfer learning and extended XAI technique is used to provide explainability and transparency for wound image classifiers by visually indicating what particular class is estimated for various model regions. The proposed model forms a hybrid XAI framework through the use of LIME and heatmap proposals. LIME architecture using superpixels is implemented similar to the study in [
42]. LIME provides a set of correlated and connected pixels which are used as input to the heatmap method. The proposed model provides a focus on the classification task through the use of a heatmap. Medical practitioners often conceptualize the clinical problem based on their knowledge acquired in medical school, as well as clinical experience. The heatmap approach is a fairly naïve method of raising focus to different image regions based on the model. The basic intuition with the use of the heatmap is that by drawing focus to certain image regions, practitioners will narrow their attention to regions where the heatmap data correlates with their medical intuition. Warmer colors indicate the more critical areas of the wound in the importance map.
The proposed model classifies a chronic wound as a lymphovascular wound with a probability of 99.9%, shown in
Figure 4.
Figure 4b highlights the model’s focused area for classification tasks in the wound image with an importance map as an explanation.
Figure 5,
Figure 6,
Figure 7 and
Figure 8 show images of diabetic, lymphovascular, pressure injury, and surgical wounds. Each wound type has a respective heatmap highlighting the focused area that affects the model to choose the proper wound type. Diabetic wound is correctly predicted at 95.36% (Pressure injury: 4.07%, lymphovascular: 0.01%, surgical: 0.56%) and lymphovascular wound is predicted at 100% (Diabetic: 0%, pressure injury: 0%, surgical: 0%) in
Figure 5 and
Figure 6, respectively. The low diabetic wound classification probability can be increased with additional data to amplify feature extraction of diabetic wounds during training.
Probabilities of wound classification are very high for
Figure 7, i.e., pressure injury wound at 100% (Lymphovascular: 0%, surgical: 0%, diabetic: 0%), and for
Figure 8, i.e., surgical wound at 99.91% (Diabetic: 0.05%, pressure injury: 0.03%, lymphovascular: 0.01%).
Figure 5a,b show explanations of the most important features that contribute to the prediction. Like
Figure 5a,b,
Figure 6a,b shows explanations and map features with the highest contribution to prediction for lymphovascular classification. Both figures provide insights as to why the wound type was predicted to be diabetic or lymphovascular. Focus on the diabetic wound includes surrounding wound tissues and toes, with the shape of the ulcer and its proximity to toes as the explanations of the diabetic foot ulcer.
The lymphovascular wound, as seen in
Figure 5a, is explained with a focus on deeper damaged tissue. This kind of explanation enhances trust in the wound classifier, and helps caregivers make a decision and support their decision with a visual explanation.
The pressure injury wound explainer focuses on the wounded area and indicates the correct placement of the wound, shown in
Figure 7b. In
Figure 8, a surgical wound image is explained with a scar pattern and the shape of the wound. The explainer identifies the scar of the wound as the highest feature, and the wound area is highlighted by the proposed model with an importance map.
The proposed method explains diabetic wounds with respect to wound tissue and ulcer location. Diabetic ulcers mostly occur under the foot and follow a similar pattern. A different diabetic wound occurs just below the ankle in
Figure 9, which is misclassified as a lymphovascular wound. This kind of ulcer is hard to differentiate from lymphovascular wounds because of its location, as lymphovascular wounds frequently occur at the ankle. Misclassification of a diabetic wound can also be the result of a large wound area, wherein lymphovascular wounds typically cover larger areas than diabetic ulcers.
Lymphovascular wounds are detected with high probability. There is a slightly lower probability of a lymphovascular wound in
Figure 10. The spread of the wound forms a line that looks like a surgical wound’s scar. The darker part of the wound also looks like a diabetic ulcer. That’s why the proposed model gives about 7 percent probability to each wound. Nonetheless, the proposed method highlights the important area for the lymphovascular wound correctly.
It is assumed that the pressure injury wound in
Figure 11 is misclassified due to the size and the shape of the wound area. Pressure injury typically has a large wound area with surrounding damaged skin. As shown in
Figure 11, the wound occurs under the foot, which is a common diabetic wound area, and also the wound area is smaller in comparison to the regular pressure injury wounds. These comprise the reasons why the proposed model misclassified the image of pressure injury wounds.
Figure 12 depicts a surgical wound, which is correctly classified with a probability of 63.4%. This surgical wound might be the result of a previous pressure injury that covered a larger area. The vast spread of the wound causes this conclusion for the model. In addition to this, the model is confused with the edge of the white cloth, which causes a larger highlighted area. The darker and deeper wound in the middle might be the reason for the high diabetic wound percentage. On the other hand, surgical wounds tend to take a longer time to heal and may convert to diabetic ulcers in diabetic patients. Model classification performance could be increased by collecting more data as this will strengthen the extraction of wound features in the training phase.
5. Results and Discussion
The proposed model extracts features with convolutional networks from a pre-trained VGG16 network. The use of transfer learning accelerates training and produces efficient results, as shown in
Figure 4,
Figure 5,
Figure 6,
Figure 7 and
Figure 8. Performance metric evaluation of the model on diabetic wounds (with a precision of 0.85, recall of 1.00, and F1-score of 0.92) indicate that the model has limitations with feature identification for this wound type. This is especially evident with sparse datasets. Surgical wounds have a fair performance on the evaluation metrics where precision, recall, and F1-scores are 1.00, 0.91, 0.95, respectively. Precision, recall, and F1 scores of lymphovascular wounds are 0.95, 0.98, and 0.96, respectively. Pressure injury wound type has one of the highest precisions, 1.00, a low recall score, 0.86, and an F1-score of 0.92. Surgical and pressure injury wounds have good precision and low recall scores. The recall score of pressure injury wounds is low, which is an indicator that the proposed model has some difficulty in learning the features of pressure injury wounds. The proposed model has the average precision at 0.95, the recall at 0.94, the F1-score at 0.94. The ROC curve and the AUC provide a visualization related to the performance of the model on the classification task. The performance of the model could be improved with a larger training dataset [
45] and fine-tuning the hyperparameters [
46].
The second part of the model is specialized in explaining why the model gives a specific output with a hybrid structure. This part extends the LIME technique using a heatmap model. Heatmap is used as a tool to draw focus to image regions based on work done with the intuition being that practitioners will take less time under guidance. The explainer of the proposed model is successful, while the classification part of the hybrid model could be further improved with additional data (a common problem in data-hungry deep learning models). The explainer provides visual cues through the use of a heatmap overlaid on wound images to indicate image regions identified by the AI model.
A clinician may eliminate certain wound types for consideration based on the location of the wound. For example, in the case of a plantar foot ulcer, a doctor will likely eliminate sacral pressure injury wounds from the possible wound type list. This is why wound location is important, and an explanation of a wound type should also indicate location information for a complete understanding. Diabetic wound type is explained via the corresponding deeper and darker damaged tissue size and location on toes. These features are stressed and shown in
Figure 5. Lymphovascular wound features are highlighted and shown in
Figure 6, where the size and texture of the damaged tissue are essential indicators. Explanation of the lymphovascular wound type is unexpected; its focus is on the border of the lesion and the adjacent areas instead of the whole lesion. This is another case whereby deep learning utilizes a non-intuitive search space that provides important information. Pressure injury wounds are explained via wound tissue and the surrounding wound area, as seen in
Figure 7. Pressure injury wounds often have a surrounding region of newly healed or damaged skin immediately adjacent to the larger wound. A surgical wound has more straightforward features to explain, such as postoperative scar and stitches.
Observations deduced from the results of the proposed model are summarized below:
Observation 1: AI applications with XAI have high potential in improving explainability and transparency in high-risk industries, such as healthcare where trust is key.
Observation 2: Limitation in the classification task is carried to the explanation part of the model.
Observation 3: The list of possible wound types is decreased significantly based on wound location.
Observation 4: Explainer has different approaches for each class, yet it uses a qualitative method to explain decisions.
Observation 5: Qualitative methods may explain AI models better to non-subject experts as model parameters and inputs alone are too numerous to be meaningful to non-experts.
Observation 6: Given hardships in understanding quantitative methods, human reasoning can be augmented through qualitative methods.
Observation 7: XAI has great potential to improve overall model performance by analyzing the effect and importance of features.
Observation 8: Non-expert users are often able to intuitively grasp the rationale behind class decisions made by the model.
Observation 9: AI decision-making processes might be unanticipated, yet they can provide insights and improve how we handle certain tasks through a bottom-up approach.
6. Conclusions
This paper presents a use case of wound type classification in the healthcare domain using an explainable artificial intelligence model. The proposed model is used to augment decision-making through clinician guidance. Moreover, the proposed method reveals the underlying reason for a particular output by analyzing the relationship between input and output. This study intends to showcase an approach to make common AI models more transparent and explainable to understand the results and gain trust into the AI model. By utilizing readily available AI neural networks, it can be shown that more transparency or explainability can be introduced to a variety of commonly available models, such as transfer learning.
DNN using the transfer learning technique is utilized to predict the classification of four wound types: diabetic, lymphovascular, pressure injury, and surgical. The model accepts an image as input and predicts the etiology of a chronic wound as output. It is discussed that trust is crucial for effective human interaction with machine learning systems and that explaining individual predictions is important in assessing trust. We used XAI techniques identified here in a healthcare application to faithfully explain predictions of wound type classifications in an interpretable manner through the use of heatmaps. The proposed model extends the LIME technique with a heatmap method for better explainability. XAI techniques allow AI systems to cooperate with non-expert end-users. The AI and end-user give each other feedback to arrive at a decision together by guiding a human, e.g., researcher or caregivers, during a classification task. It can also explain how a decision was made, tracing back to the inner workings of the AI system. Transparency is crucial in developing caregiver confidence and improving wound treatment.
This study demonstrated that explanations are useful for wound type classification in the healthcare domain, when assessing trust, to develop new approaches to wound classification and prediction insights. The proposed hybrid model performs well on both chronic wound classification and explanation tasks. Collecting additional data will increase classification performance further. Interpretation of the results obtained from the XAI module provides satisfactory information about the chosen wound type. Application of other XAI techniques such as Taylor Decomposition, Grad-CAM, and sensitivity analysis will enhance the overall trustworthiness of the model as well.
It is expected that this work can benefit researchers and caregivers who work in the chronic wound management field in healthcare by providing insights into the XAI potentials and availability in healthcare applications.