Self-Supervised Skin Lesion Segmentation: An Annotation-Free Approach

Gharawi, Abdulrahman; Alahmadi, Mohammad D.; Ramaswamy, Lakshmish

doi:10.3390/math11183805

Open AccessArticle

Self-Supervised Skin Lesion Segmentation: An Annotation-Free Approach

by

Abdulrahman Gharawi

^1,2,*,

Mohammad D. Alahmadi

³

and

Lakshmish Ramaswamy

²

¹

College of Computers and Information Systems, Umm Al-Qura University, Makkah 21955, Saudi Arabia

²

Department of Computer Science, University of Georgia, Athens, GA 30602, USA

³

Department of Software Engineering, College of Computer Science and Engineering, University of Jeddah, Jeddah 21959, Saudi Arabia

^*

Author to whom correspondence should be addressed.

Mathematics 2023, 11(18), 3805; https://doi.org/10.3390/math11183805

Submission received: 30 July 2023 / Revised: 19 August 2023 / Accepted: 25 August 2023 / Published: 5 September 2023

(This article belongs to the Special Issue Machine Learning and Deep Learning for Healthcare Applications and Advances)

Download

Browse Figures

Versions Notes

Abstract

:

Skin cancer poses a significant health risk, affecting multiple layers of the skin, including the dermis, epidermis, and hypodermis. Melanoma, a severe type of skin cancer, originates from the abnormal proliferation of melanocytes in the epidermis. Current methods for skin lesion segmentation heavily rely on large annotated datasets, which are costly, time-consuming, and demand specialized expertise from dermatologists. To address these limitations and improve logistics in dermatology practices, we present a self-supervised strategy for accurate skin lesion segmentation in dermatologist images, eliminating the need for manual annotations. Unlike the traditional appraoch, our proposed approach integrates a hybrid CNN/Transformer model, harnessing the complementary strengths of both architectures. The Transformer module captures long-range contextual dependencies, enabling a comprehensive understanding of image content, while the CNN encoder extracts local semantic information. To dynamically recalibrate the representation space, we introduce a contextual attention module that effectively combines hierarchical features and pixel-level information. By incorporating local and global dependencies among image pixels, we perform a clustering process that organizes the image content into a meaningful space. Furthermore, as another contribution, we incorporate a spatial consistency loss to promote the gradual merging of clusters with similar representations, thereby improving the segmentation quality. Experimental evaluations conducted on two publicly available skin lesion segmentation datasets demonstrate the superiority of our proposed method, outperforming both unsupervised and self-supervised strategies, and achieving state-of-the-art performance in this challenging task.

Keywords:

deep learning; texture segmentation; Laplacian Transformer

MSC:

68U10

1. Introduction

The advancement of computer vision has led to a significant focus on medical image segmentation as a crucial area of research. The demand for computer-aided diagnosis (CAD) systems has increased, aiming to support healthcare professionals with accurate interpretations of medical images and improving logistics in healthcare practices. Automated image segmentation is crucial in the analysis of medical images, serving as a fundamental component for accurate and reliable diagnoses. Skin cancer, a highly prevalent and potentially life-threatening form of cancer, primarily affects the different layers of human skin, namely the dermis, epidermis [1], and hypodermis. These tissues collectively make up the structure of the skin and are involved in various physiological functions. Melanoma, which arises from the abnormal proliferation of melanocytes within the epidermis, represents the most severe form of skin cancer [2]. With a mortality rate of 1.62%, it is regarded as the deadliest variant among skin cancers. The timely and accurate detection of melanoma is crucial for effective treatment and improved patient outcomes. The year 2019 witnessed an estimated total of 96,480 newly diagnosed cases of melanoma, making it a significant public health concern [3,4]. As per the World Health Organization (WHO), non-melanoma skin cancers account for a considerable number of cases worldwide with an estimated incidence of 2 to 3 million annually [3]. This highlights the substantial burden of non-melanoma skin cancers and emphasizes the need for effective prevention and treatment strategies to address this significant global health issue.

The early detection of melanoma is paramount for successful treatment outcomes, as studies have demonstrated that timely diagnosis can significantly improve the five-year relative survival rate, reaching up to 92% [3]. In this regard, the precise and early identification of skin lesions plays a crucial role in computerized systems aimed at the accurate diagnosis of skin cancer, as depicted in Figure 1. In the past decade, considerable attention has been devoted to the development of automated skin cancer detection methods, catering to various skin conditions, including the three primary forms of abnormal skin cell growth: basal cell carcinoma, squamous cell carcinoma, and melanoma. These efforts contribute to advancing the field of dermatology and improving the effectiveness of early intervention strategies.

The accurate segmentation of skin cancer in medical images poses several challenges due to various factors, including low contrast, texture variations, variations in positioning, color variations, and differences in lesion sizes. Moreover, the presence of artifacts, such as air bubbles, hair, dark borders, measurement marks, blood vessels, and uneven color illumination, further complicates the task of segmenting lesions. These factors collectively contribute to the complexity and difficulty of accurately delineating skin cancer lesions, making the segmentation process a demanding and challenging endeavor.

In this paper, we propose a novel self-supervised clustering network for skin lesion segmentation that improves the feature representation of each pixel and iteratively learns the optimal cluster size. The proposed approach utilizes a consistency loss function based on context, which helps effectively segment image regions with unclear boundaries and noise. This loss function enforces all the pixels belonging to a cluster to be spatially close to the cluster center, leading to improved clustering performance. To evaluate the effectiveness of our proposed approach, we conducted experiments on a public skin lesion segmentation dataset, PH2, and compared our approach to other unsupervised clustering and self-supervised clustering methods. The experimental results demonstrate that our proposed approach achieves state-of-the-art performance in terms of segmentation accuracy, indicating its potential for practical use in skin lesion segmentation tasks. Our contribution can be summarized as follows:

Hybrid CNN/Transformer Architecture: We introduce a unique architecture that combines the strengths of Convolutional Neural Networks (CNNs) and Transformers. This synergy enables our model to effectively capture both local and global contextual information, enhancing skin lesion boundary detection.
Precise Boundary Detection: Our method excels in accurately identifying skin lesion boundaries, even in scenarios with complex object background overlaps. This capability is crucial for ensuring precise diagnosis and treatment planning.
Self-Supervised Learning: Unlike traditional supervised methods requiring pixel-level annotations, our approach operates in a self-supervised manner. This significantly reduces the annotation burden, making it more practical and cost-effective.
Boundary and Contextual Attention Modules: We introduce innovative boundary and contextual attention modules that improve the accuracy of pixel clustering and subsequent segmentation. These modules enhance the model’s ability to capture fine-grained details.
State-of-the-Art Performance: Through comprehensive evaluations on skin lesion segmentation datasets, we demonstrate that our proposed method achieves state-of-the-art performance. Our model outperforms existing approaches in terms of both quantitative metrics and qualitative segmentation results.

2. Related Work

Supervised deep learning methods applied to images have demonstrated their potential in extracting image features to support various image analysis tasks. Many of the techniques that have shown success in logistics applications, such as image segmentation for parcel [5,6] and packaging recognition [7,8], have the potential to be applied in the medical image domain as well. These techniques leverage underlying algorithms and rely on abundant labeled data which is costly [9], underscoring their efficacy and promise [10,11]. However, the use of these methods for skin cancer segmentation faces challenges due to limited labeled data, costly and time-consuming manual delineation by imaging experts, variability among experts, and the complexity of the images, which can have diverse appearances based on factors like skin lesion type in dermoscopic images. To address these challenges, researchers have employed various techniques. These methods encompass various techniques, such as deep learning with knowledge transferability between diverse domains, fine-tuning models using limited labeled data (domain adaptation), and unsupervised feature learning using algorithms like sparse coding [12] and auto-encoders [13], and self-supervised learning [14].

Self-supervised learning, a form of unsupervised learning, shows promise in computer vision and medical image segmentation tasks [15,16]. Notably, in self-supervised learning, the data itself generate supervisory signals for feature learning, offering a unique approach to address the scarcity of labeled data while extracting informative features [17]. For instance, Zhuang et al. [18] employed self-supervised learning techniques to train a Convolutional Neural Network specifically for brain tumor segmentation using 3D CT scan images. By leveraging self-supervised learning, they were able to enhance the network’s performance without the need for manual annotations or labeled data. Likewise, Tajbakhsh et al. [19] adopted self-supervised learning methods to generate supervisory signals for lung lobe segmentation in CT scans. Their approach involved predicting color, rotation, and noise within the images. Through self-supervised learning, they effectively tackled the challenge of limited labeled data, resulting in improved accuracy in lung lobe segmentation tasks.

Another method for generating supervisory signals involves the use of image clustering [20,21]. Caron et al. [20] presented a technique known as DeepCluster that offers an iterative approach to enhance both the feature representation and clustering assignment of image features during the training of Convolutional Neural Networks (CNNs). In a similar vein, Ji et al. [21] introduced Invariant Information Clustering (IIC), which is a method that aims to improve the performance of clustering by maximizing the mutual information between image patches that have undergone spatial transformations. In the field of medical imaging, ref. [22] employed the k-means clustering technique to group pixels with similar characteristics in micro-CT images. By utilizing the cluster labels, they were able to learn the feature representation of these pixels. However, it is important to acknowledge that self-supervised clustering methods have certain limitations. These include the manual selection of cluster size (e.g., determining the value of ’k’ in k-means) and difficulties associated with intricate regions that have fuzzy boundaries, diverse shapes, artifacts, and noise.

Karimi et al. [23] introduced a novel dual-branch transformer network to incorporate global contextual dependencies while preserving local information at various scales. Their self-supervised learning approach takes into account semantic dependencies between scales, generating a supervisory signal to ensure inter-scale consistency and a spatial stability loss within each scale for self-supervised content clustering. To further enhance the clustering process, the authors introduced an additional cross-entropy loss function on the clustering score map, enabling the effective modeling of each cluster distribution and improving the decision boundary between clusters. Through iterative training, the algorithm learns to assign pixels to semantically related clusters, resulting in the generation of a segmentation map.

In comparing the detailed self-supervised approaches discussed, it becomes evident that each method offers unique strengths and limitations. Self-supervised clustering methods, such as DeepCluster [20] and IIC [21], have shown their ability to enhance feature representation and clustering assignment iteratively during training. These approaches leverage inherent data structures to improve segmentation accuracy, yet they may struggle with intricate regions characterized by fuzzy boundaries and diverse shapes. On the other hand, the utilization of k-means clustering [22] for feature learning demonstrates simplicity and effectiveness in certain contexts while facing similar challenges with complex boundaries. Notably, the recent introduction of the dual-branch transformer network by Karimi et al. [23] showcases a novel avenue for self-supervised content clustering. This approach combines global contextual dependencies and local information, resulting in inter-scale consistency and accurate spatial stability, thereby addressing the limitations of earlier techniques. While these methods show promise, further exploration is warranted to overcome challenges posed by complex object background overlap and noisy annotations in the context of skin lesion segmentation.

3. Proposed Method

The proposed method is illustrated in Figure 1, providing an overview of our approach. Initially, the input image

x \in R^{H \times W \times D}

, where

H \times W

indicate the spatial dimension and D shows the color values, undergoes a two-pronged processing pipeline. The Convolutional Neural Network (CNN) component captures local feature descriptions, focusing on the extraction of fine-grained details. Simultaneously, the Transformer model operates on the input image, modeling global representations and long-range dependencies. This dual pathway enables the incorporation of both local semantic information and global contextual understanding. To efficiently combine local and global representations, as well as pixel-level information (RGB color), we propose a novel context attention strategy. The context attention module calculates the feature correlation matrix by leveraging the interdependencies between local and global features. Moreover, by conditioning the feature space using pixel color information, it incorporates prior knowledge to construct the normalized feature set

S \in R^{H \times W \times C}

for content clustering. In essence, this fusion process enables the integration of pixel prior information with local and global cues, resulting in a comprehensive representation that captures fine details and contextual information effectively.

Subsequently, the final segmentation map is created by selecting the channel-wise dimension with the highest response value in

S \in R^{H \times W \times C}

, employing the argmax function. This step maps each pixel to its respective cluster label without the need of manually labeling each pixel, which in turns facilitates the generation of a segmentation map that assigns semantic labels to image regions. Training of the network parameters, denoted as

Net (x; θ

,

γ

), is accomplished through the minimization of the cross-entropy loss. This loss function measures the discrepancy between the generated segmentation map

S \in R^{H \times W \times C}

and the ground truth cluster labels assigned to each pixel

Y \in R^{H \times W \times C}

, encouraging the network to produce accurate and meaningful segmentation:

L_{c e} (S, Y) = - \frac{1}{H \times W} \sum_{i = 1}^{H \times W} \sum_{j = 1}^{K} Y_{i, j} log (S_{i, j}) .

(1)

To further enhance the clustering assignments and capture spatial relationships within the image, we introduce two additional components. Firstly, the spatial consistency loss emphasizes the importance of spatial proximity between pixels, reinforcing the understanding of local connections. This loss encourages the network to group together visually similar pixels that are in close spatial proximity, promoting spatially coherent segmentations. Secondly, the object-level interaction module (modeled by boundary representation) enables the network to separate different clusters and encourages the merging of semantically similar clusters. By incorporating these components, the clustering assignments are refined, leading to improved segmentation results. The proposed architecture adopts an iterative learning process, allowing for the continuous improvement of both the feature representation and clustering assignment of each pixel. In the next subsections, we will elaborate on each part of the network in more detail.

3.1. Local Features

Our proposed method, outlined in Figure 1, incorporates two encoding streams to effectively capture both local and object-level contextual representations. In the first pathway, we employ a shallow CNN network to extract local representations. Given an input image

x \in R^{H \times W \times C}

, our CNN encoder

F = E (x; θ)

utilizes a series of convolutional blocks to model pixel-level contextual representations. Next, we follow the same strategy suggested in [24] to perform channel-wise feature normalization weights as follows:

w_{c h} = σ (W 2 δ (W 1 G A P (f)))

(2)

Here,

G A P

represents the global average pooling operation applied to the CNN features (F), while

W_{1}

and

W_{2}

denote the learnable parameters.

δ

and

σ

refer to the ReLU and Sigmoid activation functions, respectively. The normalized features, denoted as

F^{''}

, are obtained by element-wise multiplication of the weights and the original features:

F^{''} = w_{c h} \cdot F

(3)

However, the inherent locality of the convolutional operation limits its ability to fully capture object-level interactions. To address this limitation and incorporate object-level representations, we introduce an additional component to learn the boundary heatmap. This is achieved through the application of a

1 \times 1

kernel convolutional operation denoted as

C o n v_{b} (.)

, which operates on the output of the CNN encoder

E (x; θ)

. The resulting boundary heatmap, denoted as

B \in R^{H \times W}

, serves as a surrogate signal for modeling regional contextual dependencies. By encoding the boundaries between regions, it facilitates the representation of object-level interactions and enables the network to capture spatial relationships and contextual dependencies at a larger scale. This integration of the boundary heatmap within the encoding process enhances the overall understanding of the image, contributing to more comprehensive and contextually informed representations. By combining the normalized CNN features with the boundary heatmap, our proposed method generates

F^{'} = F^{''} + B

, thereby capturing comprehensive contextual representations that encompass both pixel-level and object-level information. This enables the network to better model the intricate interactions and dependencies within the image, leading to improved performance in tasks requiring detailed contextual understanding.

3.2. Long-Range Dependency

The Vision Transformer (Vit) is a pioneering architecture that has gained significant attention in the field of computer vision [25]. It represents a departure from traditional Convolutional Neural Networks (CNNs) by relying on the principles of self-attention mechanisms commonly used in natural language processing tasks. In the context of image analysis, Vit divides an input image into fixed-size patches, which are then linearly embedded and processed using multi-head self-attention mechanisms. This approach enables Vit to capture long-range dependencies between image patches and model intricate relationships within the data. The resulting representations are subsequently fed through fully connected layers to perform classification or regression tasks. Vit’s unique ability to effectively capture global context information across image patches has led to its success in tasks such as image classification and object detection [25]. In our proposed self-supervised skin lesion segmentation method, we employ Vit as a foundational component within our dual-branch network, leveraging its capacity to capture both local and global contextual dependencies for the accurate clustering and segmentation of skin lesions. Hence, in the second path, we incorporte a Vision Transformer module to capture global contextual representation. However, the standard self-attention mechanism, as expressed in Equation (4), poses computational challenges when dealing with high-resolution images due to its quadratic computational complexity (

O (N^{2})

). In this equation, Q, K, and V represent the query, key, and value vectors, respectively, while d denotes the embedding dimension.

S (Q, K, V) = softmax (\frac{Q K^{T}}{\sqrt{d_{k}}}) V .

(4)

Shen et al. [26] introduced the concept of efficient attention, which capitalizes on the observation that regular self-attention generates a redundant context matrix. To address this inefficiency, they devised an alternative approach for computing self-attention, as described in Equation (5). This formulation aims to streamline the self-attention process and improve computational efficiency.

E (Q, K, V) = ρ_{q} (Q) (ρ_{k} {(K)}^{T} V) .

(5)

where

ρ_{q}

and

ρ_{k}

show the normalization functions for the queries and keys, respectively. Research conducted by Shen et al. [26] has demonstrated that efficient attention can yield equivalent results to dot-product attention by applying softmax normalization functions (

ρ_{q}

and

ρ_{k}

). In the efficient attention mechanism, the normalization of keys and queries is performed prior to the multiplication of keys and values. The resulting global context vectors are then multiplied by the queries to generate the new representation. Unlike dot-product attention, which calculates pairwise similarities between data points, efficient attention represents keys as attention maps (

d_{k}

maps denoted as

k_{j}^{T}

, with j indicating position j in the input feature). These global attention maps capture the semantic aspects of the entire input feature rather than focusing on positional similarities. This change in order significantly reduces the computational complexity of the attention mechanism while maintaining a high level of representational power. The memory complexity of efficient attention is determined by the term

O (d n + d^{2})

, reflecting the influence of the embedding dimension d and the number of data points n. Additionally, the computational complexity is proportional to

O (d^{2} n)

in scenarios where

d_{v} = d

and

d_{k} = \frac{d}{2}

—a common configuration. In our proposed structure, we leverage an efficient attention block to construct our Transformer-based encoder module

G = E (x; γ)

to capture the spatial importance of the input feature map, enabling more effective information processing and representation learning.

3.3. Contextual Attention Module

In order to enhance the embedding vector’s understanding of the entire image, we introduce a contextual attention (CA) module that incorporates a self-attention mechanism. The CA module, depicted in Figure 2, facilitates communication between pixels based on their similarities. These similarities are determined using three types of features: (1) high-level embeddings extracted from CNN features, (2) color features extracted directly from the raw image, and (3) global contextual features obtained from a Vision Transformer.

To this end, we first fuse the CNN (

F^{'} \in R^{H \times W \times C}

) and Transformer features (

G \in R^{H \times W \times C}

) by employing

1 \times 1

convolution followed by a non-linear activation function. Then, to quantify the similarity between pixels, we take into account the relationship between the features and colors. Drawing inspiration from the self-attention mechanism in the Transformer model, we utilize the correlation matrix to calculate the similarity between color and Vision Transformer (ViT) features:

ψ_{i j} = F * ((ρ_{C} {(C)}^{T} C) \circ (ρ_{F} {(F)}^{T} F)) + F

(6)

Here, F represents the feature, C indicates the color features and

ψ_{i}

represents the combined color and network features (fused CNN/Transformr features) information. This formulation allows us to capture the relationships between pixels based on color and global context. The overall process is also visualized in Figure 2. By integrating the CA module into our framework, we enable the embedding vectors to capture the contextual information of the entire image. The self-attention mechanism facilitates communication and interaction between pixels, promoting a more comprehensive understanding of the image structure.

3.4. Spatial Consistency

Self-supervised image clustering faces the challenge of effectively guiding the feature extractor to embed locally related pixels into the same cluster. To address this challenge, we propose the incorporation of an auxiliary spatial consistency loss, denoted as

L_{sp}

, during the training process. The spatial consistency loss plays a crucial role in promoting the spatial coherence of clustered pixels. In order to enforce spatial consistency, we introduce a spatial consistency loss that focuses on minimizing the feature discrepancy within local regions. Specifically, we apply a convolutional layer with a kernel size of

2 \times 2

to the score map S, resulting in a localized representation denoted as

S^{'}

. By minimizing the L1 distance between the original score map S and the localized version

S^{'}

, we aim to encourage spatial consistency and reduce variations within adjacent areas. The

L_{sp}

can be expressed as follows:

L_{sp} = \sum_{i, j} | S_{i j} - S_{i j}^{'} |

(7)

Here,

S_{i j}

represents the score at position

(i, j)

in the original score map S, and

S_{i j}^{'}

represents the corresponding score in the localized version

S^{'}

. The L1 distance measures the absolute difference between the corresponding elements of S and

S^{'}

, which are summed over all positions

(i, j)

in the score maps. By enforcing the spatial proximity of pixels within the same region, this loss encourages the feature extractor to capture local dependencies and preserve spatial relationships in the embedded feature space. Consequently, it enhances the ability of the network to group together visually similar pixels and effectively segment objects in dense prediction tasks.

3.5. Joint Objective

In our training process, we employ a final loss function consisting of two distinct terms:

L_{joint} = λ_{1} L ce + λ_{2} L_{sp}

(8)

The first term represents the cross-entropy loss, which ensures confidence in the network’s predictions by comparing them with the maximum index. This term also enables the network to learn the distribution of each cluster. The inclusion of the second term aims to enforce spatial consistency within each image region. This spatial consistency term is responsible for reducing local variation and facilitating the smooth merging of neighboring clusters.

4. Experimental Evaluation

4.1. Dataset

This study evaluates the proposed self-supervised skin lesion segmentation method on two publicly available datasets:

{P H}^{2}

[27] and ISIC 2018 [28]. The

{P H}^{2}

dataset is a widely used dermoscopic image database for skin lesion segmentation and classification tasks. It contains a total of 200 melanocytic lesions, consisting of 80 common nevi, 80 atypical nevi, and 40 melanomas. Ground truth manual segmentations of the skin lesions are provided with the dataset. Each input image is an 8-bit RGB color image with a resolution of 768 × 560 pixels, making it a challenging real-world dataset. To evaluate the performance of our proposed method, we use all 200 samples from the

{P H}^{2}

dataset. The dataset provides a diverse range of lesion types, enabling us to validate our approach’s generalization. We use the manual segmentations provided with the dataset as the ground truth for performance evaluation.

The International Skin Imaging Collaboration (ISIC) released the ISIC 2017 dataset, documented in [28], in 2018. This dataset comprises a substantial collection of dermoscopy images and emerged from a challenge related to lesion segmentation, the detection of dermoscopic features, and disease classification. In total, the dataset encompasses 2594 images. Similar to prior methods as outlined in [29], we employed 520 samples as our inference subset. The original dimensions of each sample stand at

2016 \times 3024

. However, we resized the images to

256 \times 256

. Our methodology’s efficacy is assessed across the entire test set to showcase its performance.

4.2. Evaluation Protocol

To ensure a fair evaluation and provide comparative insights into the effectiveness of our proposed network, we compared it with both unsupervised and self-supervised clustering-based methods. Specifically, we evaluated our approach using the Dice similarity coefficient (DSC), the XOR metric, and the Hammoud distance (HM) and compared the results with those obtained using the unsupervised k-means clustering method and recent self-supervised methods such as DeepCluster [20], IIC [21], MS-Former [23] and the spatial guided self-supervised strategy (SGSCN) [30]. As per [30], we only considered the cluster with the highest overlap with the ground truth (GT) map as the target class prediction for evaluating our method. The DSC score measures the agreement between the predicted target region and the GT map, with higher scores indicating better performance. Conversely, the HM and XOR metrics measure the disagreement between the predicted target and the GT map, with lower values indicating better performance. The use of multiple evaluation metrics allows for a comprehensive and nuanced assessment of the proposed method’s performance in comparison to other state-of-the-art methods.

4.3. Results

This section presents the evaluation results of the proposed method against state-of-the-art (SOTA) approaches on the skin lesion segmentation task. A comparative analysis is conducted, and the findings are reported in Table 1. The quantitative results indicate that the proposed method outperforms the SOTA approaches in all metrics, demonstrating the effectiveness of the proposed strategy for self-supervised content clustering. Specifically, compared to the SGSCN and MS-Former methods, the proposed method models the spatial consistency withing the convolutioanl operation, which imposes a spatial dependency for the segmentation task. Moreover, the proposed method incorporates boundary information and an effective context attention module to minimize the intra-cluster variation while pushing the similar pixels into the same cluster.

From a qualitative perspective, comparative segmentation results on the skin lesion dataset are presented in Figure 3. The proposed method produces a smooth segmentation map for the skin lesion area with a slightly better prediction of the lesion boundary compared to the DeepCluster and k-means methods. Additionally, SGSCN and MS-Former result in an under-segmentation map caused by considering edges around the lesion as a new class and false negative prediction, while the proposed method achieves a slightly better segmentation map.

4.4. Ablation Study

We conducted an exhaustive ablation study aimed at meticulously assessing the impacts of various components within our method. The central aim of this study was to comprehensively dissect the unique contributions of distinct modules and loss terms. Our first specific focus was on probing the efficacy of the joint objective function, as denoted in Equation (8), comprising two distinct constituents: the cross-entropy loss (

L ce

) and the spatial consistency loss (

L sp

). The cross-entropy loss serves to imbue confidence into the network’s predictions by juxtaposing them against the maximum index, thereby promoting precise cluster allocation and enabling the learning of each cluster’s distribution. Conversely, the spatial consistency loss is geared toward heightening the smoothness of segmented regions, which is accomplished by diminishing local fluctuations and fostering the coherent amalgamation of neighboring clusters. The balance between these two elements is regulated by hyperparameters

λ_{1}

and

λ_{2}

, respectively. Our method’s hyperparameters were meticulously designed and fine-tuned through empirical evaluations on a small subset of skin lesion segmentation images (10 samples) from ISIC 2018. Employing a grid search approach over a small range (0–3), we systematically explored the hyperparameter space to identify optimal values. The resultant hyperparameters were determined as

λ_{1} = 1.5

and

λ_{2} = 2.2

, effectively capturing the trade-off between cross-entropy and spatial consistency in our method. We proceeded to utilize these determined hyperparameters across both datasets. Notably, it is worth mentioning that minor adjustments to these hyperparameters still yielded satisfactory results on both datasets. However, we recommend users, particularly when working with new datasets, to initially undertake hyperparameter tuning for optimal performance.

Morover, the proposed network architecture incorporates various modules, namely the Transformer, boundary, and contextual attention add-ons, to facilitate the learning of both local and global feature sets. Additionally, to enforce spatial consistency, our method employs convolutional operations to model local disparities. In order to assess the impact and contribution of each module on generalization performance, we conducted selective removal experiments, as presented in Table 2. The qualitative findings indicate that removing any of the modules from the architecture leads to a noticeable degradation in performance. Specifically, the absence of the Transformer module significantly diminishes the significance of global attention maps, resulting in a substantial performance drop. Conversely, with the inclusion of the Transformer module, the network effectively combines the strength of global attention mechanisms with the local effectiveness of convolutional feature maps, leading to the learning of robust and comprehensive feature sets for precise localization capabilities.

Similar observations can be made regarding the CNN features, where the absence of this feature set results in false positive and false negative predictions. Moreover, the combined utilization of the proposed modules reduces the number of incorrect predictions and isolated false positives (as shown in Table 2), owing to the influence of long-range contextual dependency maps. Additionally, we demonstrate that integrating the boundary attention map into the network leads to a slight enhancement in the model’s performance when it comes to discerning distinct clusters, particularly along the object boundary. Since skin lesions often exhibit well-defined boundaries, incorporating this module can yield efficiency in certain scenarios. Moreover, we observe that the inclusion of the contextual attention module further advances our model’s performance. This module effectively captures the relationships between local and global representations, incorporating color information to accurately cluster each pixel. We further provided visualization results in Figure 4 to demonstrate the effects of these modules.

Furthermore, we conducted an ablation study to investigate the impact of the spatial loss in our method. It was observed that our model without the intra-scale loss function exhibited weaker performance across various metrics. This finding emphasizes the significance of spatial consistency in achieving accurate segmentation. Furthermore, through our visualization (see Figure 4), it becomes evident that the removal of the spatial loss component in our model leads to a tendency of under-segmentation in the boundary of the lesion.

4.5. Strengths and Limitations

Our proposed method has shown superior performance over state-of-the-art (SOTA) approaches on PH2 and ISIC 2018 datasets. We further provide additional visualization samples to investigate the effectiveness of our strategy for self-supervised segmentation and to identify cases where our algorithm faces challenges. In particular, we present sample clustering results in Figure 3, which illustrate that our method is particularly effective when there is a sharp edge between object boundaries. This can explain why our method often produces slightly over-segmentation results. This is a desirable property, as it indicates that our method is capable of accurately detecting object boundaries, which is important for many applications. However, we also identify cases where our method fails to accurately predict lesion regions, as shown in Figure 5. Our qualitative analysis suggests that our method struggles when the object of interest has a high overlap with background regions. Skin lesions may appear in a deformed shape, which makes it challenging for the model to predict the exact lesion location. Additionally, the ground truth mask may contain noisy annotations, which can lead to errors in the segmentation map. Despite these limitations, our proposed method shows promise for improving self-supervised segmentation, particularly in cases where object boundaries are well-defined. Future work could explore strategies to address the challenges of segmentation in cases with high object background overlap or noisy annotations.

Additionally, it is noteworthy that our approach is designed to be computationally efficient and does not demand substantial computational resources. Our experiments reveal that the method can be effectively executed on a GPU with only 4 GB of memory, making it accessible for a wide range of users and applicable in settings with limited hardware capabilities. This aspect of our approach aligns with the practical requirements of medical imaging applications, where resource constraints often need to be taken into account without compromising the quality of the results.

5. Conclusions

In conclusion, we have presented a self-supervised strategy for accurate skin lesion segmentation in dermatologist images. Our approach leverages a hybrid CNN/Transformer model, combining the strengths of both architectures to effectively capture local and global contextual information without the need to manually label the pixels of the images. Through comprehensive evaluations on publicly available skin lesion segmentation datasets, we have demonstrated that our proposed method achieves state-of-the-art performance in this challenging task. Quantitative results have shown that our method excels in precisely recognizing skin lesion boundaries, even in cases of high overlap with the background area. This capability is crucial for accurate diagnosis and treatment planning. Moreover, qualitative visualizations have showcased the obtained results, highlighting the effectiveness and robustness of our approach in accurately segmenting skin lesions.

Author Contributions

Conceptualization, A.G.; Methodology, A.G.; Software, A.G. and M.D.A.; Validation, M.D.A.; Data curation, M.D.A.; Writing—original draft, A.G.; Writing—review and editing, M.D.A. and L.R.; Supervision, L.R.; Project administration, M.D.A.; Funding acquisition, L.R. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Acknowledgments

We would like to extend our sincere gratitude to Jumana Alsubhi for her invaluable assistance in proofreading this manuscript. Her comments and suggestions provided critical insights that enhanced the quality of our work.

Conflicts of Interest

The authors declare no conflict of interest.

References

Feng, J.; Isern, N.G.; Burton, S.D.; Hu, J.Z. Studies of secondary melanoma on C57BL/6J mouse liver using 1H NMR metabolomics. Metabolites 2013, 3, 1011–1035. [Google Scholar] [CrossRef]
Tarver, T. American cancer society. Cancer facts and Figures 2014. J. Consum. Health Internet 2012, 16, 366–367. [Google Scholar] [CrossRef]
Siegel, R.L.; Miller, K.D.; Jemal, A. Cancer statistics, 2019. CA A Cancer J. Clin. 2019, 69, 7–34. [Google Scholar] [CrossRef] [PubMed]
Aghdam, E.K.; Azad, R.; Zarvani, M.; Merhof, D. Attention swin u-net: Cross-contextual attention mechanism for skin lesion segmentation. arXiv 2022, arXiv:2210.16898. [Google Scholar]
Naumann, A.; Hertlein, F.; Dörr, L.; Furmans, K. Parcel3D: Shape Reconstruction from Single RGB Images for Applications in Transportation Logistics. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 4402–4412. [Google Scholar]
Naumann, A.; Hertlein, F.; Zhou, B.; Dorr, L.; Furmans, K. Scrape, cut, paste and learn: Automated dataset generation applied to parcel logistics. In Proceedings of the 2022 21st IEEE International Conference on Machine Learning and Applications (ICMLA), Nassau, Bahamas, 12–14 December 2022; pp. 1026–1031. [Google Scholar]
Mayershofer, C.; Holm, D.M.; Molter, B.; Fottner, J. Loco: Logistics objects in context. In Proceedings of the 2020 19th IEEE International Conference on Machine Learning and Applications (ICMLA), Miami, FL, USA, 14–17 December 2020; pp. 612–617. [Google Scholar]
Mayershofer, C.; Ge, T.; Fottner, J. Towards fully-synthetic training for industrial applications. In LISS 2020: Proceedings of the 10th International Conference on Logistics, Informatics and Service Sciences; Springer: Berlin/Heidelberg, Germany, 2021; pp. 765–782. [Google Scholar]
Gharawi, A.A.; Alsubhi, J.; Ramaswamy, L. Cost-Aware Ensemble Learning Approach for Overcoming Noise in Labeled Data. In Proceedings of the ICAART (3), online, 3–5 February 2022; pp. 381–388. [Google Scholar]
Tan, M.; Le, Q. Efficientnet: Rethinking model scaling for convolutional neural networks. In Proceedings of the International Conference on Machine Learning, PMLR, Long Beach, CA, USA, 10–15 June 2019; pp. 6105–6114. [Google Scholar]
Strudel, R.; Garcia, R.; Laptev, I.; Schmid, C. Segmenter: Transformer for semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 7262–7272. [Google Scholar]
Ahn, E.; Kumar, A.; Fulham, M.; Feng, D.; Kim, J. Convolutional sparse kernel network for unsupervised medical image analysis. Med. Image Anal. 2019, 56, 140–151. [Google Scholar] [CrossRef] [PubMed]
Ahn, E.; Kumar, A.; Fulham, M.; Feng, D.; Kim, J. Unsupervised domain adaptation to classify medical images using zero-bias convolutional auto-encoders and context-based feature augmentation. IEEE Trans. Med Imaging 2020, 39, 2385–2394. [Google Scholar] [CrossRef] [PubMed]
Misra, I.; Maaten, L.v.d. Self-supervised learning of pretext-invariant representations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 6707–6717. [Google Scholar]
Bai, W.; Chen, C.; Tarroni, G.; Duan, J.; Guitton, F.; Petersen, S.E.; Guo, Y.; Matthews, P.M.; Rueckert, D. Self-supervised learning for cardiac mr image segmentation by anatomical position prediction. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Shenzhen, China, 13–17 October 2019; Springer: Berlin/Heidelberg, Germany, 2019; pp. 541–549. [Google Scholar]
Blendowski, M.; Nickisch, H.; Heinrich, M.P. How to learn from unlabeled volume data: Self-supervised 3d context feature learning. In Proceedings of the Medical Image Computing and Computer Assisted Intervention–MICCAI 2019: 22nd International Conference, Shenzhen, China, 13–17 October 2019; Proceedings, Part VI 22. Springer: Berlin/Heidelberg, Germany, 2019; pp. 649–657. [Google Scholar]
Chen, C.; Zhou, K.; Wang, Z.; Xiao, R. Generative consistency for semi-supervised cerebrovascular segmentation from TOF-MRA. IEEE Trans. Med. Imaging 2022, 42, 346–353. [Google Scholar] [CrossRef] [PubMed]
Zhuang, X.; Li, Y.; Hu, Y.; Ma, K.; Yang, Y.; Zheng, Y. Self-supervised feature learning for 3d medical images by playing a rubik’s cube. In Proceedings of the Medical Image Computing and Computer Assisted Intervention–MICCAI 2019: 22nd International Conference, Shenzhen, China, 13–17 October 2019; Proceedings, Part IV 22. Springer: Berlin/Heidelberg, Germany, 2019; pp. 420–428. [Google Scholar]
Tajbakhsh, N.; Hu, Y.; Cao, J.; Yan, X.; Xiao, Y.; Lu, Y.; Liang, J.; Terzopoulos, D.; Ding, X. Surrogate supervision for medical image analysis: Effective deep learning from limited quantities of labeled data. In Proceedings of the 2019 IEEE 16th International Symposium on Biomedical Imaging (ISBI 2019), Venice, Italy, 8–11 April 2019; pp. 1251–1255. [Google Scholar]
Caron, M.; Bojanowski, P.; Joulin, A.; Douze, M. Deep clustering for unsupervised learning of visual features. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 132–149. [Google Scholar]
Ji, X.; Henriques, J.F.; Vedaldi, A. Invariant information clustering for unsupervised image classification and segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9865–9874. [Google Scholar]
Moriya, T.; Roth, H.R.; Nakamura, S.; Oda, H.; Nagara, K.; Oda, M.; Mori, K. Unsupervised segmentation of 3D medical images based on clustering and deep representation learning. In Proceedings of the Medical Imaging 2018: Biomedical Applications in Molecular, Structural, and Functional Imaging, Houston, TX, USA, 10–15 February 2018; SPIE: Bellingham, DC, USA, 2018; Volume 10578, pp. 483–489. [Google Scholar]
Karimijafarbigloo, S.; Azad, R.; Kazerouni, A.; Merhof, D. MS-Former: Multi-Scale Self-Guided Transformer for Medical Image Segmentation. In Proceedings of the Medical Imaging with Deep Learning, Nashville, TN, USA, 10–12 July 2023; Available online: https://openreview.net/forum?id=pp2raGSU3Wx (accessed on 19 April 2023).
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
Azad, R.; Kazerouni, A.; Heidari, M.; Aghdam, E.K.; Molaei, A.; Jia, Y.; Jose, A.; Roy, R.; Merhof, D. Advances in Medical Image Analysis with Vision Transformers: A Comprehensive Review. arXiv 2023, arXiv:2301.03505. [Google Scholar]
Shen, Z.; Zhang, M.; Zhao, H.; Yi, S.; Li, H. Efficient attention: Attention with linear complexities. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Virtual Conference, 5–9 January 2021; pp. 3531–3539. [Google Scholar]
Mendonça, T.; Ferreira, P.M.; Marques, J.S.; Marcal, A.R.; Rozeira, J. PH 2-A dermoscopic image database for research and benchmarking. In Proceedings of the 2013 35th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), Osaka, Japan, 3–7 July 2013; pp. 5437–5440. [Google Scholar]
Codella, N.; Rotemberg, V.; Tschandl, P.; Celebi, M.E.; Dusza, S.; Gutman, D.; Helba, B.; Kalloo, A.; Liopyris, K.; Marchetti, M.; et al. Skin lesion analysis toward melanoma detection 2018: A challenge hosted by the international skin imaging collaboration (isic). arXiv 2019, arXiv:1902.03368. [Google Scholar]
Azad, R.; Asadi-Aghbolaghi, M.; Fathy, M.; Escalera, S. Bi-Directional ConvLSTM U-Net with Densley Connected Convolutions. In Proceedings of the ICCV 2019, IEEE International Conference on Computer Vision 2019, Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar]
Ahn, E.; Feng, D.; Kim, J. A spatial guided self-supervised clustering network for medical image segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Strasbourg, France, 27 September–1 October 2021; Springer: Berlin/Heidelberg, Germany, 2021; pp. 379–388. [Google Scholar]

Figure 1. Schematicrepresentation of the proposed approach for semantic segmentation employing a self-supervised strategy. The method integrates CNN and Transformer features to effectively capture local and global dependency. Additionally, to enhance the fine-grained feature clustering process by leveraging pixel color information, a contextual attention module is introduced to capture correlations among local, global, and color information. Ultimately, our method enforces semantic and spatial consistency loss to assign each pixel to semantically related clusters.

Figure 2. Illustration of the contextual attention module. The module integrates local and global features to dynamically capture dependencies among image contents. Furthermore, to incorporate color information into the clustering process, a self-correlation map is utilized, leveraging the color information, as detailed in the pipeline.

Figure 3. Visual comparison of different methods on the PH² datasets for skin lesion segmentation.

Figure 4. Visual effect of removing each module on the segmentation results of the proposed method using the PH² dataset.

Figure 5. Limitations of the proposed method on the PH² dataset. These limitations arise primarily due to challenges associated with noisy annotations and the complex overlap of the lesion with the background.

Table 1. Comparative performance of the proposed method against the state-of-the-art approaches on the PH² and ISIC 2018 datasets. Notably, k is set to 3 in our experiments.

Methods	PH²			ISIC 2018
Methods	DSC ↑	HM ↓	XOR ↓	DSC ↑	HM ↓	XOR ↓
k-means	71.3	130.8	41.3	62.7	140.6	45.6
DeepCluster [20]	79.6	35.8	31.3	76.5	38.1	32.8
IIC [21]	81.2	35.3	29.8	78.9	37.2	31.5
SGSCN [30]	83.4	32.3	28.2	83.1	32.9	29.6
MSS-Former [23]	86.0	23.1	25.9	84.2	25.3	27.8
Our Method	87.2	21.6	23.8	86.1	22.5	24.7

Table 2. Contributionof the suggested modules to model performance: The table presents the impact of each module on the model’s performance. The experiments were conducted using the PH² dataset. In the table, BA indicates the boundary attention module and CA indicates the contexual attention mechanism.

CNN/Transformer	BA	CA	$L_{sp}$	DSC ↑	HM ↓	XOR ↓
CNN	✓	✓	✓	85.9	22.8	25.8
Transformer	✓	✓	✓	84.3	30.2	28.5
Both	✗	✓	✓	87.0	21.9	24.1
Both	✓	✗	✓	86.5	22.8	24.5
Both	✓	✓	✗	84.6	30.1	28.3
Both	✓	✓	✓	87.2	21.6	23.8

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Gharawi, A.; Alahmadi, M.D.; Ramaswamy, L. Self-Supervised Skin Lesion Segmentation: An Annotation-Free Approach. Mathematics 2023, 11, 3805. https://doi.org/10.3390/math11183805

AMA Style

Gharawi A, Alahmadi MD, Ramaswamy L. Self-Supervised Skin Lesion Segmentation: An Annotation-Free Approach. Mathematics. 2023; 11(18):3805. https://doi.org/10.3390/math11183805

Chicago/Turabian Style

Gharawi, Abdulrahman, Mohammad D. Alahmadi, and Lakshmish Ramaswamy. 2023. "Self-Supervised Skin Lesion Segmentation: An Annotation-Free Approach" Mathematics 11, no. 18: 3805. https://doi.org/10.3390/math11183805

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Self-Supervised Skin Lesion Segmentation: An Annotation-Free Approach

Abstract

1. Introduction

2. Related Work

3. Proposed Method

3.1. Local Features

3.2. Long-Range Dependency

3.3. Contextual Attention Module

3.4. Spatial Consistency

3.5. Joint Objective

4. Experimental Evaluation

4.1. Dataset

4.2. Evaluation Protocol

4.3. Results

4.4. Ablation Study

4.5. Strengths and Limitations

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI