1. Introduction
The advancement of computer vision has led to a significant focus on medical image segmentation as a crucial area of research. The demand for computer-aided diagnosis (CAD) systems has increased, aiming to support healthcare professionals with accurate interpretations of medical images and improving logistics in healthcare practices. Automated image segmentation is crucial in the analysis of medical images, serving as a fundamental component for accurate and reliable diagnoses. Skin cancer, a highly prevalent and potentially life-threatening form of cancer, primarily affects the different layers of human skin, namely the dermis, epidermis [
1], and hypodermis. These tissues collectively make up the structure of the skin and are involved in various physiological functions. Melanoma, which arises from the abnormal proliferation of melanocytes within the epidermis, represents the most severe form of skin cancer [
2]. With a mortality rate of 1.62%, it is regarded as the deadliest variant among skin cancers. The timely and accurate detection of melanoma is crucial for effective treatment and improved patient outcomes. The year 2019 witnessed an estimated total of 96,480 newly diagnosed cases of melanoma, making it a significant public health concern [
3,
4]. As per the World Health Organization (WHO), non-melanoma skin cancers account for a considerable number of cases worldwide with an estimated incidence of 2 to 3 million annually [
3]. This highlights the substantial burden of non-melanoma skin cancers and emphasizes the need for effective prevention and treatment strategies to address this significant global health issue.
The early detection of melanoma is paramount for successful treatment outcomes, as studies have demonstrated that timely diagnosis can significantly improve the five-year relative survival rate, reaching up to 92% [
3]. In this regard, the precise and early identification of skin lesions plays a crucial role in computerized systems aimed at the accurate diagnosis of skin cancer, as depicted in
Figure 1. In the past decade, considerable attention has been devoted to the development of automated skin cancer detection methods, catering to various skin conditions, including the three primary forms of abnormal skin cell growth: basal cell carcinoma, squamous cell carcinoma, and melanoma. These efforts contribute to advancing the field of dermatology and improving the effectiveness of early intervention strategies.
The accurate segmentation of skin cancer in medical images poses several challenges due to various factors, including low contrast, texture variations, variations in positioning, color variations, and differences in lesion sizes. Moreover, the presence of artifacts, such as air bubbles, hair, dark borders, measurement marks, blood vessels, and uneven color illumination, further complicates the task of segmenting lesions. These factors collectively contribute to the complexity and difficulty of accurately delineating skin cancer lesions, making the segmentation process a demanding and challenging endeavor.
In this paper, we propose a novel self-supervised clustering network for skin lesion segmentation that improves the feature representation of each pixel and iteratively learns the optimal cluster size. The proposed approach utilizes a consistency loss function based on context, which helps effectively segment image regions with unclear boundaries and noise. This loss function enforces all the pixels belonging to a cluster to be spatially close to the cluster center, leading to improved clustering performance. To evaluate the effectiveness of our proposed approach, we conducted experiments on a public skin lesion segmentation dataset, PH2, and compared our approach to other unsupervised clustering and self-supervised clustering methods. The experimental results demonstrate that our proposed approach achieves state-of-the-art performance in terms of segmentation accuracy, indicating its potential for practical use in skin lesion segmentation tasks. Our contribution can be summarized as follows:
Hybrid CNN/Transformer Architecture: We introduce a unique architecture that combines the strengths of Convolutional Neural Networks (CNNs) and Transformers. This synergy enables our model to effectively capture both local and global contextual information, enhancing skin lesion boundary detection.
Precise Boundary Detection: Our method excels in accurately identifying skin lesion boundaries, even in scenarios with complex object background overlaps. This capability is crucial for ensuring precise diagnosis and treatment planning.
Self-Supervised Learning: Unlike traditional supervised methods requiring pixel-level annotations, our approach operates in a self-supervised manner. This significantly reduces the annotation burden, making it more practical and cost-effective.
Boundary and Contextual Attention Modules: We introduce innovative boundary and contextual attention modules that improve the accuracy of pixel clustering and subsequent segmentation. These modules enhance the model’s ability to capture fine-grained details.
State-of-the-Art Performance: Through comprehensive evaluations on skin lesion segmentation datasets, we demonstrate that our proposed method achieves state-of-the-art performance. Our model outperforms existing approaches in terms of both quantitative metrics and qualitative segmentation results.
2. Related Work
Supervised deep learning methods applied to images have demonstrated their potential in extracting image features to support various image analysis tasks. Many of the techniques that have shown success in logistics applications, such as image segmentation for parcel [
5,
6] and packaging recognition [
7,
8], have the potential to be applied in the medical image domain as well. These techniques leverage underlying algorithms and rely on abundant labeled data which is costly [
9], underscoring their efficacy and promise [
10,
11]. However, the use of these methods for skin cancer segmentation faces challenges due to limited labeled data, costly and time-consuming manual delineation by imaging experts, variability among experts, and the complexity of the images, which can have diverse appearances based on factors like skin lesion type in dermoscopic images. To address these challenges, researchers have employed various techniques. These methods encompass various techniques, such as deep learning with knowledge transferability between diverse domains, fine-tuning models using limited labeled data (domain adaptation), and unsupervised feature learning using algorithms like sparse coding [
12] and auto-encoders [
13], and self-supervised learning [
14].
Self-supervised learning, a form of unsupervised learning, shows promise in computer vision and medical image segmentation tasks [
15,
16]. Notably, in self-supervised learning, the data itself generate supervisory signals for feature learning, offering a unique approach to address the scarcity of labeled data while extracting informative features [
17]. For instance, Zhuang et al. [
18] employed self-supervised learning techniques to train a Convolutional Neural Network specifically for brain tumor segmentation using 3D CT scan images. By leveraging self-supervised learning, they were able to enhance the network’s performance without the need for manual annotations or labeled data. Likewise, Tajbakhsh et al. [
19] adopted self-supervised learning methods to generate supervisory signals for lung lobe segmentation in CT scans. Their approach involved predicting color, rotation, and noise within the images. Through self-supervised learning, they effectively tackled the challenge of limited labeled data, resulting in improved accuracy in lung lobe segmentation tasks.
Another method for generating supervisory signals involves the use of image clustering [
20,
21]. Caron et al. [
20] presented a technique known as DeepCluster that offers an iterative approach to enhance both the feature representation and clustering assignment of image features during the training of Convolutional Neural Networks (CNNs). In a similar vein, Ji et al. [
21] introduced Invariant Information Clustering (IIC), which is a method that aims to improve the performance of clustering by maximizing the mutual information between image patches that have undergone spatial transformations. In the field of medical imaging, ref. [
22] employed the k-means clustering technique to group pixels with similar characteristics in micro-CT images. By utilizing the cluster labels, they were able to learn the feature representation of these pixels. However, it is important to acknowledge that self-supervised clustering methods have certain limitations. These include the manual selection of cluster size (e.g., determining the value of ’k’ in k-means) and difficulties associated with intricate regions that have fuzzy boundaries, diverse shapes, artifacts, and noise.
Karimi et al. [
23] introduced a novel dual-branch transformer network to incorporate global contextual dependencies while preserving local information at various scales. Their self-supervised learning approach takes into account semantic dependencies between scales, generating a supervisory signal to ensure inter-scale consistency and a spatial stability loss within each scale for self-supervised content clustering. To further enhance the clustering process, the authors introduced an additional cross-entropy loss function on the clustering score map, enabling the effective modeling of each cluster distribution and improving the decision boundary between clusters. Through iterative training, the algorithm learns to assign pixels to semantically related clusters, resulting in the generation of a segmentation map.
In comparing the detailed self-supervised approaches discussed, it becomes evident that each method offers unique strengths and limitations. Self-supervised clustering methods, such as DeepCluster [
20] and IIC [
21], have shown their ability to enhance feature representation and clustering assignment iteratively during training. These approaches leverage inherent data structures to improve segmentation accuracy, yet they may struggle with intricate regions characterized by fuzzy boundaries and diverse shapes. On the other hand, the utilization of k-means clustering [
22] for feature learning demonstrates simplicity and effectiveness in certain contexts while facing similar challenges with complex boundaries. Notably, the recent introduction of the dual-branch transformer network by Karimi et al. [
23] showcases a novel avenue for self-supervised content clustering. This approach combines global contextual dependencies and local information, resulting in inter-scale consistency and accurate spatial stability, thereby addressing the limitations of earlier techniques. While these methods show promise, further exploration is warranted to overcome challenges posed by complex object background overlap and noisy annotations in the context of skin lesion segmentation.
3. Proposed Method
The proposed method is illustrated in
Figure 1, providing an overview of our approach. Initially, the input image
, where
indicate the spatial dimension and
D shows the color values, undergoes a two-pronged processing pipeline. The Convolutional Neural Network (CNN) component captures local feature descriptions, focusing on the extraction of fine-grained details. Simultaneously, the Transformer model operates on the input image, modeling global representations and long-range dependencies. This dual pathway enables the incorporation of both local semantic information and global contextual understanding. To efficiently combine local and global representations, as well as pixel-level information (RGB color), we propose a novel context attention strategy. The context attention module calculates the feature correlation matrix by leveraging the interdependencies between local and global features. Moreover, by conditioning the feature space using pixel color information, it incorporates prior knowledge to construct the normalized feature set
for content clustering. In essence, this fusion process enables the integration of pixel prior information with local and global cues, resulting in a comprehensive representation that captures fine details and contextual information effectively.
Subsequently, the final segmentation map is created by selecting the channel-wise dimension with the highest response value in
, employing the argmax function. This step maps each pixel to its respective cluster label without the need of manually labeling each pixel, which in turns facilitates the generation of a segmentation map that assigns semantic labels to image regions. Training of the network parameters, denoted as
,
), is accomplished through the minimization of the cross-entropy loss. This loss function measures the discrepancy between the generated segmentation map
and the ground truth cluster labels assigned to each pixel
, encouraging the network to produce accurate and meaningful segmentation:
To further enhance the clustering assignments and capture spatial relationships within the image, we introduce two additional components. Firstly, the spatial consistency loss emphasizes the importance of spatial proximity between pixels, reinforcing the understanding of local connections. This loss encourages the network to group together visually similar pixels that are in close spatial proximity, promoting spatially coherent segmentations. Secondly, the object-level interaction module (modeled by boundary representation) enables the network to separate different clusters and encourages the merging of semantically similar clusters. By incorporating these components, the clustering assignments are refined, leading to improved segmentation results. The proposed architecture adopts an iterative learning process, allowing for the continuous improvement of both the feature representation and clustering assignment of each pixel. In the next subsections, we will elaborate on each part of the network in more detail.
3.1. Local Features
Our proposed method, outlined in
Figure 1, incorporates two encoding streams to effectively capture both local and object-level contextual representations. In the first pathway, we employ a shallow CNN network to extract local representations. Given an input image
, our CNN encoder
utilizes a series of convolutional blocks to model pixel-level contextual representations. Next, we follow the same strategy suggested in [
24] to perform channel-wise feature normalization weights as follows:
Here,
represents the global average pooling operation applied to the CNN features (
F), while
and
denote the learnable parameters.
and
refer to the ReLU and Sigmoid activation functions, respectively. The normalized features, denoted as
, are obtained by element-wise multiplication of the weights and the original features:
However, the inherent locality of the convolutional operation limits its ability to fully capture object-level interactions. To address this limitation and incorporate object-level representations, we introduce an additional component to learn the boundary heatmap. This is achieved through the application of a kernel convolutional operation denoted as , which operates on the output of the CNN encoder . The resulting boundary heatmap, denoted as , serves as a surrogate signal for modeling regional contextual dependencies. By encoding the boundaries between regions, it facilitates the representation of object-level interactions and enables the network to capture spatial relationships and contextual dependencies at a larger scale. This integration of the boundary heatmap within the encoding process enhances the overall understanding of the image, contributing to more comprehensive and contextually informed representations. By combining the normalized CNN features with the boundary heatmap, our proposed method generates , thereby capturing comprehensive contextual representations that encompass both pixel-level and object-level information. This enables the network to better model the intricate interactions and dependencies within the image, leading to improved performance in tasks requiring detailed contextual understanding.
3.2. Long-Range Dependency
The Vision Transformer (Vit) is a pioneering architecture that has gained significant attention in the field of computer vision [
25]. It represents a departure from traditional Convolutional Neural Networks (CNNs) by relying on the principles of self-attention mechanisms commonly used in natural language processing tasks. In the context of image analysis, Vit divides an input image into fixed-size patches, which are then linearly embedded and processed using multi-head self-attention mechanisms. This approach enables Vit to capture long-range dependencies between image patches and model intricate relationships within the data. The resulting representations are subsequently fed through fully connected layers to perform classification or regression tasks. Vit’s unique ability to effectively capture global context information across image patches has led to its success in tasks such as image classification and object detection [
25]. In our proposed self-supervised skin lesion segmentation method, we employ Vit as a foundational component within our dual-branch network, leveraging its capacity to capture both local and global contextual dependencies for the accurate clustering and segmentation of skin lesions. Hence, in the second path, we incorporte a Vision Transformer module to capture global contextual representation. However, the standard self-attention mechanism, as expressed in Equation (
4), poses computational challenges when dealing with high-resolution images due to its quadratic computational complexity (
). In this equation, Q, K, and V represent the query, key, and value vectors, respectively, while
d denotes the embedding dimension.
Shen et al. [
26] introduced the concept of efficient attention, which capitalizes on the observation that regular self-attention generates a redundant context matrix. To address this inefficiency, they devised an alternative approach for computing self-attention, as described in Equation (
5). This formulation aims to streamline the self-attention process and improve computational efficiency.
where
and
show the normalization functions for the queries and keys, respectively. Research conducted by Shen et al. [
26] has demonstrated that efficient attention can yield equivalent results to dot-product attention by applying softmax normalization functions (
and
). In the efficient attention mechanism, the normalization of keys and queries is performed prior to the multiplication of keys and values. The resulting global context vectors are then multiplied by the queries to generate the new representation. Unlike dot-product attention, which calculates pairwise similarities between data points, efficient attention represents keys as attention maps (
maps denoted as
, with
j indicating position
j in the input feature). These global attention maps capture the semantic aspects of the entire input feature rather than focusing on positional similarities. This change in order significantly reduces the computational complexity of the attention mechanism while maintaining a high level of representational power. The memory complexity of efficient attention is determined by the term
, reflecting the influence of the embedding dimension
d and the number of data points
n. Additionally, the computational complexity is proportional to
in scenarios where
and
—a common configuration. In our proposed structure, we leverage an efficient attention block to construct our Transformer-based encoder module
to capture the spatial importance of the input feature map, enabling more effective information processing and representation learning.
3.3. Contextual Attention Module
In order to enhance the embedding vector’s understanding of the entire image, we introduce a contextual attention (CA) module that incorporates a self-attention mechanism. The CA module, depicted in
Figure 2, facilitates communication between pixels based on their similarities. These similarities are determined using three types of features: (1) high-level embeddings extracted from CNN features, (2) color features extracted directly from the raw image, and (3) global contextual features obtained from a Vision Transformer.
To this end, we first fuse the CNN (
) and Transformer features (
) by employing
convolution followed by a non-linear activation function. Then, to quantify the similarity between pixels, we take into account the relationship between the features and colors. Drawing inspiration from the self-attention mechanism in the Transformer model, we utilize the correlation matrix to calculate the similarity between color and Vision Transformer (ViT) features:
Here,
F represents the feature,
C indicates the color features and
represents the combined color and network features (fused CNN/Transformr features) information. This formulation allows us to capture the relationships between pixels based on color and global context. The overall process is also visualized in
Figure 2. By integrating the CA module into our framework, we enable the embedding vectors to capture the contextual information of the entire image. The self-attention mechanism facilitates communication and interaction between pixels, promoting a more comprehensive understanding of the image structure.
3.4. Spatial Consistency
Self-supervised image clustering faces the challenge of effectively guiding the feature extractor to embed locally related pixels into the same cluster. To address this challenge, we propose the incorporation of an auxiliary spatial consistency loss, denoted as
, during the training process. The spatial consistency loss plays a crucial role in promoting the spatial coherence of clustered pixels. In order to enforce spatial consistency, we introduce a spatial consistency loss that focuses on minimizing the feature discrepancy within local regions. Specifically, we apply a convolutional layer with a kernel size of
to the score map
S, resulting in a localized representation denoted as
. By minimizing the L1 distance between the original score map
S and the localized version
, we aim to encourage spatial consistency and reduce variations within adjacent areas. The
can be expressed as follows:
Here, represents the score at position in the original score map S, and represents the corresponding score in the localized version . The L1 distance measures the absolute difference between the corresponding elements of S and , which are summed over all positions in the score maps. By enforcing the spatial proximity of pixels within the same region, this loss encourages the feature extractor to capture local dependencies and preserve spatial relationships in the embedded feature space. Consequently, it enhances the ability of the network to group together visually similar pixels and effectively segment objects in dense prediction tasks.
3.5. Joint Objective
In our training process, we employ a final loss function consisting of two distinct terms:
The first term represents the cross-entropy loss, which ensures confidence in the network’s predictions by comparing them with the maximum index. This term also enables the network to learn the distribution of each cluster. The inclusion of the second term aims to enforce spatial consistency within each image region. This spatial consistency term is responsible for reducing local variation and facilitating the smooth merging of neighboring clusters.