Skin Lesion Segmentation through Generative Adversarial Networks with Global and Local Semantic Feature Awareness

Zou, Ruyao; Zhang, Jiahao; Wu, Yongfei

doi:10.3390/electronics13193853

Open AccessArticle

Skin Lesion Segmentation through Generative Adversarial Networks with Global and Local Semantic Feature Awareness

by

Ruyao Zou

,

Jiahao Zhang

and

Yongfei Wu

^*

College of Artificial Intelligence, Taiyuan University of Technology, Jinzhong 030600, China

^*

Author to whom correspondence should be addressed.

Electronics 2024, 13(19), 3853; https://doi.org/10.3390/electronics13193853 (registering DOI)

Submission received: 18 July 2024 / Revised: 4 September 2024 / Accepted: 25 September 2024 / Published: 28 September 2024

(This article belongs to the Section Bioelectronics)

Download

Browse Figures

Versions Notes

Abstract

:

The accurate segmentation of skin lesions plays an important role in the diagnosis and treatment of skin cancers. However, skin lesion areas are rich in details and local features, including the appearance, size, shape, texture, etc., which pose challenges for the accurate localization and segmentation of the target area. Unfortunately, the consecutive pooling and stride convolutional operations in existing convolutional neural network (CNN)-based solutions lead to the loss of some spatial information and thus constrain the accuracy of lesion region segmentation. In addition, using only the traditional loss function in CNN cannot ensure that the model is adequately trained. In this study, a generative adversarial network is proposed, with global and local semantic feature awareness (GLSFA-GAN) for skin lesion segmentation based on adversarial training. Specifically, in the generator, a multi-scale localized feature fusion module and an effective channel-attention module are designed to acquire the multi-scale local detailed information of the skin lesion area. In addition, a global context extraction module in the bottleneck between the encoder and decoder of the generator is used to capture more global semantic features and spatial information about the lesion. After that, we use an adversarial training strategy to make the discriminator discern the generated labels and the segmentation prediction maps, which assists the generator in yielding more accurate segmentation maps. Our proposed model was trained and validated on three public skin lesion challenge datasets involving the ISIC2017, ISIC2018, and HAM10000, and the experimental results confirm that our proposed method provides a superior segmentation performance and outperforms several comparative methods.

Keywords:

skin lesion segmentation; global context extraction; multi-scale local semantic features; generative adversarial network

1. Introduction

Skin cancers, as one of the most prevalent forms of cancer today, were diagnosed in approximately 104,930 people in the United States in 2023 alone [1]. Specifically, skin cancer is a universal term used to describe any malignant lesion of the skin, and this grouping can be further divided into the two principal and most commonly occurring groups of tumors: keratinocyte cancers and melanomas [2]. Among them, melanomas are especially dangerous [3], and their occurrence frequency is closely related to the patient’s constitutive skin color and their geographical zone of origin. Skin cancers, with an estimated annual increase of approximately 3–7% during the past decades, are rapidly growing in white populations. Over the past 70 years, the changes in outdoor activities and exposure to sunlight are the main factors for the increasing occurrence of melanoma [4]. However, in recent years, some research studies have proved that the rate of invasive melanoma is decreasing in the younger populations, alongside increased protection of the skin and early detection either via self-assessment or clinical skin examinations [5]. If melanoma is accurately diagnosed and appropriately treated at an early clinical stage, the prognosis is good, and survival rates can be improved [6].

The clinical methods for early diagnosis involve doctors examining the skin region of interest. Firstly, the doctor observes the transformation that has taken place in the skin, such as in its appearance, shape, and color. Then, the diagnostic technique of employing a dermatoscope [7] may be introduced to allow the improved triage of lesions requiring biopsy compared with those assessed with the naked eye alone [8,9]. The use of the dermatoscope introduces further advances in the process of cancer diagnosis. This is because, in the examination process, the assessment undertaken by human eye alone is frequently subject to errors due to blurring, low contrast, and the illumination of the images, particularly if provided in a dataset. Figure 1 illustrates some challenging images for the dataset segmentation of skin lesions, such as hair (Figure 1a), bubble artifacts (Figure 1c), varied size (Figure 1b,e), irregular boundaries (Figure 1c,d), and low contrast between the skin lesions and the normal background (Figure 1b,c) [10]. Following these stages of visual assessment, a biopsy may be obtained via excision from the skin lesion and the results will be referred to a pathologist or dermatologist to confirm whether or not it is cancer. Therefore, the segmentation accuracy of lesions is intricately associated with the expertise and experience of the medical expert [11]. On the contrary, computer-aided diagnosis (CAD) affords an objective and quantitative evaluation of skin lesions compared to visual assessment. CAD results can be reproduced via the elimination of the inter- and intra-observer variability which are unavoidably present in the manual and visual examinations of dermatologists. The segmentation accuracy is extremely significant because the produced segmentation bias will greatly influence the subsequent processes of the diagnosis system [12]. Medical image segmentation algorithms can be employed more effectively to analyze and identify the development of lesions and disease severity, which is a fundamental application area for CAD. Furthermore, this system could also reduce the checking time for analyzing and providing a diagnosis from medical data. Therefore, the design of an automatic and accurate skin lesion segmentation method is extremely important [13].

The traditional segmentation methods for skin lesions require the manual design of features and image preprocessing, which is time-consuming and inefficient. Therefore, researchers have used deep learning methods, such as convolutional neural networks (CNNs) for skin lesion segmentation [14]. In medical images, CNNs can automatically learn the features that provide the most discrimination based on large datasets, which have a superior feature representation ability compared to traditional methods [15]. As a result, CNN-based methods are rapidly gaining a significant foothold in the field of medical image segmentation. U-Net is one of the well-known CNNs and has a U-shaped encoder–decoder structure. In this structure, the encoder is used to extract image features by gradually downsampling; meanwhile, the decoder is often applied to restore extracted features by upsampling to the size of the original image and yielding the final segmentation masks [16]. When the famous fully convolutional network (FCN) was proposed, it was trained in an end-to-end, pixel-to-pixel fashion on semantic segmentation, exceeding the state-of-the-art methods without employing any additional machinery [17]. Afterwards, U-Net was designed to modify and extend the FCN, so that U-Net could yield more precise segmentation results with very few training images [18]. Subsequently, various versions of segmentation networks based on U-Net have been proposed, including ResU-Net [19], U-Net++ [20], Attention U-Net [21], V-Net [22], and Unet3+ [23]. However, a common problem with U-Net and its variants is that the continuous pooling and spanning convolution required to learn increasingly abstract feature representations reduces the feature resolution. Although this operation is valuable for classification or object detection tasks, it is often ineffective in dense prediction tasks such as segmentation, which usually require detailed spatial feature information. Intuitively, maintaining a high-resolution feature map in the middle layer can improve the segmentation accuracy. However, it is not conducive to accelerating network training and reducing the difficulty of network optimization due to the increase in the feature map size. Therefore, a compromise needs to be found for accelerating training and maintaining high resolution [24].

Although a significant number of studies in the literature have been presented, essentially, they belong to supervised learning methods, which usually require thousands of pieces of training data to train a robust model, in particular, when the networks become deeper. Unfortunately, it is very challenging to gather abundant training data for medical image analysis because of the diversity of medical images and the high annotation cost [25]. In recent years, the generative adversarial network (GAN)-based unsupervised generation framework has been widely studied in medical image processing [26]. With their ability to mimic data distributions and synthesize images at yet unprecedented levels of realism, GANs have created new ways of bridging the gap between supervised learning and image generation. Specifically, GANs can overcome the scarcity of labeled data and class imbalance and are powerful in extracting semantically meaningful features from images that traditional pixel-wise losses fail to grasp [27]. The DCGAN architecture is fairly stable to train, which is a widely used and well-engineered convolutional GAN. Its architecture is carefully constructed utilizing leaky ReLU activations to avoid sparse gradients and designing a specific weight initialization to permit robust training [28]. The convolutional neural network-based generator used in the current GAN (Generative Adversarial Network) architecture gradually reduces the local spatial feature information as the number of convolutional layers increases during the feature extraction process, and the final extracted features focus more on the overall content and global semantic features of the image. Although the widely used skip-connection mechanism can alleviate the problem of local information loss to a certain extent, this connection method often adopts the direct stacking or simple addition of shallow and deep features, which lacks the fine screening and fusion of information, and thus may mix unnecessary noise or redundant information, which adversely affects the overall performance of the model.

To overcome these challenges, we propose a generative adversarial network with global and local semantic features awareness (GLSFA-GAN) for skin lesion segmentation. Specifically, the GLSFA-GAN comprises a generator (GLSFA-Net) and a discriminator. In addition to the encoder and decoder included in the GLSFA-Net, a designed multi-scale localized feature fusion (MSFF) module and an introduced effective channel attention (ECA) module are utilized to learn local semantic features. The MSFF employs a spatial attention mechanism to dynamically adjust the feature weights of different regions based on the spatial location information in the image to better focus on the target lesion region. Then, after the MSFF module enters the ECA module, the multi-scale localized feature information is obtained. In order to compensate for the possible loss of local, fine-grained details due to the continuous convolution and downsampling operations, the output local information features are concatenated with the decoder sampling process. In addition, a global context extraction module (GCEM) is used to extract more global semantic features and spatial context information. After that, an adversarial training strategy is used to make the discriminator discern the labels and the segmentation prediction maps generated, which then prompts the generator network to yield more accurate segmentation masks. It is worth noting that the discriminator is not used during the testing phase. Extensive experiments conducted on the ISIC2017, ISIC2018, and HAM10000 skin lesion datasets confirmed that our proposed segmentation method achieves superior performance and outperforms several competing methods.

In summary, the main contributions of this study are outlined as follows:

The multi-scale localized feature fusion module is designed in this work to fuse different scales of features. The generator acquires the multi-scale local detailed information of the skin lesion area through skip connections with local feature information modules, thus preserving the rich details and local features of the target area.
The encoder loses some of the information during the downsampling process due to the constant downsampling as well as the convolution operation. For this reason, this study proposes the global context extraction module to capture more global semantic features as well as spatial information, thereby enabling the segmentation network to achieve the accurate localization of the target region.
GLSFA-GAN involves a generator GLSFA-Net (the segmentation network) and a discriminator. An adversarial training strategy is used to make the discriminator discriminate between the generated labels as well as the segmentation prediction maps, prompting the generator to yield more accurate segmentation results.

The remainder of this work is outlined as follows:

Section 2 provides a brief review of some of the works related to the segmentation of skin lesions and generative adversarial networks. Section 3 presents the overall architecture of the proposed network. Section 4 provides the materials and experimental settings. The experimental results are shown in Section 5, and the discussion and conclusion of this work are provided in Section 6 and Section 7.

2. Related Work

2.1. Segmentation of Skin Lesion

Traditionally, medical image segmentation was mainly conducted based on either classical image processing methods or machine learning methods [29], for example, thresholding [30], active contours [31], region growth [32], clustering [33], support vector machines [34], etc. However, with datasets that are larger and more complex, it is difficult for these methods to produce satisfactory results. In contrast, deep learning combines feature extraction and task-specific decisions seamlessly into a universal framework to complete the classification task [29]. In recent years, the use of CNNs has promoted the development of diagnosis technology based on medical images [35]. For example, the CENet proposed by Gu et al. [24] utilized dense atrous convolution and residual multi-kernel pooling blocks, which could capture rich semantic feature information. In [36], CPFNet combined two pyramidal modules to fuse the global/multi-scale context information. MLP-CNN harvests the complementary results acquired from the CNN based on deep spatial feature representation and from the multi-layer perception (MLP) based on spectral discrimination [37]. Sushma et al. proposed an encoder–decoder-based U-shaped CNN variant with an attention aggregation-based pyramid feature clustering module to detect breast lesion regions [38]. The authors of [39] proposed a channel and space compound attention convolutional neural network, which hired a double squeeze-and-excitation block in the bottleneck layer to enhance the feature extraction and obtain more high-level semantic features. In this study, we propose the integration of a multi-scale localized feature fusion module with a channel attention module for capturing local details at different scales to capture the multi-scale local details of the lesion area. In addition, the global context extraction module is proposed in this study to extract rich feature representations and more comprehensive contextual information.

2.2. Generative Adversarial Networks

At the outset, Goodfellow et al. proposed generative adversarial networks (GANs) for assessing generative models via an adversarial training process, in which they simultaneously train a generative model that captures the data distribution, and a discriminative model that assesses the probability that the sample came from the training data rather than from the generator [40]. Employing GANs is another way of improving medical image segmentation and obtaining more specific results. At the moment, GANs are a hot research subject. GANs significantly increase the quality of medical image segmentation through their excellent synthesizing capacity and potential to extract and distribute data [35]. CycleGAN has been proposed with an objective that contains two functions: first, the use of adversarial losses for matching the distribution of generated images to the data distribution in the target domain and, second, the use of a cycle consistency loss to prevent the learned mappings from contradicting each other [41]. Although noise-based GANs, such as LAPGAN, DCGAN, and PGAN, can generate more diverse samples, they tend to suffer from lower quality. In addition, some image-to-image translation GANs, such as pix2pix and pix2pixHD, learn to produce new samples from a semantic segmentation map. However, they have much less freedom in the number of available samples [35]. GANs can also provide enhanced segmentation models since the adversarial training encourages high-order consistency in predicted segmentation through the implicit examination of the joint distribution of class labels and ground-truth segmentation masks [29]. Bi et al. built upon the GAN with a novel, stacked adversarial learning architecture so skin lesion features could be iteratively learned in a class-specific manner. The outputs from this method were then added to the existing FCN training data, thus increasing the overall feature diversity [42]. Han et al. proposed a 3D multi-conditional GAN to generate realistic/diverse 32 × 32 × 32 nodules placed naturally on lung computed tomography images to boost the sensitivity in 3D object detection, which adopted two discriminators for conditioning [43]. Nidhi Bansal et al. proposed a novel deep learning-based Hexa-GAN model, which can significantly distinguish the hair area from other regions by applying a hexagonal convolution operation [44]. A novel dermatological color constancy generative adversarial network algorithm has been designed through formulating the color constancy task as an image-to-image translation problem [45]. Undifferentiated combination of information by simple skip connections may allow noise from lesion regions to be introduced into the decoder part, affecting the accurate classification of the pixels. To summarize, the issue of local feature information reduction when passing feature information needs to be addressed in the intra-generator transfer process, as well as dealing with how to integrate global and local semantic information to improve the accuracy and stability of the generator segmentation network.

3. Methods

3.1. Overall Architecture of the Proposed Model

The overall framework of the designed skin lesion segmentation algorithm containing adversarial training (called GLSFA-GAN) is shown in Figure 2. Specifically, GLSFA-GAN comprises a generator and a discriminator. In the generator, a multi-scale localized feature fusion module and an effective channel-attention module are proposed for extracting the local semantic features. In addition, a global context extraction module is used in the bottleneck of the encoder–decoder of the generator for capturing more global semantic features and spatial context information. The above three modules connect the encoder to the decoder to generate the segmentation prediction map. Subsequently, the discriminator is utilized to discern the labels as well as the segmentation prediction maps, which encourages the generator to yield more precise segmentation results.

In the proposed architecture, the generator captures the multi-scale local detailed information of the skin lesion area through skip connections with MSFF and ECA to preserve the rich details and local features of the target region. Meanwhile, the designed GCEM is utilized to acquire more global semantic features as well as spatial information to enable the segmentation network to achieve an accurate localization of the target region. Incorporating adversarial training creates constraints on the generator segmentation network, which in turn improves its capability. The GAN prompts the generator to learn richer feature representations of skin lesion images via the game process between the generator network and the discriminator partner, which assists in more accurately performing the skin lesion segmentation task.

3.2. Generator GLSFA-Net

The current convolutional neural network-based generator gradually extracts the features of the input image and reduces the spatial resolution during the feature information transfer process, the local information is gradually reduced, and the global semantic information is extracted. In addition, the simple skip connection cannot seize the local detailed information at various scales and ignores the significance of the global feature information, which may then lead to the noise of the lesion region being introduced into the decoder part, affecting the accurate classification of pixels. Therefore, in order to improve the accuracy and stability of the generator segmentation network, it is necessary to solve the problem of the reduction in local feature information in the process of transmitting feature information as well as the problem of how to integrate the local information with the global information.

To address the above limitations, in this study, an MSFF module and an ECA module are proposed for capturing the local details at various scales in order to seize the multi-scale, local detailed information of the lesion area. Meanwhile, the GCEM is proposed for the extraction of rich feature representations and more comprehensive contextual information. Figure 3 shows the designed generator structure presented in this study. After the convolutional processing of the encoder, the feature information of the image is imported into the MSFF module, which is designed to capture the local details of the target region at different scales. Next, the focus of the segmentation network on the target lesion area can be further enhanced by delivering the features containing localized information to the ECA module. In order to compensate for the possible loss of local, fine-grained details due to the successive convolution and downsampling operations, the output local information features need to be spliced with the decoder upsampling process. In order to obtain multi-scale local detailed information, the feature information containing local details output from the ECA module needs to be propagated at varying scales. The multi-scale local feature details are fed into the GCEM together with the encoder output features to obtain rich feature representations and more comprehensive contextual information. Through the combination of the detailed information of the multi-scale local features from the encoder part, the global information of the lesion area is obtained, which in turn promotes the partitioning performance of the segmentation network.

3.2.1. Multi-Scale Local Feature Fusion Module

The constraints on the shallow encoder feature extraction capability will gradually dilute the feature information of the target area. In order to be able to obtain effective multi-scale local, detailed information on the lesion area, in this study, we borrowed the idea of CPF-Net [36] and fused different scale features via the addition of a multi-scale, local feature fusion (MSFF) block. Using the spatial attention mechanism, suitable scale features are dynamically selected and fused using self-learning. After the convolutional processing of the encoder, the feature information of the image is imported into the MSFF module, alongside the simultaneous input of the feature information of the upper layer after the ECA module to the MSFF module, to obtain the local feature details at different scales. Since the top layer of the encoder has only one input when entering the MSFF module, this input is sampled at 1/2 of its original size and then fed into the MSFF block together with the original input. The detailed operational process of the MSFF module is displayed in Figure 4.

First, the features

F_{A}

and

F_{B}

of different scales are concatenated, and then the

3 \times 3

convolution and the

5 \times 5

convolution are performed on these concatenated features to generate the new features

F_{A'}

and

F_{B'}

. Then, the spatial attention mechanism is utilized via applying different weight maps to the features of divergent sensory fields to more accurately capture the localized feature information. Specifically, splice

F_{A'}

and

F_{B'}

and perform

1 \times 1

convolution to generate two pixel-level feature maps

M A P_{A}

and

M A P_{B}

. Then, the pixel-wise attention maps

W_{A}

are

W_{B}

generated by softmax operator on the spatial-wise values, as follows:

W_{A} = \frac{e^{M A P_{A}}}{e^{M A P_{A}} + e^{M A P_{B}}}

(1)

W_{B} = \frac{e^{M A P_{B}}}{e^{M A P_{A}} + e^{M A P_{B}}}

(2)

Multiply the weighted feature map with the input features element by element to obtain the fused feature map

F_{f u s i o n}

, as follows:

F_{f u s i o n} = W_{A} ⊙ F_{A'} + W_{B} ⊙ F_{B'}

(3)

The MSFF module hires the spatial attention mechanism to dynamically adjust the feature weights of different regions according to the spatial location information in the image to improve its attention to the target lesion area. In this process, the spatial attention mechanism does not directly remove or modify the local feature information in the original image; however, it achieves a weighted combination of features through adjustments to the feature weights in various regions to obtain a better performance. The local feature fusion at diverse scales enables the model to retain the local feature information of the image during its processing and this can help the block improve its utilization of this local feature information for segmentation tasks.

3.2.2. Efficient Channel Attention Module

Simple skip connections cannot capture the local details at diverse scales, and this may allow the noise in the lesion area to be introduced into the decoder part, affecting the accurate classification of the pixels. Since medical images usually have blurred boundaries, a single application of the attention mechanism may be ineffective in enhancing the features and eliminating all the noise in the small lesion areas. In this study, we continuously augmented the attention of the model to the lesion region through feeding the multi-scale local feature information obtained from the MSFF module into the ECA module to more effectively capture and utilize the local features of the skin lesion. In order to alleviate the loss of local fine-grained details that may be caused by the continuous convolution and downsampling operations, the output local information features need to be spliced with the decoder sampling process. In Figure 5, the operation of the ECA block is illustrated.

The ECA block first performs global average pooling (GAP) on the feature map

F_{f u s i o n}

that is from the MSFF module. Then, one one-dimensional convolution operation is performed, where the kernel size

k

is represented by the coverage of the local cross-channel interaction, i.e., the number of neighboring channels considered by each channel during the interaction. After completing this convolution operation, the weights of each channel of the input feature layer are obtained by fixing the channel weights at

[0, 1]

via the sigmoid activation function. After obtaining these weights, the channel weights are then multiplied with the original input feature layer. In summary, after the global average pooling of the channels without dimensionality reduction, the ECA block achieves a cross-channel information interaction through the application of a one-dimensional convolution; meanwhile, the size

k

of the convolution kernel is determined by an adaptive function.

The whole process Is expressed In the following equation:

F_{E C A} = σ (C o n v (G A P (F_{f u s i o n}))) \otimes F_{f u s i o n}

(4)

And regarding the choice of convolution kernel size, the article [46] argues that the coverage of the interaction (i.e., kernel size k of 1D convolution) is proportional to the channel dimension C. And the C is usually set to a power of 2. Thus, the article introduces another possible interpretation for k and C by extending the linear function

γ \cdot k - b

:

C = ϕ (k) = 2^{(γ \cdot k - b)}

So, given the channel dimension

C

and setting both

γ

and

b

to 1, the convolution kernel size

k

can be adaptively determined according to [46], as follows:

k = ψ (C) = | \frac{\log_{2} (C)}{γ} + \frac{b}{γ} |_{odd}

(5)

With the efficient channel attention mechanism, the model can adaptively weight the features of each channel, whereby, in the region where local features are more important, the attention mechanism can enhance the expression of local features, and this creates more accurate segmentation at the pixel level and improves the accuracy of that segmentation.

3.2.3. Global Context Information Extraction Module

The encoder loses some of the information during the downsampling process due to the constant downsampling as well as the convolution operation. For this reason, in this paper, the GCEM was designed.

The GCEM is illustrated in Figure 6. Through the combination of the multi-scale dilated convolution [47] as well as sampling and aggregating the feature map at different spatial scales, the feature

X_{0}

is input into a distinct branch in the first stage for reducing the numbers of channel feature by

1 \times 1

convolution. Subsequently, a set of

1 \times n

and

n \times 1

asymmetric convolution is added to substitute the

n \times n

convolution layer, and in the latter half of this stage, the dilated convolution operation is adopted to make the receptive field effectively enlarged. Through the further combination of the dilated convolution of different expansion rates, the model effectively extracts rich feature expressions so that the model’s localization of the target lesion area can be enhanced.

For the new feature

X_{1}

generated after the first stage, the contextual information is extracted in the second stage using four pooling layers of different sizes. In order to reduce the amount of computation and the weight dimensions, the number of channels for the four output features is reduced by using the

1 \times 1

convolution, and according to the size of the feature maps, the low-dimensional feature maps are upsampled using bilinear interpolation for output. Multiple effective view fields are utilized to recognize the skin lesion areas of diverse sizes. Finally, through combination with the multi-scale local detailed information from the encoder part, the global information of the lesion region is obtained, so as to realize the accurate localization of the lesion target region. This in turn promotes the segmentation capability of the networks.

3.3. Discriminator Module

The predicted segmentation masks produced by the generator segmentation network are fed into the discriminator together with the ground truth labels. The structure of the discriminator is shown in Figure 7, and consists of five convolutional layers, with the convolutional kernel of 3, a step size of 2, the activation function of ReLU, and the number of channels in each layer is 32, 64, 128, 256, and 512. Finally, through the fully connected layer, the input features are vectorized, and the probability value of the labels and the predicted segmentation graph are obtained using the sigmoid activation function, thus determining the authenticity and the falsehood of the label.

For generative adversarial networks, the generator is responsible for generating the segmentation results; meanwhile, the discriminator subsequently evaluates the veracity of those segmentation results. The existence of the discriminator is actually a further restriction on the generator (segmentation network), and it acts as a guide for the generator to produce more accurate segmented images. Through the adversarial training between the generator and the discriminator, the generator can gradually produce more accurate segmentation images, and the discriminator can successively improve its ability to distinguish labels and predict segmentation images. This process of competition and collaboration allows the model to learn the complex distribution of real data and generate high-quality segmentation results. It is worth noting that the discriminator is not employed during the testing phase.

3.4. The Loss Function

This subsection will outline the loss function for our designed model. Specifically, the total loss function consists of the generator loss and the adversarial loss.

For the generator loss, we used two losses to train the network. First, let

x_{i}

be the original image and

x_{t}

the real label. For the generator, the binary cross entropy (BCE) loss is used as it represents the process of generating segmentation masks based on the following formula:

L_{B C E} = E_{x_{i}, x_{t}} [- x_{t} \log (G_{θ_{G}} (x_{i})) - (1 - x_{t}) \log (1 - \log (G_{θ_{G}} (x_{i})))] .

(6)

Second, in order to strengthen the boundary accuracy of the segmentation results, in this study, boundary loss is introduced to the boundary region of the generator segmentation network; therefore, when calculating the loss, only the boundary region pixels are considered and are weighted according to their gradient in the real segmentation image. This allows the model to pay more attention to the accuracy of the segmentation boundary region, penalizing the errors occurring in the segmentation boundary, and thus improving the accuracy of the segmentation results. The boundary region is defined with the boundary gradient of the label as

\nabla x_{t}

, and the objective function is shown below:

L_{b o u n d a r y} = E_{x_{i}, x_{t}} [x_{t} \log (G_{θ_{G}} (x_{i})) \nabla x_{t}] .

(7)

For the adversarial training, the loss of the generator G is integrated with that of the discriminator D. Through the utilization of the minimum maximization game of the adversarial network during training, the interaction between the generator and the discriminator makes the generated results progressively closer to the real samples. The input to the discriminative branch consists of the generated segmented mask and the real ground truth, with the following objective function:

L_{D} (θ_{G,} θ_{D}) = E_{x_{t} ~ P_{d a t a} (x_{t})} [\log D_{θ_{D}} (x_{t})] + E_{x_{t} ~ P_{d a t a} (x_{t})} [\log (1 - D_{θ_{D}} (G_{θ_{G}} (x_{i})))] .

(8)

Through the incorporation of the segmentation loss into the loss of the GAN, the segmentation performance of the generator can be augmented. Therefore, the total loss of the final GAN model is the summarization of the loss functions provided above, in the following form:

L_{G A N} (θ_{G}, θ_{D}) = L_{B C E} (θ_{G}) + L_{b o u n d a r y} (θ_{G}) + L_{D} (θ_{G}, θ_{D}) .

(9)

4. Datasets and Implemented Setting

4.1. Dataset Descriptions

In our study, three publicly available skin lesion datasets, ISIC 2017, ISIC 2018, and HAM10000, were used to validate our proposed model and experimental analyses.

ISIC 2017: The images in the ISIC 2017 dataset cover a wide range of types of skin lesion, including melanoma, seborrheic keratosis, nevus, etc. There are a total of 2750 images in RGB format in the ISIC 2017 dataset, of which 2000 are used for training (18.7% for melanomas, 12.7% for seborrheic keratoses, 68.6% for nevi), with sizes ranging from

540 \times 722

to

4499 \times 6748

pixels. In this study, the original training set was redivided into training, validation, and test sets in the ratio of 8:1:1, thus containing 1600, 196, and 204 skin lesion images, respectively.

ISIC 2018: For the first time, the ISIC 2018 dataset provides a separate dataset for the task of segmentation of the marginal regions of skin diseases, which has become a major benchmark for evaluating segmentation algorithms on skin lesion images. The ISIC 2018 dataset contains 2594 images of skin lesions with 8% seborrheic keratoses, 20% melanomas, and 72% nevi in an RGB format. The resolution size of this dataset ranges from

540 \times 576

to

4499 \times 6748

pixels. In this study, the original training set was spliced into three sets, new training, validation, and testing, in the ratio of 8:1:1, containing 2078, 260, and 256 images, respectively.

HAM10000: The HAM10000 dataset is a large-scale dataset collection of human dermoscopic images, mostly used for the detection and categorization of skin cancers. This dataset contains 10015 dermatoscope images with a size of

450 \times 600

pixels. The dermatoscopic images in the dataset encompass seven different skin lesion types.

The ISIC2017, ISIC2018, and HAM10000 datasets play a key role as the three important datasets in the field of skin image analysis. These three datasets provide researchers with rich data resources, thus facilitating the research and development of diagnostic algorithms for skin lesions. In utilizing these datasets, researchers are able to perform a variety of skin image analysis tasks, including lesion detection and lesion classification, which provide powerful aids for clinical medical diagnosis. In this study, these analytical tasks are applied to the field of lesion segmentation.

4.2. Implemented Settings

All of the experiments were conducted using Linux serve with an NVIDIA V100 Tensor Core GPU, 8-core CPU, and 16 GB RAM on the PyTorch 1.10.2 platform. In this study, we used SGD as the optimizer, and we ran the number of iterations as 100 epochs with a batch size of 2 and set the learning rate as 0.001 for training all of the competing methods. The images of the ISIC 2017, ISIC 2018, and HAM10000 datasets were resized to 512 × 512 when training. The experiments were based on the same training, verification, and testing datasets for all methods. To guarantee the fairness of the experiments, in this study, a ratio of 8:1:1 was utilized to split the training, validation, and testing sets into two separate datasets for comparison experiments. In addition, the proposed model in this study employed the early stopping manner to avoid model over-fitting.

4.3. Evaluation Criterion

In this work, the mean intersection over union (mIoU), accuracy, recall, precision, and specificity were employed to assess the segmentation performance of our proposed model and the competing models.

The mIoU metric is the main evaluation index in image segmentation for the quantization of the consistency between the ground truth and the model predictions. A higher mIoU metric represents a more accurate segmentation of skin lesions, and its mathematical representation is as follows:

m I o U = \frac{T P}{T P + F P + F N} .

(10)

Accuracy (ACC) assesses the accuracy of the model’s categorization at the pixel level. The ACC is formulated as the ratio of the correctly classified numbers of pixels to the total number of image pixels. The ACC is utilized as a universal metric for estimating the performance of segmentation and its mathematical representation is as follows:

A c c u r a c y = \frac{T P + T N}{T P + T N + F P + F N} .

(11)

Precision denotes the percentage of properly classified skin lesion areas out of the whole lesion area during segmentation. Generally, the higher the precision is, the more precise the model is in forecasting positive regions. Its mathematical representation is provided as follows:

P r e c i s i o n = \frac{T P}{T P + F P} .

(12)

Recall indicates the percentage of skin lesions correctly identified during segmentation, and its mathematical representation is as follows:

R e c a l l = \frac{T P}{T P + F N} .

(13)

Specificity indicates the percentage of normal skin or non-diseased areas correctly identified during segmentation. Its mathematical representation is as follows:

S p e c i f i c i t y = \frac{T N}{T N + F P}

(14)

In the above metrics, TP (true positive) means the number of properly segmented skin lesion pixels, FP (false positive) is the number of background pixels incorrectly predicted as skin lesion pixels, TN (true negative) represents the number of properly segmented background pixels, and FN (false negative) indicates the number of incorrectly predicted skin lesion pixels as background pixels.

5. Experimental Results

In this section, we conduct quantitative and qualitative comparisons of the proposed model with other state-of-the-art methods. The experimental results of the comparison are rendered as an average result over several rounds of testing.

5.1. Comparison to the State-of-the-Art Models

To better illustrate the advantages of the proposed method, we compared it with the following popular medical image segmentation methods, including four methods based on conventional convolutional neural networks and four methods designed for skin segmentation. We briefly introduce these models here, and the details can be found in the references.

The four widely used CNN-based segmentation methods are U-Net [18], U-Net++ [20], Att U-Net [21], and CE-Net [24]. U-Net [18] is considered a baseline model for biomedical image segmentation tasks, and it employs an encoder–decoder architecture that combines the feature maps in the encoder with the feature maps in the decoder through skip connections to retain more spatial information and improve localization accuracy. U-Net++ [20] is an enhanced version of the U-Net architecture, known for its iterative feature aggregation and enhanced skip connections. Attention U-Net (Att U-Net) [21] utilizes a novel attention gate module to highlight significant features to improve medical image segmentation performance. The context encoder network (CE-Net) [24] designs dense atrous convolutional modules and residual multi-kernel pooling blocks to extract contextual information.

The four methods for skin segmentation are CPF Net [36], Double U-Net [48], DAGAN [25], and DCGAN [49]. CPF Net [36] and Double U-Net [48] are two CNN-based methods for the segmentation of skin lesion images. The context pyramid fusion network (CPF Net) [36] is an FCN based on encoder–decoder architecture that is embedded with global pyramid guidance and scale-aware pyramid fusion modules for application in skin lesion segmentation tasks. Double U-Net [48] uses a combination of two U-Net network structures stacked on top of each other to improve the U-Net performance in segmentation tasks. DAGAN [25] and DCGAN [49] are two GAN-based methods for segmenting skin lesion images. The dual adversarial generative adversarial network (DAGAN) [25] is a proposed GAN model that utilizes dense dilated convolution to construct the generator and introduces a dual discriminator to improve the recognition. The deep convolutional generative adversarial network (DCGAN) [49] is a GAN that replaces the deterministic spatial pooling operator function with stepwise convolution.

The following subsections provide the results of our detailed comparison experiments.

5.1.1. Qualitative Visual Comparison

The proposed GLSFA-GAN is compared with superior models including convolutional neural networks as well as generative adversarial networks, as shown in Figure 8. In this study, the visual comparison results are qualitatively illustrated for several state-of-the-art models based on the ISIC 2017, ISIC 2018, and HAM10000 datasets. The first two columns of the figure provide the input images and corresponding labels of lesion regions; the remainder of the columns present the respective segmentation results of the four state-of-the-art convolutional neural network-based image segmentation tasks, U-Net [18], Att U-Net [21], CE-Net [24], and CPF Net [36], as well as one adversarial training-based skin lesion region segmentation task, DAGAN [25], and the last column displays the segmentation results of the proposed model of this study. The last column displays the segmentation results of the model proposed in this study.

The majority of the state-of-the-art methods often fail to achieve accurate segmentation when faced with the pixels of lesion areas either with large size differences or containing complex noise effects, thus making it impossible to effectively segment the lesion portion of the target region. As shown in Figure 8, the images in the fourth and fifth rows have blurring of the boundaries, and GLSFA-GAN achieves the accurate localization of the lesion region when compared to other competing models. We can see from the third and fifth rows that the images have background noise, such as the black edge effect and hair occlusion, and our model can still accurately segment the lesion regions. Furthermore, our model demonstrates superior performance in handling edge details, resulting in smooth and continuous edges devoid of jaggedness or irregularity. As illustrated in the first, fourth, fifth, and sixth rows of Figure 8, irrespective of the lesion size, U-Net, Att U-Net, and CPF Net tend to produce fragmented or incomplete edges. In contrast, our method demonstrates the capacity to maintain coherent and accurate segmentation. This is because the generator in the GLSFA-GAN model is designed to extract and aggregate both local and global information from the image. The MSFF module is responsible for capturing the multi-scale local detailed information of the lesion region, which solves the problem of weakening the encoder’s partial feature extraction. At the same time, the GCEM effectively extracts rich feature expressions and more comprehensive contextual information through the combination of multi-scale dilated convolution as well as sampling and aggregating the feature maps at different spatial scales. With the combination of the multi-scale local detailed information in the encoder part, the global information of the lesion region is obtained, which then facilitates the generator network to yield more accurate masks.

Overall, the visualization of the segmentation results illustrates the efficacy and superiority of the proposed GLSFA-GAN model for the segmentation of skin lesions. The GLSFA-GAN model achieves competing segmentation results when compared to the other medical image segmentation methods, even in the presence of lesion target regions with large size differences or a lot of noise effects in the input image.

5.1.2. Performance Comparison of the ISIC 2017 Dataset

Table 1 shows the results of performing a quantitative analysis comparison with other competing methods using the ISIC 2017 dataset. GLSFA-GAN improves the mIoU and accuracy by 5.96% and 3.07%, respectively, compared to the base segmentation network U-Net model used in this work. In this study, an advanced generative adversarial network segmentation model containing a two-branch discriminator was also compared with DAGAN, and the model improved the respective five evaluation metrics by 2.74%, 1.63%, 0.64%, 1.67%, and 0.71%. In addition, the proposed model in this study improves the main metrics of mIoU and accuracy up to 2.65% and 1.36%, respectively, when compared to CPF Net, an advanced convolutional neural network with higher evaluation metrics.

Compared with other state-of-the-art models, the model proposed in this study improves on all the metrics, achieving values of 79.87%, 95.14%, 90.37%, 86.21%, and 97.06% for the mIoU, accuracy, precision, recall, and specificity, respectively. These significant evaluation metrics are attributed to the network structure designed in this study, which captures the multi-scale local, detailed information of the lesion region through the multi-scale local feature fusion (MSFF) module, capturing the local details at different ratios, effectively solving the problem where the feature extraction is gradually weakened in the encoder stage, and thus promoting the segmentation accuracy of the segmentation network.

5.1.3. Performance Comparison on the ISIC 2018 Dataset

Table 2 shows the results of a quantitative analysis comparison with the other state-of-the-art methods using the ISIC 2018 dataset. Compared with the advanced medical image segmentation network U-Net++ model, the proposed model in this study improves the respective five evaluation metrics by 5.56%, 1.37%, 2.04%, 0.90%, and 0.41%. Compared with the advanced medical image segmentation model Att U-Net, the main metrics of mIoU and accuracy are improved by 4.87% and 1.40%, respectively. In addition, in this study, an advanced generative adversarial network segmentation model containing a two-branch discriminator was also compared with DAGAN, and improved by 2.82%, 0.11%, and 0.25% the intersection ratio, accuracy, and precision metrics, respectively. In addition, compared with CPF Net, an advanced convolutional neural network with higher evaluation metrics, the model proposed in this study improves the main metrics of mIoU and accuracy by 3.74% and 0.71%, respectively.

Compared with the other state-of-the-art models, the model presented in this study reached the optimum in the intersection ratio, accuracy, and precision metrics, which were 86.79%, 96.84%, and 91.97%, respectively. Meanwhile, the recall and specificity metrics reached 91.56% and 97.63%, respectively, which are very competitive with other competing models. In this study, the GCEM effectively extracts the rich feature representations as well as more comprehensive contextual information through the combination of multi-scale null convolution as well as the sampling and aggregation of feature maps at different spatial scales. The combination with multi-scale local detailed information from the encoder part means that the global information of the lesion region is obtained, which facilitates the segmentation network to generate more accurate segmentation prediction maps.

5.1.4. Performance Comparison on HAM10000 Dataset

As shown in Table 3, our proposed method was compared with other superior CNN-based methods, such as U-Net, U-Net++, Att U-Net, Double U-Net, and CPF Net. Furthermore, our model is competitively compared with the GAN model, DCGAN. We achieved significant improvements in mIoU, precision, accuracy, recall, and specificity, with improvements of 2.86%, 1.40%, 1.44%, 1.78%, and 1.24%, respectively, when compared to the commonly used medical image segmentation network U-Net++. In the official evaluation metric mIoU, our proposed model improved by 2.17% over Attention-Net in this metric. When compared to the GAN model DCGAN, our mIoU metric improved by 9.74%. Additionally, our model reached 97.11% specificity and 95.79% accuracy, outperforming the other competing methods. Although the recall of our model was 90.25%, which is lower than that of U-Net and CPF Net, it was also competitive with other competing models.

5.2. Ablation Research

An extended ablation study regarding the methodology of this study was conducted and is presented in this section. The step-by-step ablation experiments using the ISIC 2018 dataset were conducted to evaluate the importance of the multi-scale local feature fusion module MSFF, the efficient channel attention module ECA, the global context extraction module GCEM, and the adversarial training module.

Table 4 presents the quantitative results of the segmentation performance analysis for each module. The comparison criteria for this study include the fact that Model I is the baseline network, and U-Net is used as the segmentation network in this study. The ablative variant Model II introduces the multi-scale local feature fusion module MSFF on the basis of the baseline network, which can be seen outperforming all the evaluated metrics of the baseline network based on the metrics change. Since the MSFF module can capture multi-scale local detailed information about the lesion area, the data demonstrate the effectiveness of the MSFF module in enhancing the performance of the segmentation network. Adding the efficient channel attention module ECA to the ablation variant Model III on the basis of Model II resulted in improved values for mIoU, accuracy, precision, recall, and specificity. The ECA module is introduced to continuously enhance the model’s attention to the lesion region and more effectively capture and utilize the local features of the lesion area. The segmentation capability of the generator segmentation network is improved. The ablation variant Model IV integrates the global context information extraction module on the basis of Model II, which significantly enhances the evaluation indexes. Specifically, the mIoU, accuracy, precision, recall, and specificity values are improved by 2.84%, 0.82%, 1.33%, 0.50%, and 1.11%, respectively. In this way, it is demonstrated that the global context extraction module is very effective in enhancing the performance of segmentation networks through the combination of multi-scale null convolution as well as sampling and aggregation of feature maps at different spatial scales. In addition, when the Model IV-based variant was combined again with the ECA module, enhancements were also achieved in various metrics. The best performance of this model is noteworthy as being achieved through the combination of all the modules, which improves the mIoU, accuracy, precision, recall, and specificity, by 6.54%, 1.63%, 3.64%, 0.94%, and 1.28%, respectively, compared to the baseline network.

6. Discussion

In this study, we propose an adversarial training-based generative adversarial network model called GLSFA-GAN for the task of skin lesion image segmentation. In the experimental part of this paper, we demonstrate the superiority of GLSFA-GAN over mainstream methods on three publicly available datasets.

Furthermore, two key metrics were employed for a comprehensive assessment of the various methods: parameter volume and floating-point operations (FLOPs). The input size for all methods was standardized to 512 × 512. As illustrated in Figure 9, the parameter and GFLOPs of the competing models were evaluated on the ISIC2018 test set, with the mIoU metric considered. As illustrated in Figure 9a, our proposed method demonstrates superior performance compared to the majority of advanced models, with the exception of U-Net and U-Net+. However, the primary indicator, mIoU, exhibits a significantly superior performance compared to all advanced networks. As illustrated in Figure 9b, our proposed method exhibits not only lower GFLOPs but also superior mIoU indicators compared to other models. The proposed method exhibits superior performance compared to other advanced methods while maintaining low computational complexity and moderate parameter counts.

Although the proposed method not only performs well on the three publicly available datasets, but also has some advantages in terms of computational complexity and number of parameters, there are still some limitations. Firstly, in our study, the skin images used do not cover all skin tones, which is attributed to the publicly available nature of the datasets used, ISIC2016, ISIC2018 and HAM10000, which mainly feature images of skin lesions in light-skinned individuals. However, in practical applications, it is important to ensure a fair and inclusive presentation of results for different skin colors. In future work, we will actively seek or construct more comprehensive datasets to address such skin color differences. Second, while a good generator can produce better segmentation results, a good discriminator can also encourage the generator to produce better results, making the minimal-extremely large game a mutually reinforcing competition to improve segmentation results. In future research, we will also consider improving the structure of the discriminator in generative adversarial networks.

7. Conclusions

In this study, a generative adversarial network model based on adversarial training, called GLSFA-GAN, is proposed to address the difficulties and challenges in the task of skin lesion image segmentation. Our proposed model combines a local feature extraction module with fused attention and a global context extraction module to efficiently perceive and utilize the rich image details and local features. The MSFF module is capable of capturing multi-scale local detailed information about the lesion region; the module also solves the issue of gradually weakening feature extraction in the encoder part and effectively improves the perception of the local features. The GCEM effectively extracts rich feature representations as well as more comprehensive contextual information through the combination of multi-scale null convolution as well as sampling and aggregating the feature maps at different spatial scales. Through the combination with the multi-scale local detailed information in the encoder part, we obtained the lesion area of global information, facilitating the segmentation network to produce more accurate segmentation prediction maps. Quantitative and qualitative analyses on three public datasets, ISCI2017, ISIC2018, and HAM10000, of skin lesion images confirmed the effectiveness of the individual modules of the proposed model and demonstrated that the proposed model has good competitiveness when compared to state-of-the-art methods, achieving excellent performance.

Author Contributions

Data curation, methodology, and writing—original draft, R.Z.; investigation, J.Z. and R.Z.; data curation, J.Z.; writing—review and editing, Y.W. and R.Z.; funding acquisition, Y.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China, grant number 61901292, and the Fundamental Research Program of Shanxi Province, China, grant numbers 201901D211080 and 202303021211082.

Data Availability Statement

Data will be made available upon request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Siegel, R.L.; Miller, K.D.; Wagle, N.S.; Jemal, A. Cancer statistics, 2023. CA Cancer J. Clin. 2023, 73, 17–48. [Google Scholar] [CrossRef] [PubMed]
Cullen, J.K.; Simmons, J.L.; Parsons, P.G.; Boyle, G.M. Topical treatments for skin cancer. Adv. Drug Deliv. Rev. 2020, 153, 54–64. [Google Scholar] [CrossRef] [PubMed]
Ding, H.; Huang, Q.; Alkhayyat, A. A computer aided system for skin cancer detection based on Developed version of the Archimedes Optimization algorithm. Biomed. Signal Process. Control 2024, 90, 105870. [Google Scholar] [CrossRef]
Leiter, U.; Keim, U.; Garbe, C. Epidemiology of skin cancer: Update 2019. Adv. Exp. Med. Biol. 2020, 1268, 123–139. [Google Scholar] [PubMed]
Blazek, K.; Furestad, E.; Ryan, D.; Damian, D.; Fernandez-Penas, P.; Tong, S. The impact of skin cancer prevention efforts in New South Wales, Australia: Generational trends in melanoma incidence and mortality. Cancer Epidemiol. 2022, 81, 102263. [Google Scholar] [CrossRef]
Gershenwald, J.E.; Scolyer, R.A.; Hess, K.R.; Sondak, V.K.; Long, G.V.; Ross, M.I.; Lazar, A.J.; Faries, M.B.; Kirkwood, J.M.; McArthur, G.A. Melanoma staging: Evidence-based changes in the American Joint Committee on Cancer eighth edition cancer staging manual. CA Cancer J. Clin. 2017, 67, 472–492. [Google Scholar] [CrossRef]
Lallas, A.; Apalla, Z.; Lazaridou, E.; Ioannides, D. Dermoscopy. In Imaging in Dermatology; Elsevier: Amsterdam, The Netherlands, 2016; pp. 13–28. [Google Scholar]
Apalla, Z.; Lallas, A.; Argenziano, G.; Ricci, C.; Piana, S.; Moscarella, E.; Longo, C.; Zalaudek, I. The light and the dark of dermatoscopy in the early diagnosis of melanoma: Facts and controversies. Clin. Dermatol. 2013, 31, 671–676. [Google Scholar] [CrossRef]
Warsi, M.F.; Chauhan, U.; Gupta, S.N.; Tiwari, P. A comparative analysis of melanoma detection methods based on computer aided diagnose system. Mater. Today Proc. 2022, 57, 1962–1968. [Google Scholar] [CrossRef]
Dayananda, C.; Yamanakkanavar, N.; Nguyen, T.; Lee, B. AMCC-Net: An asymmetric multi-cross convolution for skin lesion segmentation on dermoscopic images. Eng. Appl. Artif. Intell. 2023, 122, 106154. [Google Scholar] [CrossRef]
Qin, H.; Deng, Z.; Shu, L.; Yin, Y.; Li, J.; Zhou, L.; Zeng, H.; Liang, Q. Portable Skin Lesion Segmentation System with Accurate Lesion Localization Based on Weakly Supervised Learning. Electronics 2023, 12, 3732. [Google Scholar] [CrossRef]
Garnavi, R.; Aldeen, M.; Celebi, M.E.; Varigos, G.; Finch, S. Border detection in dermoscopy images using hybrid thresholding on optimized color channels. Comput. Med. Imaging Graph. 2011, 35, 105–115. [Google Scholar] [CrossRef] [PubMed]
Xu, J.; Wang, X.; Wang, W.; Huang, W. PHCU-Net: A parallel hierarchical cascade U-Net for skin lesion segmentation. Biomed. Signal Process. Control 2023, 86, 105262. [Google Scholar] [CrossRef]
Feng, K.; Ren, L.; Wang, G.; Wang, H.; Li, Y. SLT-Net: A codec network for skin lesion segmentation. Comput. Biol. Med. 2022, 148, 105942. [Google Scholar] [CrossRef] [PubMed]
Chen, W.; Zhang, R.; Zhang, Y.; Bao, F.; Lv, H.; Li, L.; Zhang, C. Pact-Net: Parallel CNNs and Transformers for medical image segmentation. Comput. Methods Programs Biomed. 2023, 242, 107782. [Google Scholar] [CrossRef] [PubMed]
Song, Z.; Luo, W.; Shi, Q. Res-CDD-net: A network with multi-scale attention and optimized decoding path for skin lesion segmentation. Electronics 2022, 11, 2672. [Google Scholar] [CrossRef]
Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention—MICCAI 2015: 18th International Conference, Munich, Germany, 5–9 October 2015; pp. 234–241. [Google Scholar]
Zhang, Z.; Liu, Q.; Wang, Y. Road extraction by deep residual u-net. IEEE Geosci. Remote Sens. Lett. 2018, 15, 749–753. [Google Scholar] [CrossRef]
Zhou, Z.; Siddiquee, M.M.R.; Tajbakhsh, N.; Liang, J. Unet++: Redesigning skip connections to exploit multiscale features in image segmentation. IEEE Trans. Med. Imaging 2019, 39, 1856–1867. [Google Scholar] [CrossRef]
Oktay, O.; Schlemper, J.; Folgoc, L.L.; Lee, M.; Heinrich, M.; Misawa, K.; Mori, K.; McDonagh, S.; Hammerla, N.Y.; Kainz, B.; et al. Attention u-net: Learning where to look for the pancreas. arXiv 2018, arXiv:1804.03999. [Google Scholar]
Milletari, F.; Navab, N.; Ahmadi, S.-A. V-net: Fully convolutional neural networks for volumetric medical image segmentation. In Proceedings of the 2016 Fourth International Conference on 3D Vision (3DV), Stanford, CA, USA, 25–28 October 2016; pp. 565–571. [Google Scholar]
Huang, H.; Lin, L.; Tong, R.; Hu, H.; Zhang, Q.; Iwamoto, Y.; Han, X.; Chen, Y.-W.; Wu, J. Unet 3+: A full-scale connected unet for medical image segmentation. In Proceedings of the ICASSP 2020—2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 1055–1059. [Google Scholar]
Gu, Z.; Cheng, J.; Fu, H.; Zhou, K.; Hao, H.; Zhao, Y.; Zhang, T.; Gao, S.; Liu, J. Ce-net: Context encoder network for 2d medical image segmentation. IEEE Trans. Med. Imaging 2019, 38, 2281–2292. [Google Scholar] [CrossRef]
Lei, B.; Xia, Z.; Jiang, F.; Jiang, X.; Ge, Z.; Xu, Y.; Qin, J.; Chen, S.; Wang, T.; Wang, S. Skin lesion segmentation via generative adversarial networks with dual discriminators. Med. Image Anal. 2020, 64, 101716. [Google Scholar] [CrossRef]
Zhang, J.; Yu, L.; Chen, D.; Pan, W.; Shi, C.; Niu, Y.; Yao, X.; Xu, X.; Cheng, Y. Dense GAN and multi-layer attention based lesion segmentation method for COVID-19 CT images. Biomed. Signal Process. Control 2021, 69, 102901. [Google Scholar] [CrossRef]
Kazeminia, S.; Baur, C.; Kuijper, A.; van Ginneken, B.; Navab, N.; Albarqouni, S.; Mukhopadhyay, A. GANs for medical image analysis. Artif. Intell. Med. 2020, 109, 101938. [Google Scholar] [CrossRef] [PubMed]
Baur, C.; Albarqouni, S.; Navab, N. Generating highly realistic images of skin lesions with GANs. In Proceedings of the OR 2.0 Context-Aware Operating Theaters, Computer Assisted Robotic Endoscopy, Clinical Image-Based Procedures, and Skin Image Analysis: First International Workshop, OR 2.0 2018, 5th International Workshop, CARE 2018, 7th International Workshop, CLIP 2018, Third International Workshop, ISIC 2018, Held in Conjunction with MICCAI 2018, Granada, Spain, 16–20 September 2018; Proceedings 5. pp. 260–267. [Google Scholar]
Mirikharaji, Z.; Abhishek, K.; Bissoto, A.; Barata, C.; Avila, S.; Valle, E.; Celebi, M.E.; Hamarneh, G. A survey on deep learning for skin lesion segmentation. Med. Image Anal. 2023, 88, 102863. [Google Scholar] [CrossRef] [PubMed]
Green, A.; Martin, N.; Pfitzner, J.; O’Rourke, M.; Knight, N. Computer image analysis in the diagnosis of melanoma. Melanoma Res. 1994, 31, 958–964. [Google Scholar] [CrossRef] [PubMed]
Erkol, B.; Moss, R.H.; Joe Stanley, R.; Stoecker, W.V.; Hvatum, E. Automatic lesion boundary detection in dermoscopy images using gradient vector flow snakes. Skin Res. Technol. 2005, 11, 17–26. [Google Scholar] [CrossRef] [PubMed]
Emre Celebi, M.; Alp Aslandogan, Y.; Stoecker, W.V.; Iyatomi, H.; Oka, H.; Chen, X. Unsupervised border detection in dermoscopy images. Skin Res. Technol. 2007, 13, 454–462. [Google Scholar] [CrossRef]
Gomez, D.D.; Butakoff, C.; Ersboll, B.K.; Stoecker, W. Independent histogram pursuit for segmentation of skin lesions. IEEE Trans. Biomed. Eng. 2007, 55, 157–161. [Google Scholar] [CrossRef]
Zortea, M.; Skrøvseth, S.O.; Schopf, T.R.; Kirchesch, H.M.; Godtliebsen, F. Automatic segmentation of dermoscopic images by iterative classification. Int. J. Biomed. Imaging 2011, 2011, 972648. [Google Scholar] [CrossRef]
Han, Q.; Qian, X.; Xu, H.; Wu, K.; Meng, L.; Qiu, Z.; Weng, T.; Zhou, B.; Gao, X. DM-CNN: Dynamic Multi-scale Convolutional Neural Network with uncertainty quantification for medical image classification. Comput. Biol. Med. 2024, 168, 107758. [Google Scholar] [CrossRef]
Feng, S.; Zhao, H.; Shi, F.; Cheng, X.; Wang, M.; Ma, Y.; Xiang, D.; Zhu, W.; Chen, X. CPFNet: Context pyramid fusion network for medical image segmentation. IEEE Trans. Med. Imaging 2020, 39, 3008–3018. [Google Scholar] [CrossRef]
Zhang, C.; Pan, X.; Li, H.; Gardiner, A.; Sargent, I.; Hare, J.; Atkinson, P.M. A hybrid MLP-CNN classifier for very fine resolution remotely sensed image classification. ISPRS J. Photogramm. Remote Sens. 2018, 140, 133–144. [Google Scholar] [CrossRef]
Sushma, B.; Pulikala, A. AAPFC-BUSnet: Hierarchical encoder–decoder based CNN with attention aggregation pyramid feature clustering for breast ultrasound image lesion segmentation. Biomed. Signal Process. Control 2024, 91, 105969. [Google Scholar]
Shu, X.; Wang, J.; Zhang, A.; Shi, J.; Wu, X.-J. CSCA U-Net: A channel and space compound attention CNN for medical image segmentation. Artif. Intell. Med. 2024, 150, 102800. [Google Scholar] [CrossRef] [PubMed]
Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial nets. arXiv 2018, arXiv:1406.2661. [Google Scholar]
Zhu, J.-Y.; Park, T.; Isola, P.; Efros, A.A. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE international Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2223–2232. [Google Scholar]
Bi, L.; Feng, D.; Fulham, M.; Kim, J. Improving skin lesion segmentation via stacked adversarial learning. In Proceedings of the 2019 IEEE 16th International Symposium on Biomedical Imaging (ISBI 2019), Venice, Italy, 8–11 April 2019; pp. 1100–1103. [Google Scholar]
Han, C.; Kitamura, Y.; Kudo, A.; Ichinose, A.; Rundo, L.; Furukawa, Y.; Umemoto, K.; Li, Y.; Nakayama, H. Synthesizing diverse lung nodules wherever massively: 3D multi-conditional GAN-based CT image augmentation for object detection. In Proceedings of the 2019 International Conference on 3D Vision (3DV), Quebec City, QC, Canada, 16–19 September 2019; pp. 729–737. [Google Scholar]
Bansal, N.; Sridhar, S. Hexa-gan: Skin lesion image inpainting via hexagonal sampling based generative adversarial network. Biomed. Signal Process. Control 2024, 89, 105603. [Google Scholar] [CrossRef]
Salvi, M.; Branciforti, F.; Veronese, F.; Zavattaro, E.; Tarantino, V.; Savoia, P.; Meiburger, K.M. DermoCC-GAN: A new approach for standardizing dermatological images using generative adversarial networks. Comput. Methods Programs Biomed. 2022, 225, 107040. [Google Scholar] [CrossRef]
Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient channel attention for deep convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 11534–11542. [Google Scholar]
Yu, F.; Koltun, V. Multi-scale context aggregation by dilated convolutions. arXiv 2015, arXiv:1511.07122. [Google Scholar]
Jha, D.; Riegler, M.A.; Johansen, D.; Halvorsen, P.; Johansen, H.D. Doubleu-net: A deep convolutional neural network for medical image segmentation. In Proceedings of the 2020 IEEE 33rd International Symposium on Computer-Based Medical Systems (CBMS), Rochester, MN, USA, 28–30 July 2020; pp. 558–564. [Google Scholar]
Radford, A.; Metz, L.; Chintala, S. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv 2015, arXiv:1511.06434. [Google Scholar]

Figure 1. The interference factors that challenge skin lesion segmentation.

Figure 2. An overview of the GLSFA-GAN model framework. GLSFA-GAN consists of a generator GLSFA-Net and a discriminator. Skin images are first fed into the encoder of the generator to reduce resolution and extract the multi-scale features. The high-level features with the lowest resolution are fed into the GCEM to capture the global semantic features and spatial context information. Meanwhile, the multiscale features of the encoder are sequentially fed into the MSFF and ECA modules to learn the local semantic features and are concatenated with the corresponding resolution features of the decoder to generate segmentation prediction maps. Finally, the labels are fed into the discriminator along with the segmentation prediction map for learning. The discriminator is not used in the testing phase.

Figure 3. The overall pipeline of the generator GLSFA-Net. GLSFA-Net consists of an encoder, a decoder, and a connectivity component. Among them, the encoder is responsible for capturing the high-level semantic features of the image and the decoder is responsible for recovering the spatial information from the extracted features. Meanwhile, the MSFF and ECA modules connecting the encoder and decoder are responsible for learning local features and the GCEM is responsible for learning global features.

Figure 4. The operation pipeline of the MSFF module. Semantic features from different scales are firstly concatenated, and then features under different receptive fields are obtained by convolution operation, and different weight maps are obtained for features under different receptive fields by spatial attention mechanism.

Figure 5. The structure of the ECA module. ECA obtains the aggregated features by global average pooling and then generates the channel weights by performing a one-dimensional convolution of size k, where k is adaptively determined by a mapping of the channel dimension C.

Figure 6. The structure of the GCEM. GCEM captures and centers information at different spatial scales through a series of convolution operations, then uses multiple pooling layers to extract the contextual information, and finally ensures that the output dimensions of each branch are consistent through upsampling.

Figure 7. The structure of the discriminator module.

Figure 8. The qualitative comparison results of the proposed model to other state-of-the-art methods on three different datasets.

Figure 9. Comparative analysis of parameters and GFLOPs based on the mIoU metric, respectively.

Table 1. The performance comparison of various competing models on the ISIC 2017 dataset (%).

Method	mIoU	Accuracy	Precision	Recall	Specificity
DCGAN [49]	63.41	88.39	84.21	76.14	93.58
U-Net [18]	73.91	92.07	88.49	83.83	95.36
U-Net++ [20]	68.12	90.12	86.46	76.29	95.17
Att U-Net [21]	74.28	93.04	91.52	83.97	95.69
CE Net [24]	75.55	93.18	89.04	84.18	97.04
CPF Net [36]	77.22	93.78	89.61	84.86	96.22
DAGAN [25]	77.13	93.51	89.73	84.54	96.35
Ours	79.87	95.14	90.37	86.21	97.06

Notes: Bold in the table is the maximum of each evaluation criterion.

Table 2. The comparison of various competing models on the ISIC 2018 dataset (%).

Method	mIoU	Accuracy	Precision	Recall	Specificity
DCGAN [49]	71.71	89.53	81.89	86.32	90.37
U-Net [18]	80.25	95.21	88.33	90.62	96.35
U-Net++ [20]	81.23	95.47	89.93	90.66	97.22
Att U-Net [21]	81.92	95.44	91.52	89.07	97.64
CE Net [24]	82.11	95.79	86.43	89.28	97.52
CPF Net [36]	83.05	96.13	86.32	90.05	97.77
DAGAN [25]	83.97	96.73	91.72	92.17	97.74
Ours	86.79	96.84	91.97	91.56	97.63

Notes: Bold in the table is the maximum of each evaluation criterion.

Table 3. The performance comparison of various competing models on the HAM10000 dataset (%).

Method	mIoU	Accuracy	Precision	Recall	Specificity
DCGAN [49]	78.89	91.95	83.49	83.29	92.36
U-Net [18]	83.34	94.78	86.25	90.62	95.43
U-Net++ [20]	85.77	94.35	89.93	88.47	95.87
Att U-Net [21]	86.46	94.93	90.67	89.35	96.35
Double U-Net [48]	86.64	94.47	91.26	88.32	96.79
CPF Net [36]	87.39	95.67	89.99	92.44	96.82
DAGAN [25]	87.74	95.33	91.14	89.73	97.06
Ours	88.63	95.79	91.33	90.25	97.11

Notes: Bold in the table is the maximum of each evaluation criterion.

Table 4. The ablation study of the GLSFA-GAN architecture (%).

Model	Generator				Discriminator	Metrics
Model	Baseline	MSFF	ECA	GCEM	Discriminator	mIoU	Acc	Pre	Rec	Spe
I	✓					80.25	95.21	88.33	90.62	96.35
II	✓	✓				82.43	95.57	89.14	90.71	96.72
III	✓	✓	✓			83.58	95.84	90.41	90.92	97.11
IV	✓	✓		✓		85.27	96.39	90.47	91.21	97.46
V	✓	✓	✓	✓		86.11	96.47	91.44	91.44	97.57
VI	✓	✓	✓	✓	✓	86.79	96.84	91.56	91.56	97.63

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zou, R.; Zhang, J.; Wu, Y. Skin Lesion Segmentation through Generative Adversarial Networks with Global and Local Semantic Feature Awareness. Electronics 2024, 13, 3853. https://doi.org/10.3390/electronics13193853

AMA Style

Zou R, Zhang J, Wu Y. Skin Lesion Segmentation through Generative Adversarial Networks with Global and Local Semantic Feature Awareness. Electronics. 2024; 13(19):3853. https://doi.org/10.3390/electronics13193853

Chicago/Turabian Style

Zou, Ruyao, Jiahao Zhang, and Yongfei Wu. 2024. "Skin Lesion Segmentation through Generative Adversarial Networks with Global and Local Semantic Feature Awareness" Electronics 13, no. 19: 3853. https://doi.org/10.3390/electronics13193853

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Skin Lesion Segmentation through Generative Adversarial Networks with Global and Local Semantic Feature Awareness

Abstract

1. Introduction

2. Related Work

2.1. Segmentation of Skin Lesion

2.2. Generative Adversarial Networks

3. Methods

3.1. Overall Architecture of the Proposed Model

3.2. Generator GLSFA-Net

3.2.1. Multi-Scale Local Feature Fusion Module

3.2.2. Efficient Channel Attention Module

3.2.3. Global Context Information Extraction Module

3.3. Discriminator Module

3.4. The Loss Function

4. Datasets and Implemented Setting

4.1. Dataset Descriptions

4.2. Implemented Settings

4.3. Evaluation Criterion

5. Experimental Results

5.1. Comparison to the State-of-the-Art Models

5.1.1. Qualitative Visual Comparison

5.1.2. Performance Comparison of the ISIC 2017 Dataset

5.1.3. Performance Comparison on the ISIC 2018 Dataset

5.1.4. Performance Comparison on HAM10000 Dataset

5.2. Ablation Research

6. Discussion

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI