1. Introduction
Skin cancer is the most common malignant tumor in human beings, including squamous cell carcinoma, basal cell carcinoma, malignant melanoma and malignant lymphoma, among which malignant melanoma is a skin disease with high mortality [
1]. If melanoma is found and treated early, the cure probability is as high as 95%. However, the cure rate of melanoma found in the late stage is extremely low, and the mortality rate is as high as 85%. Thus, the early diagnosis and treatment of skin diseases is very necessary [
2]. Nowadays, many doctors determine the lesion area of patients by naked eye observation or dermatoscope diagnosis. However, because hair and blood vessels will interfere with the diagnosis, even experienced doctors find it difficult to accurately segment the lesion area. Therefore, it is very necessary to introduce a computer-aided diagnosis (CAD) system to segment skin lesions, and it is of great significance for doctors to evaluate and diagnose clinically.
Skin disease segmentation methods include traditional segmentation methods and deep learning segmentation methods. Traditional segmentation methods focus on the original information of skin disease images, including threshold segmentation [
3], region segmentation [
4], edge segmentation [
5] and support vector machine segmentation methods [
6]. However, the skin lesion image segmentation model based on deep learning can adaptively learn image features without manual intervention so as to obtain accurate segmented images. These segmentation methods are superior to traditional segmentation methods and have become the mainstream methods in the field of skin lesion segmentation. The most common model structure in the skin disease segmentation model is encoder-decoder structure. The encoder part is used to extract the features of the input image, while the decoder restores the extracted feature map to the original image size and outputs the final segmentation result map. In order to improve the performance of skin disease segmentation, many scholars will improve the encoder-decoder structure and introduce some effective mechanisms to enhance the learning performance of the network, such as attention, residual structure and so on.
Zhou et al. [
7] put forward a U-Net++ network in 2018, which introduced dense connections and added more hop connection paths to make up for the lack of information between codec and decoder. Huang et al. [
8]] proposed a U-Net3+ network in 2020. In view of the deficiency that U-Net++ did not extract enough information from multiple scales, U-Net3+ used full-scale jump connection and deep supervision to improve this problem. Alom et al. [
9] proposed an R2U-Net network architecture, which integrated the structures of U-Net, ResNet [
10] and RCNN [
11] and achieved good experimental results in many medical image segmentation tasks such as blood vessels, lungs and retina. Jin et al. [
12] put forward the Residual Attention Perception Network, which added an attention mechanism to the U-Net network for the first time and used an attention mechanism to fuse low-level feature information and deep feature information, so as to extract the context information of the feature map and improve the segmentation effect. Sarker et al. [
13] proposed a new segmentation network model by using extended residual and pyramid pooling. The encoding part uses an extended residual network to extract image feature information, and the decoding part uses pyramid pooling to reconstruct image information to obtain a segmented image.
Although the above methods can effectively segment the lesion region, there are still some challenges that cannot be ignored, such as small lesions, similar lesion regions to the background, fuzzy edges and so on, which can easily cause the loss of local features of the lesion region in the segmentation process, resulting in poor segmentation results.
Figure 1 shows the poor segmentation effect of skin disease images with blurred edges and similar lesion areas in some networks. Because of the lack of feature extraction ability of the input image, the network can not extract deep-level image information, resulting in the loss of feature information, which leads to inaccurate segmentation.
In view of the above problems, this paper proposes an efficient skin lesion segmentation method based on multi-level split receptive field and attention. Firstly, the skin lesion images are preprocessed to remove foreign bodies such as hair and blood vessels and reduce their interference with lesion region segmentation. Secondly, the depth feature extraction module and multi-level split receptive field module are used to extract the global context information and local information of skin lesion images. Then, a hybrid pooling module is used in the bottleneck layer to aggregate the long-distance dependencies and short-distance dependencies in the image and to fuse the global information and local information of the feature map. Cascade can reduce the lack of feature map information and enhance the transmission between pixels. Finally, reverse residual external attention is introduced into the decoder to enhance the information association between samples, obtain the features of the whole data set, and recover the feature map information. The main contributions of this paper are as follows.
1. We propose an efficient skin lesion segmentation method based on a multi-level split receptive field and attention, which can segment lesion images efficiently;
2. A new depth feature extraction module and a multi-level split receptive field module are proposed to replace the traditional convolution feature map information extraction, which can extract the shallow information of the image more accurately and enhance the learning ability of the network;
3. In order to better integrate context information and local information, a hybrid pooling module is introduced;
4. A reverse residual external attention module is proposed to enhance the connection between data sets and improve the segmentation effect;
5. A large number of experiments have been carried out on the ISBI2017 data set and the ISIC2018 data set. The results show that the model can effectively segment the lesion area.
2. Related Works
Compared with deep learning, the traditional machine learning segmentation method is often cumbersome to implement, and doctors need to intervene manually according to their own prior knowledge and professional practical experience in the early feature selection process, which may lead to certain errors in doctors’ judgment due to some external factors. As a learning method based on pixel classification, the deep learning model no longer needs doctors to design features manually. The network can actively learn image-related features by selecting reasonable Loss functions and gradient descent algorithms through supervised learning. The method based on deep learning makes the segmentation results obtained by the network more objective and referential because there is not too much manual intervention.
Nowadays, deep learning networks have been widely used in medical image processing, among which convolution neural networks are the representative ones, which can be divided into two categories: traditional convolution neural networks and full convolution neural networks.
A traditional convolution neural network structure divides the input image into several image blocks and then predicts each image block through the model, whether it is inside or outside the target. Codella et al. [
14] combine sparse coding, support vector machine and convolution neural network to realize accurate recognition of melanoma. However, traditional CNN has a fixed requirement for the input size of the network. Different data sets need to be preprocessed and post-processed, and a large number of parameters will be generated in the training process of the network, which occupies a large memory space. The full convolution neural network uses the convolution layer instead of the last full connection layer so that the network can accept any size input and reduce the complexity of the network. In 2015, Long et al. [
15] first proposed a pixel semantic segmentation technology that allows arbitrary input size, called Fully Network (FCN). In the same year, Badrinarayanan et al. [
16] proposed SegNet. Both of them are encoder-decoder segmentation methods. The encoder generates a low-resolution feature map and the decoder samples it to restore it. Finally, the Softmax classifier is used to predict the segmentation area. In 2015, Chen et al. [
17] put forward Deeplab and then put forward the last three versions one after another. In view of the neglect of small objects in convolution networks and the multi-scale problems, the Deeplab series successively replaced different pre-training models, introduced extended convolution, hole convolution and ASPP layers, and finally made organic combinations. Ronneberger et al. [
18] proposed a new scroll integral cut method called U-Net in 2015, which is a modification and extension of the FCN structure. The whole network has a U-shaped symmetric structure, with an encoder on the left side, including four convolution layers, and a decoder on the right side, including four upsampling layers corresponding to the encoder. The feature map of each convolution layer will pass through the skip connection layer. Nowadays, the U-Net structure has been widely used in medical image lesion segmentation tasks.
In recent years, the deep learning method has been applied to the field of skin lesion segmentation. In 2017, the improvement of the U-Net variant and U-Net encoder using pre-training appeared frequently. B.S. Lin et al. [
19] give a comparison of two skin lesion segmentation methods based on a U-Net histogram equalization and C-means clustering. Y. Yuan et al. [
20] used the ISBI2017 data set to propose a skin lesion segmentation method using different color spaces of dermatoscope images as training depth convolution-deconvolution neural network (CDNN). N. C. Codella et al. [
21] made an integrated system combining traditional machine learning methods with deep learning methods.
Schlemper et al. [
22] proposed an Attention Gate (AG), which integrates an attention mechanism into U-Net. Through attention learning, it can automatically focus on target structures in different areas, highlight prominent target areas, and reduce the influence of irrelevant background areas in input images on feature extraction. Li Haixiang et al. [
23] designed a dense deconvolution network based on encoding and decoding modules.
The generated confrontation network began to play a role in the field of medical images in 2019. Bi et al. [
24] proposed an automatic segmentation method of skin lesions based on antagonistic learning, which uses data enhancement methods such as rotation, masking and cropping to expand data. Then, the features extracted from the convolution operation in the encoder and decoder are deeply fused, which improves the segmentation performance of the network to a certain extent. L. Canalini et al. [
25] designed a codec architecture with multiple pre-trained models as feature extractors. They explored multiple pre-trained models to initialize feature extractors without using bias-induced data sets. A codec segmentation structure is adopted to utilize each pre-trained feature extractor. Additional training data is also generated by Generating Countermeasure Network (GAN).
Semi-supervised learning will gradually expand in 2019. The purpose of semi-supervised learning is to greatly alleviate the problems caused by the lack of large-scale labeled data by allowing the model to use the available massive unlabeled data. Semi-supervised network skin detection based on mutual guidance designed by He Yaying et al. [
26] and R. Dupre et al. [
27] improved data set capacity and model accuracy by semi-supervised iterative self-learning. In 2021, Wu [
28] et al. proposed a skin disease segmentation method based on an adaptive dual attention module and proposed a new and efficient adaptive dual attention module, which integrated two global context modeling mechanisms into the module and could extract more comprehensive and discriminating features to identify the boundaries of skin lesions. Hritam [
29] et al. proposed a multifocal segmentation network for skin lesion segmentation in 2022. The final segmentation mask is calculated by using feature maps with different proportions, and the Convolutional Neural Network (CNN) Res2Net backbone is used to obtain the depth features used in the parallel partial decoder module to obtain the global mapping of the segmentation mask.
5. Discussion
In this paper, a skin disease segmentation network based on a multi-level split receptive field and attention is proposed, which can solve the problems of low contrast of skin disease images and similar lesion areas to the background. The network adopts a U-shaped coding and decoding structure, uses a depth feature extraction layer and multi-level split receptive field module to extract feature map information to capture global context information, and realizes global information and information fusion through a hybrid module to build long-term and short-term dependency relationships. At the same time, we introduce a reverse residual attention block into the decoder to better process the feature information of the image. Experiments show that the proposed network can segment the lesion image more accurately, which provides important clinical diagnosis and treatment assistance for clinicians. Doctors can easily obtain the diagnosis results of patients through the segmentation result.
Although our research can accurately segment images with low contrast and similar color backgrounds, for images with blurred edges and small lesions, the segmentation is not accurate enough, which easily leads to the loss of feature information. In future work, we will strengthen the research of image feature extraction, optimize the calculations, improve the training speed, strengthen the processing of noise interference, repair the missing and fuzzy segmentation problems caused by the difficulty in identifying the lesion areas, improve the ability of skin disease segmentation network, and provide doctors with better medical technology. In addition, a lightweight network is also the focus of future research. A lightweight model is conducive to the deployment of applications, reduces memory footprint, and improves efficiency while also reducing space footprint and improving user experience.
6. Conclusions
In order to solve the problems of skin lesion images, such as the similarity between region and background and hair occlusion, a skin lesion segmentation method based on multi-level segmentation of receptive field and attention is proposed. In the coding part, the depth feature extraction module and MS-RFB module are used to encode, which improves the extraction of global context information of feature maps and enriches feature information of different scales. The hybrid pool module establishes long-short correlation through various forms of convolution. In addition, the reverse residual external attention module is introduced into the decoder, which enhances the connection between data sets, obtains the characteristics of the whole data set, and improves the generalization ability of the model. Experimental results show that this algorithm can improve the segmentation effect of skin lesion images, especially for images with similar backgrounds and lesion areas. The ability of edge detail processing is also improved, and the comprehensive indexes are superior to other algorithms, which is helpful to further improve the accuracy and efficiency of computer-aided skin lesion diagnosis. Compared with the classical network, our model has improved in all indexes. Therefore, this model is helpful in improving the efficiency of computer-aided diagnosis of skin diseases and provides a reference for future research.