A Segmentation Algorithm of Colonoscopy Images Based on Multi-Scale Feature Fusion

Yu, Jing; Li, Zhengping; Xu, Chao; Feng, Bo

doi:10.3390/electronics11162501

Open AccessArticle

A Segmentation Algorithm of Colonoscopy Images Based on Multi-Scale Feature Fusion

by

Jing Yu

^1,2,

Zhengping Li

^1,2,*,

Chao Xu

^1,2 and

Bo Feng

^1,2

¹

School of Integrated Circuits, Anhui University, Hefei 230601, China

²

AnHui Engineering Laboratory of Agro-Ecological Big Data, Hefei 230601, China

^*

Author to whom correspondence should be addressed.

Electronics 2022, 11(16), 2501; https://doi.org/10.3390/electronics11162501

Submission received: 13 July 2022 / Revised: 4 August 2022 / Accepted: 9 August 2022 / Published: 11 August 2022

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Colorectal cancer is a common malignant tumor. Colorectal cancer is primarily caused by the cancerization of an adenomatous polyp. Segmentation of polyps in computer-assisted enteroscopy images is helpful for doctors to diagnose and treat the disease accurately. In this study, a segmentation algorithm of colonoscopy images based on multi-scale feature fusion is proposed. The proposed algorithm adopts ResNet50 as the backbone network to extract features. The shallow features are processed using the cross extraction module, thus increasing the receptive field, retaining the texture information, and fusing the processed shallow features and deep features at different proportions based on a multi-proportion fusion module. The proposed algorithm is capable of suppressing redundant information, removing background noise, and sharpening boundaries while acquiring considerable semantic information. As revealed by the results of the experiments on the published Kvasir-SEG dataset of intestinal polyps, the mean Dice coefficient and mean intersection over union were obtained as 0.9192 and 0.8873, better than that of existing mainstream algorithms. The result verifies the effectiveness of the proposed network and provides a reference for deep learning concerning the image processing and analysis of intestinal polyps.

Keywords:

colorectal cancer; polyp segmentation; multi-scale feature fusion; cross extraction module; multi-proportion fusion module

1. Introduction

Colorectal cancer is a common malignant tumor comprising colon and rectal cancer. The statistics of the World Health Organization in 2018 suggest that the incidence and mortality of colorectal cancer ranked third worldwide, the incidence rate was 10% of the annual new tumor cases worldwide, and the mortality rate accounted for 9.4% of the global annual tumor-associated deaths [1]. Changes in living habits and dietary structure have led to the rising incidence and mortality of colorectal cancer over the past few years [2]. Accordingly, the effective prevention of colorectal cancer is an urgent problem.

Colorectal polyps are significantly related to colorectal cancer, and removing intestinal polyps can reduce the incidence of colorectal cancer [3]. Doctors are required to analyze polyps before formulating a plan to remove intestinal polyps. Even if doctors have sufficient clinical experience, there is still a missed detection rate of 25% in video examinations [4]. Existing studies have found that for every 1% increase in the detection rate of polyps, the incidence of colorectal cancer will decrease by 3%, and increasing the detection rate of polyps is beneficial in reducing the incidence of colorectal cancer [5]. With the extensive application of computer-aided medical image processing, the polyp images segmented by medical experts can be used as real values to train neural networks. The automatic segmentation of intestinal polyps through the neural network can increase the efficiency of segmentation and decrease the missed detection rate of polyps, which is beneficial to the prevention and treatment of colorectal cancer.

Conventional polyp segmentation methods are based on underlying features. For example, Mamonov et al. [6] proposed to segment polyp images for principal component tracking. However, the conventional polyp segmentation method only considers a single feature and cannot organically combine the features of the polyp region, resulting in unsatisfactory segmentation results. Compared with conventional methods, deep learning can extract deeper features and increase the accuracy of image segmentation. Akbari et al. [7] proposed that FCN [8] was used to segment intestinal polyps, and the segmentation effect was greatly improved compared with traditional methods. Zhao et al. [9] proposed improving the accuracy of polyp segmentation by combining global and local information. The structure model based on encoder-decoder has been widely used in medical image processing over the past few years (e.g., U-Net [10] and UNet++ [11]). Jha et al. [12] proposed that adding SE [13], ASPP [14], and CBAM [15] to the model can significantly improve the network performance. Feng et al. [16] proposed to utilize two pyramid modules to fuse global information. Kang et al. [17] used Mask R-CNN [18] model to segment polyps. Hemin et al. [19] proposed to use different CNN as extractors based on Mask R-CNN [18] to further improve the segmentation performance of the network. Fan et al. [20] used the reverse attention module to remove background noise and obtain accurate and complete polyp boundaries. Dong et al. [21] suggested that Polyp-PVT carries out intestinal polyp segmentation based on Transformer architecture [22]. They used Cascaded Fusion Module and Similarity Aggregation Module to organically combine the features of high-grade and low-grade polyps and improve the expression ability of features. Lou et al. [23] proposed a parallel partial decoder to aggregate high-level features, obtain rough polyp positions, and repeatedly use the reverse attention module to establish the relationship between regions and boundaries. Zhang et al. [24] captures global and detailed information by using transformer and CNN together, and fuses multi-level features by using the BiFusion module. Srivastava et al. [25] proposed MSRF network with gated shape stream to calculate multi-scale features, and used DSDF block to effectively fuse multi-scale features. Srivastava et al. [26] designed cross multi-scale attention and multi-scale feature selection modules to perform multi-scale fusion operation on all resolution ratios. Tan et al. [27] proposes to fuse high-level semantic information into the shallow layer, and use the learning of spatial attention blocks to enhance the original feature map. Wang et al. [28] proposed to use Transformer as encoder, and designed a local emphasis module to extract key local information, and stepwise feature aggregation module to use a linear layer to fuse features of different scales step by step.

Deep learning has been extensively used in the segmentation of intestinal polyps. The accuracy has been significantly improved compared to conventional methods, but due to the unique characteristics of intestinal datasets, there are still several problems to be solved. Intestinal polyps change significantly in size and shape, and the existing methods are better for large polyps, while smaller polyps are difficult to segment. The polyp area is similar in color and texture to the background, thus resulting in an inaccurate edge of the segmented polyp. There are multiple polyps in the same image, and the positions are close, thus leading to the situation that multiple polyps are segmented into one polyp. To solve the above problems, a segmentation algorithm of colonoscopy images based on multi-scale feature fusion is proposed in this study. Table 1 lists the strengths and weaknesses of common models and models proposed in this study. Compared to existing methods, the contributions of this study are as follows:

(1) A cross extraction module is designed to further extract the shallow features extracted by the ResNet50 backbone network to expand the receptive field and extract multi-scale detailed information, which is beneficial to capture polyps of different sizes and shapes.

(2) A multi-proportion fusion module is proposed, which adjusts the weights of different features by the number of inputs and fuses the processed deep and shallow features in different proportions. As a result, the polyp region can be strengthened, the segmentation accuracy of the edge can be increased, and the network can perform better in complex environments.

In the second part, the method proposed in this study is introduced in detail. In the third part, experimental details, analysis of experimental results, and subjective and objective analysis are introduced. The fourth part is the conclusion of this study and future steps.

2. Methods

2.1. Proposed Network Structure

The characteristics of the intestinal polyp dataset consider the proposed algorithm fully. Based on the backbone network of ResNet50, the cross extraction module (CEM) is adopted to improve the capability of shallow feature extraction. Multi-scale features are fused by the multi-proportion fusion module (MPFM) to decrease the number of network parameters. Figure 1 shows the network structure diagram. For the entire neural network, given an input image, the feature images X1, X2, X3, and X4 of four scales from low-level to high-level are first extracted based on the ResNet50 backbone network. The low-level feature image X1 extracts the shallow features through CEM and then adjusts the number of channels through the convolutional block and up-samples to obtain A. Moreover, the advanced feature maps X2, X3, and X4 are respectively adjusted by the number of channels through three convolutional blocks to obtain B, C, and D. Subsequently, the feature maps A, B, C, and D perform multi-scale feature fusion through MPFM and output prediction P2 results fused with deep features and output prediction P1 results fused with deep and shallow features, in which P1 is used as the final prediction result.

2.2. Network Backbone Extraction

The recent widespread use of deep convolutional networks suggests that network depth plays a positive role in the effect of image segmentation. Still, with the continuous increase of the network depth, the degradation phenomenon will appear after the accuracy reaches saturation. He et al. [30] proposed a deep residual learning framework, which introduces an identity shortcut key, skips one or more layers directly, and passes the input information directly to the output. As a result, the entire network is only required to learn the part of the difference between input and output, which simplifies learning goals, reduces learning difficulty, and leads to deeper layers than in previous networks. Considering the characteristics of the intestinal polyp dataset, this study uses ResNet50 as the backbone network to extract image features and outputs four-scale feature maps X1, X2, X3, and X4 from low-level to high-level. Specifically, X1 contains rich, detailed information (e.g., texture and color), while X2, X3, and X4 have rich semantic information.

2.3. Cross Extraction Module

The receptive field of the shallow network is relatively small, and the overlapping part of the receptive field is also relatively small. Although the extracted features are relatively close to the input and contain a considerable number of detailed information (e.g., texture and color), it also leads to the inability to effectively integrate the global feature information. For the intestinal polyp dataset, the polyp tissue is highly similar to the intestinal background, making it difficult to extract polyp features. Thus, a feature extraction module that can increase the receptive field and exhibits a powerful segmentation function is urgently required.

Ramachandran et al. [31] proposed that SiLU can be regarded as a smooth function between linear and ReLU functions, and the first half of SiLU is non-zero to avoid the situation that some neurons of ReLU may never be activated. Accordingly, this study proposes to replace the commonly used ReLU activation function with SiLU to enhance the network’s learning ability. He et al. [14] proposed that the ASPP module can obtain receptive fields of different sizes by aggregating the context information of different regions, with the expansion rate of R = (6, 12, 18, 24) and the convolution kernel size of 3 × 3. The ASPP structure has achieved good results in semantic segmentation, but the expansion rate of r = 24 is unsuitable for directly extracting shallow features. Thus, this study proposes to remove the empty convolution branches with an expansion rate of r = 24 in the ASPP structure and improve four branches to three.

A cross extraction module is presented, comprising two main parts. The first part refers to a convolution block composed of a 3 × 3 convolution, a SiLU activation function, and a 1 × 1 convolution. The 3 × 3 convolution is adopted to extract features. The SiLU activation function leads to the enhancement of the learning ability of the network, and 1 × 1 adjusts the number of channels. The second part is the optimized ASPP structure, in which three parallel dilated convolutions with different expansion rates complete multiple feature extraction processes at different scales. The dilated convolutions with different expansion rates can be beneficial to solving the problem of the receptive field of the shallow network being relatively small, which can ensure that more details can be captured. Besides, the computation of the second part is relatively simple. Lastly, the results of the first and second parts are spliced and output. Figure 2 presents the structure diagram of the pyramid extraction module.

2.4. Multi-Proportion Fusion Module

Intestinal image has simple semantics and a fixed structure, so its deep and shallow features are significant. The shallow features retain considerable details because of the limitation of the receptive field. The mentioned features have clear boundaries, which is critical to the final generation of an accurate polyp segmentation map. Deep features have rough boundaries because of repeated down-sampling. Although excessively detailed information is lost, they still contain consistent semantics and clear background. To segment accurate polyp information, more focus should be placed on features extracted from the deep network.

Dong et al. [21] proposed a cascaded fusion module and a similarity aggregation module to fuse deep and shallow features. These researchers achieved good results in the fusion of deep features, whereas the differences between deep and shallow features were ignored. In this study, a multi-proportion fusion module is proposed to fuse deep and shallow features, which consists of three parts for the processed first-, second-, third- and fourth-layer feature maps A, B, C, and D. In the first part, D1, D2, D3, and D4 obtained by up-sampling and convolution layer the feature map D, D2 is multiplied by C and then connected with D1, and the features are smoothly spliced by the convolution layer to generate the feature map C3 of the first part. In the second part, C1 and C2 are obtained through up-sampling and the convolution layer the feature map C, the feature map B2 of the second part is generated by multiplying C1, D3, and B, and the features of B2 and C3 are spliced. Subsequently, the fusion result T2 of deep features is achieved through a 3 × 3 convolution. In the third part, the feature map A2 of the third part is obtained by multiplying B1 obtained after the up-sampling and convolution layer of the feature map B with C2 and D4. Next, A2 is spliced with the result B3 of deep feature fusion. After a 3 × 3 convolution, the smooth stitching features and the number of channels are adjusted to achieve the final output result T1 of deep and shallow feature fusion. Figure 3 illustrates the structure diagram of the multi-proportion fusion module.

MPFM follows the strategy of multi-proportion fusion, which increases the proportion of deep features by inputting the fourth layer features four times, the third layer features three times, the second layer features two times, and the first layer features one time. MPFM is capable of obtaining clear semantics, fusing shallow features, strengthening polyp areas, and obtaining accurate segmentation results.

3. Experiment

3.1. Data Preprocessing

In the intestinal polyp segmentation task, the common polyp dataset Kvasir-SEG [32] and CVC-ClinicDB [33] were divided into the training set, the validation set, and the test set at a ratio of 8:1:1. As the training of neural network requires many support images, the Kvasir-SEG dataset contains a total of 1000 polyp images, and the CVC-ClinicDB dataset has a total of 612 images, a relatively small amount of data. In this study, 800 images from the Kvasir-SeG training set and 490 images from the CVC-ClinicDB training set were extended through vertical flipping, horizontal flipping, 90 degrees clockwise rotation, panning, adjusting image brightness, and performing a Gaussian blur. Lastly, a total of 8800 Kvasir-SEG training sets were obtained, and 5390 CVC-ClinicDB training sets were obtained.

3.2. Experimental Details

The experiment of this study was based on Pytorch, a deep learning framework. The compiler used was Pycharm. The CPU was Inter Core i7-6700HQ. The configuration of the GPU was NVIDIA GeForce GTX 1080Ti with 8 GB of video memory. CUDA adopts version 10.0. The operating system was Windows. In addition, the programming language was Python3.7 of Anconda3. Considering the different sizes of the respective polyp image, this study adopted the multi-scale strategy at the training stage [34]. To update the network parameters, Resnet50, which has been widely used in segmentation, was employed as the backbone network, and the input image size was adjusted to 352 × 352. Adam served as the optimizer. Figure 4 shows the loss change of the model with different initial learning rates during training. It can be seen from Figure 4 that the initial learning rate should be set to 1 × 10⁻⁴. And the batch size was set to 8, and the iteration period of training was set to 100.

3.3. Loss Function

Dong et al. [21] proposed the main loss and the auxiliary loss to supervise the final result and the output of intermediate results and network training, respectively. Qin et al. [35] proposed the loss of weighted Intersection over Union (IOU) [36], the loss of binary cross entropy (BCE) [37], and the loss of structural similarity (SSIM) [38]. In this study, the two methods were used together. The main loss was calculated using the final output result and the real value of MPFM, and the auxiliary loss was obtained by the output and the real value of deep features fused through MPFM. The IOU loss and BCE loss were combined to assign weight to different pixels. The pixels difficult to segment were assigned larger weights, while those easy to segment were assigned smaller. Meanwhile, the SSIM loss was used to increase the boundary weight, which helped to segment the accurate polyp boundary. The formula for the loss function is written as follows:

L = L_{main} + L_{anc}

(1)

L_{main} = L_{IOU}^{w} (P 1, G) + L_{BCE}^{w} (P 1, G) + L_{SSIM}^{w} (P 1, G)

(2)

L_{anc} = L_{IOU}^{w} (P 2, G) + L_{BCE}^{w} (P 2, G) + L_{SSIM}^{w} (P 1, G)

(3)

where

L_{main}

denotes the main loss;

L_{anc}

is the auxiliary loss;

L_{IOU}^{w} ()

is the loss of weighted BCE;

L_{BCE}^{w} ()

represents the loss of weighted IOU;

L_{SSIM}^{w} ()

is the loss of weighted SSIM; P1 is the final output result;

P 2

is the result of deep feature fusion;

G

expresses the true value.

3.4. Experimental Metrics

To evaluate the effect of the proposed algorithm on intestinal polyp segmentation, five indexes were used for quantitative evaluation, including Recall, Precision, mean Intersection over Union (mIoU), mean Dice coefficient (mDice) [39], and Accuracy. Recall represents the proportion of the pixels of the correct polyp determined by the algorithm to the pixels of the actual polyp. Precision represents the proportion of pixels judged by the algorithm as polyps to the pixels of actual polyps. mIoU and mDice represent the degree of coincidence between the algorithm segmentation results and the real results. Accuracy represents the proportion of pixels correctly judged by the algorithm to the actual pixels, and the indicator values were all between [0, 1]. The closer these indicators to 1, the better the segmentation effect will be, and the better the accuracy of the experimental results will be. Their formulas are written as follows:

Recall = \frac{TP}{TP + FN}

(4)

Precision = \frac{TP}{TP + FP}

(5)

mIoU = \frac{TP}{FP + FN + TP}

(6)

mDice = \frac{2 TP}{2 TP + FP + FN}

(7)

Accuracy = \frac{TP + TN}{TP + TN + FP + FN}

(8)

where TP (True Positive) means that the pixel on the real image is a polyp, and the pixel on the predicted image is also a polyp. TN (True Negative) means that the pixels on the real image are non-polyps and the pixels on the predicted image are also non-polyps. FP (False Positive) means that the pixel on the real image is a non-polyp and the pixel on the predicted image is a polyp. FN (False Negative) suggests that the pixel on the real image is a polyp and the pixel on the predicted image is a non-polyp.

3.5. Experimental Results

To verify the effectiveness of the proposed algorithm for intestinal polyp segmentation, this study selects five commonly used medical segmentation indexes for quantitative analysis, and compares them with the experimental results of Unet [10], Unet++ [11], ResUnet [29], ResUnet++ [12], PSPNet [9], Mask R-CNN [18], PraNet [20], Polyp-PVT [21], CaraNet [23], and SSFormer-L [28]. Table 2 lists the index results of different algorithms on the test set. Compared with other algorithms, all the indicators of the proposed algorithm were significantly higher than the compared algorithms. For the accuracy and mean Intersection-over-Union of the test set, ResUnet and SSFormer-L achieved the lowest and highest values, respectively, in the compared algorithms. The accuracy of the proposed algorithm was 5.75% and 0.75% higher than that of ResUnet and SSFormer-L, respectively, and the mean Intersection-over-Union was 15.04% and 1.77% higher than that of ResUnet and SSFormer-L, thus verifying the effectiveness of the proposed algorithm on intestinal polyps.

Figure 5 illustrates the comparison of the prediction results of different algorithms. In the first line of examples, Unet, Unet++, ResUnet, ResUnet++, and PSPNet could not segment or completely segment small polyps. PraNet and Polyp-PVT would misjudge the raised tissue as polyps. The proposed algorithm could accurately segment small polyps. In the second and seventh rows of examples, the polyp was similar to the background area, Unet, Unet++, ResUnet, and ResUnet++ only predicted part of the polyp area, PSPNet, PraNet, and Polyp-PVT were inaccurate in the segmentation of polyp edges, while the proposed algorithm could accurately predict the position and border of polyps. In the examples of the third and fifth lines, there were some interfering raised tissues in the picture (e.g., folds). The comparison algorithms (e.g., Unet and Unet++) misjudged the folds as polyps, and the proposed algorithm could identify the polyp area. In the example in the fourth row, there were multiple polyps in the same image, and ResUnet only segmented some polyps. Unet, Unet++, ResUnet++, PSPNet, PraNet, and Polyp-PVT were ineffective in segmenting polyps with inconspicuous features. The proposed algorithm could segment all polyps, and the edges were basically accurate. In the sixth line of examples, ResUnet and PSPNet could only segment a small number of polyps. Other contrast models can segment polyps but have imprecise edges. The proposed algorithm could completely segment polyp.

Table 2 and Figure 5 show the advantages of the proposed model over the comparison model. As revealed by both quantitative and qualitative results, the proposed algorithm can increase the accuracy of intestinal polyp image segmentation to a certain extent and shows significant advantages in the location and contour of intestinal polyp segmentation. Thus, the proposed algorithm provides a reference for deep learning in the processing and analysis of intestinal polyp images. The results of mIoU and mDice are presented in Figure 6, where the proposed model steadily improves on Kvasir-SEG and CVC-ClinicDB.

3.6. Ablation Experiment

3.6.1. Effect of Activation Function and Batch Normalization (BN) on Segmentation Results

To verify the effectiveness of the structure of the convolution block introduced in CEM, this study makes two comparisons. (1) The effects of different activation functions in CEM convolution block on intestinal polyp segmentation. ReLU and SiLU activation functions were used for segmentation. (2) The effect of BN in convolution block of CEM on intestinal polyp segmentation. The effect of BN was correlated with the size setting of the input batch. If the setting of the input batch is small, the output result cannot represent the feature distribution. The large set of the input batch will require more memory and make it difficult to update the parameters. Table 3 shows the comparison of the experimental results. Comparing the ReLU activation function to BN suggests that the SiLU activation function without setting BN can impart higher performance to the model, and the Accuracy, mIoU, and mDice indicators increased by 0.5%, 1.04%, and 1.84%, respectively.

3.6.2. Effect of CEM on Segmentation Results

The model without using the cross extraction module adopted “N/CEM” as the benchmark network for intestinal polyp segmentation to verify the effectiveness of the proposed cross extraction module. The experimental results are compared in Table 4. For the benchmark network, the ASPP module or the CBAM module was added for comparison with the CEM module. After the ASPP module was added, the mDice decreased by 0.6%. After the CBAM module was added, the mDice increased by 0.05%. After the CEM module was used, the mDice increased by 1.03%. In contrast, the CBAM module and the CEM module had positive effects on the segmentation of intestinal polyps, and the proposed CEM module was 1.63% and 0.98% higher than the ASPP and CBAM modules, respectively. Comparing the values of Accuracy, mIoU, and Recall, the proposed method achieved the optimal results.

Figure 7 is a comparison chart of the segmentation results of intestinal polyps by N/CEM, ASPP, CBAM, and the proposed algorithm. Compared with the true value, N/CEM has a better effect on the segmentation of small polyps, whereas the edge contour of polyps is not clear, which is more significant when segmenting large polyps, as revealed by the second row. The ASPP network has the problem of redundant or missing segmentation for both large and small polyp segmentation, as revealed by the third row. The CBAM network can solve the problem that the baseline network has no clear polyp outline, whereas there is still the phenomenon of redundant or missing segmentation. The proposed algorithm can better predict the position and contour of polyps and effectively solve the problem that polyps of different shapes and sizes cannot be accurately segmented. In brief, multi-scale feature extraction for shallow features can help the network accurately segment the contours of polyps so that segmentation accuracy can be further increased on the original basis.

3.6.3. The Effect of MPFM on Segmentation Results

To verify the effectiveness of the proposed MPFM module, the “N/MPFM” of the model without the MPFM module was used as the benchmark network in the experiment. The GPG module or the CFM module was added for comparison with the MPFM module, which is compared with the proposed algorithm. The experimental results are compared in Table 5. For the benchmark network, after the GPG module was added, the mDice increased by 0.58%. After the CFM module was added, the mDice increased by 1.56%. After the MPFM module was used, the mDice increased by 2.35%. It is verified that all feature fusion modules positively affect the segmentation of intestinal polyps. Meanwhile, the proposed MPFM module was 1.77% and 0.79% higher than the GPG and CFM modules, respectively. Comparing the values of Accuracy, mIoU, and Recall, the proposed method achieved the optimal results.

Figure 8 is a comparison chart of the segmentation results of intestinal polyps by N/MPFM, GPG, CFM, and the proposed algorithm. In the example in the first row, the polyp had an irregular edge, and the N/MPFM, GPG, and CFM were inaccurate for edge segmentation, whereas the irregular edge could be accurately segmented after adding the MPFM module. The examples in the second row had some interfering bulge tissue, and the N/MPFM, GPG, and CFM segmented the normal intestinal region into polyps. After adding the MPFM module, the polyps could be accurately segmented. In the example in the third row, the polyps were very similar to the background features. The N/MPFM and CFM could not determine the polyp area. The edge segmentation of polyp by GPG is inaccurate, whereas the MPFM module could accurately segment the polyps. Thus, the proposed algorithm can better predict the position and contour of polyps and effectively solve the problem of more or less segmentation of polyp regions. In brief, a multi-proportion fusion of deep and shallow features can help the network accurately segment the position and contour of polyps so that segmentation accuracy can be further increased on the original basis.

4. Conclusions

Image segmentation for enteroscopy is a critical step in pre-realizing accurate visualization, diagnosis, early treatment, and surgical planning of intestinal polyps. In this study, an image segmentation method for colonoscopy based on multi-scale feature fusion is proposed. In this model, ResNet50 is used as the backbone network of feature extraction. The extracted shallow features are extracted by the innovative cross extraction module, and the processed deep and shallow features are fused using the proposed multi-proportion fusion module. Moreover, the result of deep feature fusion is determined to be an auxiliary loss, and the network training is supervised. The model fully uses the shallow features and emphasizes the deep features, thus enhancing the network’s ability to segment polyps. It performs well in the experiment of the Kvasir-SEG intestinal polyps dataset and makes the computer-aided system better applied to medical diagnosis. The model can not only be applied to the segmentation of intestinal polyps, but also be extended to other biomedical images, natural images, and other pixel classification tasks.

In the experiment, the image data was not processed too much. The data will be processed carefully in the following work (e.g., removing the bright spots and improving the model’s ability in intestinal polyp segmentation). Furthermore, the model can be applied to the ISIC-2018 dataset, VOC dataset, and other image datasets, and the generalization ability of the model will be verified and enhanced.

Author Contributions

Methodology, J.Y.; software, Z.L.; validation, C.X. and B.F.; data curation, C.X. and B.F.; writing—original draft preparation, J.Y.; writing—review and editing, Z.L.; visualization, J.Y.; supervision, Z.L.; project administration, Z.L.; funding acquisition, C.X. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Key Research and Development Program of China (No. 2019YFC0117800).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Publicly available datasets were analyzed in this study. This data canbe found here: Kvasir-SEG [https://datasets.simula.no/kvasir-seg/, accessed on 20 May 2022] and CVC-ClinicDB [https://polyp.grand-challenge.org/CVCClinicDB/, accessed on 20 May 2022].

Conflicts of Interest

The authors declare no conflict of interest.

References

Bray, F.; Ferlay, J.; Soerjomataram, I.; Siegel, R.L.; Torre, L.A.; Jemal, A. Global cancer statistics 2018: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA Cancer J. Clin. 2018, 68, 394–424. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Sung, H.; Ferlay, J.; Siegel, R.L.; Laversanne, M.; Soerjomataram, I.; Jemal, A.; Bray, F. Global Cancer Statistics 2020: GLOBOCAN Estimates of Incidence and Mortality Worldwide for 36 Cancers in 185 Countries. CA Cancer J. Clin. 2021, 71, 209–249. [Google Scholar] [CrossRef] [PubMed]
Winawer, S.J.; Zauber, A.G.; Ho, M.N.; O’Brien, M.J.; Gottlieb, L.S.; Sternberg, S.S.; Waye, J.D.; Schapiro, M.; Bond, J.H.; Panish, J.F.; et al. Prevention of Colorectal Cancer by Colonoscopic Polypectomy. N. Engl. J. Med. 1993, 329, 1977–1981. [Google Scholar] [CrossRef] [PubMed]
Leufkens, A.M.; van Oijen, M.G.H.; Vleggaar, F.P.; Siersema, P.D. Factors influencing the miss rate of polyps in a back-to-back colonoscopy study. Endoscopy 2012, 44, 470–475. [Google Scholar] [CrossRef]
Dawwas, M.F. Adenoma Detection Rate and Risk of Colorectal Cancer and Death. N. Engl. J. Med. 2014, 370, 2539–2541. [Google Scholar] [CrossRef] [Green Version]
Mamonov, A.V.; Figueiredo, I.N.; Figueiredo, P.N.; Tsai, Y.-H.R. Automated Polyp Detection in Colon Capsule Endoscopy. IEEE Trans. Med Imaging 2014, 33, 1488–1502. [Google Scholar] [CrossRef] [Green Version]
Akbari, M.; Mohrekesh, M.; Nasr-Esfahani, E.; Soroushmehr, S.M.R.; Karimi, N.; Samavi, S.; Najarian, K. Polyp Segmentation in Colonoscopy Images Using Fully Convolutional Network. In Proceedings of the 40th Annual International Conference of the IEEE-Engineering-in-Medicine-and-Biology-Society (EMBC), Honolulu, HI, USA, 18–21 July 2018. [Google Scholar]
Long, J.; Shelhamer, E.; Darrell, T. Fully Convolutional Networks for Semantic Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; Volume 39, pp. 640–651. [Google Scholar]
Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid Scene Parsing Network. In Proceedings of the 30th IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Proceedings of the 18th International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI), Munich, Germany, 5–9 October 2015. [Google Scholar]
Zhou, Z.; Siddiquee, M.M.R.; Tajbakhsh, N.; Liang, J. UNet++: A Nested U-Net Architecture for Medical Image Segmentation. In Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support, Proceedings of the 4th International Workshop, DLMIA 2018, and 8th International Workshop, ML-CDS 2018, Granada, Spain, 20 September 2018; Springer: Cham, Switzerland, 2018; Volume 11045, pp. 3–11. [Google Scholar]
Jha, D.; Smedsrud, P.H.; Riegler, M.A.; Johansen, D.; de Lange, T.; Halvorsen, P.; Johansen, H.D. ResUNet++: An Advanced Architecture for Medical Image Segmentation. In Proceedings of the 21st IEEE International Symposium on Multimedia (ISM), San Diego, CA, USA, 9–11 December 2019. [Google Scholar]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-Excitation Networks. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2020; Volume 42, pp. 2011–2023. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 37, 1904–1916. [Google Scholar] [CrossRef] [Green Version]
Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.S. CBAM: Convolutional Block Attention Module. In Proceedings of the 15th European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018. [Google Scholar]
Feng, S.; Zhao, H.; Shi, F.; Cheng, X.; Wang, M.; Ma, Y.; Xiang, D.; Zhu, W.; Chen, X. CPFNet: Context Pyramid Fusion Network for Medical Image Segmentation. IEEE Trans. Med. Imaging 2020, 39, 3008–3018. [Google Scholar] [CrossRef]
Kang, J.; Gwak, J. Ensemble of Instance Segmentation Models for Polyp Segmentation in Colonoscopy Images. IEEE Access 2019, 7, 26440–26447. [Google Scholar] [CrossRef]
He, K.; Gkioxari, G.; Dollar, P.; Girshick, R. Mask R-CNN. Ieee Trans. Pattern Anal. Mach. Intell. 2020, 42, 386–397. [Google Scholar] [CrossRef]
Qadir, H.A.; Shin, Y.; Solhusvik, J.; Bergsland, J.; Aabakken, L.; Balasingham, I. Polyp Detection and Segmentation using Mask R-CNN: Does a Deeper Feature Extractor CNN Always Perform Better? In Proceedings of the 13th International Symposium on Medical Information and Communication Technology (ISMICT), Oslo, Norway, 8–10 May 2019. [Google Scholar]
Fan, D.-P.; Ji, G.-P.; Zhou, T.; Chen, G.; Fu, H.; Shen, J.; Shao, L. PraNet: Parallel Reverse Attention Network for Polyp Segmentation. In Proceedings of the 2020 International Conference on Medical Image Computing and Computer-Assisted Intervention, Lima, Peru, 4–8 October 2020; pp. 263–273. [Google Scholar] [CrossRef]
Dong, B.; Wang, W.; Fan, D.P.; Li, J.; Fu, H.; Shao, L. Polyp-PVT: Polyp Segmentation with Pyramid Vision Transformers. arXiv 2021, arXiv:2108.06932. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention Is All You Need. In Proceedings of the 31st Annual Conference on Neural Information Processing Systems (NIPS), Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
Lou, A.; Guan, S.; Ko, H.; Loew, M.H. CaraNet: Context axial reverse attention network for segmentation of small medical objects. In Proceedings of the SPIE Medical Imaging 2022: Image Processing, San Diego, CA, USA, 20 February–28 March 2022. [Google Scholar] [CrossRef]
Zhang, Y.; Liu, H.; Hu, Q. TransFuse: Fusing Transformers and CNNs for Medical Image Segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention 2021, Strasbourg, France, 27 September–1 October 2021; pp. 14–24. [Google Scholar] [CrossRef]
Srivastava, A.; Jha, D.; Chanda, S.; Pal, U.; Johansen, H.; Johansen, D.; Riegler, M.; Ali, S.; Halvorsen, P. MSRF-Net: A Multi-Scale Residual Fusion Network for Biomedical Image Segmentation. IEEE J. Biomed. Health Inform. 2021, 26, 2252–2263. [Google Scholar] [CrossRef] [PubMed]
Srivastava, A.; Chanda, S.; Jha, D.; Pal, U.; Ali, S. GMSRF-Net: An improved generalizability with global multi-scale residual fusion network for polyp segmentation. arXiv 2021, arXiv:2111.10614. [Google Scholar]
Jiang, D.; Sun, B.; Su, S.; Zuo, Z.; Wu, P.; Tan, X. FASSD: A Feature Fusion and Spatial Attention-Based Single Shot Detector for Small Object Detection. Electronics 2020, 9, 1536. [Google Scholar] [CrossRef]
Wang, J.; Huang, Q.; Tang, F.; Meng, J.; Su, J.; Song, S. Stepwise Feature Fusion: Local Guides Global. arXiv 2022, arXiv:2203.03635. [Google Scholar]
Zhang, Z.; Liu, Q.; Wang, Y. Road Extraction by Deep Residual U-Net. Ieee Geosci. Remote Sens. Lett. 2018, 15, 749–753. [Google Scholar] [CrossRef] [Green Version]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
Ramachandran, P.; Zoph, B.; Le, Q.V. Searching for Activation Functions. arXiv 2017, arXiv:1710.05941. [Google Scholar]
Jha, D.; Smedsrud, P.H.; Riegler, M.A.; Halvorsen, P.; Lange, T.D.; Johansen, D.; Johansen, H.D. Kvasir-SEG: A Segmented Polyp Dataset. In Proceedings of the 26th International Conference on MultiMedia Modeling (MMM), Daejeon, Korea, 5–8 January 2020. [Google Scholar]
Bernal, J.; Sánchez, F.J.; Fernández-Esparrach, G.; Gil, D.; Rodríguez, C.; Vilariño, F. WM-DOVA maps for accurate polyp highlighting in colonoscopy: Validation vs. saliency maps from physicians. Comput. Med. Imaging Graph. 2015, 43, 99–111. [Google Scholar] [CrossRef]
Huang, C.H.; Wu, H.Y.; Lin, Y.L. HarDNet-MSEG: A Simple Encoder-Decoder Polyp Segmentation Neural Network that Achieves over 0.9 Mean Dice and 86 FPS. arXiv 2021, arXiv:2101.07172. [Google Scholar]
Qin, X.; Zhang, Z.; Huang, C.; Gao, C.; Dehghan, M.; Jagersand, M. BASNet: Boundary-Aware Salient Object Detection. In Proceedings of the 32nd IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–21 June 2019. [Google Scholar]
Mattyus, G.; Luo, W.; Urtasun, R. DeepRoadMapper: Extracting Road Topology from Aerial Images. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017. [Google Scholar]
De Boer, P.T.; Kroese, D.P.; Mannor, S.; Rubinstein, R.Y. A Tutorial on the Cross-Entropy Method. Ann. Oper. Res. 2005, 134, 19–67. [Google Scholar] [CrossRef]
Wang, Z.; Simoncelli, E.P.; Bovik, A.C. Multiscale structural similarity for image quality assessment. In Proceedings of the Thrity-Seventh Asilomar Conference on Signals, Systems & Computers, Pacific Grove, CA, USA, 9–12 November 2003. [Google Scholar] [CrossRef] [Green Version]
Graham, S.; Vu, Q.D.; Raza, S.E.A.; Azam, A.; Tsang, Y.W.; Kwak, J.T.; Rajpoot, N. Hover-Net: Simultaneous segmentation and classification of nuclei in multi-tissue histology images. Med. Image Anal. 2019, 58, 101563. [Google Scholar] [CrossRef] [PubMed] [Green Version]

Figure 1. Structure diagram of the network.

Figure 2. Structure diagram of cross extraction module.

Figure 3. Structure diagram of multi-proportion fusion module.

Figure 4. Graph of the loss decline of the model as iterates over the training epochs. (a) learning rate is 1 × 10⁻³. (b) learning rate is 1 × 10⁻⁴. (c) learning rate is 1 × 10⁻⁵.

Figure 5. Comparison of qualitative results from the Kvasir-SEG dataset. (a) image. (b) mask. (c) Unet. (d) ResUNet. (e) UNet++. (f) ResUNet++. (g) PSPNet. (h) Mask R-CNN (i) PraNet. (j) Polyp-PVT. (k) CaraNet. (l) SSFormer-L. (m) ours.

Figure 6. The generalization ability was evaluated by box plots. (a) The proposed model and comparison methods in Kvasir-SEG and CVC-ClinicDB mDice results. (b) The proposed model and comparison methods in Kvasir-SEG and CVC-ClinicDB mIoU results.

Figure 7. Comparison of segmentation results with and without CEM from the Kvasir-SEG dataset. The image, the mask, and the segmentation results of N/CEM, ASPP, CBAM, and the proposed model are presented from left to right.

Figure 8. Comparison of segmentation results with and without MPFM from the Kvasir-SEG dataset. The image, the mask, the segmentation results of N/MPFM, GPG, CFM, and the proposed model are presented from left to right.

Table 1. Strengths and weaknesses of common models and models proposed in this study.

Model	Strength	Weakness
Unet [10]	These four models can use fewer training sets for end-to-end training, fully use context information, and have good output results.	The detailed information is lost during sampling under these four models, and there are numerous repeated operations in the training process.
Unet++ [11]
ResUnet [29]
ResUnet++ [12]
PSPNet [9]	The PPM module is used in this model to aggregate global context information.	Considerable detail information is lost in the sampling process of this model, leading to the imprecise edge of the segmentation result.
Mask R-CNN [18]	The model is segmented based on target detection and can achieve high accuracy.	The model needs to generate the region of interest first, then classify the object and return the bounding box, which is usually slow.
PraNet [20]	In this model, advanced features are used to capture the rough position of polyp tissue, and the reverse attention module mines the edge information to obtain accurate segmentation results.	This model mainly focuses on edge information and ignores context information at different scales.
CaraNet [23]
Polyp-PVT [21]	The model uses a transformer as the encoder to have the whole image sensing range and fully uses the global context information.	The model does not acquire enough local information, which affects the final segmentation result.
SSFormer-L [28]
Ours	The proposed model fully considers multi-scale context information and uses feature fusion modules for different proportions of fusion.	The model has modules to deal with shallow features, which may require more computing resources.

Table 2. Index results of different algorithms on the test set.

Model	Accuracy	Recall	Precision	mIoU	mDice
Unet	0.9341	0.9141	0.7709	0.8027	0.8179
ResUnet	0.9074	0.8653	0.7095	0.7369	0.7877
Unet++	0.9404	0.9171	0.7998	0.8165	0.8211
ResUnet++	0.9389	0.9181	0.7976	0.8006	0.8132
PSPNet	0.9346	0.9407	0.7478	0.8161	0.8091
Mask R-CNN	0.9289	0.8794	0.7984	0.7899	0.7962
PraNet	0.9431	0.9389	0.8583	0.8568	0.8981
Polyp-PVT	0.9441	0.9387	0.8541	0.8621	0.9171
CaraNet	0.9562	0.9326	0.8614	0.8627	0.9163
SSFormer-L	0.9574	0.9384	0.8662	0.8696	0.9156
Ours	0.9649	0.9401	0.9011	0.8873	0.9192

Table 3. Comparison before and after using different activation functions and Batch Normalization (BN).

Model	Activation	BN	Accuracy	Recall	Precision	mIoU	mDice
CEM	NO	NO	0.9606	0.9288	0.8936	0.8743	0.9003
	NO	YES	0.9619	0.9279	0.9225	0.8775	0.9076
	ReLU	NO	0.9615	0.9318	0.9014	0.8779	0.9075
	ReLU	YES	0.9599	0.9331	0.8951	0.8769	0.9008
	SiLU	NO	0.9649	0.9401	0.9011	0.8873	0.9192
	SiLU	YES	0.9605	0.9287	0.8963	0.8746	0.9011

Table 4. Comparison of experimental results with or without CEM.

Model	Accuracy	Recall	Precision	mIoU	mDice
N/CEM	0.9612	0.9308	0.9075	0.8784	0.9089
ASPP	0.9612	0.9278	0.9119	0.8777	0.9029
CBAM	0.9612	0.9336	0.8949	0.8785	0.9094
Ours	0.9649	0.9401	0.9011	0.8873	0.9192

Table 5. Comparison of experimental results with or without MPFM.

Model	Accuracy	Recall	Precision	mIoU	Dice
N/MPFM	0.9596	0.9256	0.8966	0.8724	0.8957
GPG	0.9592	0.9304	0.8964	0.8734	0.9015
CFM	0.9614	0.9366	0.8935	0.8809	0.9113
Ours	0.9649	0.9401	0.9011	0.8873	0.9192

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yu, J.; Li, Z.; Xu, C.; Feng, B. A Segmentation Algorithm of Colonoscopy Images Based on Multi-Scale Feature Fusion. Electronics 2022, 11, 2501. https://doi.org/10.3390/electronics11162501

AMA Style

Yu J, Li Z, Xu C, Feng B. A Segmentation Algorithm of Colonoscopy Images Based on Multi-Scale Feature Fusion. Electronics. 2022; 11(16):2501. https://doi.org/10.3390/electronics11162501

Chicago/Turabian Style

Yu, Jing, Zhengping Li, Chao Xu, and Bo Feng. 2022. "A Segmentation Algorithm of Colonoscopy Images Based on Multi-Scale Feature Fusion" Electronics 11, no. 16: 2501. https://doi.org/10.3390/electronics11162501

APA Style

Yu, J., Li, Z., Xu, C., & Feng, B. (2022). A Segmentation Algorithm of Colonoscopy Images Based on Multi-Scale Feature Fusion. Electronics, 11(16), 2501. https://doi.org/10.3390/electronics11162501

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Segmentation Algorithm of Colonoscopy Images Based on Multi-Scale Feature Fusion

Abstract

1. Introduction

2. Methods

2.1. Proposed Network Structure

2.2. Network Backbone Extraction

2.3. Cross Extraction Module

2.4. Multi-Proportion Fusion Module

3. Experiment

3.1. Data Preprocessing

3.2. Experimental Details

3.3. Loss Function

3.4. Experimental Metrics

3.5. Experimental Results

3.6. Ablation Experiment

3.6.1. Effect of Activation Function and Batch Normalization (BN) on Segmentation Results

3.6.2. Effect of CEM on Segmentation Results

3.6.3. The Effect of MPFM on Segmentation Results

4. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI