BMFA-Net: Boundary Constraint Multi-Level Feature Aggregation Framework for Precise Polyp Segmentation

Li, Qin; Zhang, Tianchi; Mosharaf, Parvej Md; Zhang, Jing

doi:10.3390/app14104063

Open AccessArticle

BMFA-Net: Boundary Constraint Multi-Level Feature Aggregation Framework for Precise Polyp Segmentation

¹

Department of Information Science and Engineering, University of Jinan, Jinan 250022, China

²

Shandong Provincial Key Laboratory of Network-Based Intelligent Computing, University of Jinan, Jinan 250022, China

³

Department of Information Science and Engineering, Chongqing Jiaotong University, Chongqing 400054, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(10), 4063; https://doi.org/10.3390/app14104063

Submission received: 12 March 2024 / Revised: 7 May 2024 / Accepted: 7 May 2024 / Published: 10 May 2024

Download

Browse Figures

Versions Notes

Abstract

Colorectal cancer is one of the three most common cancers worldwide. Polyps are complex and have unclear boundaries, often leading to inaccurate boundary segmentation and missed detections. To address these challenges, we propose a boundary constraint multi-level feature aggregation framework called BMFA-Net to precisely segment polyps. The framework comprises four key modules. First, the parallel partial decoder is introduced to aggregate high-level features within the network to generate a globally informative semantic map serving as the initial guidance region for reverse erasing the foreground. Second, we propose an efficient atrous convolution attention module to effectively aggregate local and global contextual information over multiple levels. Additionally, a multi-level feature aggregation mechanism is designed and placed among the efficient atrous convolution attention modules to enable the network to capture a large amount of semantic structure while preserving intricate details. Finally, a boundary constraint reverse attention module is proposed to perform the boundary constraint while removing the foreground to improve the quality of boundary segmentation. Extensive experiments demonstrated the superiority and versatility of our framework compared with state-of-the-art methods; specifically, it achieved a mean Dice score of 0.922 on the CVC-300 dataset.

Keywords:

polyp segmentation; atrous convolution; boundary constraint reverse attention; multi-level feature aggregation

1. Introduction

Colorectal cancer is a prevalent malignant tumor that affects the digestive tract, primarily occurring at the junction of the rectum and colon. Therefore, it is often collectively referred to as colorectal cancer with rectal cancer. In China, colorectal cancer ranks second in terms of the number of cases and fourth in terms of fatalities among all malignant tumors. The burden of colorectal cancer in China is substantial, with new cases accounting for approximately 30% of global cases, and East Asia contributing more than 75% of these cases. However, despite these alarming statistics, the situation for its prevention and control remains challenging, as reported by [1]. The high incidence and mortality rates of colorectal cancer place immense demands and stress on medical professionals, often leading to a decline in their well-being and an increased risk of errors during diagnosis and surgery. Therefore, the implementation of machine-assisted diagnosis and treatment could offer several benefits. It can enhance the diagnostic accuracy, improve patient survival rates, and significantly alleviate the pressure on medical practitioners. Furthermore, machine-assisted diagnosis can address the shortage of expert personnel, particularly in larger hospitals, resulting in a more efficient healthcare system.

In clinical practice, colonoscopy serves as a valuable tool for providing crucial information regarding the location and appearance of colorectal polyps. However, determining the conditions of colorectal polyps based solely on a colonoscopy can be challenging. Our analysis of related datasets of polyps revealed three challenges: First, each pathological polyp encompasses multiple categories, even within the same polyp type; there is no uniform appearance. The lack of uniform texture information greatly increases the difficulty of segmentation, which can very easily lead to the wrong detection of polyps. Second, multiple polyps are present in the same colonoscopy image. In this particular case, the segmentation network often focuses on one of the polyps, resulting in missed detection. Finally, due to the tissue-based nature of the polyps, there is often no clear boundary between the polyps and their surrounding mucosal tissue, which is a very big challenge for polyp segmentation, and most segmentation networks do not handle this challenge well.

To tackle these challenges, this paper introduces a novel deep neural network designed for precise polyp segmentation named the boundary constraint multi-level feature aggregation framework (BMFA-Net). Our research was grounded in three key considerations: First, Geirhos et al. [2] pointed out that convolutional neural networks pay more attention to texture extraction. Therefore, we designed a five-layer convolutional network with Res2Net as the basic skeleton. In contrast to CaraNet [3], which relies on single-level feature utilization, we drew inspiration from the findings from CSPNet [4], which reveal that multi-level feature aggregation yields features with superior discrimination and robustness. This, in turn, enhances the classification, detection, and identification performance while acquiring more accurate and dependable spatial information. It contributes to improving the overall model accuracy. Therefore, we designed a framework to aggregate the information of the previous level and the information of the latter level so that the network can learn the information between multiple levels and the network focuses on both the overall information and the local information to improve the overall accuracy of the model. Second, different from the channel and feature pyramid CFP module (CFPNet [5]) introduced by CaraNet, we exploited the power of the atrous convolution operators to extend the receptive fields. This extension allowed for the integration of global semantic information without sacrificing the resolution or coverage. Moreover, we designed atrous convolution layers of different scales as parallel multi-branching structures and added attention mechanisms thereafter, which enables the network to learn the joint features of the local and surrounding context from a multi-scale perspective and to improve the joint features from the global context. Thus, it can significantly improve the accuracy of semantic segmentation and greatly reduces the probability of a missed detection in the special case of multiple polyps in one picture. Finally, in convolutional networks, there is richer edge position information at the lower level and richer semantic information at the higher level. We multiplied the second layer and the fifth layer to segment the highlighted polyp area, multiplied the second layer and the global semantic information map after the reverse operation to segment the highlighted non-polyp area, and so on. This makes the network able to perform boundary constraints while erasing the foreground target in the backward direction to effectively solve the problem of unclear boundary segmentation. In summary, this paper’s contribution can be distilled into three main aspects:

(1) The parallel partial decoder (PPD) is used to leverage high-level semantic global feature maps and generate a globally informative semantic map. This map serves as the initial guidance region for the boundary constraint reverse attention (BCRA) module, which preserves more low-frequency details and obtains better segmentation results.

(2) We propose the efficient atrous convolution attention (EACA) module. On the one hand, it learns residual features from side outputs, enhancing the spatial representation; on the other hand, the convolutions with different dilation rates can capture both the local pixel context and surrounding context information. Additionally, a multi-level feature aggregation mechanism was designed to combine the feature output from the EACA module with deeper layers, effectively merging the intricate details found in low-level features with the abundant semantic information present in high-level features.

(3) We designed the boundary constraint reverse attention (BCRA) module, which enables the network to generate polyp masks on shallow feature maps and use the deep location information as a boundary guide, which helps the network to perform boundary constraints while reverse-erasing the foreground targets, thus enhancing the polyp boundary features and suppressing the non-polyp boundary features.

Thus, we designed the BMFA-Net network, which is a boundary constraint multi-level feature aggregation framework for precise polyp segmentation and has better segmentation performance compared with existing semantic segmentation networks.

2. Related Work

Colonoscopy allows doctors to detect and treat the disease early, significantly improving patient survival. However, it is challenging to determine the presence of colorectal polyps based on a colonoscopy alone. First, each pathological polyp contains multiple categories, for example, only common polyps include ulcerative, prominent, and invasive variants. Second, even within the same type of polyp, there is no uniform appearance. Finally, due to the frequently indistinct boundaries between polyps and surrounding mucosal tissue, this can result in problems of both under-segmentation and over-segmentation at the edges of segmentation results. This lack of stark contrast poses a challenge to segmentation methods.

In computer vision tasks, the network responsible for extracting image features is often called the backbone network. AlexNet [6], using the convolutional neural network method, won first place in the ImageNet image classification task at the time. With the mining of deeper networks, ResNet [7] proposed by He Kaim in 2015 has become one of the most widely used backbone networks in recent years. Due to ResNet’s excellent performance, several networks built on its architecture have emerged, including ResNeXt [8], ResNeSt [9], and Res2Net [10]. Several recent studies ([11,12,13,14]) highlighted the importance of contextual information in improving a model’s ability to predict high-quality segmentation results. Dilation 8 [15] uses a map-like multi-scale extended convolution layer to perform multi-scale context aggregation. DeepLab-v3 [16] uses the atrous spatial pyramid pool (ASPP) to capture useful contextual information across multiple scales. After this, DenseASPP [17] was introduced, which joins a set of convolution properties to generate multi-scale features.

In segmentation methods for intestinal polyps, it is common to train a classifier to distinguish between polyps and surrounding tissue. However, these models often have a high error rate. The main reason behind this challenge is the obvious intra-class differences and relatively weak inter-class differences between polyp areas and their visually similar environment. For example, Brandao et al. [18] used a pre-trained model-based fully convolutional network (FCN) to identify and segment polyps, and Akbari et al. [19] used an improved FCN to improve the accuracy of polyp segmentation. Inspired by the successful application of U-Net [20] in biomedical image segmentation, U-NET++ [21] and ResUNet++ [22] were applied to the polyp segmentation task and good results were obtained. These methods focus on the segmentation of the entire polyp region but ignore the regional boundary constraint, which is the key to improving the segmentation performance. Therefore, Psi-Net [23] utilizes both region and boundary information in polyp segmentation but does not fully capture the relationship between the region and boundary. In addition, Fang et al. [24] proposed a selective aggregation channel network with regional boundary constraints to segment polyps. PraNet [25] uses parallel reverse attention networks to accurately outline polyps and achieve good segmentation results. CaraNet [3] introduced the feature channel pyramid based on PraNet so that the network can learn the context information and improve the segmentation accuracy. BoxPolyp [26] uses box annotations to alleviate the over-fitting issue of previous polyp segmentation models, which generate a fine-grained polyp area through the iterative boosted segmentation model. RFPA [27] leverages feature propagation enhancement and feature aggregation enhancement modules for more efficient feature fusion and multi-scale feature propagation. CFA-Net [28] exploits hierarchical semantic information from cross-level features, which can characterize the cross-level and multi-scale information to handle the scale variations of polyps.

3. Materials and Methods

In this work, we designed the BMFA-Net network, which consists of a parallel partial decoder, four EACA modules, a multi-level feature aggregation mechanism, and three boundary constraint reverse attention modules, as shown in Figure 1. First, Res2Net was used as the basic skeleton; a parallel partial decoder (PPD) was used to generate high-level semantic global maps; the global maps were used as the initial bootstrap region for the reverse erasure foreground; and then efficient atrous convolutional attention (EACA) modules were added under the second to fifth convolutional layers, which were used to aggregate local information and global context information. The output of the second layer, the output of the next layer, and the output of this layer are fed into the boundary constraint reverse attention module (BCRA) so that the network performs boundary constraints while erasing the foreground target in reverse, and finally, obtains the prediction map. In this section, we detail the various functional modules of the network.

3.1. Parallel Partial Decoder

At present, widely used medical image segmentation networks typically rely on U-Net or U-Net-style architectures, which densely integrate both the shallow features and the deep features. Recent research [29] highlighted that in comparison with low-level features, high-level features exhibit a relatively lower granularity but are remarkably rich in semantic information. Given our intention to employ the reverse attention module for subsequent reverse erasure tasks, we obtained a feature map endowed with abundant semantic information to serve as the initial guidance region for foreground reversal.

Consequently, we introduce the concept of parallel partial decoders (PPDs) to amalgamate high-level features within the backbone network, thereby producing globally informative semantic maps. More specifically, taking an image I of size h × w as the input, five levels of features

f_{i}

, i = 1, 2, 3, 4, 5, with resolution [

h / 2^{k - 1}

,

w / 2^{k - 1}

] can be extracted from a backbone network. As depicted in Figure 2, we employed the decoder component pd(·) [30] to deconstruct the deep network through parallel connections, specifically involving

f_{i}

, i = 3, 4, 5. This process was undertaken to derive joint features characterized by abundant semantic information. The computation of the joint feature, denoted as PD = pd(

f_{3}

,

f_{4}

,

f_{5}

), allowed us to generate the global map

S_{g}

, which served as the initial guiding region for subsequent reverse erasure within the reverse attention module.

3.2. Efficient Atrous Convolution Attention Module

A traditional convolutional layer often incorporates down-sampling operations, which can effectively expand the receptive field but decrease the spatial size of feature maps. This can indeed result in the loss of spatial hierarchical information, which could be detrimental, especially when reconstructing small objects like polyps. The atrous convolution proposed by [16,31] effectively addresses this issue by inserting a gap between the convolution kernels to increase the receptive field. The key parameter dilation rate in the atrous convolution indicates that the convolution kernel is filled with (dilation rate-1) zeros. To capture the image context at multiple scales, the atrous convolution with various dilation rates is applied in parallel to a given input, and the resulting outputs are concatenated in the channel dimension. Subsequently, a 1 × 1 convolution is used to reduce the number of channels to the desired value.

EACA module: Figure 3 shows the architecture of this module. The architecture consists of three parts: (1) An atrous spatial pyramid, which consists of several convolutions: one 3 × 3 normal convolution is used to perform the local feature extractor, which is inside the yellow dashed box, and three 3 × 3 atrous spatial convolutions in the red dashed box, with dilation rates of 1, 2, and 3, are set to increase the perceptual field, acting as the surrounding context extractor. (2) Joint feature extractor: after concatenating the feature maps output from the atrous spatial pyramid, the resultant is passed through batch normalization (BN) and the parametric rectified linear unit (PReLU) to speed up the convergence and alleviate the problem of a potential vanishing gradient in the deep network to a certain extent, corresponding to the blue dashed box. (3) Global feature extractor: the global average pooling is first used as a weighted vector for fine-tuning the joint features channel by channel to emphasize the useful elements and compress the useless ones; then, two layers of fast one-dimensional convolution are used to generate channel attention so that the global context is further extracted based on no dimension reduction; finally, a scale layer is used as the subsequent layer of BN to scale and shift the normalized parameters, re-weight the joint features and correct the features, and retain the valuable features to achieve the purpose of improving the feature expression ability, corresponding to the purple dashed box.

MFA mechanism: Additionally, in traditional target detection, typically only the last layer’s output features are utilized for object classification and localization. This approach can compromise the detection capability, especially for small target objects; this is commonly referred to as the multi-scale problem. Consequently, it is imperative to gather information from various scales to comprehend the intricacies of tasks such as fine-grained classification and semantic segmentation. Hence, we propose a multi-level feature aggregation (MFA) mechanism to consolidate multi-scale information across different levels. As depicted in Figure 1, the feature

f_{i}

output by the convolution of this layer and the output

f_{i - 1}^{'}

by the upper layer EACA module are concatenated to obtain the joint feature, and then the joint feature is input into the EACA module of this layer. The multi-level feature aggregation mechanism offers a feature aggregation mechanism with minimal parameter overhead, enabling the transfer of detailed information from lower levels to the top layer, thereby augmenting the high-level details and enhancing the ability to detect multi-scale targets. Remarkably, the computational and memory overhead incurred by concatenation is nearly negligible.

3.3. Boundary Constraint Reverse Attention Module

In the realm of machine vision, most existing deep salience models are fine-tuned from image classification networks, and they often struggle to effectively capture residual details. However, reverse attention [32] highlights the potential of the reverse attention module in guiding side output residual learning in a top-down fashion. This guidance swiftly directs the network’s focus toward previously undetected regions for residual capture, ultimately leading to improved performance. On this basis, the spatial resolution of the deeper features is low, which may lead to rough boundaries if polyp masks are generated on these layers. Studies such as Inf-Net [33], ETNet [34], and EGNet [35] all demonstrated that integrating edge and positional information can lead to clearer boundaries in segmentation results. Inspired by BCNet [36], we chose to generate polyp masks on the shallow feature maps using the positional information extracted from the deeper layers as a bilateral guide, which helps to enhance the polyp boundary features and suppress the non-polyp boundary features.

Hence, we propose a boundary constraint reverse attention module (BCRA). As we stated before, the global map generated by the parallel partial decoder was employed as the initial anchor region, as shown in Figure 4. First, the lower branch feature

S_{i + 1}

is multiplied with

f_{2}^{'}

after doing sigmoid and reverse operations to obtain the foreground feature

B_{s}

. Subsequently, the upper branch feature

f_{i}^{'}

is directly multiplied with

f_{2}^{'}

after normalization of the width and height to obtain the background feature

B_{f}

, and finally,

B_{s}

is summed with

B_{f}

to obtain the boundary constraint feature

B_{i}

. Finally,

S_{i + 1}

is added with

B_{i}

after doing a down-sampling to obtain the boundary information graph

S_{g}

. This progressive top-down reverse erasure of foreground information continues until the third layer outputs at

S_{3}

, with the sigmoid applied to

S_{3}

yielding the global prediction, as shown in Figure 1.

\{\begin{matrix} B_{f} = c (U (f_{i}^{'}) ⊙ f_{2}^{'}) \\ B_{s} = c (ε (⊖ (U (S_{i + 1} ⊙ f_{2}^{'})))) \end{matrix}

(1)

B_{i} = B_{f} + B_{s}

(2)

where c(·) indicates the operation consisting of the 3 × 3 convolution + BN + ReLU, U(·) indicates the double up-sampling operation,

ε

indicates the sigmoid operation, ⊖ indicates the reverse operation of subtracting the input from the matrix E, and ⊙ indicates the point multiplication operation.

3.4. Loss Function

This paper combines two kinds of loss functions to evaluate the segmentation effect: weighted IoU loss (

L_{I O U}^{W}

) and binary cross-entropy (

L_{B C E}^{W}

). The total loss function is the sum of the edge loss function and the ordinary loss function.

The edge loss function uses the standard binary cross-entropy loss function to measure the variability of the generated edge map

G_{e}

with the edge truth map:

L_{e d g} = - \sum_{x = 1}^{x = w} \sum_{y = 1}^{y = h} [G_{e} l o g (S_{e}) + (1 - G_{e}) l o g (1 - S_{e})]

(3)

where (x,y) are the coordinates of none of the pixels in the predicted edge map

S_{e}

and the edge truth map

G_{e}

, and W and h represent the width and height of the corresponding feature mapping map, respectively. The ordinary loss function is defined as

L_{s e g} = L_{I O U}^{W} + L_{B C E}^{W}

(4)

where

L_{I O U}^{W}

represents the weighted IoU loss based on global constraints and local (pixel level) constraints; this is important for increasing the weight of difficult sample pixels.

L_{B C E}^{W}

represents the binary cross-entropy loss; compared with the standard BCE loss function, this pays more attention to the sample pixels rather than giving equal weight allocation to all pixels. The total loss function is defined as

L_{t o t a l} = L_{e d g} + L_{s e g} (G_{T}, S_{g}^{u p}) + \sum_{i = 3}^{i = 5} L_{s e g} (G_{T}, S_{i}^{u p})

(5)

Here, a deep supervision strategy is used for the three lateral outputs (i.e.,

S_{3}

,

S_{4}

, and

S_{5}

) and the global map

S_{g}

to solve the problem of gradient disappearance and slow convergence during the training process. Each plot was up-sampled one by one to the same size as the ground truth

G_{T}

.

4. Experiment and Results

To assess the segmentation performance of BMFA-Net, we carried out experiments across five datasets and evaluated the outcomes using six distinct metrics. We conducted separate analyses to gauge the model’s learning and generalization capabilities. Furthermore, we performed a qualitative analysis of the results.

4.1. Dataset Descriptions

We conducted abundant experiments on five polyp segmentation datasets, as shown in Table 1. The first four were standard review datasets, and the last one was the largest publicly available challenging dataset in recent times. The training set consisted of 900 images selected from the subclass of Kvasir [38] and 550 images from CVC-ClinicDB [39], totaling 1450 samples. The testing set included samples from all five datasets:

The complete ETIS-LaribPolypDB [40] dataset consists of 196 samples. The main characteristic of polyps in the ETIS-LaribPolypDB dataset is their small size, resulting in a notable imbalance between polyps and the background.
The remaining 62 images from CVC-ClinicDB. In the CVC-ClinicDB dataset, polyps vary in color, size, position, and texture.
The entire CVC-ColonDB [41] dataset includes 380 samples. In the CVC-ColonDB dataset, polyps are frequently occluded and indistinct.
Sixty images from the CVC-T dataset (CVC-T is derived from the 912 images originally found in EndoScene [42], with 612 images subtracted from CVC-ClinicDB). In the CVC-T dataset, polyps appear uniform and smooth, yet they display a significant similarity in color, texture, and other characteristics to surrounding normal tissues.
The rest of the 100 images came from Kvasir. In the Kvasir dataset, polyps are large but irregularly shaped.

Table 1. Summary of published data on intestinal polyps.

Dataset	Marked Content	Modal	Size	Test/Train
ETIS-LaribPolypDB	Primary lesion of colon cancer.	CT	196	196/0
CVC-ClinicDB	A total of 612 public images were cut out from 31 colonoscopy images.	CT	612	62/550
CVC-ColonDB	Contains 380 images of polyps from 15 short colon video sequences.	CT	380	380/0
EndoScene	A total of 912 images of polyps combined by CVC-ClinicDB and CVC-300, and 60 images of CVC-300 were selected for testing.	CT	912	60/0
Kvasir	Selected 1000 polyp images from the subcategory.	CT	1000	100/900

It is worth noting that the testing set was divided into two parts: one for evaluating the model’s learning ability and the other for assessing its generalization ability. The learning ability was evaluated using the remaining images from the training set, while the generalization ability was assessed using data that was not seen during training.

4.2. Evaluation Criteria

Six metrics, including the mean Dice similarity coefficient and the mean intersection over union (IoU) correlation, which operate at the regional level, were employed to quantitatively evaluate the data. The weighted Dice metric (

F_{β}^{ω}

) was utilized to rectify the issue of “equally important defects” in the Dice calculations. The mean absolute error (MAE) measured the pixel-level discrepancies between the predicted and globally predicted maps, making it a valuable tool for assessing the pixel-level accuracy. Additionally, the enhanced alignment metric (

E_{φ}^{m a x}

) was utilized to assess both the pixel-level and global-level similarities. As

F_{β}^{ω}

and MAE are pixel-level evaluation metrics that overlook the similarity of the target structures, and thus, we incorporated the structural index (

S_{α}

) into our evaluation.

S_{α}

aligns more closely with the human visual system and is used to evaluate the similarity between the predictions and the ground truth maps.

Simultaneously, this study considered eleven comparative models from the realm of intestinal polyp segmentation, namely, U-Net [20], U-Net++ [21], PraNet [25], SANet [43], SFA [24], ACSNet [44], SAM [45], CaraNet [3], RFPA [27], CFA-Net [28], and BoxPolyp [26]. To ensure a fair comparison, we executed open-source code using the same dataset and experimental environment.

4.3. Experimental Setup

BMFA-Net was implemented using the PyTorch framework. A multi-scale strategy was used in the training phase, considering the differences in the size of each polyp image. The details of the hyperparameters are shown next. To update the network parameters, we used the Adam optimizer. The learning rate was set to 1 × 10⁻⁴ and the weight decay was also adjusted to 1 × 10⁻⁴. In addition, the input images were re-sized to 352 × 352, 150 epochs, and 16 mini-batches during training. For testing, only the images were resized to 352 × 352 without any post-processing optimization strategy.

4.4. Model Learning Ability Analysis

Setting: In this section, we assess the learning capabilities of the proposed model using two frequently employed datasets, namely, CVC-ClinicDB and Kvasir. As illustrated in Table 1, CVC-ClinicDB comprises 612 open images extracted from 31 colonoscopy images, while 1000 polyp images were selected from the Kvasir dataset. For our experimentation, 90% of these images were allocated for training purposes, with the remaining 10% reserved for the test set.

Results: As depicted in Table 2 and Table 3, the BMFA-Net model showcased its impressive learning capabilities. On the Kvasir dataset, the BMFA-Net model exhibited a mean Dice score that was 2.5% greater than the classical PraNet model. Meanwhile, on the CVC-ClinicDB dataset, the BMFA-Net model demonstrated a remarkable 0.1% advantage over the best performing CFA-Net model, surpassing it. Moreover, it also exceeded the performance of PraNet by 3.5% in terms of the mean Dice score. These comparative results underscore the continued effectiveness of the BMFA-Net model design introduced in this paper. These experimental findings collectively demonstrate the outstanding learning capabilities of BMFA-Net, thereby affirming the viability and effectiveness of its design.

4.5. Model Generalization Ability Analysis

Setting: In this section, we evaluate the model’s generalization capabilities using three distinct datasets, namely, ETIS-LaribPolypDB, CVC-ColonDB, and CVC-300. As shown in Table 1, all three datasets, comprising a total of 318 images, were exclusively employed as the test set. Notably, unlike CVC-ClinicDB and Kvasir, the model underwent no prior training before being subjected to this test.

Results: As illustrated in Table 2, Table 4 and Table 5, the BMFA-Net model not only surpassed the existing methods but also demonstrated remarkable generalization capabilities. On CVC-ColonDB, it outperformed the RFPA model and the classical CaraNet by 0.4% and 7.6%, respectively. Similarly, on ETIS-LaribPolypDB, it surpassed RFPA and CaraNet by margins of 0.3% and 8.5%, respectively. Additionally, on CVC-300, it exhibited superior performance, where it was 2.0% and 1.7% better than CaraNet and RFPA, respectively.

Furthermore, for a more intuitive visualization of the model comparison results, we provide line graphs depicting the mean Dice scores of the ten models across the CVC-300 dataset in Figure 5. Analyzing the reasons, the distinctive feature of CVC-300 is that it presents a smooth and single target.

However, the CVC-300 dataset closely resembles the surrounding normal tissues in terms of texture, color, and other characteristics, making it challenging to differentiate. Nonetheless, our proposed model BMFA-Net excelled at extracting features through multiple scales and at multiple levels, providing superior classification performance compared with a single classification network. The use of parallel multi-branching structures gives the network a wider range of receptive fields so that it is able to capture a wider range of contextual information and more detailed information. The boundary constraint reverse attention module can effectively constrain the polyp boundary to enhance the segmentation results. All these data suggest that the BMFA-Net model can be easily extended to multi-color data with different domain distributions.

4.6. Qualitative Analysis

Figure 6 and Figure 7 depict the visualization results of the BMFA-Net model in comparison with other models. For each dataset, we chose a representative image that exhibited various polyp features. These visual outcomes served to demonstrate that the BMFA-Net model excelled at handling diverse environmental conditions, maintaining its robust recognition and segmentation capabilities, even in the presence of various forms of interference. Notably, the predicted edges closely aligned with the ground truth, further highlighting the model’s accuracy and effectiveness.

5. Ablation Experiment

For this BMFA-Net model, we conducted four sets of ablation experiments. The experiments utilized Res2Net + RA + PPD as the foundational framework and tested three modules, including the dilation selection of dilated convolutions, the EACA module, the multi-level feature aggregation mechanism, and the boundary constraint reverse attention module. We provided a comprehensive assessment of the efficacy of each component. The parameters and datasets used in these ablation experiments were aligned with the model’s settings, ensuring the robustness of comparative analyses. Quantitative evaluations were performed using the mean Dice metric, and the impact of each module on the segmentation performance was demonstrated on the CVC-300 and CVC-ColonDB datasets.

The dilation selection of dilated convolutions: Due to the traditional computation of atrous convolution resembling a checkerboard pattern, it can lead to the grid effect issue. Additionally, employing large dilation rates in convolutions may yield favorable segmentation results for larger objects but prove less friendly for smaller targets, like colorectal polyps. Therefore, the selection of the dilation rate is crucial. In our experiments, we tried several different dilation rate settings, and the results are presented in Table 6. The experimentation confirmed that a larger dilation rate was not necessarily better. In the case of small medical imaging targets like polyps, a larger receptive field was not required. Hence, following the principle that the dilation values in superimposed convolutions should not have a common divisor greater than one and should be designed in a zigzag manner, we conducted repeated experiments. The best segmentation accuracy was achieved when the dilation values were set to 1, 2, and 3. At this point, the model could better learn the multi-scale contextual information and prevent the grid effect from occurring.

Ablation experiments of three key modules: The experimental results are shown in Table 7, clearly illustrating the influence of each module on the experiments. The efficient atrous convolution attention (EACA) module improved the model’s segmentation accuracy by 3.2–6.3%. The experiment showed that the EACA module could not only explore a large amount of contextual information from multiple sensory domains to better apply the semantic information to the network but also capture the significant dependence of objects by using the attention mechanism, thus increasing the diversity of the feature extraction and reducing the computational amount. The EACA module can be plug and play, improving the accuracy of network segmentation.

The multi-level feature aggregation (MFA) mechanism resulted in a 1.5–2.8% improvement. These findings highlight the complementary role of rich detail information from low-level features in enhancing the semantic information of high-level features. The inclusion of detailed features from the boundary constraint reverse attention module (BCRA) boosted the segmentation accuracy by 0.3–3.6%, and the experimental results confirmed the edge supervision in the superficial layer, using the position information extracted from the deep layer as the bilateral guidance, which could inhibit the non-polyp boundary features while enhancing the polyp boundary features, thus improving the segmentation accuracy.

As shown in Figure 8, after removing the EACA module, the segmentation effect was significantly reduced, which indicates that the EACA module was the key module of BMFA-Net and played a positive role in achieving better segmentation performance. Elimination of the BCRA module caused blurred boundaries of the polyp segmentation, suggesting that the boundary constraint ability of the BCRA module made an indispensable contribution to the enhanced segmentation performance. In summary, the design of the BMFA-Net network structure was advanced and effective.

6. Discussion

The experimental data from comparative trials comparing BMFA-Net with other segmentation models demonstrated the effectiveness of the BMFA-Net design. This confirmed that the utilization of the multi-level feature aggregation (MFA) mechanism and the design of the efficient atrous convolution attention (EACA) module in BMFA-Net, when compared with the single-branch CaraNet, could better capture local information and surrounding contextual information across different receptive fields, effectively reducing both the missed detection and false detection rates of polyps. The experimental data from ablation studies consistently highlighted the boundary constraint reverse attention (BCRA) module as a key component in BMFA-Net, emphasizing its role in focusing on edge information and enhancing the segmentation performance. These research findings aligned closely with those of BCNet, yet it should be noted that BCNet applies boundary constraints to joint feature maps using shallow features in an attempt to improve the quality of boundary segmentation. However, joint feature maps can be susceptible to edge information loss due to the iterative nature of multiple joint operations on deep layers, where directly incorporating shallow features, which often contain abundant noise, may not sufficiently mitigate edge information loss. Conversely, during the segmentation process, BMFA-Net employs bilateral constraints on deep features and reverse attention features using shallow features. This approach not only effectively resolves the edge information loss caused by zero padding at the edge of convolutional kernels in the EACA module but also addresses the issues of the under-segmentation and over-segmentation of segmentation results at the edges during the reverse erasure of foreground in the reverse attention module.

Extensive experiments showed that the proposed BMFA-Net model achieved better performance on five challenging datasets compared with some state-of-the-art methods, especially on the CVC-300 dataset. It still has certain limitations. First, the current model encountered specific failure scenarios, as shown in Figure 9. These limitations mainly existed in the case of multiple polyps in an image; where the human eye finds it difficult to distinguish polyp regions from normal tissue, the model may miss detection. It is worth noting that the model proposed in this paper mainly focuses on polyp segmentation, and the experiments were not extended to other medical objects. Therefore, the next research stage is to apply the innovative method proposed in this paper to tumor segmentation in different anatomical regions of the human body.

7. Conclusions

This paper introduces a novel boundary constraint multi-level feature aggregation framework for polyp segmentation named BMFA-Net. We utilized parallel partial decoders to aggregate high-level features, generating a global mapping as a guidance region for reverse foreground erasure. Furthermore, we propose the efficient atrous convolution attention (EACA) module, which extracts contextual information from different receptive fields and learns residual features from lateral outputs to enhance the resolution and reduce polyp omission rates. Additionally, we designed a multi-level feature aggregation (MFA) mechanism to integrate feature information across different layers, compensating for the lack of detailed features in semantic information. Moreover, we propose the boundary constraint reverse attention (BCRA) module to erase the foreground layer by layer while performing boundary constraints. Extensive experimental results demonstrate that our model outperformed existing methods on challenging datasets.

Author Contributions

Conceptualization, Q.L. and J.Z.; methodology, Q.L.; software, Q.L.; validation, Q.L., T.Z. and J.Z.; formal analysis, P.M.M.; investigation, P.M.M.; resources, J.Z.; data curation, Q.L. and P.M.M.; writing—original draft preparation, Q.L.; writing—review and editing, T.Z., Q.L. and J.Z.; visualization, Q.L.; supervision, T.Z. and J.Z.; project administration, J.Z.; funding acquisition, J.Z. All authors read and agreed to the published version of this manuscript.

Funding

This research was funded by the (1) Tianchi Zhang: 2022–2025 National Natural Science Foundation of China under Grant No. 52171310 and (2) Jing Zhang: 2021–2023 National Natural Science Foundation of China under Grant (Youth) No. 52001039.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data sources are cited within the article.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

BMFA-Net	Boundary constraint multi-level feature aggregation framework for precise polyp segmentation
EACA	Efficient atrous convolution attention module
PPD	Parallel partial decoder
BCRA	Boundary constraint reverse attention module
MFA	Multi-level feature aggregation mechanism

References

Chen, Q.; Liu, Z.; Cheng, L.; Song, G.; Sun, X.; Zheng, R.; Zhang, S.; Chen, W. Analysis of colorectal cancer incidence and mortality in China from 2003 to 2007. China Oncol. 2012, 21, 179–182. [Google Scholar]
Geirhos, R.; Rubisch, P.; Michaelis, C.; Bethge, M.; Wichmann, F.A.; Brendel, W. ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness. arXiv 2018, arXiv:1811.12231. [Google Scholar]
Lou, A.; Guan, S.; Loew, M. CaraNet: Context axial reverse attention network for segmentation of small medical objects. J. Med. Imaging 2023, 10, 014005. [Google Scholar] [CrossRef] [PubMed]
Wang, C.; Liao, H.M.; Yeh, I.; Wu, Y.; Chen, P.; Hsieh, J. CSPNet: A New Backbone that can Enhance Learning Capability of CNN. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Seattle, WA, USA, 14–19 June 2020; pp. 1571–1580. [Google Scholar]
Lou, A.; Loew, M.H. CFPNET: Channel-Wise Feature Pyramid for Real-Time Semantic Segmentation. In Proceedings of the 2021 IEEE International Conference on Image Processing (ICIP), Anchorage, AK, USA, 19–22 September 2021; pp. 1894–1898. [Google Scholar]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet classification with deep convolutional neural networks. Commun. ACM 2012, 60, 84–90. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 26 June–1 July 2016; pp. 770–778. [Google Scholar]
Xie, S.; Girshick, R.B.; Dollár, P.; Tu, Z.; He, K. Aggregated Residual Transformations for Deep Neural Networks. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 5987–5995. [Google Scholar]
Zhang, H.; Wu, C.; Zhang, Z.; Zhu, Y.; Zhang, Z.; Lin, H.; Sun, Y.; He, T.; Mueller, J.W.; Manmatha, R.; et al. ResNeSt: Split-Attention Networks. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), New Orleans, LA, USA, 19–20 June 2022; pp. 2735–2745. [Google Scholar]
Gao, S.; Cheng, M.; Zhao, K.; Zhang, X.; Yang, M.; Torr, P.H. Res2Net: A New Multi-Scale Backbone Architecture. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 43, 652–662. [Google Scholar] [CrossRef] [PubMed]
Ding, H.; Jiang, X.; Shuai, B.; Liu, A.Q.; Wang, G. Context Contrasted Feature and Gated Multi-scale Aggregation for Scene Segmentation. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 2393–2402. [Google Scholar]
Yu, C.; Wang, J.; Peng, C.; Gao, C.; Yu, G.; Sang, N. Learning a Discriminative Feature Network for Semantic Segmentation. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 1857–1866. [Google Scholar]
Zhang, H.; Dana, K.J.; Shi, J.; Zhang, Z.; Wang, X.; Tyagi, A.; Agrawal, A. Context Encoding for Semantic Segmentation. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 7151–7160. [Google Scholar]
Zhang, R.; Zheng, Y.; Poon, C.C.; Shen, D.; Lau, J.Y. Polyp detection during colonoscopy using a regression-based convolutional neural network with a tracker. Pattern Recognit. 2018, 83, 209–219. [Google Scholar] [CrossRef] [PubMed]
Yu, F.; Koltun, V. Multi-Scale Context Aggregation by Dilated Convolutions. arXiv 2015, arXiv:1511.07122. [Google Scholar]
Chen, L.; Papandreou, G.; Schroff, F.; Adam, H. Rethinking Atrous Convolution for Semantic Image Segmentation. arXiv 2017, arXiv:1706.05587. [Google Scholar]
Yang, M.; Yu, K.; Zhang, C.; Li, Z.; Yang, K. DenseASPP for Semantic Segmentation in Street Scenes. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 3684–3692. [Google Scholar]
Brandao, P.; Mazomenos, E.B.; Ciuti, G.; Caliò, R.; Bianchi, F.; Menciassi, A.; Dario, P.; Koulaouzidis, A.; Arezzo, A.; Stoyanov, D. Fully convolutional neural networks for polyp segmentation in colonoscopy. In Medical Imaging; SPIE: Bellingham, WA, USA, 2017. [Google Scholar]
Akbari, M.; Mohrekesh, M.; Nasr-Esfahani, E.; Soroushmehr, S.M.; Karimi, N.; Samavi, S.; Najarian, K. Polyp Segmentation in Colonoscopy Images Using Fully Convolutional Network. In Proceedings of the 2018 40th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), Honolulu, HI, USA, 18–21 July 2018; pp. 69–72. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. arXiv 2015, arXiv:1505.04597. [Google Scholar]
Zhou, Z.; Siddiquee, M.M.; Tajbakhsh, N.; Liang, J. UNet++: A Nested U-Net Architecture for Medical Image Segmentation. In Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support: 4th International Workshop, DLMIA 2018, and 8th International Workshop, ML-CDS 2018, Held in Conjunction with MICCAI 2018, Granada, Spain, September 20, 2018, Proceedings 4; Springer: Cham, Switzerland, 2018; Volume 11045, pp. 3–11. [Google Scholar]
Jha, D.; Smedsrud, P.H.; Riegler, M.; Johansen, D.; Lange, T.D.; Halvorsen, P.; Johansen, H.D. ResUNet++: An Advanced Architecture for Medical Image Segmentation. In Proceedings of the 2019 IEEE International Symposium on Multimedia (ISM), Diego, CA, USA, 9–11 December 2019; pp. 225–2255. [Google Scholar]
Murugesan, B.; Sarveswaran, K.; Shankaranarayana, S.M.; Ram, K.; Sivaprakasam, M. Psi-Net: Shape and boundary aware joint multi-task deep network for medical image segmentation. In Proceedings of the 2019 41st Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), Berlin, Germany, 23–27 July 2019; pp. 7223–7226. [Google Scholar]
Fang, Y.; Chen, C.; Yuan, Y.; Tong, R.K. Selective Feature Aggregation Network with Area-Boundary Constraints for Polyp Segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Shenzhen, China, 13–17 October 2019. [Google Scholar]
Fan, D.; Ji, G.; Zhou, T.; Chen, G.; Fu, H.; Shen, J.; Shao, L. PraNet: Parallel Reverse Attention Network for Polyp Segmentation. arXiv 2020, arXiv:2006.11392. [Google Scholar]
Wei, J.; Hu, Y.; Li, G.; Cui, S.; Kevin Zhou, S.; Li, Z. BoxPolyp: Boost Generalized Polyp Segmentation Using Extra Coarse Bounding Box Annotations. In Medical Image Computing and Computer Assisted Intervention—MICCAI 2022; MICCAI 2022; Lecture Notes in Computer Science; Wang, L., Dou, Q., Fletcher, P.T., Speidel, S., Li, S., Eds.; Springer: Cham, Switzerland, 2022; Volume 13433. [Google Scholar] [CrossRef]
Su, Y.; Shen, Y.; Ye, J.; He, J.; Cheng, J. Revisiting Feature Propagation and Aggregation in Polyp Segmentation. In Medical Image Computing and Computer Assisted Intervention—MICCAI 2023; MICCAI 2023; Lecture Notes in Computer Science; Greenspan, H., Ed.; Springer: Cham, Switzerland, 2023; Volume 14224. [Google Scholar] [CrossRef]
Zhou, T.; Zhou, Y.; He, K.; Gong, C.; Yang, J.; Fu, H.; Shen, D. Cross-level Feature Aggregation Network for Polyp Segmentation. Pattern Recognit. 2023, 140, 109555. [Google Scholar] [CrossRef]
Wu, Z.; Su, L.; Huang, Q. Cascaded Partial Decoder for Fast and Accurate Salient Object Detection. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 3902–3911. [Google Scholar]
Wu, Z.; Su, L.; Huang, Q. Stacked Cross Refinement Network for Edge-Aware Salient Object Detection. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 7263–7272. [Google Scholar]
Wu, T.; Tang, S.; Zhang, R.; Zhang, Y. CGNet: A Light-Weight Context Guided Network for Semantic Segmentation. IEEE Trans. Image Process. 2018, 30, 1169–1179. [Google Scholar] [CrossRef] [PubMed]
Chen, S.; Tan, X.; Wang, B.; Hu, X. Reverse Attention for Salient Object Detection. arXiv 2018, arXiv:1807.09940. [Google Scholar]
Fan, D.; Zhou, T.; Ji, G.; Zhou, Y.; Chen, G.; Fu, H.; Shen, J.; Shao, L. Inf-Net: Automatic COVID-19 Lung Infection Segmentation From CT Images. IEEE Trans. Med. Imaging 2020, 39, 2626–2637. [Google Scholar] [CrossRef] [PubMed]
Zhang, Z.; Fu, H.; Dai, H.; Shen, J.; Pang, Y.; Shao, L. ET-Net: A Generic Edge-aTtention Guidance Network for Medical Image Segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Shenzhen, China, 13–17 October 2019. [Google Scholar]
Zhao, J.; Liu, J.; Fan, D.; Cao, Y.; Yang, J.; Cheng, M. EGNet: Edge Guidance Network for Salient Object Detection. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 8778–8787. [Google Scholar]
Yue, G.; Han, W.; Jiang, B.; Zhou, T.; Cong, R.; Wang, T. Boundary Constraint Network With Cross Layer Feature Integration for Polyp Segmentation. IEEE J. Biomed. Health Inform. 2022, 26, 4090–4099. [Google Scholar] [CrossRef]
Ho, J.; Kalchbrenner, N.; Weissenborn, D.; Salimans, T. Axial Attention in Multidimensional Transformers. arXiv 2019, arXiv:1912.12180. [Google Scholar]
Jha, D.; Smedsrud, P.H.; Riegler, M.; Halvorsen, P.; Lange, T.D.; Johansen, D.; Johansen, H.D. Kvasir-SEG: A Segmented Polyp Dataset. arXiv 2019, arXiv:1911.07069. [Google Scholar]
Bernal, J.; Sánchez, F.J.; Fernández-Esparrach, G.; Gil, D.; Miguel, C.R.; Vilariño, F. WM-DOVA maps for accurate polyp highlighting in colonoscopy: Validation vs. saliency maps from physicians. Comput. Med Imaging Graph. Off. J. Comput. Med Imaging Soc. 2015, 43, 99–111. [Google Scholar] [CrossRef]
Silva, J.; Histace, A.; Romain, O.; Dray, X.; Granado, B. Toward embedded detection of polyps in WCE images for early diagnosis of colorectal cancer. Int. J. Comput. Assist. Radiol. Surg. 2014, 9, 283–293. [Google Scholar] [CrossRef]
Tajbakhsh, N.; Gurudu, S.R.; Liang, J. Automated Polyp Detection in Colonoscopy Videos Using Shape and Context Information. IEEE Trans. Med. Imaging 2016, 35, 630–644. [Google Scholar] [CrossRef]
Vázquez, D.; Bernal, J.; Sánchez, F.J.; Fernández-Esparrach, G.; López, A.M.; Romero, A.; Drozdzal, M.; Courville, A.C. A Benchmark for Endoluminal Scene Segmentation of Colonoscopy Images. J. Healthc. Eng. 2017, 2017, 037190. [Google Scholar] [CrossRef]
Wei, J.; Hu, Y.; Zhang, R.; Li, Z.; Zhou, S.; Cui, S. Shallow Attention Network for Polyp Segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Strasbourg, France, 27 September–1 October 2021. [Google Scholar]
Zhang, R.; Li, G.; Li, Z.; Cui, S.; Qian, D.; Yu, Y. Adaptive Context Selection for Polyp Segmentation. arXiv 2020, arXiv:2301.04799. [Google Scholar]
Zhou, T.; Zhang, Y.; Zhou, Y.; Wu, Y.; Gong, C. Can SAM Segment Polyps? arXiv 2023, arXiv:2304.07583. [Google Scholar]

Figure 1. The framework of our method. It consists of a parallel partial decoder, four EACA modules, a multi-level feature aggregation mechanism, and three boundary constraint reverse attention modules. The parallel partial decoder (PPD) generates high-level semantic global maps. The efficient atrous convolutional attention (EACA) module aggregates contextual information. The multi-level feature aggregation mechanism obtains multi-level information. The boundary constraint reverse attention module (BCRA) enables the network to generate polyp masks on shallow feature maps and use the deep location information as a boundary guide.

G_{e}

represents the edge map extracted from the ground truth (

G_{T}

) using algorithms.

L_{e d e}

represents the deep supervision applied to the predicted edge map

S_{e}

.

Figure 1. The framework of our method. It consists of a parallel partial decoder, four EACA modules, a multi-level feature aggregation mechanism, and three boundary constraint reverse attention modules. The parallel partial decoder (PPD) generates high-level semantic global maps. The efficient atrous convolutional attention (EACA) module aggregates contextual information. The multi-level feature aggregation mechanism obtains multi-level information. The boundary constraint reverse attention module (BCRA) enables the network to generate polyp masks on shallow feature maps and use the deep location information as a boundary guide.

G_{e}

represents the edge map extracted from the ground truth (

G_{T}

) using algorithms.

L_{e d e}

represents the deep supervision applied to the predicted edge map

S_{e}

.

Figure 2. Paralleled partial decoder structure.

Figure 3. Efficient atrous convolution attention module structure. The yellow dashed box signifies a standard convolution with a kernel size of 3, which is employed for local feature extraction. The red dashed box utilizes atrous convolutions to extract contextual features from the surrounding area. The blue dashed box concatenates the local features and the surrounding context features through concatenation. Within the purple dashed box resides the global feature extractor.

Figure 4. Boundary constraint reverse attention module structure.

f_{i}^{'}

represents the output of the third, fourth, and fifth layers after passing through the EACA;

f_{2}^{'}

represents the output of the second layer after passing through the EACA; width and height represent the axial attention [37].

Figure 4. Boundary constraint reverse attention module structure.

f_{i}^{'}

represents the output of the third, fourth, and fifth layers after passing through the EACA;

f_{2}^{'}

represents the output of the second layer after passing through the EACA; width and height represent the axial attention [37].

Figure 5. Line plot of the generalization ability of the nine models on the CVC-300 dataset.

Figure 6. Visualization results with the current models. As we can see, the proposed model could accurately locate and segment the polyps, regardless of size.

Figure 7. Visualization results with the current models.

Figure 8. Visualization results of ablation experiment.

Figure 9. Some failed cases of the BMFA-Net.

Table 2. Segmentation results of the recent models on five datasets.

Method	CVC-ClinicDB		Kvasir		CVC-ColonDB		ETIS-LaribPolypDB		CVC-300
Method	mDice	mIoU	mDice	mIoU	mDice	mIoU	mDice	mIoU	mDice	mIoU
CaraNet SPIEMI’22	0.921	0.876	0.913	0.859	0.775	0.700	0.740	0.660	0.902	0.836
BoxPolyp MICCAI’22	0.904	0.849	0.910	0.857	0.820	0.741	0.829	0.742	0.903	0.835
RFPA MICCAI’23	0.931	0.885	0.928	0.880	0.837	0.759	0.822	0.746	0.905	0.839
CFA-Net PR’23	0.933	0.883	0.915	0.861	0.743	0.665	0.732	0.655	0.893	0.827
Ours	0.934	0.887	0.923	0.875	0.841	0.764	0.825	0.744	0.922	0.843

Table 3. Quantitative results of test datasets from CVC-ClinicDB and Kvasir.

Model	Kvasir						CVC-ClinicDB
Model	Mean Dice	Mean IoU	$F_{β}^{ω}$	$S_{α}$	$E_{φ}^{\max}$	MAE	Mean Dice	Mean IoU	$F_{β}^{ω}$	$S_{α}$	$E_{φ}^{\max}$	MAE
U-Net	0.818	0.746	0.794	0.858	0.893	0.055	0.823	0.755	0.811	0.889	0.954	0.019
U-Net++	0.821	0.743	0.808	0.862	0.910	0.048	0.794	0.729	0.785	0.873	0.931	0.022
ACSNet	0.898	0.838	0.882	0.920	0.952	0.032	0.882	0.826	0.873	0.927	0.959	0.011
SAM	0.782	0.710	0.773	0.832	0.836	0.061	0.579	0.526	0.563	0.744	0.685	0.057
EU-Net	0.908	0.854	0.893	0.917	0.954	0.028	0.902	0.846	0.891	0.936	0.965	0.011
PraNet	0.898	0.840	0.885	0.915	0.948	0.030	0.899	0.849	0.896	0.936	0.979	0.009
SANet	0.904	0.847	0.892	0.915	0.953	0.028	0.916	0.859	0.909	0.939	0.976	0.012
Ours	0.923	0.875	0.915	0.933	0.972	0.021	0.934	0.887	0.942	0.959	0.994	0.005

Table 4. Quantitative results of test datasets from CVC-ColonDB and CVC-300.

Model	CVC-ColonDB						CVC-300
Model	Mean Dice	Mean IoU	$F_{β}^{ω}$	$S_{α}$	$E_{φ}^{\max}$	MAE	Mean Dice	Mean IoU	$F_{β}^{ω}$	$S_{α}$	$E_{φ}^{\max}$	MAE
U-Net	0.512	0.444	0.498	0.712	0.776	0.061	0.710	0.627	0.684	0.843	0.876	0.022
U-Net++	0.483	0.410	0.467	0.691	0.760	0.064	0.707	0.624	0.687	0.839	0.898	0.018
ACSNet	0.716	0.649	0.697	0.829	0.851	0.039	0.863	0.787	0.825	0.923	0.968	0.013
SAM	0.468	0.422	0.463	0.690	0.608	0.054	0.726	0.676	0.729	0.849	0.826	0.020
EU-Net	0.756	0.681	0.730	0.831	0.872	0.045	0.837	0.765	0.805	0.904	0.933	0.015
PraNet	0.709	0.640	0.696	0.819	0.869	0.045	0.871	0.797	0.843	0.925	0.972	0.010
SANet	0.753	0.670	0.726	0.837	0.878	0.043	0.888	0.815	0.859	0.928	0.972	0.008
Ours	0.841	0.764	0.799	0.913	0.917	0.037	0.922	0.843	0.887	0.951	0.990	0.004

Table 5. Quantitative results of test datasets from ETIS-LaribPolypDB.

Model	ETIS-LaribPolypDB
Model	Mean Dice	Mean IoU	$F_{β}^{ω}$	$S_{α}$	$E_{φ}^{\max}$	MAE
U-Net	0.398	0.335	0.366	0.684	0.740	0.036
U-Net++	0.401	0.344	0.390	0.683	0.776	0.035
ACSNet	0.578	0.509	0.530	0.754	0.764	0.059
SAM	0.551	0.507	0.544	0.751	0.687	0.030
EU-Net	0.687	0.609	0.636	0793.	0.841	0.067
PraNet	0.628	0.567	0.600	0.794	0.841	0.031
SANet	0.750	0.654	0.685	0.849	0.897	0.015
Ours	0.825	0.744	0.781	0.911	0.936	0.009

Table 6. Quantitative results for the different dilations.

Dilation Setting	Mean Dice
Dilation Setting	CVC-300	CVC-ColonDB
6, 12, 18	0.878	0.710
2, 4, 8	0.893	0.729
2, 3, 5	0.915	0.813
1, 2, 3	0.922	0.841

Table 7. Ablation experiment on two datasets.

Model	Mean Dice
Model	CVC-300	CVC-ColonDB
Backbone	0.871	0.709
EACA + backbone	0.903	0.772
MFA + backbone	0.886	0.737
EACA + MFA + backbone	0.919	0.805
BCRA + EACA + MFA + backbone	0.922	0.841

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, Q.; Zhang, T.; Mosharaf, P.M.; Zhang, J. BMFA-Net: Boundary Constraint Multi-Level Feature Aggregation Framework for Precise Polyp Segmentation. Appl. Sci. 2024, 14, 4063. https://doi.org/10.3390/app14104063

AMA Style

Li Q, Zhang T, Mosharaf PM, Zhang J. BMFA-Net: Boundary Constraint Multi-Level Feature Aggregation Framework for Precise Polyp Segmentation. Applied Sciences. 2024; 14(10):4063. https://doi.org/10.3390/app14104063

Chicago/Turabian Style

Li, Qin, Tianchi Zhang, Parvej Md Mosharaf, and Jing Zhang. 2024. "BMFA-Net: Boundary Constraint Multi-Level Feature Aggregation Framework for Precise Polyp Segmentation" Applied Sciences 14, no. 10: 4063. https://doi.org/10.3390/app14104063

APA Style

Li, Q., Zhang, T., Mosharaf, P. M., & Zhang, J. (2024). BMFA-Net: Boundary Constraint Multi-Level Feature Aggregation Framework for Precise Polyp Segmentation. Applied Sciences, 14(10), 4063. https://doi.org/10.3390/app14104063

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

BMFA-Net: Boundary Constraint Multi-Level Feature Aggregation Framework for Precise Polyp Segmentation

Abstract

1. Introduction

2. Related Work

3. Materials and Methods

3.1. Parallel Partial Decoder

3.2. Efficient Atrous Convolution Attention Module

3.3. Boundary Constraint Reverse Attention Module

3.4. Loss Function

4. Experiment and Results

4.1. Dataset Descriptions

4.2. Evaluation Criteria

4.3. Experimental Setup

4.4. Model Learning Ability Analysis

4.5. Model Generalization Ability Analysis

4.6. Qualitative Analysis

5. Ablation Experiment

6. Discussion

7. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI