ECF-Net: Enhanced, Channel-Based, Multi-Scale Feature Fusion Network for COVID-19 Image Segmentation

Ji, Zhengjie; Zhou, Junhao; Wei, Linjing; Bao, Shudi; Chen, Meng; Yuan, Hongxing; Zheng, Jianjun

doi:10.3390/electronics13173501

Open AccessArticle

ECF-Net: Enhanced, Channel-Based, Multi-Scale Feature Fusion Network for COVID-19 Image Segmentation

by

Zhengjie Ji

¹,

Junhao Zhou

²,

Linjing Wei

^1,*,

Shudi Bao

^2,3,*

,

Meng Chen

²,

Hongxing Yuan

²

and

Jianjun Zheng

⁴

¹

The College of Information Science and Technology, Gansu Agricultural University, Lanzhou 730070, China

²

The School of Cyber Science and Engineering, Ningbo University of Technology, Ningbo 315211, China

³

Ningbo Institute of Digital Twin, Ningbo 315201, China

⁴

Ningbo No. 2 Hospital, Ningbo 315010, China

^*

Authors to whom correspondence should be addressed.

Electronics 2024, 13(17), 3501; https://doi.org/10.3390/electronics13173501

Submission received: 7 August 2024 / Revised: 27 August 2024 / Accepted: 28 August 2024 / Published: 3 September 2024

(This article belongs to the Special Issue Biomedical Image Processing and Classification, 2nd Edition)

Download

Browse Figures

Versions Notes

Abstract

:

Accurate segmentation of COVID-19 lesion regions in lung CT images aids physicians in analyzing and diagnosing patients’ conditions. However, the varying morphology and blurred contours of these regions make this task complex and challenging. Existing methods utilizing Transformer architecture lack attention to local features, leading to the loss of detailed information in tiny lesion regions. To address these issues, we propose a multi-scale feature fusion network, ECF-Net, based on channel enhancement. Specifically, we leverage the learning capabilities of both CNN and Transformer architectures to design parallel channel extraction blocks in three different ways, effectively capturing diverse lesion features. Additionally, to minimize irrelevant information in the high-dimensional feature space and focus the network on useful and critical information, we develop adaptive feature generation blocks. Lastly, a bidirectional pyramid-structured feature fusion approach is introduced to integrate features at different levels, enhancing the diversity of feature representations and improving segmentation accuracy for lesions of various scales. The proposed method is tested on four COVID-19 datasets, demonstrating mIoU values of 84.36%, 87.15%, 83.73%, and 75.58%, respectively, outperforming several current state-of-the-art methods and exhibiting excellent segmentation performance. These findings provide robust technical support for medical image segmentation in clinical practice.

Keywords:

COVID-19; channel enhancement; bidirectional feature pyramid; multi-scale feature fusion; image segmentation

1. Introduction

Since 2020, the coronavirus disease 2019 (COVID-19) has evolved into a global pandemic. As of 10 April 2020, this novel infectious disease, caused by Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2), has resulted in over 92,000 deaths. While the reverse transcription polymerase chain reaction (RT-PCR) serves as the primary diagnostic method for COVID-19, it is time consuming, prone to misdetection, and can yield false-positive results. In contrast, computed tomography (CT) plays a critical role in COVID-19 diagnosis. However, manually diagnosing a large number of CT scans is highly time consuming and labor intensive.

With the rapid advancement of artificial intelligence technology, medical image segmentation using deep learning has emerged as a prominent research area. The Unet [1] network, based on convolutional neural network (CNN) architecture, stands out as a pioneering model for medical image segmentation. Subsequently, various derivative models, such as DeepLab [2], SegNet [3], UNet++ [4], and RefineNet [5], have been introduced by researchers, all leveraging a CNN as a feature extraction network with encoder–decoder architecture to achieve significant segmentation performance. However, the inherent inductive bias of CNNs, which prioritizes local and translational invariance, limit its ability to effectively capture global information. The Transformer [6], originally proposed for natural language processing, has been adapted for computer vision tasks, notably with the Vision Transformer (ViT) [7] being a prominent example. By incorporating self-attention mechanisms, ViT excels in capturing the global context within images, benefiting segmentation tasks as demonstrated by models like Segmenter [8], SegFormer [9], and Swin-UNet [10]. Despite its capability in terms of modeling long-range dependencies, ViT tends to overlook local detailed features. To address this, recent research has integrated CNNs into Transformer architectures. This approach aims to fuse local and global features, thereby enhancing network representations and improving segmentation accuracy. For instance, TransUNet [11], combines a CNN for extracting low-level features with the Transformer architecture for global interactions, showcasing strong performance in tasks like CT-based multi-organ segmentation. Additionally, MobileViT [12] represents an efficient fusion of a CNN and a ViT, enabling feature extraction with reduced parameters.

While the aforementioned methods have shown promising results in image segmentation, medical CT images for COVID-19 present greater challenges, with complex textures, varying lesion morphologies, size discrepancies, and indistinct boundaries compared to natural scene images. Consequently, a single CNN or Transformer architecture for feature learning fails to adequately capture both local and global contextual information. Moreover, existing approaches that combine a CNN and the Transformer architecture typically employ sequential fusion, namely the CNN for low-level features and the Transformer architecture for high-level features, resulting in shallow networks lacking in global context and deep networks missing local detail. Furthermore, the resolution of feature maps decreases through successive convolutional pooling operations, leading to the loss of local details.

To address these challenges, we propose ECF-Net, a multi-scale feature fusion segmentation network based on channel enhancement. ECF-Net comprises four main components: (1) a channel extraction block utilizing the Swin Transformer, ResNeXt [13], and ResNeXt, with an efficient channel attention mechanism ECA [14] to extract diverse image channels in parallel; (2) a channel fusion block that combines extracted features from different channels to generate a high-dimensional feature space; (3) an adaptive channel generation block designed to filter out irrelevant and redundant information, retaining only essential details; and (4) a bidirectional feature pyramid fusion block that integrates multi-scale features from different stages, enhancing network feature expression to effectively address diverse lesion area characteristics.

This paper introduces several key contributions as follows:

(1) We propose a novel segmentation network, ECF-Net, centered on channel enhancement. This model introduces two primary innovations: first, a parallel strategy employing a CNN, Transformer architecture, and a CNN with channel attention for channel enhancement. Second, the integration of an adaptive channel generation block and a bidirectional feature pyramid fusion block to mitigate noise in the high-dimensional feature space and enhance feature expression by bidirectionally transferring features across different scales, thereby improving segmentation accuracy;

(2) Extensive experiments demonstrate that our proposed method significantly outperforms several currently popular methods used in COVID-19 CT image segmentation tasks.

The paper is structured as follows: Section 2 reviews related work, Section 3 details the network workflow and operational mechanisms of each module, Section 4 presents the comparative and ablation experiments, along with analysis of the experimental results, and Section 5 outlines the research methodology and proposes future directions.

2. Related Work

2.1. CNN-Based Methods

The convolutional neural network (CNN), a classical deep learning architecture, has eliminated the need for manual feature labeling in image segmentation, enabling automatic segmentation. The UNet network, an improvement on the fully convolutional network (FCN) [15], adopts an encoder–decoder structure with skip connections to preserve contextual information. This architecture has been widely adopted in various medical imaging applications, yielding good results [16,17,18].

2.2. Transformer-Based Methods

The self-attention mechanism introduced by Transformers, with its ability to establish long-range dependencies and capture contextual information, has gained popularity in various visual tasks. For instance, TransAttUnet [19] combines self-perceived attention with self-attention to effectively learn non-local interactions between encoder features. UNETR [20] utilizes Transformers to efficiently capture global multi-scale information, while Swin-UNet employs a hierarchical Swin Transformer [21], with a shifted window as both encoder and decoder. The CSWin Transformer [22] introduces a cross-window self-attention mechanism to reduce the computational overhead and enhance performance.

2.3. Methods Based on the Combination of a CNN and a Transformer

To enhance the overall network’s feature representation, combining CNN and Transformer architectures has become a prominent research area in computer vision. For example, CTCNet [23] incorporates two encoder branches: one utilizing ResNet34 to extract spatial and contextual features, and the other employing a Swin Transformer to capture long-range dependencies. These features are then fused through a feature complementation module. The method has demonstrated substantial improvements in multi-organ and cardiac segmentation tasks. TC-Net [24] employs CNN-based encoder and decoder structures to extract local information from medical images, while utilizing a Transformer branch to capture the global context. Additionally, it introduces a Local-Aware and Long-Range Dependency Fusion Strategy (LLCS) and a Dynamic Cyclic Focal Loss (DCFL) to address the class imbalance issue in multi-lesion segmentation. BRAU-Net++ [25] integrates a CNN and a Vision Transformer (ViT), utilizing bi-level routing attention as the core building block to construct a U-shaped encoder–decoder architecture. This design features a hierarchical construction of both the encoder and decoder to capture global semantic information, while mitigating computational complexity. The model demonstrates strong segmentation performance across three medical imaging datasets. Pact-Net [26] proposes an effective fusion of global information extracted by Transformers with local features extracted by CNNs using channel and spatial attention mechanisms, along with a multi-scale approach. TGDAUNet [27] introduces a medical image segmentation network based on a two-branch attentional UNet, combining a Transformer and a GCNN. It employs a polarized self-attention (PSA) module to reduce redundant information caused by multiple scales, thereby improving the coupling with feature information extracted from the Transformer backbone. TSCA-Net [28] utilizes a Transformer-based spatial and channel attention module to extract spatial and channel-related global complementary information from different layers of a U-shaped network. In the decoder, a spatial and channel feature fusion block is designed to integrate the Transformer features’ spatial and channel information. However, relying solely on attention mechanisms for feature fusion may cause the network to focus on salient image regions, while neglecting non-salient regions that contain critical details, thus reducing segmentation accuracy. Early signs of certain lesions may appear in these non-significant regions and ignoring them could lead to misdiagnosis. Furthermore, an over-reliance on salient regions may lead to overfitting, causing the model to perform well on training data, but struggle with new, unseen images. TFCN [29] proposes a Transformer for Fully Convolutional DenseNets (FC-DenseNet), combining the ResLinear-Transformer (RL-Transformer) and Convolutional Linear Attention Block (CLAB). While dense connectivity can enhance performance, the network still heavily relies on convolutional layers, potentially limiting its ability to handle complex nonlinear problems. This can result in an inability to effectively capture subtle boundary, shape, and texture features in lesion regions, thereby affecting the segmentation results. To fully leverage the feature learning capabilities of both CNN and Transformer architectures, we propose a novel ECF-Net for COVID-19 CT image segmentation.

3. Materials and Methods

We propose a novel multi-scale feature fusion segmentation network called ECF-Net, which is based on channel enhancement, as illustrated in Figure 1.

ECF-Net primarily comprises a channel extraction block, a channel fusion block, an adaptive feature extraction block, and a bidirectional feature pyramid fusion (BFP) block. Within these components, the channel extraction block employs a Swin-T block, a ResNeXt block, and a ResNeXt_ECA block in parallel with the extract features in three distinct manners. The Swin-T block effectively captures global image information and contextual details at different scales through a hierarchical approach. Additionally, the local window mechanism captures local features, while reducing the computational complexity, thereby avoiding the high computational overhead associated with traditional global attention mechanisms. The ResNeXt block, a variant of the Residual Network (ResNet), introduces grouped convolutions in residual connections to enhance the network’s expressive power, while maintaining computational efficiency. Given that the Efficient Channel Attention (ECA) mechanism improves the ability to model relationships between channels without significantly increasing the computational complexity, we designed the ResNeXt_ECA block to integrate ECA into the ResNeXt network. This integration efficiently captures dependencies between feature channels, further enhancing the network’s performance, particularly when processing multi-channel features and images with complex structures or details. Therefore, the channel fusion block integrates features from different stages of the extraction process, along the channel dimension, enhancing feature diversity. To address redundancy in the resulting high-dimensional features, we designed the adaptive feature extraction block (depicted in Figure 2), which includes two 1 × 1 ordinary convolutions, a 3 × 3 depth-separable convolution, and a sequence of batch normalization and ReLU activation functions. This design effectively reduces the channel dimensions, filtering out irrelevant or less relevant features. Finally, the bidirectional feature pyramid fusion block bidirectionally merges features of various scales. This enhances ECF-Net’s ability to accurately segment lesions with different scales in COVID-19 CT images.

3.1. Channel Extraction Block

In order to integrate both key local information and the global context effectively, we propose a multi-path parallel channel extraction block that leverages the diverse learning capabilities of the CNN and the Transformer, as illustrated in Figure 3.

3.1.1. Channel Extraction 1

As illustrated in (a), channel extraction 1 employs the Swin Transformer architecture utilizing a shift window. This choice is motivated by the significant computational expense and complexity associated with the traditional Multi-Head Self-Attention (MSA) mechanism in Transformers. In contrast, Swin-T introduces the Window Multi-Head Self-Attention Mechanism (W-MSA), which confines the attention computation within a window rather than across the entire global context, resulting in a notable reduction in the computational load. Furthermore, the Shifted Window Mechanism (SW-MSA) captures a broader range of global context by spanning the windows, while maintaining lower computational complexity. In (b), the Swin-T block is composed of two sub-layers: the Multi-Head Self-Attention Layer (W-MSA or SW-MSA) and the Multilayer Perceptron Layer (MLP). Each sub-layer is preceded by a LayerNorm (LN) layer and is interconnected via residual connections. For an input image

I \in R (H \times W \times C)

and a window size, the computational complexity of each window is detailed in Equation (1).

{Ω (MSA) = 4 hwC}^{2} {+ 2 (hw)}^{4} C

(1)

The total number of windows is

H / h \times W / w

, so the computational complexity of the W-MSA is as shown in Equation (2).

\begin{array}{l} Ω (W - MSA) & = (H / h \times {W / w) (4 hwC}^{2} {+ 2 (hw)}^{4} C) \\ {= 4 HWC}^{2} {+ 2 (hw)}^{4} HWC \end{array}

(2)

To address the limitations of the W-MSA, the SW-MSA facilitates information exchange and fusion among neighboring windows by shifting the window position across each layer. The calculation method for the SW-MSA is depicted in Equation (3).

\begin{array}{l} {\hat{z}}^{l} = W - MSA (LN (z^{l - 1})) + z^{l - 1}, \\ z^{l} = MLP (LN ({\hat{z}}^{l})) + {\hat{z}}^{l}, \\ {\hat{z}}^{l + 1} = SW - MSA (LN (z^{l})) + z^{l}, \\ z^{l + 1} = MLP (LN ({\hat{z}}^{l + 1})) + {\hat{z}}^{l + 1}, \end{array}

(3)

where

{\hat{z}}^{l}

and

z^{l}

represent the output features after the W-MSA and the MLP in the layer, respectively.

3.1.2. Channel Extraction 2

Since the original ResNet [30] employs standard convolutional operations, each layer’s filters are applied to every input channel. This significantly increases the network parameters and computational load. To address this, we propose ResNeXt as an alternative channel extraction network. ResNeXt enhances feature extraction by incorporating grouped convolutions and parallel paths, while maintaining a manageable computational cost. As illustrated in (c), a feature map with 256 channels is fed into the ResNeXt block, where it is processed through 32 branches. Each branch first performs a 1 × 1 convolution to generate a feature map with 4 channels, followed by 3 × 3 and 1 × 1 convolutions to restore the channel count to 256. Finally, the output features from all paths are added together to create a fused output feature, which is residually connected to the initial input features to produce the final output feature.

3.1.3. Channel Extraction 3

To better capture and emphasize locally useful features and reduce the loss of critical local information after a series of convolutional pooling operations, we propose introducing the Efficient Channel Attention (ECA) mechanism into ResNeXt. This mechanism dynamically adjusts the channel weights, while achieving diversified feature extraction through a multi-path structure, enhancing the validity and importance of features across different paths. Consequently, this improves the network’s overall feature expression capability, leading to better recognition of the salient features in lesion regions. Specifically, as illustrated in (d), for the input feature map

I \in R (H \times W \times C)

, the spatial information of each channel is aggregated into a scalar using global average pooling (GAP) to indicate the channel’s overall activation degree. The channel descriptor s is then generated, as calculated in Equation (4).

s_{c} = \frac{1}{H \times W} \sum_{i = 1}^{H} \sum_{j = 1}^{W} I_{i, j, c}

(4)

Here,

s_{c}

denotes the global average of the c channel, H and W denote the height and width of the feature map, respectively, and c denotes the number of channels.

Then, a one-dimensional convolution operation is performed on the channel descriptor s to generate

\hat{s}

, which captures the local interactions between channels. The

\hat{s}

is then passed through a sigmoid activation function to generate the weight coefficients for each channel, W. Finally, these weight coefficients W are used to adjust the weights of each channel in the input feature map, redistributing the importance of each channel to obtain the output feature map O. The computation method is shown in Equation (5).

\begin{array}{l} k = φ (C) = {⌊\frac{\log_{2} (C)}{γ} + \frac{b}{γ}⌋}_{odd} \\ \hat{s} = Conv 1 D (s, k) \\ w = σ (\hat{s}) \\ O = w \times I \end{array}

(5)

Here, k denotes the convolution kernel size, adaptively determined by the number of channels,

φ

denotes the adaptive function, C denotes the number of channels in the input feature map, b and

γ

are hyperparameters, and ⌊ ⌋ denotes rounding down to an odd number to ensure that the convolution kernel size is odd. Moreover, s and

\hat{s}

represent the channel descriptors before and after one-dimensional convolution, respectively. W denotes the channel weight coefficient. The sigmoid activation function is represented by σ. Finally, I and O denote the input feature map and output feature map, respectively.

3.2. Bidirectional Feature Pyramid Fusion Block (BFP)

To enhance the feature expression capability of the network, we propose a bidirectional feature pyramid fusion block. This block achieves comprehensive fusion and enhancement of the feature information by combining feature maps with different scales through top-down and bottom-up paths, as shown in Figure 4.

In the top-down path, the process starts from the high-level low-resolution feature map, gradually increasing the resolution through up-sampling operations, and fuses it with the low-level high-resolution feature map to enhance the global semantic understanding of the low-level feature map. Conversely, the bottom-up path starts from the low-level feature maps, gradually reducing their resolution through downsampling operations, and fuses them with the high-level feature maps. This process utilizes the detailed information of the low-level feature maps to enrich the spatial resolution of the high-level feature maps. This bidirectional fusion mechanism allows each layer of the feature map to merge with neighboring layers, ensuring that each layer in the feature pyramid contains information from different scales. This capability enables the network to better handle features of varying sizes and shapes, improving segmentation accuracy, as well as the network’s robustness and generalization ability.

3.3. BCE Loss

Since our study focuses on COVID-19 CT image segmentation, it addresses a pixel-level binary classification problem, specifically identifying the foreground (pneumonia foci) and the background (normal lung tissue). Therefore, using the binary cross-entropy loss function directly measures the difference between the predicted probabilities of the foreground and background and the true labels. The calculation formula is shown in Equation (6).

Loss = \frac{1}{N} \sum_{i = 1}^{N} (- {[y}_{i} {\log (p}_{i}) + (1 - y_{i}) \log (1 - p_{i})])

(6)

Here, N denotes the number of samples in each batch; y denotes the true label value, which takes the value of 0 or 1; p represents the probability that the network predicts the foreground; 1 − p represents the probability that the network predicts the background; log(p) measures the difference between the probability of predicting the foreground and the true label y; and log(1 − p) measures the difference between the probability of predicting the background and the true label 1 − y.

4. Experiments and Analysis of the Results

4.1. Datasets

We performed validation on three datasets: the COVID-19 CT Segmentation Dataset, the COVID-19 CT Lung and Infection Segmentation Dataset, and MosMedData. The COVID-19 CT Segmentation Dataset consists of two subsets, Dataset_A and Dataset_B. To ensure the fairness of the experiments, the datasets were divided into training and testing sets based on the types of cases. The parameter information of the datasets is presented in Table 1.

Dataset_A contains 100 annotated CT images, with 80 used for training and 20 for testing, while Dataset_B contains 373 images, with 316 used for training and 57 for testing. The COVID-19 CT Lung and Infection Segmentation Dataset includes 20 cases totaling 1844 images, from which 12 cases (1513 images) were used for training and 7 cases (331 images) for testing. MosMedData consists of 50 cases totaling 785 images, with 30 cases (626 images) used for training and 20 cases (159 images) used for testing. The size of the CT images in each of the four datasets is 512 × 512.

4.2. Experimental Environment

Our experimental environment is based on a Windows 10 64-bit operating system (Manufacturer: Microsoft Corporation Country: United States Region: Washington, Redmond) with an Intel(R) Core(TM) i9-10885H processor (Manufacturer: Intel Country: United States Region: Santa Clara, CA, USA), using the deep learning open-source framework PyTorch 1.7.1, and training accelerated by an NVIDIA RTX 3090 GPU (Manufacturer: NVIDIA Corporation Country: United States Region: Santa Clara, CA, USA). We used the AdamW optimizer, with an initial learning rate (LR) set to 1 × 10⁻⁴, a momentum range of (0.9, 0.999), the eps set to 1 × 10⁻⁸, and a weight decay of 0.05. A poly learning rate strategy was employed, decaying by a power of 0.9 after each iteration. We performed 50,000 iterations for all four datasets, with the input images uniformly adjusted to a resolution of 512 × 512.

4.3. Evaluation Metrics

To comprehensively analyze the COVID-19 image segmentation performance, we use nine evaluation metrics: the mean intersection over union (mIoU), sensitivity (SEN), specificity (SPC), the F-measure (F1), Dice’s similarity coefficient (DSC), the Hausdorff distance (HD), the mean absolute error (MAE), the number of floating-point operations (FLOPs), and the number of parameters (Params). These metrics evaluate the segmentation network model’s performance from various perspectives: the mIoU and DSC assess the overlap between the model’s segmentation results and the ground truth labels, SEN and SPC gauge the model’s ability to identify infected and normal regions, F1 combines precision and sensitivity, HD measures the geometric accuracy by calculating the boundary discrepancies, and MAE examines the magnitude of differences between the predicted and true values. Meanwhile, FLOPs and Params evaluate the model’s efficiency and resource requirements in terms of computational load and the number of parameters, respectively.

4.4. Evaluation Results

To verify the effectiveness of the proposed method, we conducted comparative experiments with other state-of-the-art methods on different COVID-19 datasets. The results of the experiments on COVID-19 CT Segmentation Dataset_A are shown in Table 2, which includes APCNet [31], DANet [32], DeepLabv3, FCN, UNet, GCNet [33], PSPNet [34], Segformer, Swin, UPerNet [35], ViT, TransUNet, VOLO [36], PoolFormer [37], CSWin, UNeXt [38], MRL-Net [39], BiseNetv2 [40], and AFFormer [41], totaling 19 methods. To ensure the fairness of the experiment, all training parameters and the experimental environment were kept the same. For the TransUNet, VOLO, PoolFormer, CSWin, and UNeXt methods, we referred to the experimental data in MRL-Net. As shown in Table 2, our ECF-Net achieves optimal performance in regard to nearly all the metrics, with an mIoU of 84.36% and a DSC of 83.08%, significantly outperforming the other methods. Additionally, the MAE is just 0.24%, indicating a minimal deviation between the predicted and actual values. Compared to CNN-based networks, such as UNet, DeepLabv3, and PSPNet, ECF-Net improves the mIoU by 5.94%, 2.30%, and 8.34%, respectively. This suggests that CNN-only networks, while capable of capturing local and multi-scale features, are limited in their ability to capture global contextual information. These networks typically rely on downsampling and upsampling to expand the receptive fields, which can lead to a loss of spatial resolution. In contrast, our network leverages the Swin Transformer and an efficient channel attention mechanism, enabling better global contextual information capture, while preserving detailed features. When compared to Transformer-based networks, such as Swin-T, ViT, and Segformer, ECF-Net achieves an mIoU improvements of 3.05%, 7.48%, and 4.30%, respectively. This is because Transformer-based networks often struggle with processing fine-grained information. ECF-Net, however, combines the strengths of both CNN and Transformer architectures, allowing for more accurate segmentation of lesion regions at different scales through its channel extraction blocks and multi-scale feature fusion.

In comparison with hybrid networks like TransUNet, PoolFormer, and AFFormer, which combine CNN and Transformer architectures, ECF-Net improves the mIoU by 6.53%, 4.07%, and 2.43%, respectively. This is because these networks typically combine CNN and Transformer features through feature splicing or weighting, which may be insufficient for fine-grained tasks, leading to limited feature fusion. Often, features are processed independently at different stages, causing a potential loss of important details or global contextual information during integration. In contrast, ECF-Net’s channel fusion block adopts a more sophisticated feature fusion approach, integrating extracted features at a deeper level to generate a high-dimensional feature space, thereby retaining more details and global information.

Compared to the dual-branch CNN-based BiseNetv2, ECF-Net improves the mIoU by 4.92%. BiseNetv2 uses a dual-branch structure to separate spatial and semantic information, improving the processing efficiency and segmentation to some extent. However, this separation may lead to inadequate information integration, particularly when dealing with complex multi-scale features. The dual-branch structure may also struggle to fully leverage global information. In contrast, ECF-Net fully integrates spatial and semantic information through its channel enhancement mechanism and multi-scale feature fusion block, adaptively filtering redundant information during feature fusion. The bidirectional feature pyramid fusion block further optimizes multi-scale feature integration, enabling the network to better handle complex image scenes and diverse lesion regions.

Additionally, DeepLabv3’s mIoU is 1.74% and 3.64% higher than that of FCN and UNet, respectively, demonstrating the effectiveness of its atrous convolution and multi-scale atrous spatial pyramid pooling (ASPP) module. These components allow the network to expand the receptive field without losing resolution, efficiently capturing global contextual information through multi-scale feature fusion, thereby enhancing segmentation performance. Swin-T outperforms ViT with a 1.18% higher mIoU and a 1.42 higher DSC, while also reducing the number of parameters and computation by 83.06 M and 206 G, respectively. This improvement in segmentation accuracy, combined with a significant reduction in the parameters and computation, is attributed to Swin-T’s efficient hierarchical design and sliding window attention mechanism, which is why it has been adopted as one of the channel extraction blocks in ECF-Net.

To intuitively demonstrate the differences in segmentation performance between our method and other state-of-the-art methods, five CT images were randomly selected from Dataset_A, and five different networks were tested to visualize the segmented lesion regions, as shown in Figure 5.

As shown in Figure 5, in the first and fourth COVID-19 CT images, the other four methods have missed detections, whereas our method accurately identifies tiny lesions. We also observe that Swin-T has a high overall leakage rate and DeepLabv3 exhibits more misdetections. Although our method has some discrepancies compared to the real values, it maintains a high overall fit to the real values, accurately recognizing and segmenting even tiny and fuzzy-bordered lesions. This indicates that our proposed channel extraction strategy, through three pathways, enables the network to capture both spatial detail information and contextual semantic information. Additionally, the bidirectional feature pyramid fusion block enhances the focus on boundary features, effectively addressing the difficulty of recognizing tiny lesions with fuzzy boundaries.

The experimental results for the COVID-19 CT Segmentation Dataset_B are shown in Table 3. The mIoU of ECF-Net reaches 87.15%, the Dice value is 85.57%, and the Hausdorff distance (HD) is 29.91%. Compared to Swin-T and ViT, which are based solely on the Transformer architecture, ECF-Net improves the mIoU by 5.06% and 12.24%, respectively, and reduces the HD value by 8.36% and 37.40%, indicating a significant reduction in bias between the predicted and actual boundaries. Compared to full convolutional networks, such as FCN, UNet, and DeepLabv3, ECF-Net improves the mIoU by 6.84%, 10.73%, and 10.83%, respectively. These results demonstrate that our method offers better smoothness and higher segmentation accuracy.

Table 4 and Table 5 present the experimental results for the COVID-19 CT Lung and Infection Segmentation Dataset and MosMedData, respectively. Our method achieves the highest mIoU value for both datasets, demonstrating superior segmentation accuracy compared to other state-of-the-art methods. However, the experimental results for MosMedData are not as strong as those for the other three datasets. This is attributed to the fact that the lesions in MosMedData are very small and low resolution, which increases the difficulty of the recognition and impacts the segmentation results. Nonetheless, our method still delivers the best performance metrics relative to several other methods, further demonstrating its strong generalization ability.

4.5. Ablation Study

To thoroughly validate the effectiveness of each component in our proposed method, we conducted a series of ablation experiments on the COVID-19 CT Segmentation Dataset_B. These experiments involved varying the channel extraction blocks, including and excluding feature generation blocks, incorporating and omitting bidirectional feature pyramid fusion blocks, and testing different loss functions.

4.5.1. Ablation Studies Involving Channel Extraction Blocks

From Table 6, it is evident that the mIoU value for Extract1, which uses only the Swin-T architecture for channel extraction, is 82.09%. This is 5.06% lower than the mIoU achieved by our proposed method, which employs three channel extraction blocks in parallel. The mIoU values for extract1 + extract2 and extract1 + extract3, which utilize two channel extraction blocks in parallel, are 84.91% and 85.28%, respectively. These are lower than our method’s mIoU by 2.24% and 1.87%, respectively. This suggests that a single channel extraction block alone cannot adequately capture both global and local information. Although using two channel extraction blocks in parallel addresses the limitations of a single block, extract1 + extract3 achieves a 0.37% higher mIoU than extract1 + extract2. This indicates that the extensive convolutional pooling operations during CNN-based methods may lead to the loss of some local detailed features. Therefore, incorporating the ECA channel attention mechanism helps capture fine-grained features and effectively addresses this issue.

4.5.2. Ablation Studies Involving Adaptive Feature Generation Block

From Table 7, it can be observed that the mIoU and Dice values for the network without the adaptive feature generation block are 86.34% and 84.50%, respectively. These values are 0.81% and 1.07% lower than those for the network with the adaptive feature generation block. This indicates that the addition of this module effectively maps high-dimensional features to a lower-dimensional space, eliminating redundant information and emphasizing key features, thereby improving segmentation accuracy.

4.5.3. Ablation Studies Involving BFP

Table 8 presents the experimental results for the network with and without the bidirectional feature pyramid fusion block. The introduction of this block results in a 1.43% increase in the mIoU and a 28.82% reduction in the Hausdorff distance (HD). This improvement is attributed to the block’s ability to perform top-down and bottom-up bidirectional feature fusion, effectively integrating feature information across different scales and enhancing feature expression. The experimental results underscore the effectiveness of this approach.

4.5.4. Ablation Studies Involving the Loss Function

A series of ablation experiments were conducted to assess the impact of different loss functions on the network’s training phase, with the results presented in Table 9. The mIoU is the lowest with HDLoss, at 80.07%. The mIoU values for FocalLoss and DiceLoss are similar, with a difference of only 0.30%. CrossEntropy Loss, the binary cross-entropy loss function used in this study, yields the highest performance metrics overall, except for the SPC and F1 values, which are slightly lower than those achieved with DiceLoss. This indicates that the CrossEntropy Loss is more effective at optimizing the network and reducing boundary errors.

5. Discussion and Conclusions

In this study, we propose a multi-scale feature fusion network, ECF-Net, based on channel enhancement to accurately segment COVID-19 lesion regions in lung CT images. The experimental results demonstrate that our method exhibits excellent segmentation performance, with mIoU values of 84.36%, 87.15%, 83.73%, and 75.58% across four COVID-19 CT image datasets, validating its effectiveness and robustness. This improvement in segmentation accuracy has significant potential to enhance surgical planning, optimize doses for adjuvant radiation therapy, and automate pathology analysis, thereby improving the quality and efficiency of healthcare services. Despite these promising results, our method has the following shortcomings:

(1) The datasets used in our experiments are publicly available and no actual clinical trials have been conducted, which limits the validation of our method in real-world scenarios;

(2) The network’s parameter count and computational load reach 98 MB and 261 GB, respectively, reflecting its high complexity in terms of both computation and storage requirements. A high parameter count generally allows the model to capture more detailed features, but also increases the storage needs. This can be challenging in clinical environments, where storage resources may be limited. Additionally, the increased computational load affects both training and inference speeds. During training, the higher computational volume extends the training time, necessitating powerful resources and longer processing cycles, which can increase the development costs and affect research timelines. During inference, where real-time processing is essential for many medical applications, the high computational demand may lead to slower processing speeds, potentially undermining the efficiency of real-time applications.

In future work, we will focus on two key aspects for further research and improvement. First, in terms of the datasets, we plan to collect CT image data from various hospitals and devices to comprehensively evaluate the model’s applicability and robustness. Additionally, we intend to conduct tests in real clinical environments to assess the performance of the proposed network model across different healthcare organizations and devices. This approach will help identify and address potential issues the model may encounter during practical applications, such as image noise or quality variations generated by different devices. By resolving these challenges, we aim to reduce the risk of misdiagnosis and omission, thereby enhancing diagnostic accuracy, speeding up diagnosis, and optimizing treatment decisions to minimize adjustments caused by inaccurate diagnoses. Second, regarding network lightweighting, we will explore techniques, such as pruning, quantization, and sparsification, to reduce the number of network parameters and the computational overhead. We will also investigate hardware-based optimization solutions, including improving the inference speed of models using accelerated hardware like GPUs or FPGAs. Furthermore, we plan to utilize parallel computing and optimization algorithms to enhance inference efficiency. Developing lightweight models is crucial for reducing data processing and transmission latency, thus shortening the time from image acquisition to result output. This is especially important in scenarios, such as emergency medicine and high-throughput screening, where improved efficiency and response times are vital.

Author Contributions

Conceptualization, Z.J. and J.Z. (Junhao Zhou); methodology, Z.J., L.W. and S.B.; validation, M.C., J.Z. (Junhao Zhou), Z.J. and H.Y.; writing—original draft preparation, Z.J.; writing—review and editing, S.B., H.Y. and M.C.; supervision, L.W., S.B. and J.Z. (Jianjun Zheng); funding acquisition, L.W. and S.B. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Lanzhou Municipal Talent Innovation and Entrepreneurship Project (2021-RC-47), the Ministry of Science and Technology National Foreign Expertise Project (No. G2022042005L), Gansu Higher Education Institutions Industrial Support Project (No. 2023CYZC-54), Gansu Key R&D Program (No. 23YFWA0013), Gansu Agricultural University Aesthetic and Labor Education Teaching Reform Project (No. 2023-09), the open research fund of the National Mobile Communications Research Laboratory, Southeast University (No. 2023D15), Ningbo Clinical Research Center for Medical Imaging (No. 2022LYKFYB01).

Data Availability Statement

The data presented in this study are available upon request from the corresponding author.

Conflicts of Interest

The authors declare no conflict of interest.

References

Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention—MICCAI 2015: 18th International Conference, Munich, Germany, 5–9 October 2015; pp. 234–241. [Google Scholar]
Chen, L.C.; Papandreou, G.; Schroff, F.; Hartwig, A. Rethinking atrous convolution for semantic image segmentation. arXiv 2017, arXiv:1706.05587. [Google Scholar]
Badrinarayanan, V.; Kendall, A.; Cipolla, R. Segnet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 2481–2495. [Google Scholar] [CrossRef] [PubMed]
Zhou, Z.; Rahman Siddiquee, M.M.; Tajbakhsh, N.; Liang, J. Unet++: A nested u-net architecture for medical image segmentation. In Proceedings of the Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support: 4th International Workshop, DLMIA 2018, and 8th International Workshop, ML-CDS 2018, Held in Conjunction with MICCAI 2018, Granada, Spain, 20 September 2018; pp. 3–11. [Google Scholar]
Lin, G.; Milan, A.; Shen, C.; Reid, I. Refinenet: Multi-path refinement networks for high-resolution semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1925–1934. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.; Kaiser, L.; Polosukhin, I. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 6000–6010. [Google Scholar]
Alexey, D.; Lucas, B.; Alexander, K.; Dirk, W.; Xiaohua, Z.; Thomas, U.; Mostafa, D.; Matthias, M.; Georg, H.; Sylvain, G.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Strudel, R.; Garcia, R.; Laptev, I.; Schmid, C. Segmenter: Transformer for semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Nashville, TN, USA, 10–25 June 2021; pp. 7262–7272. [Google Scholar] [CrossRef]
Xie, E.; Wang, W.; Yu, Z.; Anandkumar, A.; Alvarez, J.M.; Luo, P. SegFormer: Simple and efficient design for semantic segmentation with transformers. Adv. Neural Inf. Process. Syst. 2021, 34, 12077–12090. [Google Scholar]
Cao, H.; Wang, Y.; Chen, J.; Jiang, D.; Zhang, X.; Tian, Q.; Wang, M. Swin-unet: Unet-like pure transformer for medical image segmentation. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 205–218. [Google Scholar] [CrossRef]
Chen, J.; Lu, Y.; Yu, Q.; Luo, X.; Adeli, E.; Wang, Y.; Le, L.; Alan, L.; Zhou, Y. Transunet: Transformers make strong encoders for medical image segmentation. arXiv 2021, arXiv:2102.04306. [Google Scholar]
Shashank, M.; Mehta, S.; Rastegari, M. Mobilevit: Light-weight, general-purpose, and mobile-friendly vision transformer. arXiv 2021, arXiv:2110.02178. [Google Scholar]
Xie, S.; Girshick, R.; Dollár, P.; Tu, Z.; He, K. Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Venice, Italy, 22–29 October 2017; pp. 1492–1500. [Google Scholar] [CrossRef]
Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient channel attention for deep convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 11534–11542. [Google Scholar] [CrossRef]
Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar]
Feng, S.; Zhao, H.; Shi, F.; Cheng, X.; Wang, M.; Ma, Y.; Xiang, D.; Zhu, W.; Chen, X. CPFNet: Context pyramid fusion network for medical image segmentation. IEEE Trans. Med. Imaging 2020, 39, 3008–3018. [Google Scholar] [CrossRef] [PubMed]
Gu, R.; Wang, G.; Song, T.; Huang, R.; Aertsen, M.; Deprest, J.; Ourselin, S.; Vercauteren, T.; Zhang, S. CA-Net: Comprehensive attention convolutional neural networks for explainable medical image segmentation. IEEE Trans. Med. Imaging 2020, 40, 699–711. [Google Scholar] [CrossRef] [PubMed]
Azad, R.; Bozorgpour, A.; Asadi-Aghbolaghi, M.; Merhof, D.; Escaler, S. Deep frequency re-calibration u-net for medical image segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 10–17 October 2021; pp. 3274–3283. [Google Scholar] [CrossRef]
Chen, B.; Liu, Y.; Zhang, Z.; Lu, G.; Kong, A.W.K. TransAttUnet: Multi-Level Attention-Guided U-Net with Transformer for Medical Image Segmentation. IEEE Trans. Emerg. Top. Comput. Intell. 2023, 8, 55–68. [Google Scholar] [CrossRef]
Hatamizadeh, A.; Tang, Y.; Nath, V.; Yang, D.; Myronenko, A.; Landman, B.; Roth, H.R.; Xu, D. Unetr: Transformers for 3d medical image segmentation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2022; pp. 574–584. [Google Scholar] [CrossRef]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 10–17 October 2021; pp. 10012–10022. [Google Scholar] [CrossRef]
Dong, X.; Bao, J.; Chen, D.; Zhang, W.; Yu, N.; Yuan, L.; Guo, B. CSWin transformer: A general vision transformer backbone with cross-shaped windows. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 12124–12134. [Google Scholar] [CrossRef]
Yuan, F.; Zhang, Z.; Fang, Z. An effective CNN and Transformer complementary network for medical image segmentation. Pattern Recognit. 2023, 136, 109228. [Google Scholar] [CrossRef]
Zhang, Z.; Sun, G.; Zheng, K.; Yang, J.K.; Zhu, X.; Li, Y. TC-Net: A joint learning framework based on CNN and vision transformer for multi-lesion medical images segmentation. Comput. Biol. Med. 2023, 161, 106967. [Google Scholar] [CrossRef] [PubMed]
Lan, L.; Cai, P.; Jiang, L.; Liu, X.; Li, Y.; Zhang, Y. BRAU-Net++: U-Shaped Hybrid CNN-Transformer Network for Medical Image Segmentation. arXiv 2024, arXiv:2401.00722. [Google Scholar]
Chen, W.; Zhang, R.; Zhang, Y.; Bao, F.; Lv, H.; Li, L.; Zhang, C. Pact-Net: Parallel CNNs and Transformers for medical image segmentation. Comput. Methods Programs Biomed. 2023, 242, 107782. [Google Scholar] [CrossRef] [PubMed]
Song, P.; Li, J.; Fan, H.; Fan, L. TGDAUNet: Transformer and GCNN based dual-branch attention UNet for medical image segmentation. Comput. Biol. Med. 2023, 167, 107583. [Google Scholar] [CrossRef] [PubMed]
Fu, Y.; Liu, J.; Shi, J. TSCA-Net: Transformer based spatial-channel attention segmentation network for medical images. Comput. Biol. Med. 2024, 170, 107938. [Google Scholar] [CrossRef] [PubMed]
Jia, X.; Li, D. TFCN: Temporal-frequential convolutional network for single-channel speech enhancement. arXiv 2022, arXiv:2201.00480. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar] [CrossRef]
He, J.; Deng, Z.; Zhou, L.; Wang, Y.; Qiao, Y. Adaptive pyramid context network for semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 7519–7528. [Google Scholar] [CrossRef]
Xue, H.; Liu, C.; Wan, F.; Jiao, J.; Ji, X.; Ye, Q. Danet: Divergent activation for weakly supervised object localization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6589–6598. [Google Scholar] [CrossRef]
Cao, Y.; Xu, J.; Lin, S.; Wei, F.; Hu, H. Gcnet: Non-local networks meet squeeze-excitation networks and beyond. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar] [CrossRef]
Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid scene parsing network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2881–2890. [Google Scholar] [CrossRef]
Xiao, T.; Liu, Y.; Zhou, B.; Jiang, Y.; Sun, J. Unified perceptual parsing for scene understanding. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018; pp. 418–434. [Google Scholar] [CrossRef]
Yuan, L.; Hou, Q.; Jiang, Z.; Feng, J.; Yan, S. VOLO: Vision outlooker for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 6575–6586. [Google Scholar] [CrossRef] [PubMed]
Yu, W.; Luo, M.; Zhou, P.; Si, C.; Zhou, Y.; Wang, X.; Yan, S. MetaFormer is actually what you need for vision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 10819–10829. [Google Scholar] [CrossRef]
Valanarasu JM, J.; Patel, V.M. UNeXt: MLP-based rapid medical image segmentation network. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Singapore, 18–22 September 2022; pp. 23–33. [Google Scholar] [CrossRef]
Liu, S.; Cai, T.; Tang, X.; Wang, C. MRL-Net: Multi-scale Representation Learning Network for COVID-19 Lung CT Image Segmentation. IEEE J. Biomed. Health Inform. 2023, 27, 4317–4328. [Google Scholar] [CrossRef] [PubMed]
Yu, C.; Gao, C.; Wang, J.; Yu, G.; Shen, C.; Sang, N. Bisenet v2: Bilateral network with guided aggregation for real-time semantic segmentation. Int. J. Comput. Vis. 2021, 129, 3051–3068. [Google Scholar] [CrossRef]
Dong, B.; Wang, P.; Wang, F. Head-free lightweight semantic segmentation with linear transformer. In Proceedings of the AAAI Conference on Artificial Intelligence, Washington, DC, USA, 7–14 February 2023; pp. 516–524. [Google Scholar] [CrossRef]

Figure 1. ECF-Net structure diagram.

Figure 2. Adaptive feature generation block.

Figure 3. Channel extraction block. (a) Three types of channel extraction blocks. (b) Swin-T Block. (c) ResNeXt Block. (d) ECA: Efficient Channel Attention.

Figure 4. Bidirectional feature pyramid fusion block (BFP).

Figure 5. Visualization comparison of the segmentation results: (a) original image, (b) Ground Truth, (c) DeepLabv3, (d) Swin-T, (e) AFFormer, (f) BiseNetv2, (g) ECF-Net.

Table 1. The parameter information of the datasets.

Dataset	No. of Cases	Total No. of Samples	No. of Samples in Training Set/Case	No. of Samples in Test Set/Case	Image Size
Dataset_A	>40	100	80	20	512
Dataset_B	9	373	316/7	57/2	512
MosMed	50	785	626/30	159/20	512
Inf	20	1844	1513/12	331/7	512

Inf: the COVID-19 CT Lung and Infection Segmentation Dataset.

Table 2. Experimental results for Dataset_A.

Method	FLOPs (G)	Params (M)	mIoU (%)	SEN (%)	SPC (%)	F1 (%)	DSC (%)	HD	MAE (%)
APCNet	205	53.98	68.45	59.09	97.13	60.04	59.68	115.00	0.53
DANet	211	47.45	79.63	69.95	99.04	81.5	76.75	94.09	0.28
DeepLabv3	270	65.74	82.06	75.95	98.95	82.81	80.1	49.49	0.26
FCN	198	47.12	80.86	76.46	98.62	79.87	78.56	51.64	0.28
UNet	23.77	13.40	78.42	72.55	95.52	75.37	73.10	44.63	6.88
GCNet	198	47.25	81.19	77.73	98.56	79.83	79.03	43.62	0.27
PSPNet	179	46.60	76.02	66.26	98.56	75.24	71.60	90.74	0.35
Segformer	7.88	3.71	80.06	72.17	98.92	80.89	77.39	57.49	0.28
Swin-T	236	58.94	78.06	66.34	99.09	80.41	74.49	97.47	0.31
UPerNet	237	64.04	76.26	69.01	98.28	74.02	72.06	92.97	0.36
ViT	442	142	76.88	72.57	98.02	73.37	73.07	77.83	0.36
TransUNet	33.49	110.52	77.83	75.87	94.40	73.87	73.27	50.58	7.32
VOLO	7.66	26.78	80.04	77.23	95.44	75.11	75.15	42.12	6.34
PoolFormer	9.98	57.71	80.29	76.25	95.43	76.16	75.21	41.38	6.26
CSWin	4.65	23.03	80.50	78.51	94.92	75.33	75.48	41.93	6.29
UNeXt	0.44	1.47	73.90	68.93	93.76	69.87	67.51	55.93	8.82
MRL-Net	20.01	85.06	82.75	77.78	96.49	79.47	78.13	35.67	5.29
BiseNetv2	7.64	2.20	79.44	72.63	98.69	79.15	76.57	47.10	0.30
AFFormer	2.93	2.16	81.93	66.01	99.23	73.77	72.13	43.47	2.49
ECF-Net	261	98	84.36	80.38	99.10	84.79	83.08	43.77	0.24