1. Introduction
Accurate image segmentation can more clearly identify changes in anatomic or pathologic structure in medical images [
1], which is crucial in various computer-aided diagnostic applications, including lesion contour, surgical planning, and three-dimensional reconstruction. Medical image segmentation can detect and locate the boundaries of lesions in an image, thus helping to quickly identify the potential existence of tumors and cancerous regions, which may help clinicians save diagnosis time and improve the possibility of finding tumors [
2]. Traditionally, the symmetric encoder–decoder structures has been the standard in medical image segmentation, where U-Net [
3] has become the benchmark of choice among different variants with great success.
The U-Net model consists of convolutions, the fundamental operation of which is the convolution operator with two characteristics, weight sharing and local connection, which ensures the affine invariance of the model. Although these characteristics help to create effective and versatile medical imaging systems, they still require additional improvements to aid clinicians in early disease diagnosis [
4]. Researchers have proposed various improved methods to add global context to convolutional neural networks (CNNs), among which influential methods include introducing an attention mechanism [
5,
6,
7,
8] and expanding convolution kernels [
9,
10,
11] to extend their receptive fields. However, the locality of the convolutional layer’s receptive fields leads to the inability of networks to utilize remote semantic dependence effectively, so their learning ability is limited to a relatively small area and they fail to fully explore object-level information, especially for organs in terms of texture, shape, and size, typically yielding weaker properties and exhibiting sizable inter-patient variability.
Transformer [
12] has shown high performance in language learning tasks, and the attention-based model has emerged as an appealing solution because of its ability to efficiently handle very long sequence dependencies, adapting to various vision tasks. Recent research has demonstrated the ability of Transformer modules to entirely take over the role of conventional convolutions by manipulating sequences of picture patches; the most representative of these is Vision Transformer (ViT) [
13]. There have been many works proving that the ViT model can promote the development of many computer vision tasks, including semantic segmentation [
14], object detection [
15], and image classification [
13], among others. The accomplishments of ViT in processing natural images have captivated the medical field. In response, researchers are delving into the capabilities of Transformer for medical image segmentation to address the intrinsic receptive field limitations of CNNs and make them suitable for medical imaging applications [
16,
17,
18,
19].
However, the performance of Transformer-based models largely depends on pre-training [
20,
21]. There are two problems with Transformer-based models’ pre-training process. First, there is a scarcity of comprehensive and widely accepted large datasets for pre-training in the medical imaging field due to the extensive time professionals need to dedicate to annotating medical images (in contrast, ImageNet [
22], a large dataset, is available for pre-training natural scene images). Secondly, pre-training consumes much time and many computing resources. Moreover, using large natural image datasets for pre-training medical image segmentation models is difficult because of the domain difference between medical and natural pictures. In addition, there are also some open challenges in different types of medical images. For example, when Swin UNETR [
23] is pre-trained on a CT dataset and subsequently applied to different medical imaging modalities like MRI, its performance tends to decline. This is attributed to the significant differences in regional characteristics between CT and MRI images [
24].
Fully exploiting CNNs’ and Transformer’s respective advantages to effectively integrate fine-grained and coarse-grained information in images, thereby boosting the precision and performance in deep learning models, has become a research direction researchers are actively working on. In
Figure 1, we summarize various medical image segmentation methods that utilize a combined CNN and Transformer hybrid architecture. As shown in
Figure 1a, researchers have incorporated Transformer into models with CNN as the backbone in different ways, either by adding them or replacing certain architectural components to create a network that combines Transformer and a CNN in a serial or embedded fashion. However, this strategy only uses stacking to fuse fine-grained and coarse-grained features, which may reduce the fusion effect and not fully leverage the synergistic capabilities of both network types.
Figure 1b,c illustrate parallel frameworks of a CNN and Transformer, extracting distinct feature information from both structures and merging them multiple times before passing them to the decoder for decoding. In
Figure 1b, additional branches are used to fuse the CNN and Transformer branches, but this introduces network overhead and lacks effective interaction. For example, TransFuse [
25] uses a branch composed of BiFusion modules to simultaneously utilize the different characteristics of the CNN branch and the Transformer branch during the feature extraction process, alleviating the problem of ignoring intermediate features due to serial connections. However, its upsampling method struggles to effectively restore the mid-layer information, leading to a loss of detail.
In
Figure 1c, each layer extracts features using the CNN and Transformer, respectively, and finally, adds and fuses them as the output. However, due to semantic differences between Transformer’s and the CNN’s features, this strategy limits the fusion’s effectiveness. In
Figure 1c, the input features are independently extracted by both the Transformer and the CNN, then added and merged as the input of the CNN or Transformer. However, due to the feature semantic differences between Transformer and CNNs, this strategy limits the fusion effect. For example, both HiFormer [
26] and CiT-Net [
27] use CNN and Transformer branches to extract features, respectively, but HiFormer only implements one-way feature fusion from the CNN to Transformer, while CiT-Net only fuses the two features through a single fusion method and sends them to different branches. Both ignore the different characteristics of the two branches.
In our study, we present the collaborative cross-fusion network (CCFNet), designed for medical image segmentation. CCFNet’s encoder consists of two parts: first, in the shallow layers of the encoder, a CNN is employed to acquire convolutional depth features, compensating for detail lost in upsampling; second, in the deep encoder layers, a collaborative cross-fusion of Transformer and the CNN is applied, which can simultaneously enhance the image’s local and global depictions. As shown in
Figure 1d, compared with other combination strategies, CCFNet achieves close information interaction between the CNN and Transformer while continuously modeling local and global representations. By continuously aggregating the hierarchical representation and information interaction of global and local features, the information interaction of collaborative cross-fusion is closer and the feature fusion is more thorough. This makes CCFNet excellent in medical image segmentation tasks.
In CCFNet, high-resolution features extract fine-grained local information and perform depthwise convolutions. Since low-resolution features contain more global information (location and semantic information), feature prediction can fuse long-distance global information, and the self-attention mechanism facilitates the capture of deep information [
28]. CCFNet processes low-resolution features through a parallel fusion of the CNN and Transformer within the collaborative cross-fusion module (CCFM). This method capitalizes on the self-attention mechanism’s robust long-range dependency capabilities to ensure accurate medical image segmentation. Considering the complementarity of the two network features, the CCFM sequentially delivers global context from the Transformer branch to the feature maps through the spatial feature injector (SFI) block. This integration significantly boosts the global perceptual capabilities of the CNN branch. Likewise, the collaborative self-attention fusion (CSF) block progressively reintroduces the local features from the CNN branch back into the Transformer, enriching the local detail and creating a dynamic interplay of fused features. Finally, local–global feature complementarity can be achieved, and the network’s feature-encoding capabilities can be enhanced. The experiments utilize Synapse, an open accessible dataset for multi-organ medical image segmentation. When the average Dice similarity coefficient (DSC) scores are compared with those from other hybrid models, our proposed CCFNet shows improved accuracy in organ segmentation. This paper’s important contributions are summarized below:
We propose CCFNet, a collaborative cross-fusion network, to integrate Transformer-based global representations with convolutional local features in a parallel interactive manner. Compared with other fusion methods, the collaborative cross-fusion module can not only encode the hierarchical local and global representations independently but also aggregate the global and local representations efficiently, maximizing the capabilities of the CNN and Transformer.
In the collaborative cross-fusion module, the CSF block is designed to adaptively fuse the correlation between the local tokens and the global tokens and reorganize the two features to introduce the convolution-specific inductive bias into the Transformer. The spatial feature injector block is designed to reduce the spatial information gap between local and global features, avoiding the asymmetry of extracted features and introducing the global information of the Transformer into the CNN.
On two publicly accessible medical image segmentation datasets, CCFNet outperforms other competitive segmentation models, validating its effectiveness and superiority.
3. Method
CCFNet follows a U-shaped structure featuring hierarchical decoder and encoder sections, in which skip connections facilitate the linkage between the decoder and encoder. It is essential to note that CCFNet is structured with two branches, which process information differently, as shown in
Figure 2. The two branches preserve global contexts and local features through the parallel fusion layer composed of the CCFM. In this CCFM, the CSF block can adaptively fuse them according to the correlation between local and global tokens, thus introducing convolution-specific inductive bias into the Transformer. The SFI block can avoid asymmetry of extracted features and introduce global representations of Transformer into the CNN branch, which has extracted local semantic features through the detail feature extractor (DFE) block. Features from both parallel branches are successively fused to form features that are fused with each other, and finally, realize the complementarity of the two features. The proposed parallel branching approach has three main benefits: Firstly, the CNN branch gradually extracts low-level, high-resolution features to obtain detailed spatial information, which can help Transformer obtain rich features and accelerate its convergence. Second, the Transformer branch can capture global information while remaining sensitive to low-level contexts without building a deep network. Finally, during feature extraction, the proposed CCFM can leverage the different characteristics of Transformer and the CNN to the full extent, continuously aggregating hierarchical representations from global and local features.
3.1. CNN Branch
The CNN branch adopts a feature pyramid structure. This is because while in the Transformer branch patch embedding is used to project image patches into vectors, which results in the loss of local details, in the CNN the convolution kernels slide across overlapping feature maps, which provides the possibility of preserving fine local features. As a result, the CNN branch is able to supply local feature details to the Transformer branch. Specifically, as the network depth increases in the CNN branch, the resolution of feature maps gradually decreases, the number of channels gradually increases, the receptive field gradually increases, and the feature encoding changes from local to global. Given an input image
, its spatial resolution is
and
is the number of channels, the feature map generated by
is represented as
where
D represents the dimension of the feature map,
represents the parameters of the CNN branch, and
L represents the quantity of feature layers. Specifically, the first block
is made up of 2 convolutions (
) with strides 1 and 2, and each convolution block is followed by normalization and the GELU activation function to extract initial local features (such as edge and texture information). As shown in
Figure 3a,
and
are stacked with SEConv blocks composed of three convolutional blocks and an SE module [
6]. The number of SEConv blocks in
and
is 2 and 6, respectively. The efficient and lightweight SE module can be seamlessly integrated into the CNN architecture, which can help the CCFNet network to enhance local details, suppress irrelevant regions, correct channel features by modeling the relationship between channels, and improve the representational capacity of the neural network.
The CNN branch of the parallel fusion layer consists of a six-layer stack of modules consisting of a DFE block and an SFI block. The feature map output
of each layer has the same resolution size (
), and the output of the
i-th layer is expressed as
where
is the
i-th layer’s CSF-block-coded image representation on the Transformer branch with the same resolution as
. The structures of the DFE block and SFI block are shown in
Figure 3b. More detailed operations are described in the parallel fusion layer.
3.2. Transformer Branch
The CNN branch obtains rich local features under a limited receptive field through convolution operations, while the Transformer branch performs global self-attention through attention mechanisms. The Transformer branch has the same input image
as the CNN branch. Following [
13,
16], in the patch embedding we first divide
x into an
sequence of patches, the size of each patch is
, and the default setting is 16. After splitting the input images into small patches, the patches are flattened to a sequence of 2D patches
and fed to a trainable linear layer, which converts the vectorized patches
into a sequence embedding space with an output dimension of
, and then, in order to facilitate the fusion with the CNN branch, the reshape operation is used to generate image
, which can be expressed as
where
is the patch embedding projection. The Transformer branch in the parallel fusion layer is connected to six CSF blocks of attention operations, and the CSF block consists of a ConvAttention and a CMLP (convolution multi-layer perceptron). The feature map output
of each layer has the same resolution size (
). Therefore, the output of the
i-th layer can be written as follows:
where
and
are the two inputs of the CSF block,
is the intermediate output of the
i-th layer DFE module on the CNN branch, which has the same resolution as
, and
is the encoded image representation. The structure of a CSF block is illustrated in
Figure 3b. More detailed operations are described in the parallel fusion layer.
3.3. Parallel Fusion Layer
The parallel fusion layer has two branches, namely, the Transformer branch and the CNN branch, which process information in distinct ways. In the CNN branch, local features are collected hierarchically through a convolution operation, and local clues are also saved as feature maps. The parallel fusion layer fuses the feature representation of the CNN in a parallel manner through cascaded attention modules, which maximizes the preservation of local features and global representations. The parallel fusion layer is composed of six CCFMs superposed.
An image has two completely different representations: global features and local features. The former focuses on model object-level relationships between remote parts, while the latter aims at fine-grained details and is beneficial for pixel-level localization and tiny object detection. As shown in
Figure 3b, a CCFM is used to efficiently combine these encoded features of the Transformer and CNN, which can interactively fuse convolution-based local features and Transformer-based global representations.
The CCFM has two inputs,
and
, where
is the input on the CNN branch with the same resolution as
,
is the input on the Transformer branch, and
is the feature map formed after extracting features on the CNN branch with the same resolution and number of channels as
, which can be expressed as
The Transformer aggregates information between global tokens, but CNN only aggregates information in the limited local field of view of the convolution kernel, which leads to certain feature semantic differences between the Transformer and CNN. Therefore, by superimposing the feature maps of the CNN and Transformer, the CSF block adaptively fuses the self-attention weights with common information between them to calculate the mutual relationship between local tokens and global tokens.
As shown in
Figure 3b, the CSF block consists of ConvAttention and CMLP. Like the traditional attention mechanism, the basic module of ConvAttention is multi-head self-attention (MHSA). As shown in
Figure 4a, the difference is that ConvAttention has two inputs, adding
and
to obtain feature maps
and
as its input. In addition, ConvAttention uses convolutional mapping. The specific operation is that
generates
through
convolutional mapping, and
generates
and
through
convolutional mapping. Subsequently, we use the flatten operation to project the patches into the
d-dimensional embedding space as the input of the underlying module MHSA in the ConvAttention block.
The MHSA is performed on the obtained
,
, and
, an MHSA comprises
h parallel self-attention heads. The calculation process is as follows:
where
represents the multi-headed trainable parameter weights. The self-attention of each head in MHSA is calculated as
where
Q,
K, and
are the query, key, and value matrices, which are obtained by convolution projection,
denotes the number of patch tokens.
stand for the size of the feature
, and
d is the query/key dimension. We follow [
28,
44] by including a relative position bias
. Since the relative position along each axis lies in the range
, we parameterize a smaller deviation matrix
; the value of
B is taken from
.
As shown in
Figure 4a, a CMLP is then carried out, which consists of two convolution layers (
). The output
obtained after the CMLP is used as the input of the Transformer branch in the next fusion module, and at the same time, it is feature-fused with the feature map of the same resolution on the CNN branch.
Given the varying receptive fields of the CNN and Transformer, the features they extract exhibit asymmetry. At the same time, the information reflected by these features has a great gap in space. As shown in
Figure 4b, when the Transformer branch is fused to the CNN branch, the SFI block uses the spatial attention weight map of the feature obtained on the Transformer branch. The calculation formula is as follows:
where
represents the sigmoid function, and
and
represent the cross-channel average-pooled features and max-pooled features, respectively. The attention map is multiplied by the feature map on the CNN branch to achieve spatial information feature enhancement. Then, it is concatenated with the feature map on the Transformer branch, and the features are further fused by
convolution. The final output is used as the input of the CNN branch in the next fusion module. In the last layer of the parallel fusion layer, the two features are finally used as the input of the decoder through fuse operation. Specifically, the outputs of the CNN branch and Transformer branch are added together and fused through a convolution.
3.4. Decoder
The decoder in CCFNet is a pure convolution module that consists of numerous upsampling steps to decode hidden features, with the ultimate output being the segmentation result. Firstly, bilinear interpolation is applied to the input feature map. The following operations are then repeated until the resolution of the original input is restored by concatenating the feature maps with the resolution improved by a factor of 2 with the feature maps on the corresponding jump joins, inputting them into successive convolution layers (), and upsampling the output using bilinear interpolation. Finally, the feature maps with the restored original resolution are fed into a special convolution layer (segmentation head) to generate the pixel-level semantic prediction.
The encoder and decoder merge the semantic information of the encoder through skip connections and concatenation operations to obtain more contextual information. The outputs of the three layers of the CNN branch in the encoder are sequentially connected to the three layers of the decoder to regain local spatial information to improve finer details. The parallel fusion layer is a dual-stream fusion operation of the CNN and Transformer, which sends the fused feature output of the two features to the decoding layer.
3.5. Loss Function
In general segmentation tasks, Dice loss [
29] and cross-entropy loss are both frequently used, with Dice loss being suitable for large-sized target objects and cross-entropy loss performing well for a uniform distribution of categories. Following the TransUNet [
16] literature, the loss function used in CCFNet training also uses the combined form of Dice loss and binary cross-entropy, which is defined as
where
I and
J are the number of voxels and classes, respectively;
and
, respectively, represent the predicted value and true value of class
j at pixel
i.