1. Introduction
Due to the robust results achieved in the field of Natural Language Processing (NLP) by the transformer [
1] created by the Google team, that team then introduced a transformer [
2] into the field of computer vision (CV). Although this is not the first time they have introduced a transformer into computer vision, it has become the most used model by a wide range of researchers due to its simplicity and high scalability. At present, transformers have become a central pillar of a powerful network modelling architecture that is used for various computer vision tasks, such as image classification [
3,
4,
5,
6], target detection [
7,
8,
9], and image segmentation [
10,
11,
12,
13]. Convolutional neural networks (CNNs) are traditionally used in medical image segmentation models. Due to the excellent performance of transformers in the CV field, transformer frameworks have been gradually introduced into medical image segmentation, such as TransUNet [
14], SwinUnet [
15], Mamba-unet [
16], etc., and have achieved excellent results, but the disadvantage is that their training and inference process requires a large amount of computational resources, especially when dealing with large-scale datasets, which restricts their application in resource-constrained devices or real-time applications.
The Visual Transformer (Vit) is not widely used in practice, mainly due to its large number of model parameters and excessive training and inference costs. Recently, a large amount of research has focused on compressing Visual Transformer networks by finding more powerful and efficient architectures for their use [
17,
18,
19]. However, many traditional network architecture search methods are consume a lot of computer resources, such as reinforcement learning, evolutionary algorithms, Bayesian optimization, etc. Guo et al. [
20] introduced the Single Path One Shot (SPOS) method, which greatly decreases the search cost compared to traditional approaches. However, it still requires evaluating thousands of sub-networks to identify the optimal sub-architecture, making the process very time-consuming; it often takes several weeks to complete.
Recently, some researchers have used the scaling parameter of Batch Normalization (BN) as an indicator of operational importance to prune a network or search for sub-networks, such as the BN-NAS proposed by Chen et al. [
21], the SCP proposed by Kang [
22], and so on. Although the training speed of these search methods is tens of times faster than the general SPOS, not all of the models contain a BN layer. More and more scholars have introduced transformers in medical image segmentation, but traditional transformers do not contain a BN layer, and, in addition, as transformers contain a dependence on input tokens that moves from shallow to deep layers, using such a search strategy is not always practically feasible, and the structure of the searched model is not always optimal.
Most of the models in the field of medical image segmentation do not contain BN; in order to complete the search process they can not use the scaling factor in a BN as a search metric. As such, this paper adopts an explicit soft mask as a search metric to indicate the global importance of each dimension in different modules. In this paper, we perform a joint search for the three dimensions of a transformer’s patch mechanism, Multi-Head Attention (MHSA), and Multilayer Perceptron (MLP) and design additional Microsoftable masks in different modules and add the L1-paradigm regularization to force sparsity on the masks. A tanh function is also introduced to prevent the mask values from exploding in the patch search section. In addition, the search process can cause a loss of accuracy; in order to solve this problem, a correction module is introduced at the end of the model to prevent the accuracy degradation caused by the search process.
In this paper, we propose a model search method for medical image segmentation, the DFTransUNet-slim, which is a joint search method based on sparse masks with an implicit weight-sharing mechanism for the better searching of sub-networks. There is no BN layer in this model, which is better for the transformer, and the primary benefit lies in its search efficiency, as it can leverage pre-trained parameters and conduct a rapid search using them. Another advantage is its zero-cost subnet selection and high flexibility. Compared to a Single Path One Shot (SPOS) search, which requires thousands of subnets to be evaluated on validation data, once the search is completed, the method in this paper can obtain countless subnets and determine the final structure without an additional evaluation based on the trade-off between the actual device’s need for accuracy and FLOPs. Another advantage is its ability explore more detailed architectures, such as varying dimensions across different self-attention heads. The continuity of the search space across these dimensions allows for identifying architectures with unique dimensions and configurations in different layers and modules, which consistently results in superior sub-networks compared to other approaches.
In addition, there are blurred boundaries to the tissues in medical images, making it difficult to determine their exact location and shape; medical imaging devices may introduce noise and artifacts when acquiring images, which can affect the clarity and contrast of the image and thus increase the difficulty of segmentation. Different tissues or organs may also have similar grey values or texture features in an image, making it difficult to distinguish between them. In this paper, a DF module is added, which can effectively distinguish the boundaries of different tissues from the direction vectors of the pixel points of the tissue boundaries and correct misclassified pixel points. And, for models that are large, computationally expensive, and difficult to apply in resource-constrained environments, we explored a multi-dimensional search approach that can effectively reduce the model’s size.
The main contributions of this paper are as follows:
- (1)
A DFTransUnet model is proposed for medical image segmentation; it is based on the TransUnet framework and capable of pixel-level segmentation.
- (2)
By compressing the DFTransUnet framework, a new search framework, DFTransUnet-slim, is proposed. It is able to perform efficient searches on all three modules in a transformer—its Multi-Head Attention, Multilayer Perceptron, and patch mechanism.
2. Related Work
Efficient model and architecture search: Model compression is a method used to reduce the resource consumption of a model by reducing the number of its parameters and its computational complexity while guaranteeing its performance. Popular compression methods include channel pruning [
23,
24,
25], quantization and binarization [
26,
27,
28,
29], knowledge distillation [
30,
31,
32], and structure searches [
33,
34,
35]. Howard et al. proposed MobileNets [
36], which decompose the convolutional filter into depth and pointwise convolution to reduce the parameters in the convolutional neural network. Tan et al. introduced EfficientNet [
37], which explores uniform scaling across the depth, width, and resolution dimensions to improve both accuracy and efficiency. Network Slimming [
24], proposed by Zhuang et al., uses the BN parameter as a measure to find the optimal substructure. Liu et al. [
38] proposed Joint Pruning, a method that simultaneously searches for the optimal number of channels layer by layer, while also considering depth and resolution to achieve more precise compression. NetAdapt [
39] and AMC [
40] employ feedback loops or reinforcement learning techniques to determine the optimal number of channels in a CNN. Additionally, various neural architecture search (NAS) methods focus on exploring different structural operations, such as 3 × 3, 5 × 5, and 7 × 7 convolutions. For instance, SPOS builds a super-network that encompasses all possible configurations and utilizes an evolutionary search to identify sub-networks within this super-network. However, these NAS methods defined on discrete operational search spaces are difficult to generalize and struggle to deal with continuous channel number search problems.
Efficient visual transformer: Many scholars are already exploring this area; for example, Dynamic-ViT [
41] employed multiple hierarchical prediction modules to assess the importance of each patch in order to find the optimal selection of dynamic patches. However, this approach to patch pruning did not enhance parameter efficiency. On the other hand, ViTAS [
42] used evolutionary algorithms to search for the best architectures within a specified budget. However, their search space is discrete and predefined, which significantly restricts its applicability. GLiT [
17] combines a CNN and attention mechanisms to perform an evolutionary search of global and local modules and introduces a local module to model both local and global features. BigNAS [
43] introduces a single-stage approach to generate efficient sub-models by slicing weight matrices. Building on this, AutoFormer [
18] showed that weight entanglement is better than defining weight matrices for each possible sub-module and searching for the best sub-network using an evolutionary algorithm. However, due to the use of evolutionary algorithms for the search, their search space is just as much restricted to discrete space. S2ViTE [
19] provides end-to-end sparsity exploration for Visual Transformers using an iterative pruning and growth strategy. Its structured pruning approach, based on a loss function computed by the Taylor Expansion and a score function for the L1 paradigm, eliminates complete attention heads and Multilayer Perceptron neurons, but the complete elimination of attention heads is suboptimal, limiting the learning dynamics of the transformer. Allowing the model to determine the optimal dimension for each attention head (rather than eliminating the attention heads completely) is a better alternative to pruning the Multi-Head Attention module. Huang et al. proposed Mamba-UNet [
13], a medical image segmentation network that combines Visual Mamba Blocks (VSSs) with UNet architecture, enhancing long-range dependency modelling to improve segmentation accuracy.
Medical Image Segmentation: Most of the existing models for processing medical images are based on variations or modifications of the Unet [
44] framework and, although the ideas behind their modifications are all different, the general framework of their models is still u-shaped. For example, Unet++ [
45] designed a dense jump connection by changing the jump connection part of Unet, and the dense jump path makes it easier to optimize semantically similar feature maps and improves the model’s segmentation accuracy; Unet3+ [
46] proposed a full-scale jump connection, which combines low-level details from feature maps of different scales with high-level semantics and maximizes the use of the information of full-scale feature maps to improve its segmentation accuracy. Even more models are based on variations or modifications of the Unet framework. With the interest surrounding transformers, more and more scholars have introduced transformers in medical image segmentation; the accompanying problem is that the number of model parameters increases exponentially, which greatly hinders the process of medical image segmentation in practical applications.