1. Introduction
Instance segmentation is a challenging task in computer vision that aims to make pixel-level dense predictions and distinguish different instances in images. Driven by the progress of the information age and the practical needs of various application scenarios, instance segmentation has gained wide-ranging application demands and promising prospects across diverse industrial and daily-life domains. Notably, in autonomous driving [
1], instance segmentation plays a pivotal role in assisting driving systems to recognize distinct lane markings, vehicles, pedestrians, and obstacles, thus enabling an accurate assessment of the surrounding driving environment. Similarly, within industrial production settings, real-time and precise segmentation of objects captured in video frames from work sites can effectively mitigate safety risks and enhance production efficiency. Furthermore, in areas such as medical image segmentation [
2] and image editing and enhancement, the quest for faster and more accurate segmentation results remains a constant aspiration. These compelling factors motivate our research and development efforts aimed at devising an instance segmentation method that optimally balances speed and accuracy.
Recent advancements in deep convolutional networks have led to the development of two-stage models [
3,
4,
5,
6,
7,
8] such as Mask RCNN and single-stage methods [
9,
10,
11,
12,
13] such as YOLACT for instance segmentation. The single-stage methods offer faster inference times [
14] due to their end-to-end architecture, making them more suitable for practical scenarios. In recent years, the YOLO series of object detection models [
15,
16,
17,
18,
19], renowned for their fast and accurate performance, have also developed variants adapted for segmentation tasks, which further propel the advancement of instance segmentation. However, there is still room for improvement in the segmentation accuracy of single-stage methods. This motivates us to think about a question: can we use the advantage of single-stage real-time and add new mechanisms to improve its segmentation accuracy?
With the emergence of vision transformers [
20] in the field of computer vision, several models based on vision transformers, such as Mask Transfiner [
21], QueryInst [
22], SOLQ [
23], and Mask2Former [
24], have achieved breakthroughs in segmentation accuracy. Self-attention, a core component of transformers [
25], allows for better capturing of long-range dependencies compared to convolutions. However, using global self-attention throughout the feature extraction process increases the computational complexity and memory usage of the model exponentially with the input feature map resolution. This poses challenges for training the model on ordinary hardware devices and results in unsatisfactory inference times for downstream tasks.
To address this problem, researching sparse attention strategies as alternatives to global attention has become a promising direction. In recent years, significant progress has been made in the development of sparse attention mechanisms. The pioneering work of Swin Transformer [
20] introduced the use of local and shifted windows for self-attention computations, leading to a significant reduction in computational costs. NAT [
26] extracts features by conducting dot product operations within a window defined by each pixel and its nearest neighbors. DiNAT [
27] expands the receptive field by introducing dilation operations based on NAT. Despite employing diverse sparse techniques for key-value pair selection, all of the mentioned methods depend on manually defined rules to determine attention regions, resulting in the sharing of selected key-value pairs among query regions. This indiscriminate application of sparse attention in each sub-region fails to attend to different targets differentially. This inspired us to contemplate the second question: can a novel sparse attention mechanism be designed to enable the model to perceive different semantic regions and adaptively search for attention windows?
Furthermore, we noted that the aforementioned models conduct attention operations using a fixed window size, which imposes constraints on capturing features for objects of different sizes. Hence, this motivates us to explore how to simultaneously model global, regional, and local information to better adapt to mask prediction for objects of different sizes.
To tackle these challenges and questions, this paper proposes a real-time segmentation model called ESAMask. The objective is to improve the accuracy of the model while ensuring real-time performance. Combining thoughts on problem one, the proposed model follows the design paradigm of a single-stage model and introduces novel modules that are efficient and memory-friendly.
To address the second question, the paper introduces the Related Semantic Perceived Attention module (RSPA), which dynamically adapts to different semantic regions. RSPA performs coarse-grained correlation calculations in sub-regions of the graph to reserve a few key-value regions with high semantic correlation in each query region. Fine-grained attention operations are then performed on these relevant regions, strengthening the semantic representation of feature maps.
For the third question, considering that targets in the image have different sizes, the paper designs the Mixed Receptive Field Context Perception Module (MRFCPM). This module fuses information from three branches: global content awareness, large-kernel region awareness, and convolutional channel attention. By explicitly modeling information in global, regional, and local scales, this module improves the segmentation accuracy of multi-scale objects.
In addition, to further reduce the weight of the model, the paper introduces GSInvSAM in the network neck part. GSInvSAM reduces redundant information and enhances channel information interaction by utilizing GSConv [
28] and inverted bottleneck structures. Leveraging SAM’s [
29] non-parametric attention, it assists the pyramid network in focusing on key feature areas without increasing computational costs.
Combining the above analysis and strategies, the contributions of this paper are summarized as follows:
- (1)
We introduce RSPA to the backbone network, which supports differentiated attention for different semantic features in a sparse, adaptive manner.
- (2)
We design GSInvSAM, which removes redundant information and strengthens feature associations between different channels during bidirectional pyramid feature aggregation.
- (3)
We added the MRFCPM to the prototype branch, which performs multi-level modeling of global, regional, and local representations, which helps to improve the segmentation effect of targets of different scales.
- (4)
The design of the entire model and each component follows the principles of being lightweight, effective, and efficient. Experimental results show that our model achieves a better balance between accuracy and efficiency.
3. Methods
3.1. Overall Architecture
To leverage the simplicity and fast inference speed of single-stage segmentation models while incorporating the advantages of self-attention and long-range modeling, this paper introduces an effective and efficient real-time segmentation network called ESAMask. The network architecture is depicted in
Figure 1. The backbone network encodes the input image across multiple stages, gradually transforming spatial information into high-dimensional channel information. By integrating the designed RSPA module into the feature map downsampled by a factor of 32, the network can effectively capture semantic variations during feature extraction without introducing excessive parameters. To enhance feature fusion across different scales, this study adopts a conventional two-way pyramid structure. However, a novel GSInvSAM is proposed in this work to replace the commonly used CSP block. This novel module facilitates effective information fusion and interaction among different feature layers while reducing redundant parameters and computational costs. In the prediction head section, an anchor-free decoupling head is employed to perform classification and detection tasks, reducing the post-processing time associated with non-maximum suppression (NMS). For the segmentation task, the prototype branch is primarily responsible for mask prediction. Given the significance of fully utilizing features in generating accurate masks, a lightweight MRFCPM is designed and integrated into the prototype branch to cater to the diverse range of feature representations required for targets of different scales.
3.2. Related Semantic Perceived Attention
Several current works have designed different windowed attention or sparse attention mechanisms to alleviate the computationally intensive problem of MHSA. However, most of them are based on artificially set fixed rules that share a subset of key-value pairs within all regions indiscriminately and cannot perceive the semantic relevance of targets in different regions. In this work, we explore a dynamic adaptive and semantically relevant sparse attention mechanism to design the RSPA module. The main idea of RSPA is to initially find the top k + k/2 semantically relevant sub-regions corresponding to each region within all sub-regions globally, remove the irrelevant or less relevant regions, and finally perform token attention operations within the semantically relevant regions retained in each region. The execution process of RSPA is shown in
Figure 2.
Region division and related region search. For the input feature map X, we divide it into M × M non-overlapping grids. By linearly mapping the partitioned X, the Query, Key, and Value tensors are obtained (
). In order to establish semantic associations for each region, this paper uses a directed graph to construct an adjacency matrix. Specifically, firstly, the average value of each region is calculated to obtain the region-level
. Then, the affinities between different regions are obtained by matrix multiplication to construct an adjacency matrix
. This process can be represented as follows:
where
represents the semantic correlation between the two regions and
represents the matrix transpose.
Next, we crop the adjacent region and perform row-level top-k operations to obtain a semantically related index matrix
. The formula is as follows:
where the
operation retrieves the indices of the top k regions with the highest relevance to each query region, based on the magnitudes of the affinity matrix
.
Among the k correlation regions, the regions with higher correlation values are most likely to be located inside the same target, and the regions with the next highest correlation values, such as the kth and k − 1th regions, are likely to be located near the target boundary. In order to improve the perception of the contextual information inside and outside the target boundary during network feature extraction, we borrow the idea of expansion convolution and add the expansion regions corresponding to the latter k/2 relevant regions to the semantic relevant regions, where k/2 is rounded down when k is odd.
Associated region token attention. According to the index matrix and the corresponding expansion region , we can perform token-level attention operations on the joint key-value pairs of the ith query region and its corresponding top k + k/2 semantically related regions . Since the relevant regions are scattered in different parts of the entire feature map, it will be very inefficient if the query region is followed by the key-value region for attention operation. Therefore, before the attention operation, we first aggregate key-value pair tensors of relevant regions to perform GPU-friendly token attention.
The process of the aggregation operation is shown in Formulas (3) and (4):
where the
operation represents the aggregation of the scattered related regions
and
corresponding to the same query region,
is the key tensor after aggregation, and
is the value tensor after aggregation.
The process of token attention can be expressed as Equation (5):
where
represents the number of channels, which is used to avoid gradient disappearance and concentration of weights.
3.3. GSInvSAM
The backbone network is usually used as an encoder to extract image features. As the model level deepens, spatial information is gradually converted to channel information, and the nonlinear expression ability of the model is becoming stronger and stronger. To fuse backbone feature information at different scales, various feature pyramid networks are widely used. However, directly splicing the feature maps of two adjacent layers will inevitably bring about the problems of information redundancy and lack of interaction between channels. In order to alleviate the above problems by processing the feature maps of neck stitching, we propose the GSInvSAM structure based on GSConv [
28], inverted bottleneck, and SimAM [
29], as shown in
Figure 3.
GSInvBottleneck is the basic block of GSInvSAM. It consists of a GSConv [
28] and two symmetric kernel-1 convolution operations. Among them, GSConv compresses redundant information by halving the number of channels and deep-wise operations and performs shuffle operations on channel features to enhance feature interaction. After GSConv, a symmetric convolution operation of 1 × 1 channel expansion and channel compression is performed to further strengthen the fusion of channel information. Borrowing ideas from OSA [
40], we aggregate multiple depths of GSInvBottleneck to generate richer gradient flow information. In addition, we added the Simple Attention Module [
29] at the end of GSInvSAM. Based on the principle of the optimal solution of the energy function, SimAM [
29] assigns different weights to each pixel value of the feature map, which can capture important feature representations without increasing any parameters.
3.4. Global Content-Aware Module
Self-attentive mechanisms have achieved remarkable success in capturing long-range dependencies, especially for intensive prediction tasks. However, due to its large number of model parameters, it inevitably leads to an exponential increase in computational cost and memory usage. In order to model global information while improving the inference efficiency of the model, this paper proposes a memory-friendly Global Content-aware Module, which contains a lightweight and efficient axial attention branch for extracting global semantics and a detail extraction branch based on small kernel convolution to retain local details. The structure of GCAM is shown in
Figure 4.
Axial Attention. To extract global contextual information with low computational cost, we perform self-attention operations on the horizontal and vertical axes separately and aggregate information from both directions. Specifically, we convert the input feature map X into a Query, Key, and Value tensor. In the direction of the horizontal axis, we perform average pooling on each row of feature tensors to obtain
. The calculation process of
, and
can be expressed as follows:
where
denotes the width of the image,
denotes the jth column of the image, and
denotes the total number of rows of the image.
In the direction of the vertical axis, we perform the same operation on the elements of each column to obtain
. The calculation process of
, and
can be expressed as follows:
where
denotes the height of the image,
denotes the
i-th row of the image, and
denotes the total number of columns of the image.
To make feature tensors position sensitive, we introduce axis position embeddings to sense the position of features. The position embedding vector
is constructed by randomly initializing learnable parameters
and performing linear interpolation. In the same way,
can be obtained. During the model training process, the position vector can be dynamically updated according to the actual features. Position-aware axis attention can be expressed as the formula (12):
where
represents the position of the pixel,
represents the horizontal coordinate of the pixel point,
represents the vertical coordinate of the pixel point, and
represents the position vector, which is added to the query tensor
Q and key tensor
K to sense the position information of the feature map.
The horizontal and vertical tensors with embedded location information are fed separately into the multi-headed attention module for self-attentive operations. To combine the feature information in both directions to model the global information, we fuse the horizontal and vertical features using a simple and efficient broadcast operation. The time complexity of the axial average pooling is , and the time complexity of the self-attention is . Thus, the axial attention branching significantly reduces the time complexity of modeling global dependencies.
Detail Extraction. To compensate for the local details lost when global extraction is performed by axis attention, we designed the Detail Extraction branch to capture and preserve local information. As shown in
Figure 4, the
Q,
K,
V tensor is stitched in the channel dimension, and local features are extracted by a small kernel depth separable convolution of 3 × 3. Then, the point convolution with kernel 1 and the corresponding normalization and activation operations are used to reduce the channel dimension to C. Finally, the Detail Extraction branch and Axial Attention branch are fused in a multiplicative manner to achieve a mutual complement of global and local information.
3.5. Mixed Receptive Field Context Perception Module
The generation of prototypes plays a key role in the quality of instance segmentation, and different prototypes represent different instance information in feature maps. In order to make the prototype branch of the head be able to fully extract and preserve the features of the backbone encoding, a novel Mixed Receptive Field Context Perception Module is designed in this paper. It can jointly capture global, regional, and local representation information, which is helpful for the segmentation of objects at different scales. The structure of MRFCPM is shown in
Figure 5.
The whole module mainly includes three branches of global attention, regional attention, and channel attention to extract key representation information of large-scale, medium-scale, and small-scale ranges, respectively. The global attention part uses the lightweight GCAM designed in this paper to model large-scale and long-distance information dependencies. For small-scale targets or local details, standard convolution can play a good role in feature extraction. Therefore, we directly use ordinary convolution with a kernel of 3 to capture local features and use a simple SE channel attention block to strengthen the interaction of channel information in key dimensions. For the extraction of regional features, the most commonly used is Window Attention. However, in order to reduce computational costs and maintain the overall lightweight and efficient nature of the model, this paper did not adopt the approach of window attention. Instead, a Large Kernel Region-aware Module was designed to extract crucial region-specific information. The structure of LKRAM is shown in
Figure 6.
Large Kernel Region-aware Module. The larger receptive field is the reason why window attention has an advantage over ordinary convolution. However, the operation of self-attention in the whole window inevitably introduces a large amount of calculation. Inspired by large kernel convolution and depth convolution, this paper argues that large kernel depth convolution can provide a larger receptive field similar to window self-attention, while greatly reducing computational costs. Therefore, we use a large kernel (e.g., 7 × 7)-based depthwise convolution to extract region information. In addition, we use a depthwise convolution scaling with a kernel of 1 to perform dilation and compression operations on each channel to minimize the information redundancy between channels. In the whole module, two consecutive residual connections are used to ensure the stability of the gradient, and the Batch Norm (BN) commonly used in convolution is replaced with Layer Norm (LN) to avoid the problem of weak model generalization caused by BN.