Global–Local Query-Support Cross-Attention for Few-Shot Semantic Segmentation

Xie, Fengxi; Liang, Guozhen; Chien, Ying-Ren

doi:10.3390/math12182936

Open AccessArticle

Global–Local Query-Support Cross-Attention for Few-Shot Semantic Segmentation

by

Fengxi Xie

^1,†,

Guozhen Liang

^1,† and

Ying-Ren Chien

^2,*

¹

Department of Electrical Engineering and Computer Science, Technische Universität Berlin, 10623 Berlin, Germany

²

Department of Electrical Engineering, National Ilan University, Yilan 260007, Taiwan

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Mathematics 2024, 12(18), 2936; https://doi.org/10.3390/math12182936

Submission received: 31 July 2024 / Revised: 10 September 2024 / Accepted: 19 September 2024 / Published: 21 September 2024

Download

Browse Figures

Versions Notes

Abstract

:

Few-shot semantic segmentation (FSS) models aim to segment unseen target objects in a query image with scarce annotated support samples. This challenging task requires an effective utilization of support information contained in the limited support set. However, the majority of existing FSS methods either compressed support features into several prototype vectors or constructed pixel-wise support-query correlations to guide the segmentation, which failed in effectively utilizing the support information from the global–local perspective. In this paper, we propose Global–Local Query-Support Cross-Attention (GLQSCA), where both global semantics and local details are exploited. Implemented with multi-head attention in a transformer architecture, GLQSCA treats every query pixel as a token, aggregates the segmentation label from the support mask values (weighted by the similarities with all foreground prototypes (global information)), and supports pixels (local information). Experiments show that our GLQSCA significantly surpasses state-of-the-art methods on the standard FSS benchmarks PASCAL-5ⁱ and COCO-20ⁱ.

Keywords:

few-shot semantic segmentation; global–local query-support cross-attention; multi-head attention; transformer

MSC:

68U05; 68U10

1. Introduction

In recent years, Deep Neural Networks (DNNs) have played a pivotal role in achieving substantial performance enhancements in semantic segmentation [1,2]. However, DNN-based methods require a substantial number of annotated examples extracted from extremely large-scale datasets. In low-data scenarios, DNNs face challenges in predicting unseen categories, primarily attributed to poor generalization. Humans, in contrast, exhibit the remarkable ability to rapidly master new skills or tasks based on prior knowledge and sparse examples. Motivated by this human cognitive pattern, few-shot learning (FSL) [3,4] was developed. The objective of FSL is to understand unseen classes with scarce annotated exemplars. FSL has found widespread application in computer vision tasks, such as image classification [5], object detection [6], image retrieval [7] and semantic segmentation [8,9,10,11,12,13,14,15,16,17,18,19,20,21]. In this work, we focus on few-shot semantic segmentation (FSS), where the model leverages only a small number of support images to segment novel targets in a query image.

The crux of FSS lies in effectively leveraging the information within the limited support set. Existing FSS methods utilize support information from two perspectives. On one hand, many prior works adopted the notion of metric-based prototyping [22], where support features were extracted into a limited number of class-wise prototypes, against which the query features were compared for fine-grained mask predictions. For instance, PANet [9] extracts class-wise prototypes from support features via average pooling, and then segments query objects by matching the query pixels with these prototypes in an embedding space. CANet [10] also adopts the class-wise average pooling strategy and proposes using an iterative optimization algorithm to refine the predictions. In enhancing the prototype-based semantic representation, PPNet [12] generates multiple prototypes to represent a single class. CECNet [13] proposes using a clustered-patch attention mechanism to extract discriminative features, and a new distance metric was designed to measure the similarity between support-query pair features. SiGCN [14] introduced a support-induced GCN to locate query targets of different feature levels and proposed an instance association module to exploit high-level semantics from a support and query set. QPENet [15] proposed a dual prototype enhancement branch to generate a prototype bank, which significantly enhances the prototype representation ability.

However, prototypical approaches are faced with non-negligible information loss resulting from the compression of support information. This problem, on the other hand, has encouraged researchers to explore pixel-wise support-query correlations. Noteworthy contributions in this direction include PGNet [11] and DAN [16], both of which employ graph attention mechanisms to establish pixel-to-pixel dense support-query correspondences. Moreover, HSNet [17] proposed 4D convolution operations to refine the dense support-query connections. Recently, researchers have started to explore applications of the transformer architecture [23] in aggregating pixel-wise support information into query predictions. For instance, CyCTR [24] calculates a support-query affinity map via multi-head attention to guide the query segmentation. DCAMA [20] also adopts scaled dot-product attention to aggregate the pixel-wise support-query correlations, where background cues in the support set are fully utilized.

Although pixel-wise approaches have pushed the FSS performance to a new level, there still exist two big drawbacks in effectively matching the support-query pairs. Firstly, many existing approaches mask out support foregrounds with support ground truths, thereby removing important cues for query predictions. Secondly, certain ambiguous query tokens lack clear similarities to all support pixels due to intra-class discrepancies, resulting in incomplete query segmentation.

To address the aforementioned issues, we propose Global–Local Query-Support Cross-Attention (GLQSCA) for FSS, aiming to fully exploit both local support-query similarities and global prototype-query correlations. As shown in Figure 1, GLQSCA additively aggregates the segmentation label of a query pixel from the support mask values, and then proportionally weights the predicted mask value by its similarities to the corresponding foreground masked prototypes and all the support image pixels. The underlying principle of GLQSCA can be likened to a voting process: a query pixel will be voted for as foreground for the mask value if it is semantically similar to foreground prototypes and foreground support pixels, and vice versa. Incorporating foreground prototypes plays a crucial role in determining ambiguous target pixels, as they encapsulate more general characteristics of the novel classes. In addition, it is noteworthy that the GLQSCA query mask generation pipeline can be readily implemented with the multi-head attention architecture [23], where each query pixel serves as an input token, a matrix Q comprises the flattened query features, a matrix V consists of the flattened support mask values, and a matrix K is composed of the flattened support features added with the flattened foreground masked prototype features. Subsequently, coarse query masks are obtained through softmax(

Q K^{T}

)V, which are mingled with the multi-level support and query features in the mixer to yield refined predictions.

Furthermore, the majority of previous methods [9,10,25,26] have relied on averaged or weighted prototypes under a few-shot setting, leading to further information loss. In contrast to these approaches, our proposed GLQSCA maximizes the utilization of support information under different n settings. Specifically, each query token is simply aggregated from all the n support masks and weighted by its similarities to all the n corresponding prototypes and all the pixels of the n support images. Meanwhile, the core of our mask aggregation pipeline is scaled dot-product attention and is thus non-parametric. Consequently, the same one-shot trained model can be reused for n-shot segmentation.

The primary contributions of our work can be summarized as follows:

We propose a novel Global–Local Query-Support Cross-Attention (GLQSCA) mechanism for FSS, which is designed to effectively leverage both global prototype-query connections and local pixel-wise support-query similarities from the support images. The performance exhibits significant improvement in terms of the mean intersection over union (MIoU) and the foreground–background IoU (FB-IoU).
The proposed GLQSCA can be easily and efficiently extended from one-shot to few-shot semantic segmentation with the full utilization of support information and without additional training.
The proposed GLQSCA was evaluated on the FSS benchmarks PASCAL-5ⁱ and COCO-20ⁱ. The results show that our GLQSCA surpasses current prevalent FSS algorithms under both one-shot and five-shot settings. Comprehensive ablation studies prove the effectiveness and accuracy of the proposed framework.

2. Problem Definition

In the context of FSS, we define a support set, denoted as

S = {\{(I_{s}^{i}, M_{s}^{i} (l))\}}_{i = 1}^{n}

, with

I_{s}^{i}

being the supported image of the shot i and

M_{s}^{i}

being the corresponding segmentation mask for a specific semantic class l. Similarly, we denote the query set as

Q = \{I_{q}, M_{q} (l))}

. The objective of FSS is to train a model, denoted as

f (I_{q}, S)

, capable of taking both the support set

S

and a query image

I_{q}

as input and predicting a binary mask

{\hat{M}}_{q}

for a novel category l.

The network is trained based on episodic learning [5]. We denote two image sets

D_{t r a i n}

and

D_{t e s t}

that do not contain overlapped object categories. Both image sets include multiple episodes, each containing a randomly sampled support set

S

and a query set

Q

. In the training stage, the model is optimized on

D_{t r a i n}

and given extensive annotated samples, learning the mapping from

(S, I_{q})

to query prediction. During testing, the trained model is assessed on

D_{t e s t}

(novel classes) with scarce annotated samples.

3. Methodology

3.1. Overall Framework

The overall architecture of the proposed GLQSCA is depicted in Figure 2. To simplify, we first outline the overall framework under the 1-shot setting. In each episode, multi-level support features

F_{s}^{i} \in R^{B \times C \times \frac{H}{2^{i + 1}} \times \frac{W}{2^{i + 1}}}

and query features

F_{q}^{i} \in R^{B \times C \times \frac{H}{2^{i + 1}} \times \frac{W}{2^{i + 1}}}

are generated from the input images by a pre-trained feature extractor, where B is the batch size, C is the channel number, and H and W represent the height and width of the input images, respectively. The index

i \in 1, 2, 3, 4

corresponds to the feature level. Similar to Min et al. [17], we utilize all the intermediate-level features. Meanwhile, we downsample the support masks to align with the dimensions of the multi-level image features. Then, we separately perform the cross-attention at feature levels

i \in {2, 3, 4}

. At each selected feature level, the support and query features, concatenated with the support mask(s), are inputted to the GLQSCA block to generate the coarse query mask. Afterward, the multi-level coarse masks are processed by a 3 × 3 convolution and subsequently resampled to the same dimensions

\frac{H}{8} \times \frac{W}{8}

. Finally, these intermediate query masks along with the skip-connected image features at levels

{1, 2}

are passed through the mixer block to generate the refined query predictions.

3.2. Global–Local Query-Support Cross-Attention Block

In this subsection, we introduce our global–local context-aware cross-attention block, which is the core of our work. To incorporate global semantic information, we first generate the foreground prototype features as follows:

F_{c} = F_{s} \otimes M_{s}

(1)

P = M A P (F_{c})

(2)

F_{p} = P . R e p e a t (1, 1, H * W)

(3)

where

F_{c} \in R^{B \times 1 \times \frac{H}{2^{i + 1}} \times \frac{W}{2^{i + 1}}}

refers to the class-aware features,

M A P (\cdot)

is the mean average pooling operation,

P \in R^{B \times C}

denotes the class-aware prototypes, and

F_{p} \in R^{B \times 1 \times (\frac{H}{2^{i + 1}} \times \frac{W}{2^{i + 1}})}

represents the class-aware prototype features. ⊗ is the element-wise addition.

We adopt the multi-head attention in the transformer architecture and employ sine and cosine functions to create a positional encoding [23]. We apply the positional encoding and linear projection Linear(·) to generate Q, V, and K matrices from the flattened

F_{q}

, the flattened

M_{s}

, and the summation between the flattened

F_{s}

and the flattened

F_{p}

(weighted by the learnable parameter

α

, which is set to 1 initially), respectively:

Q = L i n e a r (F_{q} . r e s h a p e (B, \frac{C}{H e a d s}, H e a d s, H W))

(4)

V = M_{s} . R e p e a t (1, H e a d s, 1, 1)

(5)

V = V . r e s h a p e (B, H e a d s, H W, 1)

(6)

K_{s} = F_{s} . R e s h a p e (B, \frac{C}{H e a d s}, H e a d s, H W)

(7)

K_{p} = F_{p} . R e s h a p e (B, \frac{C}{H e a d s}, H e a d s, H W)

(8)

K = L i n e a r (K_{s} \oplus α * K_{p})

(9)

Here,

H e a d s

denotes the total number of the scaled dot-product attention, and ⊕ represents element-wise addition.

For each query token, the correlation feature

F_{c o r r}

is calculated, which measures its similarities to its correlations to the class-wise prototypes (global information) and all the support pixels (local information):

F_{c o r r} = S o f t m a x (\frac{Q \otimes K^{T}}{\sqrt{\frac{C}{H e a d s}}}, d i m = - 1)

(10)

Finally, we obtain the coarse query mask

M_{c o a r s e} \in R^{B \times 1 \times \frac{H}{2^{i + 1}} \times \frac{W}{2^{i + 1}}}

through the multiplication of

F_{c o r r}

and V:

M_{c o a r s e} = F_{c o r r} \otimes V

(11)

Intuitively, if a query token is more similar to foreground prototypes and pixels, it would be voted for the foreground, and vice versa.

3.3. Mask–Feature Mixer

After obtaining the collection of intermediate query masks

M_{c o a r s e}^{i}

with

i \in 2, 3, 4

, we skip connectting them with the query and support features of levels

{1, 2}

. These concatenated features are proceeded through the mask–feature mixer to generate the final predictions. As shown in Figure 3, the mixer comprises three mixer blocks, each consisting of two consecutive convolution layers followed by ReLU activation. Two interleaved upsampling operations are adopted to restore the resolution of predicted query masks to that of input images.

3.4. n-Shot Segmentation

In an n-shot setting (n > 1), n annotated support images are input to the network. Previous methods either separately perform n forward passes to obtain the predicted masks by ensemble [17], or take the average of support prototypes and retrain a specific model for n-shot inference [22,25,26]. However, these approaches inevitably lose support information and introduce computational burden to some extent. In contrast, we reuse the 1-shot trained model under different n-shot settings while effectively utilizing support information from both global and local perspectives. Concretely, the additional support features, prototype features, and support masks are treated as extra tokens in

K_{s}

,

K_{p}

, and V, which can be readily implemented by an appropriate tensor reshaping (as shown in Figure 4). The mask generation pipeline remains unchanged since each query token is simply voted by all prototype vectors and support pixels at once. Benefiting from the parameter-free scaled dot-product attention, the proposed GLQSCA can be switched from 1-shot to n-shot inference without retraining.

4. Experiments

4.1. Experimental Setup

4.1.1. Dataset and Evaluation Metrics

In this study, we implemented the experiments on PASCAL-5ⁱ and COCO-20ⁱ. The PASCAL-5ⁱ dataset, originally introduced in the OSLSM [8] framework, is a compilation of additional annotations from PASCAL VOC 2012 [27,28], with 20 distinct categories separated uniformly in 4 folds. We conducted the evaluation based on cross-validation. Specifically, three folds of the PASCAL-5ⁱ are randomly sampled for model training, and the remaining fold is used as a target fold for testing. COCO-20ⁱ is generated from the MS-COCO [29], which is split into 4 folds with 60 base categories and 20 novel categories in each fold. Following the previous works [11,17,26], the mean intersection over union (MIoU) is employed for evaluation. For a fair comparison, the foreground–background IoU (FB-IoU) is also reported.

4.1.2. Implementation Details

The experiments were implemented using PyTorch version 1.10.0 [30]. ResNet-50, ResNet-101, and base Swin transformer (Swin-B) models pre-trained on ImageNet [31] were selected as our backbone feature extractors. For the supervision of model training, the binary cross entropy (BCE) loss was adopted. The dimensions of input images and support masks were resized to 384 × 384. For the hyperparameters, we configured the learning rate, momentum, and weight decay to 0.001, 0.9, and 0.0001, respectively. In addition, a stochastic gradient descent-optimized SGD optimizer was used to update the network parameters (except those in the frozen backbone layers). We trained the network for 200 epochs, with the batch size set to 48 for both datasets. Moreover, we trained the network on four NVIDIA Tesla A40s and tested it on an NVIDIA Tesla T4.

4.2. Comparison with State-of-the-Art Methods

The performance of our GLQSCA against the prevalent FSS approaches is reported in Table 1 in terms of mIoU and FB-IoU. We ran the source code of these models to evaluate the performance for a fair comparison. Our method with the ResNet-50 backbone consistently outperformed all the state-of-the-art (SOTA) methods under all few-shot settings, achieving performance improvements of a 0.3%mIoU (one-shot) and 0.6%mIoU (five-shot) against the second-best method SiGCN [14] (one-shot) and QPENet [15] (five-shot) on PASCAL-5ⁱ, respectively. For ResNet-101, HSNet [17] and our method rank in the top two among all the benchmarks with outstanding performances. Remarkably, our proposed method achieves state-of-the-art performances among all data folds with the Swin-B backbone, surpassing the second-best method by 1.2%mIoU (one-shot) and 1.3%mIoU (five-shot), respectively. For COCO-20ⁱ, our GLQSA achieves the best performances among all the combinations of different backbones and n-shot settings. The proposed method outperforms the second-best approach DCAMA [20] by 1.2%mIoU (one-shot) and 1.5%mIoU (five-shot), respectively. Additionally, it can be observed that the proposed method significantly surpasses both the prototyping methods and pixel-wise methods, indicating the effectiveness of combining the local and global information in the current FSS paradigm.

In Figure 5 and Figure 6, we visualize some representative examples of the segmentation under one-shot and five-shot settings, respectively. It can be seen that our predictions are almost identical to the ground truths, and the proposed method under a five-shot setting significantly outperforms that of a one-shot.

4.3. n-Shot Inference Analysis

As shown in Table 2, we also investigated the computational efficiency of the proposed method. It can be seen that the memory cost and inference time increased linearly from one-shot to five-shot, which is in line with the proposed mask aggregation mechanism.

4.4. Limitation Analysis

In Table 1, we observe that the proposed method underperforms in some data folds, especially with the ResNet-101 backbone. We assume that this is because the proposed method relies on the pixel-wise information provided by the extracted features. However, this local information is progressively diluted, with the backbone layers being deeper. This is why the proposed method achieves better performance with ResNet50 than with ResNet101, which is counter-intuitive.

In addition, our GLQSA achieves the best performance in all data folds with the Swin-B backbone, indicating that the proposed method relies on a strong backbone. The proposed method also shows better performance in the larger dataset COCO-20ⁱ. These two factors inevitably lead to higher computational costs in training.

Besides, we list some representative failure cases of the proposed method in Figure 7. Failure cases happen when the query objects are too small (row 1), with intra-class discrepancy (row 2), and with inter-class similarity (row 3). The proposed method tends to yield incomplete query predictions if the target objects are too small. Intra-class discrepancy and inter-class similarity may lead to false activation, which reduces the model’s performance to some extent. These problems are also the major challenges faced by the current few-shot semantic segmentation paradigm.

4.5. Ablation Studies

Ablative experiments were conducted by removing certain components to investigate the contribution of the components and parameters to the overall framework. We conducted a series of ablative experiments on PASCAL-5ⁱ under the one-shot setting. We used ResNet-50 as the backbone network for all the ablation experiments. Other experiment settings were the same as those mentioned in Section 4.1. Table 3 provides a detailed analysis of the contribution of each component and parameter within the framework of our segmentation approach, with MIoU and FB-IoU listed as the performance indicators.

4.5.1. Effectiveness of Incorporating Foreground Prototypes in Matrix K

We first conducted an ablative experiment to validate the effectiveness of utilizing global (prototype) information in the proposed framework. The second row shows the experimental result of considering global information in the overall framework by adding the foreground prototypes in matrix K based on the baseline. Compared with the baseline, incorporating foreground prototypes in matrix K results in performance enhancements of 4.6%mIoU and 2.9%FB-IoU. This result demonstrates that utilizing both local details and global semantics in the proposed mask aggregation pipeline can further exploit the information from the given data, giving it a significant advantage over previous pixel-wise approaches.

4.5.2. Effectiveness of Learnable Parameter $α$

To evaluate the effectiveness of weighting the masked foreground prototypes with the learnable parameter

α

, we conducted an ablative experiment where we set the parameter

α

to 1 (constant value) when calculating the matrix K. From the comparison of the results of the second and third row in Table 3, it can be observed that incorporating

α

brings improvements of 2.2% and 0.9% in terms of the mIoU and FB-IoU, indicating that the parameter

α

can effectively balance the proportion of the pixel-wise similarities and the semantic correlations in the mask aggregation process. We also note the influences of different initial

α

values in Table 4. In this ablative experiment, we set the framework with the setting of the second row in Table 3 as the baseline, and increased the initial

α

value by 0.2 for each sub-experiment. It can be seen that the accuracy is progressively enhanced with the increase in the initial

α

value, although the performance slightly decreases when the initial

α

value is increased from 0 to 0.2.

5. Conclusions

In this paper, we proposed the Global–Local Query-Support Cross-Attention (GLQSCA) mechanism for FSS, which can be simply and effectively implemented through scaled dot-product attention in the transformer architecture. Our proposed method demonstrates two distinct advantages over previous approaches: (i) it fully exploits both global prototype-query correlations and local pixel-wise support-query similarities; (ii) the one-shot trained model can be readily switched to n-shot inference without retraining. On PASCAL-5ⁱ and COCO-20ⁱ, GLQSCA outperforms current cutting-edge few-shot semantic segmentation algorithms by a significant margin. These results confirm the efficacy and precision of this innovative framework.

The major challenges of the proposed method lie in the small targets, intra-class discrepancy, inter-class similarity, and high computational cost. In the future, we will explore the possibilities of improving the mask aggregation mechanism to further exploit novel class information and enhance computational efficiency by designing a lightweight transformer architecture to enhance the computational efficiency. We will also attempt to adapt the proposed method to other few-shot learning tasks, such as object detection.

Author Contributions

Conceptualization, F.X., G.L. and Y.-R.C.; methodology, F.X. and G.L.; software, F.X. and G.L.; validation, F.X. and G.L.; formal analysis, F.X. and Y.-R.C.; resources, Y.-R.C.; data curation, F.X. and Y.-R.C.; writing—original draft, F.X., G.L. and Y.-R.C.; writing—review and editing, F.X., G.L. and Y.-R.C.; project administration, Y.-R.C. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the National Science and Technology Council, Taiwan (NSTC) under Grant 112-2221-E-197-022.

Data Availability Statement

No new data were created or analyzed in this study.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

FSS	Few-shot semantic segmentation;
FSL	Few-shot learning;
DNN	Deep Neural Network;
MIoU	Mean intersection over union;
FB-IoU	Foreground–background intersection over union.

References

Chen, L.-C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 834–848. [Google Scholar] [CrossRef]
Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid scene parsing network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2881–2890. [Google Scholar]
Fink, M. Object classification from a single example utilizing class relevance metrics. Adv. Neural Inf. Process. Syst. 2004, 17, 1–8. [Google Scholar]
Li, F.-F.; Fergus, R.; Perona, P. One-shot learning of object categories. IEEE Trans. Pattern Anal. Mach. Intell. 2006, 28, 594–611. [Google Scholar]
Vinyals, O.; Blundell, C.; Lillicrap, T.; Wierstra, D. Matching networks for one shot learning. Adv. Neural Inf. Process. Syst. 2016, 29, 1804. [Google Scholar]
Chen, H.; Wang, Y.; Wang, G.; Qiao, Y. Lstd: A low-shot transfer detector for object detection. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; Volume 32. [Google Scholar]
Triantafillou, E.; Zemel, R.; Urtasun, R. Few-shot learning through an information retrieval lens. Adv. Neural Inf. Process. Syst. 2017, 30, 1329. [Google Scholar]
Shaban, A.; Bansal, S.; Liu, Z.; Essa, I.; Boots, B. One-shot learning for semantic segmentation. arXiv 2017, arXiv:1709.03410. [Google Scholar]
Wang, K.; Liew, J.H.; Zou, Y.; Zhou, D.; Feng, J. Panet: Few-shot image semantic segmentation with prototype alignment. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9197–9206. [Google Scholar]
Zhang, C.; Lin, G.; Liu, F.; Yao, R.; Shen, C. Canet: Class-agnostic segmentation networks with iterative refinement and attentive few-shot learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 5217–5226. [Google Scholar]
Zhang, C.; Lin, G.; Liu, F.; Guo, J.; Wu, Q.; Yao, R. Pyramid graph networks with connection attentions for region-based one-shot semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9587–9595. [Google Scholar]
Liu, Y.; Zhang, X.; Zhang, S.; He, X. Part-aware prototype network for few-shot semantic segmentation. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020, Proceedings, Part IX 16; Springer: Berlin/Heidelberg, Germany, 2020; pp. 142–158. [Google Scholar]
Lai, J.; Yang, S.; Zhou, J.; Wu, W.; Chen, X.; Liu, J.; Gao, B.; Wang, C. Clustered-patch element connection for few-shot learning. arXiv 2023, arXiv:2304.10093. [Google Scholar]
Liu, J.; Bao, Y.; Yin, W.; Wang, H.; Gao, Y.; Sonke, J.; Gavves, E. Few-shot semantic segmentation with support-induced graph convolutional network. arXiv 2023, arXiv:2301.03194. [Google Scholar]
Cong, R.; Xiong, H.; Chen, J.; Zhang, W.; Huang, Q.; Zhao, Y. Query-guided Prototype Evolution Network for Few-Shot Segmentation. IEEE Trans. Multimed. 2024, 26, 6501–6512. [Google Scholar] [CrossRef]
Wang, H.; Zhang, X.; Hu, Y.; Yang, Y.; Cao, X.; Zhen, X. Few-shot semantic segmentation with democratic attention networks. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020, Proceedings, Part XIII 16; Springer: Berlin/Heidelberg, Germany, 2020; pp. 730–746. [Google Scholar]
Min, J.; Kang, D.; Cho, M. Hypercorrelation squeeze for few-shot segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Virtual, 11–17 October 2021; pp. 6941–6952. [Google Scholar]
Lu, Z.; He, S.; Zhu, X.; Zhang, L.; Song, Y.; Xiang, T. Simpler is better: Few-shot semantic segmentation with classifier weight Transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Virtual, 11–17 October 2021; pp. 8741–8750. [Google Scholar]
Azad, R.; Fayjie, A.R.; Kauffmann, C.; Ayed, I.B.; Pedersoli, M.; Dolz, J. On the texture bias for few-shot CNN segmentation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2021; pp. 2674–2683. [Google Scholar]
Shi, X.; Wei, D.; Zhang, Y.; Lu, D.; Ning, M.; Chen, J.; Ma, K.; Zheng, Y. Dense cross-query-and-support attention weighted mask aggregation for few-shot segmentation. In European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2022; pp. 151–168. [Google Scholar]
Chen, H.; Yu, Y.; Dong, Y.; Lu, Z.; Li, Y.; Zhang, Z. Multi-context interaction network for few-shot segmentation. arXiv 2023, arXiv:2303.06304. [Google Scholar]
Snell, J.; Swersky, K.; Zemel, R. Prototypical networks for few-shot learning. Adv. Neural Inf. Process. Syst. 2017, 30, 2153. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 1–11. [Google Scholar]
Zhang, G.; Kang, G.; Wei, Y.; Yang, Y. Few-shot segmentation via cycle-consistent Transformer. arXiv 2021, arXiv:2106.02320. [Google Scholar]
Tian, Z.; Zhao, H.; Shu, M.; Yang, Z.; Li, R.; Jia, J. Prior guided feature enrichment network for few-shot segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 44, 1050–1065. [Google Scholar] [CrossRef] [PubMed]
Lang, C.; Cheng, G.; Tu, B.; Han, J. Learning what not to segment: A new perspective on few-shot segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 8057–8067. [Google Scholar]
Everingham, M.; Eslami, S.M.A.; Gool, L.V.; Williams, C.K.I.; Winn, J.; Zisserman, A. The pascal visual object classes challenge: A retrospective. Int. J. Comput. Vis. 2015, 111, 98–136. [Google Scholar] [CrossRef]
Hariharan, B.; Arbeláez, P.; Girshick, R.; Malik, J. Simultaneous detection and segmentation. In Proceedings of the Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, 6–12 September 2014, Proceedings, Part VII 13; Springer: Berlin/Heidelberg, Germany, 2014; pp. 297–312. [Google Scholar]
Lin, T.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C. Microsoft coco: Common objects in context. In Proceedings of the Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, 6–12 September 2014, Proceedings, Part V 13; Springer: Berlin/Heidelberg, Germany, 2014; pp. 740–755. [Google Scholar]
Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.P.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. An imperative style, high-performance deep learning library. Adv. Neural Inf. Process. Syst. 1912, 32, 8026. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Yang, B.; Liu, C.; Li, B.; Jiao, J.; Ye, Q. Prototype mixture models for few-shot semantic segmentation. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020, Proceedings, Part VIII 16; Springer: Berlin/Heidelberg, Germany, 2020; pp. 763–778. [Google Scholar]
Boudiaf, M.; Kervadec, H.; Masud, Z.I.; Piantanida, P.; Ayed, I.B.; Dolz, J. Few-shot segmentation without meta-learning: A good transductive inference is all you need? In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 13979–13988. [Google Scholar]
Zhang, B.; Xiao, J.; Qin, T. Self-guided and cross-guided learning for few-shot segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 8312–8321. [Google Scholar]
Sun, G.; Liu, Y.; Liang, J.; Gool, L.V. Boosting few-shot semantic segmentation with Transformers. arXiv 2021, arXiv:2108.02266. [Google Scholar]

Figure 1. Visualization of the mask generation pipeline.

Figure 2. Illustration of our proposed Global–Local Query-Support Cross-Attention (GLQSCA).

Figure 3. The structure of the mask–feature mixer.

Figure 4. Global–Local Query-Support Cross-Attention in n-shot (n > 1) setting.

Figure 5. Qualitative results of the designed GLQSCA on the PASCAL-5ⁱ under one-shot setting.

Figure 6. Qualitative results of the designed GLQSCA on the PASCAL-5ⁱ under five-shot setting.

Figure 7. Visualization of the failure cases of the proposed GLQSCA on PASCAL-5ⁱ with the Resnet50 backbone.

Table 1. Quantitative results on PASCAL-5ⁱ and COCO-20ⁱ under one-shot and five-shot settings. The underlined results denote the second-best performance, while the best results are highlighted in bold.

PASCAL-5ⁱ [8]
Backbone	Models	Type	One-Shot						Five-Shot
Backbone	Models	Type	Fold-0	Fold-1	Fold-2	Fold-3	MIoU	FB-IoU	Fold-0	Fold-1	Fold-2	Fold-3	MIoU	FB-IoU
ResNet-50	PPNet [12]	Prototype	52.7	62.8	57.4	47.7	55.2	-	60.3	70.0	69.4	60.7	65.1	-
	PMM [32]		52.0	67.5	51.5	49.8	55.2	-	55.0	68.2	52.9	51.1	56.8	-
	RPMM [32]		55.2	66.9	52.6	50.7	56.3	-	56.3	67.3	54.5	51.0	57.3	-
	RePRI [33]		59.8	68.3	62.1	48.5	59.7	-	64.6	71.4	71.1	59.3	66.6	-
	CECNet [13]		61.5	68.7	62.2	49.5	60.5	-	66.7	70.9	68.1	59.1	66.2	-
	PEFNet [25]		61.7	69.5	55.4	56.3	60.8	73.3	63.1	70.7	55.8	57.9	61.9	73.9
	SCL [34]		63.0	70.0	56.5	57.7	61.8	71.9	64.5	70.9	57.3	58.7	62.9	72.8
	TRFS [35]		62.9	70.7	56.5	57.5	61.9	-	65.0	71.2	55.5	60.9	63.2	-
	QPENet [15]		64.5	70.8	63.2	57.9	64.1	75.4	68.2	73.9	67.2	64.7	68.5	79.5
	SiGCN [14]		64.6	69.2	64.6	58.8	64.3	75.5	68.5	72.1	66.5	65.7	68.2	78.3
	DCAMA [20]	Pixel-wise	66.0	71.3	59.3	57.3	63.5	74.9	70.3	73.5	63.5	65.6	67.5	78.8
	GLQSCA (Ours)	Pixel-wise+Prototype	67.5	72.3	59.6	59.0	64.6	75.7	71.0	75.5	64.2	65.9	69.1	79.7
ResNet-101	CWT [18]	Prototype	56.9	65.2	61.2	48.8	58.0	-	62.6	70.2	68.8	57.2	64.7	-
	DoG-LSTM [19]		57.0	67.2	56.1	54.3	58.7	-	57.3	68.5	61.5	56.3	60.9	-
	QPENet [15]		67.0	73.2	63.7	60.1	66.0	77.1	69.8	75.5	66.8	66.3	69.6	81.1
	DAN [16]	Pixel-wise	54.7	68.6	57.8	51.6	58.2	71.9	57.9	69.0	60.1	54.9	60.5	72.3
	CyCTR [24]		69.3	72.7	56.5	58.6	64.3	-	73.5	74.0	58.6	60.2	66.6	-
	HSNet [17]		67.3	72.3	62.0	63.1	66.2	77.6	71.8	74.4	67.0	68.3	70.4	80.6
	DCAMA [20]		65.3	71.3	63.2	58.3	64.4	77.6	70.7	73.6	66.7	61.8	68.0	80.7
	GLQSCA (Ours)	Pixel-wise+Prototype	66.2	69.7	64.4	59.2	64.9	78.2	72.0	71.2	67.5	64.0	68.7	81.0
Swin-B	HSNet [17]	Pixel-wise	67.9	74.0	60.3	67.0	67.3	77.9	72.2	77.5	64.0	72.6	71.6	81.2
	DCAMA [20]	Pixel-wise	72.2	73.8	64.3	67.1	69.3	78.5	75.7	77.1	72.0	74.8	74.9	82.9
	GLQSCA (Ours)	Pixel-wise+Prototype	73.5	75.2	65.5	67.8	70.5	79.2	77.0	78.2	73.3	76.3	76.2	83.5
COCO-20ⁱ [29]
ResNet-50	PPNet [12]	Prototype	36.5	26.5	26.0	19.7	27.2	-	48.9	31.4	36.0	30.6	36.7	-
	PMM [32]		29.3	34.8	27.1	27.3	29.6	-	33.0	40.6	30.3	33.3	34.3	-
	RPMM [32]		29.5	36.8	28.9	27.0	30.6	-	33.8	42.0	33.0	33.3	35.5	-
	TRFS [35]		31.8	34.9	36.4	31.4	33.6	-	35.4	41.7	42.3	36.1	38.9	-
	RePRI [33]		31.2	38.1	33.3	33.0	34.0	-	38.5	46.2	40.0	43.6	42.1	-
	CECNet [13]		37.9	41.3	35.2	37.9	38.1	-	44.3	51.2	45.2	46.1	46.7	-
	SiGCN [14]		38.7	46.3	41.3	37.5	41.4	62.7	44.9	54.5	46.5	45.9	48.0	66.2
	QPENet [15]		41.5	47.3	40.9	39.4	42.3	67.4	47.3	52.4	44.3	44.9	47.2	69.5
	CyCTR [24]	Pixel-wise	38.9	43.0	39.6	39.8	40.3	-	41.1	48.9	45.2	47.0	45.6	-
	DCAMA [20]	Pixel-wise	41.9	45.1	44.4	41.7	43.3	69.5	45.9	50.5	50.7	46.0	48.3	71.7
	GLQSCA (Ours)	Pixel-wise+Prototype	42.3	46.3	45.1	42.9	44.2	70.1	47.5	52.2	51.7	49.0	50.1	73.0
ResNet-101	CWT [18]	Prototype	30.3	36.6	30.5	32.2	32.4	-	38.5	46.7	39.4	43.2	42.0	-
	SCL [34]		36.4	38.6	37.5	35.4	37.0	-	38.9	40.5	41.5	38.7	39.9	-
	PEFNet [25]		36.8	41.8	38.7	36.7	38.5	63.0	40.4	46.8	43.2	40.5	42.7	65.8
	QPENet [15]		39.8	45.4	40.5	40.0	41.4	67.8	47.2	54.9	43.4	45.4	47.7	70.6
	DAN [16]	Pixel-wise	-	-	-	-	24.4	62.3	-	-	-	-	29.6	63.9
	HSNet [17]		37.2	44.1	42.4	41.3	41.2	69.1	45.9	53.0	51.8	47.1	49.5	72.4
	DCAMA [20]		41.5	46.2	45.2	41.3	43.5	69.9	48.0	58.0	54.3	47.1	51.9	73.3
	GLQSCA (Ours)	Pixel-wise+Prototype	42.0	47.1	46.1	42.8	44.5	70.8	49.1	59.6	54.9	48.4	53.0	74.2
Swin-B	HSNet [17]	Pixel-wise	43.6	49.9	49.4	46.4	47.3	72.5	50.1	58.6	56.7	55.1	55.1	76.1
	DCAMA [20]	Pixel-wise	49.5	52.7	52.8	48.7	50.9	73.2	55.4	60.3	59.9	57.5	58.3	76.9
	GLQSCA (Ours)	Pixel-wise+Prototype	50.2	53.5	53.7	51.0	52.1	73.8	56.8	61.5	60.4	60.5	59.8	77.5

Table 2. Efficiency analysis of n-shot inference.

	1-Shot	2-Shot	3-Shot	4-Shot	5-Shot
Memory (MiB)	3241	3860	4752	5318	6106
Inference Time (s)	0.16	0.26	0.35	0.56	0.65

Table 3. Ablation study on the effectiveness of different modules and parameters. The network with the settings of the first row was set as the baseline.

Pixel-Wise	Prototype	Alpha	MIoU (%)	FB-IoU (%)
✓	-	-	57.8	71.9
✓	✓	-	62.4	74.8
✓	✓	✓	64.6	75.7

Table 4. Ablation study on different initial values of

α

.

Table 4. Ablation study on different initial values of

α

.

α	0.0	0.2	0.4	0.6	0.8	1.0
MIoU(%)	62.4	61.9	62.7	63.5	64.3	64.6
FB-IoU(%)	74.8	74.2	74.6	74.9	75.2	75.7

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Xie, F.; Liang, G.; Chien, Y.-R. Global–Local Query-Support Cross-Attention for Few-Shot Semantic Segmentation. Mathematics 2024, 12, 2936. https://doi.org/10.3390/math12182936

AMA Style

Xie F, Liang G, Chien Y-R. Global–Local Query-Support Cross-Attention for Few-Shot Semantic Segmentation. Mathematics. 2024; 12(18):2936. https://doi.org/10.3390/math12182936

Chicago/Turabian Style

Xie, Fengxi, Guozhen Liang, and Ying-Ren Chien. 2024. "Global–Local Query-Support Cross-Attention for Few-Shot Semantic Segmentation" Mathematics 12, no. 18: 2936. https://doi.org/10.3390/math12182936

APA Style

Xie, F., Liang, G., & Chien, Y.-R. (2024). Global–Local Query-Support Cross-Attention for Few-Shot Semantic Segmentation. Mathematics, 12(18), 2936. https://doi.org/10.3390/math12182936

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Global–Local Query-Support Cross-Attention for Few-Shot Semantic Segmentation

Abstract

1. Introduction

2. Problem Definition