Background-Enhanced Visual Prompting Transformer for Generalized Few-Shot Semantic Segmentation

Li, Man; Ma, Xiaodong

doi:10.3390/electronics14071389

Open AccessArticle

Background-Enhanced Visual Prompting Transformer for Generalized Few-Shot Semantic Segmentation

by

Man Li

and

Xiaodong Ma

^*

College of Science, China Agricultural University, Beijing 100083, China

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(7), 1389; https://doi.org/10.3390/electronics14071389

Submission received: 12 February 2025 / Revised: 27 March 2025 / Accepted: 28 March 2025 / Published: 30 March 2025

(This article belongs to the Section Artificial Intelligence)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Generalized few-shot semantic segmentation (GFSS), which requires strong segmentation performance on novel classes while retaining the performance on base classes, is attracting increasing attention. Recent studies have demonstrated the effectiveness of applying visual prompts to solve GFSS problems, but there are still unresolved issues. Due to the confusion between the backgrounds and novel classes foreground during base class pre-training, the learned base visual prompts will mislead the novel visual prompts during novel class fine-tuning, leading to sub-optimal results. This paper proposes a background-enhanced visual prompting Transformer (Beh-VPT) to solve the problem. Specifically, we innovatively propose background visual prompts, which can learn potential novel class information in the background during base class pre-training and transfer the information to novel visual prompts during novel class fine-tuning via our proposed Hybrid Causal Attention Module. Additionally, we propose a background-enhanced segmentation head that is used in conjunction with background prompts to enhance the model’s capacity for learning novel classes. Considering the GFSS settings that take into account both base and novel classes, we introduce Singular Value Fine-Tuning in the non-meta learning paradigm to further unleash the full potential of the model. Extensive experiments show that the proposed method achieves state-of-the-art performance for GFSS on PASCAL-

5^{i}

and COCO-

20^{i}

datasets. For example, considering both base and novel classes, the improvements in mIoU range from 0.47% to 1.08% (COCO-

20^{i}

) in the one-shot and five-shot scenarios, respectively. In addition, our method does not cause a fallback of mIoU in base classes relative to the baseline.

Keywords:

few-shot learning; generalized few-shot semantic segmentation; visual prompting; transformer

1. Introduction

Deep convolution neural networks driven by large-scale labeled datasets [1,2] have achieved great success in many visual recognition tasks [3,4]. However, collecting large-scale labeled data is laborious and expensive, especially for dense prediction tasks such as semantic segmentation [5] and instance segmentation [5]. In contrast, humans can easily recognize new patterns or objects guided by a small number of samples. In order to reduce the need for a large number of dense annotations and mimic human learning process, few-shot semantic segmentation (FSS) [6] has been introduced to enable the segmentation of previously unseen objects with only a few annotated images. However, existing FSS methods [7,8] focus on improving the segmentation performance of the novel classes while ignoring the base classes. Thus, generalized few-shot semantic segmentation (GFSS) [9] is proposed to segment the novel classes while maintaining the performance of the base classes. GFSS defines a more realistic scenario, which requires better generalization and robustness of the network.

In the field of natural language processing [10], prompting techniques [11] have been proven to enable models to learn from a small number of samples and quickly adapt to new tasks in the absence of large-scale labeled data. Similarly, prompting techniques have been applied to computer vision for visual prompt tuning [12,13,14]. This is an effective mechanism to enhance the adaptability and learning ability of pre-trained models to new tasks. In light of the above, prompts techniques are inherently suitable for GFSS, which usually has sufficient base class data and a small amount of novel class data. The key to using visual prompting to solve GFSS problems lies in the following two points: (1) How can we make full use of the large amount of base class data to model an embedding to represent visual prompts? (2) How can we generalize the pre-trained visual prompts to the novel classes and maintain the performance of the base classes? Hossain et al. [15] developed a multi-scale visual prompting Transformer, which defines a series of learnable embeddings to represent base class visual prompts. Then, cross attention is calculated between the base visual prompts and the image features, which enables the prompts to learn the information of each class and eventually use the prompts for dense prediction. For the learning of novel class visual prompts, [15] designs a novel-to-base casual attention to pass the high-level information from the base visual prompts to the novel visual prompts. This attention mechanism guarantees that the novel prompts derived from a limited number of samples are distinctly separate from the base prompts, ultimately leading to the superior performance of GFSS.

Although [15] demonstrates that visual prompting can be effectively applied to GFSS, there are still unresolved issues. Ref. [15] focuses only on the foreground during pre-training of the base classes. However, as observed in [16,17], the background regions of the base class data are not homogeneous and potential novel class information is contained within these regions. Figure 1 illustrates the percentage of potential novel classes for each fold in the background region of the base class image. The current method treats the novel classes as the background when pre-training base prompts, which will lead to two problems: (1) treating the novel classes as the background leads to confusion and degrades the potential of novel class fine-tuning; (2) the pre-trained base prompts only have information about the base class, and the novel class data are not fully utilized. Therefore, further mining the potential novel class information in the background is a promising optimization direction in visual prompting-based GFSS. In addition, due to the extremely limited novel class data in GFSS, over-fitting becomes a critical issue that needs to be handled carefully. Specifically, during novel class fine-tuning in [15], the parameters of the backbone, pixel decoder, and base segmentation head are frozen, which leads to sub-optimal results of novel class learning. However, the model will quickly overfit on a sparse amount of novel class data without the freezing strategy, which will set back the performance of the base classes.

To address the above problem, we propose Beh-VPT, a background-enhanced visual prompting Transformer for GFSS. Beh-VPT uses a non-meta-learning paradigm, where the model is first trained based on a large amount of base class data. Then, several model modules are frozen and the remaining modules are fine-tuned using few data of novel classes. Our model design is inspired by [15], yet several novel modules are introduced. Specifically, we introduce background visual prompts, which are defined as learnable embedding vectors like base and novel visual prompts. During base class pre-training, background visual prompts are fed into the cross-attention module along with base class visual prompts and interact with image features. Background visual prompts learn potential novel class information in the background of base class images. Then, during novel class fine-tuning, the Hybrid Causal Attention Module (HCAM) is proposed to convey novel class information from background visual prompting to novel visual prompts. In addition, we further refine the model design to prevent model over-fitting on novel classes. First, we introduce the background-enhanced segmentation head, which enhances the model capacity of segmentation head used to predict novel classes, without affecting the base counterpart. Then, Singular Value Fine-Tuning (SVF) [19] is introduced in novel class fine-tuning, which makes the model less prone to over-fitting, leading to better model generalization. Our contributions can be summarized as follows:

We design a background-enhanced visual prompting Transformer (Beh-VPT) for generalized few-shot semantic segmentation. Background visual prompts are proposed to learn potential novel class information in the background during base class pre-training and to influence novel visual prompts during novel class fine-tuning via the Hybrid Causal Attention Module.
We introduce two methods to further enhance the model’s ability to learn novel classes and prevent over-fitting. The background-enhanced segmentation head boosts the capacity of the model to bring better performance of novel classes. Singular Value Fine-Tuning is introduced to unleash the full potential of the model. At the same time, none of them will degrade the performance of base classes.
Experiments on GFSS benchmark datasets demonstrate the effectiveness of our proposed method. The results show that the method proposed in this paper achieves better performance than current SOTA methods of GFSS in both one-shot and five-shot settings.

The remainder of this paper is organized as follows. In Section 2, we introduce the related works in the field of GFSS. In Section 3, we clarify the problem definition of GFSS. In Section 4, we illustrate the details of the proposed Beh-VPT. The experimental results and discussion are presented in Section 5. Finally, we conclude the paper in Section 6.

2. Related Work

2.1. Few-Shot Semantic Segmentation

Few-shot learning [20] is a quintessential challenge in the realm of computer vision. It involves training a model on a limited number of labeled examples from novel classes, with the goal of enabling the model to accurately predict these novel classes. Few-shot learning has been widely used in vision tasks such as image classifications [20,21], semantic segmentation [6,7,8], and object detection [22,23]. Unlike image-level and instance-level prediction, few-shot semantic segmentation (FSS) aims to provide dense pixel-level predictions for the query image with limited support for image–mask pairs. FSS can be categorized into prototype-based and dense matching-based methods based on the way of learning. The prototype-based approach [7,21,24,25,26] extracts prototypes from the support set to represent the features of each category. These prototypes are then used for segmentation of the images in the query set. The prototypes can be global vectors or representations of local regions. The dense matching-based approach [27,28] meticulously compares the features of the support set and the query set at the pixel level to achieve accurate segmentation. Specifically, this method first extracts dense feature representations from the support and query sets, which may include pixel-level or higher-level semantic features. Then, the samples in the support set are compared with the query set’s counterpart through matching mechanisms such as metric learning or attention to discover similarities or differences between them.

While the FSS model recognizes novel classes well given the corresponding support samples, even the state-of-the-art FSS models struggle to solve the common scenario in real-world applications that contain both base and novel classes without the prior knowledge of the target classes contained in query images.

2.2. Generalized Few-Shot Semantic Segmentation

In contrast to standard FSS [24,29], which focuses only on improving the performance of novel classes, generalized few-shot semantic segmentation (GFSS) [9,15,30,31] concentrates on segmenting the novel classes while maintaining the performance of the base classes. It defines a more realistic scenario that requires better generalization and robustness of the network. Ref. [9] presented the GFSS setting for the first time and proposed a framework that uses two modules to learn base and novel prototypes simultaneously. Ref. [30] proposed a strong baseline for GFSS, using knowledge distillation to maintain the performance on base classes. A concurrent work [31] built a set of orthogonal prototypes, each of which represents a semantic class, and made the prediction for each class separately. Recently, Hossain et al. [15] developed a multi-scale visual prompting Transformer, which defines a series of learnable embeddings to represent visual prompts. This method employs a non-meta-learning paradigm, first training visual prompts with a large amount of base class data. Then, a small amount of novel class data are used for fine-tuning, during which a novel causal attention allows the visual prompts of the novel classes to learn how to interact with image features and predict class masks. Despite [15] achieving state-of-the-art performance in GFSS, there are still some issues that remain to be addressed, such as low information utilization and susceptibility to over-fitting.

2.3. Prompting

In natural language processing tasks [11], prompting has become an important technique that can guide models to quickly adapt to new tasks by providing a few exemplary samples. It has also been applied to visual tasks and vision–language multi-modal tasks [12,13,14,32]. Flamingo [32] integrates powerful pre-trained visual and language models and proposes an innovative architecture capable of handling arbitrarily interleaved visual and textual data. The model accepts task-specific examples as prompts, demonstrating rapid adaptation capabilities across a variety of image and video tasks. Ref. [12] proposed VPT, which is an efficient parameter adjustment strategy for large-scale visual Transformer (ViT) models. By introducing a small number of trainable parameters in the input space, VPT can quickly adapt to new tasks while keeping the model’s backbone unchanged. Ref. [13] introduced a new method, Fair-VPT, aiming at addressing the potential unfairness issues that VPT might cause in image classification tasks. Fair-VPT divides prompts into cleaner prompts and target prompts to eliminate bias information related to sensitive attributes in pre-trained ViT models and to adapt to downstream classification tasks.

In recent years, prompts have been further applied to downstream vision tasks such as object detection and semantic segmentation to solve few-shot problems. Ref. [33] introduced a novel approach that leverages the power of transformers for few-shot object detection without the need for re-training and utilizes prompting strategies to adapt to novel classes with remarkable efficiency, eliminating the need for extensive fine-tuning. Ref. [34] presented a method to employ foundation models for few-shot object detection, showcasing their adaptability with minimal data. In the domain of FSS, Ref. [35] stands out with its auto-prompt network designed for cross-domain few-shot semantic segmentation. This method can generate prompts autonomously, allowing the model to adapt to new domains with limited labeled data. Ref. [36] introduced a dynamic class-aware enhancement strategy that improves few-shot segmentation performance through prompting. Unlike following the FSS settings, Ref. [15] used a non-meta-learning paradigm and defined base visual prompts and novel visual prompts to solve the GFSS problem.

In summary, current state-of-the-art GFSS methods still suffer from underutilization of information. Simply labeling all pixels except base classes as background during base class pre-training is detrimental to the model’s ability to generalize and solve diverse scenarios. In addition, the research on suppressing novel class over-fitting has not received much attention in current GFSS methods. Based on the above considerations, this paper attempts to address the abovementioned limitations by using the recently popular visual prompting technique.

3. Problem Definition

3.1. Standard Few-Shot Segmentation

Few-shot segmentation (FSS) aims to segment novel class foreground given only a few samples with dense annotations. Let us denote

Χ \in R^{H \times W \times 3}

as an RGB image of height

H

and width

W

, and let

P \in {[0,1]}^{| O | \times N}

be its predicted segmentation map, where

O = [0, H - 1] \times [0, W - 1]

is the set of all pixels coordinates and

N

is the number of classes to predict. In FSS, datasets are divided into two parts, corresponding to the base classes and novel classes. The base classes

C^{b a s e}

are strictly disjoint from the novel classes

C^{n o v e l}

, i.e.,

{{C}^{b a s e} {\cap C}^{n o v e l}} = \emptyset

. A conventional approach to this challenge is based on meta-learning. Typically, each segmentation task, also known as an episode, consists of the support set

S

and the query set

Q

. There are

C

classes and

K

support images for each class in the

C

-way

K

-shot segmentation task. We use

S = {x_{i}, y_{i}}_{i = 1}^{K}

to present the support set, containing

K

images and their corresponding binary segmentation masks

y_{i} \in {[0,1]}^{| O |}

. Relying on limited supervision, the model is then evaluated on the query set

Q = {x_{q}, y_{q}}

, where

y_{q}

is only used in the training phase. In addition, an alternative paradigm is non-meta learning. The model is initially trained based on a large amount of data from the base classes. Then, a portion of the parameters are frozen and the model is fine-tuned with a small amount of novel data. In the infer phase, the focus is on segmenting the novel classes, while the base classes are ignored.

3.2. Generalized Few-Shot Segmentation

The setting of generalized few-shot segmentation (GFSS) extends the standard setting based on the fact that real-world scenarios often require being able to recognize both base and novel classes. In this setup, the model needs to preserve the performance of the base classes while generalizing to the novel classes. In this paper, we follow the recent setup of the GFSS methods [15] based on the paradigm of non-meta learning. Specifically, the model needs to segment

C^{b a s e} {+ C}^{n o v e l}

+

C^{b g}

classes and, finally, give a prediction

P \in {[0, 1]}^{| O | \times (C^{b a s e} {+ C}^{n o v e l} + C^{b g})}

, where

C^{b g}

denotes the background and is set to 1.

4. Method

In this section, we first introduce the proposed background-enhanced visual prompting Transformer (Beh-VPT) model. Background visual prompts are introduced to learn more potential information about the novel classes during base pre-training phase. Then, we introduce the Hybrid Causal Attention Module (HCAM), which efficiently transfers information from base visual prompts and background visual prompts to novel counterparts unidirectionally. Finally, we introduce two methods to further improve the model’s learning ability for novel classes and prevent over-fitting.

4.1. Background-Enhanced Visual Prompting Transformer

The background-enhanced visual prompting Transformer (Beh-VPT) is trained in two training stages: base class pre-training and novel class fine-tuning. In both stages, the input image is fed into a convolution neural network-based encoder to extract features initially. The intermediate features are then up-sampled by the decoder to obtain multi-scale image features. In the decoding process, visual prompts representing base classes, novel classes, and backgrounds are used to calculate multi-scale cross-attention with the visual features to encode the corresponding classes information. Finally, a background-enhanced segmentation head takes the trained visual prompts and high-resolution image features as inputs and, finally, outputs dense predictions. The key idea of Beh-VPT is the design and use of visual prompts. Specifically, we use learnable embedding to define three types of visual prompts: base visual prompts, novel visual prompts, and background visual prompts. The three types of visual prompts are utilized in different training stages of the GFSS.

In the first stage, the model is pre-trained using base class data, as shown in Figure 2. Given B base classes, we define B base visual prompts, which are randomly initialized. In particular, we use the same way as the base visual prompts to define the B background prompts, which are designed primarily to mine valuable knowledge in the background of images during the pre-training stage of the base class data. Then, the base visual prompts and background visual prompts are first processed by

c o n c a t ()

, as shown in Equation (1). Then, concatenated vector is fed into the self-attention module to model the relationship between them. After learning the attention within visual prompts, cross-attention is utilized for the interaction between 2B visual prompts and multi-scale image features. The operations of self-attention and cross-attention are shown in Equation (2). The details of the equations are as follows:

V_{b b}^{l - 1} = c o n c a t (V_{b a s e}^{l - 1} + V_{b a c k}^{l - 1})

(1)

V_{b b}^{l} = C A (S A (V_{b b}^{l - 1}), F^{l - 1})

(2)

where

V_{b a s e}^{l - 1} \in R^{B \times C}

and

V_{b a c k}^{l - 1} {\in R}^{B \times C}

represent visual prompts for the base class and background of layer

l - 1

.

V_{b b}^{l - 1} {\in R}^{2 B \times C}

is obtained by concatenating them.

S A

and

C A

denote self-attention and cross-attention, respectively.

F^{l - 1} \in R^{H_{l - 1} W_{l - 1} \times C_{l - 1}}

denotes the corresponding image feature from the decoder.

H_{l - 1}

and

W_{l - 1}

denote height and width of image features at layer

l - 1

.

C

and

C_{l - 1}

denote the embedding dimension. After the multi-level Transformer attention, the refined base prompts and background prompts

V_{b b}^{l}

, and the high-resolution image features, are fed into the background-enhanced segmentation head to predict segmentation masks, which will be described in detail in Section 4.3.

In the second stage, the model is fine-tuned using novel class data, as shown in Figure 3. Given N novel classes, we define N novel visual prompts, which are initialized by masked average global pooling using images and the corresponding binary mask of N novel classes. Then, three kinds of visual prompts are fed into the proposed Hybrid Causal Attention Module (HCAM), which can teach the novel visual prompts how to prompt the image features. Details of HCAM will be introduced in Section 4.2. After HCAM, similar to the approach in pre-training, visual prompts are cross-attended with the multi-scale image features to learn the corresponding category information. Finally, the refined novel prompts and the high-resolution image features are fed into the background-enhanced segmentation head to predict segmentation masks. In this stage, the parameters of base visual prompts, background visual prompts, decoder, and segmentation head for base classes are frozen. It is worth noting that a small number of parameters in the encoder are not frozen. These unfrozen parameters are determined by the Singular Value Fine-Tuning, which will be introduced in Section 4.4.

4.2. Hybrid Causal Attention Module

Attributed to cross-attention learning with multi-scale image features, potential novel class information in the background of the base class images is captured by background prompts during base class pre-training. The next key challenge is how to transfer this information to the novel visual prompts. Ref. [15] uses a casual attention module to enable novel prompts to obtain the contextual modeling capabilities of base prompts and prevent novel prompts from negatively affecting base prompts. Inspired by this, the Hybrid Causal Attention Module (HCAM) is proposed to accomplish the interaction between the three types of visual prompts simultaneously, as shown in Figure 4. Specifically, self-attention is first computed only between novel visual prompts. Then, three types of visual prompts are fed to hybrid casual attention, which is based on the cross-attention mechanism in transformer decoder, and query

Q

, key

K

, and value

V

are obtained according to the following equation:

Q = V_{q}^{l} W^{q}; K = V_{b b}^{l} W^{k}; V = V_{b b}^{l} W^{v}

(3)

where

W^{q}

,

W^{k}

, and

W^{v}

are trainable linear transformation matrices, and

V_{q}^{l} {\in R}^{N \times C}

represent visual prompts for

N

novel classes of decoder layer

l

. The calculated matrices

Q

,

K

, and

V

are used for Hybrid Causal Attention operation, the process of which is shown in the following equation.

V_{n}^{l + 1} = s o f t m a x (\frac{Q K^{T}}{\sqrt{d}}) V

(4)

where

d

represents the dimension of the input vector.

In HCAM, novel prompts are updated by base prompts and background prompts through cross-attention mechanism. By treating novel prompts as query vectors and the combination of base and background prompts as key/value vectors, HCAM enables the base prompts to teach the novel prompts how to learn, i.e., to query novel class information in multi-scale visual features. HCAM can also pass potential information from the background prompts into the novel prompts, which enables the novel prompts to learn more novel class knowledge. In addition, base prompts and background prompts are directly transmitted to the next layer in HCAM, which ensures that they are not interrupted by novel prompts.

4.3. Background-Enhanced Segmentation Head

After the learning of visual prompts, the image features and visual prompts are fed into our proposed background-enhanced segmentation head for pixel-level prediction. Specifically, during base class training, base prompts

V_{b a s e} \in R^{B \times C}

and background prompts

V_{b a c k} \in R^{B \times C}

are first fed into the background-enhanced segmentation head to obtain the prototype. The background-enhanced segmentation head consists of base segmentation head

H_{b a s e}

and background segmentation head

H_{b a c k}

, both of which are multi-layer fully connected layers, as shown in Figure 2. Base prompts are firstly dot-producted with image features to obtain the similarity; then, the per-pixel per-class probability

P_{b a s e} \in R^{B \times H \times W}

is obtained by

s o f t m a x (\cdot)

, as shown in Equation (5). In particular, background prompts are flattened along the channel dimension after performing the same dot product operation. Subsequently, we use

s i g m o i d (\cdot)

to obtain the output of the background segmentation head, as shown in Equation (6). It is denoted by

P_{b a c k} \in R^{1 \times H \times W}

, which represents the probability that each pixel belongs to the background. The formulas are shown below:

P_{b a s e} = s o f t m a x (H_{b a s e} (V_{b a s e}) \cdot F^{T})

(5)

P_{b a c k} = s i g m o i d (f l a t t e n (H_{b a c k} (V_{b a c k})) \cdot F^{T})

(6)

During novel class fine-tuning, using base segmentation head to predict novel classes leads to significant over-fitting and poor base class segmentation performance. Therefore, Ref. [15] freezes parameters of base segmentation head and only learns a residual, which effectively mitigates over-fitting. However, only learning a residual reduces the capacity of the model for novel classes and, thus, degrades the upper limit of novel class fine-tuning. To address the above issue, background segmentation head is used to make dense predictions for the novel classes. In contrast, the base segmentation head is frozen to maintain the performance of the base classes, as shown in Figure 3. Flexible design of our background-enhanced segmentation head improves the model capacity and realizes the independence of the two segmentation heads of the base classes and the novel classes, thus leading to better GFSS performance.

4.4. Singular Value Fine-Tuning for Non-Meta Learning

Singular Value Fine-Tuning [19] is proposed to alleviate the over-fitting of few-shot semantic segmentation. In the meta-learning paradigm, this method decomposes backbone parameters into three successive matrices via Singular Value Decomposition, then only fine-tunes the singular values and keeps others frozen. It preserves the rich semantic cues in the visual backbone while helping the model adapt to the segmentation task. Therefore, it is natural to apply this method to the novel class fine-tuning of the non-meta-learning paradigm. Specifically, after base class training, the convolution layers in the pre-trained encoder are first decomposed according to the following equation:

W^{'} = U S V^{T}

(7)

where

W^{'} \in R^{C_{o} \times C_{i} K^{2}}

is flattened by weight matrix

W \in R^{C_{o} \times C_{i} \times K \times K}

of a convolution layer.

C_{i}

,

C_{i}

, and

K

represent the number of input channels, the number of output channels, and kernel size. Then, two matrices,

U \in R^{C_{o} \times R}

and

V^{T} \in R^{R \times C_{i} K^{2}}

, construct two convolution layers: a

R \times C_{i} \times K \times K

convolution layer and a

C_{o} \times R \times 1 \times 1

convolution layer.

S \in R^{R \times R}

is a diagonal matrix with singular values on the diagonal. Finally, a convolution layer in pre-trained encoder is decomposed into two new convolutions and an affine layer. Only

S

is fine-tuned, and its parameters take up a tiny fraction of the whole backbone. Singular Value Fine-tuning in the non-meta-learning paradigm effectively mitigates the over-fitting problem of the novel class fine-tuning and further improves the generalization of the model.

5. Experiments and Discussion

5.1. Datasets

The widely used benchmark datasets PASCAL-

5^{i}

[37] and COCO-

20^{i}

[2] are adopted to verify the segmentation performance. In line with [15,30], we report the mean results of our model on 10,000 test images for COCO-

20^{i}

, while for PASCAL-

5^{i}

we use all available test images. For both datasets, all classes are split into four folds and the experiments are performed in a cross-validation manner. For each fold, the set of novel classes is derived from one of these subsets, and the remaining subsets will be the set of base classes.

5.2. Evaluation Metrics

We use the mean intersection-over-union (mIoU) to evaluate model performance. Following [15,30], the mIoU of generalized few-shot semantic segmentation is obtained by averaging the novel and base class scores. The evaluated metrics are computed as follows:

I o U = \frac{T P}{T P + F P + F N}

(8)

m I o U = \frac{1}{C} \sum_{c = 1}^{C} {I o U}_{c}

(9)

where TP denotes true positive, FP denotes false positive, and FN denotes false negative. C represents the number of classes in each fold. The metrics are calculated as the mean of five separate runs.

5.3. Implementation Details

We use PSPNet [38] with ResNet50 [39] as our encoder, and use the same pixel decoder as in [15]. The dimension of the visual prompts is set to 256. The number of background visual prompts is set to

B

, which is the same as the number of base visual prompts. During base class pre-training, AdamW optimizer [40] with initial learning rate 1e-4 is used for updating the parameters, and the weight decay is set to 0.05. The model is trained on PASCAL-

5^{i}

for 100 epochs and COCO-

20^{i}

for 20 epochs. The batch size is set to 16. During novel class fine-tuning, the parameters of base visual prompts, background visual prompts, pixel decoder, and base segmentation head are frozen. We only optimize the parameters of the novel visual prompts, background segmentation head, and Hybrid Causal Attention Module. Due to the introduction of Singular Value Fine-tuning, we also optimize a small part of the parameters in the backbone. We use Singular Value Fine-tuning to update layers 3 and 4 of the backbone, and only the parameters of S are fine-tuned. The model is fine-tuned for 100 epochs. The learning rate and the weight decay of AdamW optimizer are set to 5e-3 and 0.05. Following [15], the model is trained with per-pixel cross-entropy loss in both of the above stages.

5.4. Comparison with the State-of-the-Art

5.4.1. Quantitative Results

Table 1 and Table 2 compare our Beh-VPT with the state-of-the-art methods on the PASCAL-

5^{i}

and COCO-

20^{i}

benchmarks. For a fair comparison, we use the same settings as in [15] to fine-tune DIaM [30] and POP [31]. All experiments are performed only in the inductive setting, where only a supervised few-shot support set is used to optimize and all test images are utilized to evaluate the model. All methods use ResNet50 as the backbone.

As shown in Table 1, Beh-VPT has 0.47% higher mIoU for one-shot and 0.56% higher mIoU for five-shot on COCO-

20^{i}

than [15]. It is worth noting that our one-shot and five-shot results on novel classes surpass [15] by 0.97% and 1.22% mIoU, respectively, demonstrating that the design of the background enhancement effectively boosts the prediction performance of the novel classes. For PASCAL-

5^{i}

, as shown in Table 2, the improvement of mIoU from [15] is 0.77% for one-shot and 0.76% for five-shot. The performance on the novel classes is also significantly improved. The introduction of Singular Value Fine-tuning allows a small number of parameters in the backbone to be unfrozen to fine-tune novel classes. This may result in a slight degradation of the base class but achieves better GFSS performance. Additionally, we compare the classical meta-learning-based FSS approaches on two benchmarks. We found that models designed specifically for FSS are unable to perform satisfactorily in both base and novel classes.

5.4.2. Qualitative Results

Figure 5 compares the qualitative results of our model and [15] for one-shot GFSS on PASCAL-

5^{i}

. For example, for the first two lines, our method achieves a more accurate prediction for the novel classes. In the third line, the base class dog and the novel class chair have a similar color, which is easy to confuse. It can be found that our method better predicts the boundary of the novel class chair, while maintaining comparable base class segmentation performance. Due to a series of designs for background enhancement, our model is less prone to confuse the foreground and background; furthermore, it predicts a clearer boundary. The fourth line demonstrates this equally well. In addition, the introduction of SVF helps us to further improve the prediction performance of the novel classes compared to the prediction without SVF in the fourth column. The last row demonstrates a bad case, where our method performs worse than [15]. Our method does not achieve better base classes prediction, which may be related to unfreezing part of the parameters in the backbone to fine-tune the novel classes.

5.5. Ablation Study

We perform a sequence of ablation studies to assess the effect of each component on the performance of segmentation. It should be noted that the experiments in this section are performed on COCO-

20^{i}

in the inductive setting.

5.5.1. Number of Background Visual Prompts

As shown in Table 3, we first evaluate the impact of the number of background visual prompts. Variation in the number of background prompts leads to different performance. Specifically, the number of background prompts have no apparent impact on the base classes. For the novel classes, the performance improves significantly as the number is raised to B, i.e., the same number as the base prompts. This is due to the fact that more background prompts better capture potential novel class information in the background. However, as the number is further boosted to 2B, we do not gain a more significant performance improvement.

5.5.2. Hybrid Causal Attention Module

In this paper, Hybrid Causal Attention Module (HCAM) plays two main roles: (1) making base prompts teach novel prompts how to query class information in multi-scale visual features; (2) passing potential information from the background into the novel prompts. We perform an ablation study of HCAM here to validate the above points. We first evaluate the performance of the different roles played by base and background prompts in the cross-attention of HCAM. As shown in Table 4, when the base and background prompts are not utilized in HCAM (Table 4, line 1), the model performs poorly, due to the fact that the rich high-level knowledge of the base and background prompts learned in the pre-training phase is not used. For the same reason, using only background prompts as key and value (Table 4, line 2) also has poor segmentation performance of novel classes. Further, using base prompts as key and value (Table 4, lines 3 and 4) significantly improves the performance of the novel classes. This confirms that base prompts can guide novel prompts to query class information. Comparing the third and fourth rows of Table 4, we can figure out that using both base and background prompts as key and value can further improve the performance of the novel classes, which indicates that HCAM is effective in passing potential novel class information from the background into the novel prompts.

In addition, we explore an ablation study considering whether novel prompts affect base prompts. From the experimental results in Table 5, it can be concluded that using novel prompts to affect the base prompts significantly reduces the performance of the base classes. Therefore, the base prompts are only used as key and value in HCAM. Parameters of the base prompts are not updated and are directly transmitted to the next layer.

5.5.3. Background-Enhanced Segmentation Head

The background-enhanced segmentation head predicts the background during base class pre-training while predicting the novel foreground during novel class fine-tuning. The experimental results are shown in Table 6. The background-enhanced segmentation head has higher accuracy of novel classes relative to a residual link. This can be attributed to the background-enhanced segmentation head further boosting the model capacity while retaining the potential novel information from the base class pre-training.

5.5.4. Singular Value Fine-Tuning

Singular Value Fine-Tuning (SVF) [19] is introduced to solve the over-fitting of the novel class fine-tuning. First, we compare the performance of Beh-VPT using different fine-tuning methods, as shown in Table 7. Unfreezing backbone training improves the performance of the novel classes; however, it significantly degrades the base classes. Compared to freezing backbone, the introduction of SVF improves the performance of the novel classes without making the base classes worse.

Then, we explore the influence of fine-tuning different subspaces and layers. Table 8 shows the performance of fine-tuning the parameters of different subspaces. We compare different combinations of the three subspaces and find that only fine-tuning

S

yields the best results. Either fine-tuning the

U

or

V

has a better fit on novel classes but leads to a severe degradation of base class performance due to the rapid over-fitting.

We compare the performance of fine-tuning different layers of the backbone. Table 9 shows the performance comparison of different combinations of convolution layers involved in the fine-tuning. First, when fine-tuning only one layer (rows 2 to 5 of Table 9), the models all achieve slight performance gains on the novel classes. For example, fine-tuning layer 4 improves more relative to the other layers due to the fact that features in deeper convolutions have high-level semantic information. Fine-tuning the subspace

S

of the deeper features enhances the model’s ability to abstract and understand novel classes. Then, as the number of fine-tuned layers increases, the model achieves a continuous increase in performance of novel classes. It is worth noting that fine-tuning all layers does not achieve the best performance. Finally, fine-tuning layers 2, 3, and 4 simultaneously achieves the best performance, which leads us to implement SVF in this way.

6. Discussion

In this section, we discuss the computational cost of the proposed method and then list the potential limitations and future directions. First, we evaluate the number of parameters and the inference efficiency of our model, as shown in Table 10. The newly added parameters mainly come from the background prompts and background-enhanced segmentation head, which are negligible compared to attention mechanisms and visual encoder–decoder structures. Their effect on inference time is also very minimal. In addition, SVF does not have any effect on the number of parameters and inference efficiency.

Now, we discuss potential limitations and future directions. In this paper, the background prompt is a key component that conveys potentially valuable information in the background to the novel prompts through HCAM. However, the ability to learn potential information in the background depends on the data distribution of the background in the datasets. Despite working well on two classic benchmarks of FSS, our method still needs further cross-dataset tests to prove the generality. In addition, the application of SVF is restrained. It is only introduced to the visual encoder, which leaves its full potential still to be unleashed. In the future, we will further validate the effectiveness of our method on other cross-domain datasets such as remote sensing [49,50], medical imaging [51], etc. As for the SVF, we consider applying it to the rest of the model, combined with some trick to prevent over-fitting.

7. Conclusions

In this work, we propose a background-enhanced visual prompting Transformer (Beh-VPT) to solve the confusion between the backgrounds and novel classes foreground. Specifically, we propose background visual prompts, which can learn potential novel class information in the background during base class pre-training and transfer the information to novel visual prompts during novel class fine-tuning via our proposed Hybrid Causal Attention Module. We introduce two modules to further enhance the model’s ability to learn novel classes and prevent over-fitting. Experiments on two GFSS benchmarks demonstrate the effectiveness of our proposed method. The results show that the method proposed in this paper achieves better performance than current SOTA methods of GFSS in both one-shot and five-shot settings. We believe that our work could shed light on future GFSS research based on visual prompts. For future works, we will explore applying background prompts to cross-domain datasets to validate generalization. In addition, we will consider applying SVF to other parts of the model to further improve the model’s performance.

Author Contributions

Conceptualization, M.L. and X.M.; data curation, M.L.; formal analysis, M.L.; methodology, M.L.; software, M.L.; writing—original draft, M.L.; writing—review and editing, M.L.; funding acquisition, X.M.; resources, X.M.; supervision, X.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Biological Breeding-National Science and Technology Major Project, grant number 2023ZD04076.

Data Availability Statement

The data that support the findings of this study are openly available at https://link.springer.com/chapter/10.1007/978-3-319-10602-1_48 (accessed on 6 September 2020) and https://link.springer.com/article/10.1007/s11263-014-0733-5 (accessed on 25 June 2014).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Deng, J.; Dong, W.; Socher, R.; Li, L.-J.; Li, K.; Li, F.-F. ImageNet: A Large-Scale Hierarchical Image Database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 248–255. [Google Scholar]
Lin, T.-Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft COCO: Common Objects in Context. In Proceedings of the Computer Vision—ECCV 2014, Zurich, Switzerland, 6–12 September 2014; Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T., Eds.; Springer International Publishing: Cham, Switzerland, 2014; pp. 740–755. [Google Scholar]
Zou, Z.; Chen, K.; Shi, Z.; Guo, Y.; Ye, J. Object Detection in 20 Years: A Survey. Proc. IEEE 2023, 111, 257–276. [Google Scholar] [CrossRef]
Minaee, S.; Boykov, Y.; Porikli, F.; Plaza, A.; Kehtarnavaz, N.; Terzopoulos, D. Image Segmentation Using Deep Learning: A Survey. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 3523–3542. [Google Scholar] [CrossRef]
Hao, S.; Zhou, Y.; Guo, Y. A Brief Survey on Semantic Segmentation with Deep Learning. Neurocomputing 2020, 406, 302–321. [Google Scholar] [CrossRef]
Ren, W.; Tang, Y.; Sun, Q.; Zhao, C.; Han, Q.-L. Visual Semantic Segmentation Based on Few/Zero-Shot Learning: An Overview. IEEE/CAA J. Autom. Sin. 2024, 11, 1106–1126. [Google Scholar] [CrossRef]
Feng, Y.; Wang, Y.; Li, H.; Qu, M.; Yang, J. Learning What and Where to Segment: A New Perspective on Medical Image Few-Shot Segmentation. Med. Image Anal. 2023, 87, 102834. [Google Scholar] [CrossRef] [PubMed]
Yang, Y.; Chen, Q.; Feng, Y.; Huang, T. MIANet: Aggregating Unbiased Instance and General Information for Few-Shot Semantic Segmentation. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 7131–7140. [Google Scholar]
Tian, Z.; Lai, X.; Jiang, L.; Liu, S.; Shu, M.; Zhao, H.; Jia, J. Generalized Few-Shot Semantic Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 11563–11572. [Google Scholar]
Min, B.; Ross, H.; Sulem, E.; Veyseh, A.P.B.; Nguyen, T.H.; Sainz, O.; Agirre, E.; Heintz, I.; Roth, D. Recent Advances in Natural Language Processing via Large Pre-Trained Language Models: A Survey. ACM Comput. Surv. 2023, 56, 1–40. [Google Scholar] [CrossRef]
Brown, T.B.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language Models Are Few-Shot Learners. Adv. Neural Inf. Process. Syst. 2020, 33, 1877–1901. [Google Scholar]
Jia, M.; Tang, L.; Chen, B.-C.; Cardie, C.; Belongie, S.; Hariharan, B.; Lim, S.-N. Visual Prompt Tuning. In Proceedings of the Computer Vision—ECCV 2022, Tel Aviv, Israel, 23–27 October 2022; Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T., Eds.; Springer Nature: Cham, Switzerland, 2022; pp. 709–727. [Google Scholar]
Park, S.; Byun, H. Fair-VPT: Fair Visual Prompt Tuning for Image Classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 12268–12278. [Google Scholar]
Visual Tuning. ACM Computing Surveys. Available online: https://dl.acm.org/doi/full/10.1145/3657632 (accessed on 15 December 2024).
Hossain, M.R.I.; Siam, M.; Sigal, L.; Little, J.J. Visual Prompting for Generalized Few-Shot Segmentation: A Multi-Scale Approach. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024. [Google Scholar]
Zhang, J.-W.; Sun, Y.; Yang, Y.; Chen, W. Feature-Proxy Transformer for Few-Shot Segmentation. Adv. Neural Inf. Process. Syst. 2022, 35, 6575–6588. [Google Scholar]
Yang, L.; Zhuo, W.; Qi, L.; Shi, Y.; Gao, Y. Mining Latent Classes for Few-Shot Segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 8721–8730. [Google Scholar]
Chen, J.; Gao, B.-B.; Lu, Z.; Xue, J.-H.; Wang, C.; Liao, Q. SCNet: Enhancing Few-Shot Semantic Segmentation by Self-Contrastive Background Prototypes. arXiv 2021, arXiv:2104.09216. [Google Scholar]
Sun, Y.; Chen, Q.; He, X.; Wang, J.; Feng, H.; Han, J.; Ding, E.; Cheng, J.; Li, Z.; Wang, J. Singular Value Fine-Tuning: Few-Shot Segmentation Requires Few-Parameters Fine-Tuning. Adv. Neural Inf. Process. Syst. 2022, 35, 37484–37496. [Google Scholar]
Song, Y.; Wang, T.; Cai, P.; Mondal, S.K.; Sahoo, J.P. A Comprehensive Survey of Few-Shot Learning: Evolution, Applications, Challenges, and Opportunities. ACM Comput. Surv. 2023, 55, 1–40. [Google Scholar] [CrossRef]
Snell, J.; Swersky, K.; Zemel, R. Prototypical Networks for Few-Shot Learning. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Curran Associates, Inc.: New York, NY, USA, 2017; Volume 30. [Google Scholar]
Köhler, M.; Eisenbach, M.; Gross, H.-M. Few-Shot Object Detection: A Comprehensive Survey. IEEE Trans. Neural Netw. Learn. Syst. 2024, 35, 11958–11978. [Google Scholar] [CrossRef]
Xin, Z.; Chen, S.; Wu, T.; Shao, Y.; Ding, W.; You, X. Few-Shot Object Detection: Research Advances and Challenges. Inf. Fusion 2024, 107, 102307. [Google Scholar] [CrossRef]
Wang, K.; Liew, J.H.; Zou, Y.; Zhou, D.; Feng, J. PANet: Few-Shot Image Semantic Segmentation with Prototype Alignment. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9197–9206. [Google Scholar]
Zhang, C.; Lin, G.; Liu, F.; Yao, R.; Shen, C. CANet: Class-Agnostic Segmentation Networks With Iterative Refinement and Attentive Few-Shot Learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 5217–5226. [Google Scholar]
Ding, H.; Zhang, H.; Jiang, X. Self-Regularized Prototypical Network for Few-Shot Semantic Segmentation. Pattern Recognit. 2023, 133, 109018. [Google Scholar] [CrossRef]
Zhang, C.; Lin, G.; Liu, F.; Guo, J.; Wu, Q.; Yao, R. Pyramid Graph Networks With Connection Attentions for Region-Based One-Shot Semantic Segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9587–9595. [Google Scholar]
Min, J.; Kang, D.; Cho, M. Hypercorrelation Squeeze for Few-Shot Segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 6941–6952. [Google Scholar]
Hajimiri, S.; Boudiaf, M.; Ben Ayed, I.; Dolz, J. A Strong Baseline for Generalized Few-Shot Semantic Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 11269–11278. [Google Scholar]
Liu, S.-A.; Zhang, Y.; Qiu, Z.; Xie, H.; Zhang, Y.; Yao, T. Learning Orthogonal Prototypes for Generalized Few-Shot Semantic Segmentation. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 11319–11328. [Google Scholar]
Alayrac, J.-B.; Donahue, J.; Luc, P.; Miech, A.; Barr, I.; Hasson, Y.; Lenc, K.; Mensch, A.; Millican, K.; Reynolds, M.; et al. Flamingo: A Visual Language Model for Few-Shot Learning. Adv. Neural Inf. Process. Syst. 2022, 35, 23716–23736. [Google Scholar]
Bulat, A.; Guerrero, R.; Martinez, B.; Tzimiropoulos, G. FS-DETR: Few-Shot DEtection TRansformer with Prompting and without Re-Training. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–3 October 2023. [Google Scholar]
Han, G.; Lim, S.-N. Few-Shot Object Detection with Foundation Models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 28608–28618. [Google Scholar]
He, W.; Zhang, Y.; Zhuo, W.; Shen, L.; Yang, J.; Deng, S.; Sun, L. APSeg: Auto-Prompt Network for Cross-Domain Few-Shot Semantic Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 23762–23772. [Google Scholar]
Bi, H.; Feng, Y.; Diao, W.; Wang, P.; Mao, Y.; Fu, K.; Wang, H.; Sun, X. Prompt-and-Transfer: Dynamic Class-Aware Enhancement for Few-Shot Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2025, 47, 131–148. [Google Scholar] [CrossRef]
Everingham, M.; Van Gool, L.; Williams, C.K.I.; Winn, J.; Zisserman, A. The Pascal Visual Object Classes (VOC) Challenge. Int. J. Comput. Vis. 2010, 88, 303–338. [Google Scholar] [CrossRef]
Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid Scene Parsing Network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2881–2890. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Loshchilov, I.; Hutter, F. Decoupled Weight Decay Regularization. arXiv 2019, arXiv:1711.05101v3. [Google Scholar]
Xu, Q.; Zhao, W.; Lin, G.; Long, C. Self-Calibrated Cross Attention Network for Few-Shot Segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 4–6 October 2023; pp. 655–665. [Google Scholar]
Zhu, L.; Chen, T.; Yin, J.; See, S.; Liu, J. Addressing Background Context Bias in Few-Shot Segmentation through Iterative Modulation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 3370–3379. [Google Scholar]
Lang, C.; Cheng, G.; Tu, B.; Han, J. Learning What Not to Segment: A New Perspective on Few-Shot Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022. [Google Scholar]
Jia, Y.; Li, J.; Wang, Q. Generalized Few-Shot Semantic Segmentation for Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2025, 63, 1–10. [Google Scholar] [CrossRef]
Liu, W.; Wu, Z.; Zhao, Y.; Fang, Y.; Foo, C.-S.; Cheng, J.; Lin, G. Harmonizing Base and Novel Classes: A Class-Contrastive Approach for Generalized Few-Shot Segmentation. Int. J. Comput. Vis. 2023, 132, 1277–1291. [Google Scholar]
Sakai, T.; Kimura, D.; Qiu, H.; Katsuki, T.; Osogami, T.; Inoue, T. A Surprisingly Simple Approach to Generalized Few-Shot Semantic Segmentation. In Proceedings of the Thirty-Eighth Annual Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 9–15 December 2024. [Google Scholar]
Tian, Z.; Zhao, H.; Shu, M.; Yang, Z.; Li, R.; Jia, J. Prior Guided Feature Enrichment Network for Few-Shot Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 1050–1065. [Google Scholar] [CrossRef] [PubMed]
Zhang, B.; Xiao, J.; Qin, T. Self-Guided and Cross-Guided Learning for Few-Shot Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 8312–8321. [Google Scholar]
Cermelli, F.; Mancini, M.; Bulo, S.R.; Ricci, E.; Caputo, B. Modeling the Background for Incremental Learning in Semantic Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 9233–9242. [Google Scholar]
Liu, S.; Zhao, D.; Zhou, Y.; Tan, Y.; He, H.; Zhang, Z.; Tang, L. Full-Scale Change Detection Network for Remote Sensing Images Based on Deep Feature Fusion. IEEE Trans. Geosci. Remote Sens. 2025, 1. [Google Scholar] [CrossRef]
Liu, S.; Zhao, D.; Zhou, Y.; Tan, Y.; He, H.; Zhang, Z.; Tang, L. Network and Dataset for Multiscale Remote Sensing Image Change Detection. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2025, 18, 2851–2866. [Google Scholar] [CrossRef]
Pachetti, E.; Colantonio, S. A Systematic Review of Few-Shot Learning in Medical Imaging. Artif. Intell. Med. 2024, 156, 102949. [Google Scholar] [CrossRef]

Figure 1. Percentage of novel classes in each base-fold image (courtesy of Ref. [18]).

Figure 2. Architecture of the proposed background-enhanced visual prompting Transformer (Beh-VPT) during base class pre-training.

Figure 3. Architecture of the proposed background-enhanced visual prompting Transformer (Beh-VPT) during novel class fine-tuning.

Figure 4. Architecture of our proposed Hybrid Causal Attention Module.

Figure 5. Qualitative results of 1-way 1-shot on PASCAL-5^i. Each column from left to right represents the original image; ground truth mask; baseline [15] predictions; and our Beh-VPT predictions without Singular Value Fine-Tuning (SVF) and our Beh-VPT predictions with SVF, respectively.

Table 1. Comparison to SOTA methods on COCO-20ⁱ using the mIoU (%) evaluation metric. Tables 2–9 all use the same metrics.

Method	$COCO - 20^{i}$
	1-Shot			5-Shot
	Base	Novel	Mean	Base	Novel	Mean
SCCAN [41]	38.63	13.74	26.19	39.56	14.77	27.17
ABCB [42]	40.76	15.11	27.94	40.81	17.17	28.99
CAPL [9]	43.21	7.21	25.21	43.71	11.00	27.36
BAM [43]	49.84	14.16	32.00	49.85	16.63	33.24
DIaM [30]	42.69	15.32	29.00	38.47	20.87	29.67
POP [31]	30.38	9.63	20.00	24.53	16.19	20.36
BSPL [44]	30.96	12.37	21.67	25.03	20.00	22.52
CC [45]	46.89	8.83	27.86	47.11	12.69	29.90
BCM [46]	49.43	18.28	33.85	49.88	30.60	40.24
VP [15]	51.55	18.00	34.78	51.59	30.06	40.83
Beh-VPT (ours)	51.53	18.97	35.25	51.49	31.28	41.39

Table 2. Comparison to SOTA methods on PASCAL-5ⁱ.

Method	$PASCAL - 5^{i}$
	1-Shot			5-Shot
	Base	Novel	Mean	Base	Novel	Mean
PANET [24]	31.88	11.25	21.57	32.95	15.25	24.1
PFENET [29]	8.32	2.67	5.50	8.83	1.89	5.36
SCL [47]	8.88	2.44	5.66	9.11	1.83	5.47
MiB [48]	63.80	8.86	36.33	68.60	28.93	48.77
SCCAN [41]	60.46	30.55	45.51	61.13	31.62	46.38
ABCB [42]	62.93	33.02	47.98	63.94	38.70	51.32
CAPL [9]	64.80	17.46	41.13	65.43	24.43	44.93
BAM [43]	71.60	27.49	49.55	71.60	28.96	50.28
DIaM [30]	66.79	27.36	47.08	64.05	34.56	49.31
POP [31]	46.68	19.96	33.32	41.50	36.26	38.80
BSPL [44]	48.33	23.03	35.68	43.19	39.88	41.54
CC [45]	69.38	22.58	45.98	70.54	32.08	51.31
BCM [46]	71.15	41.24	56.20	71.23	55.36	63.29
VP [15]	74.58	34.99	54.79	74.86	50.34	62.60
Beh-VPT (ours)	74.58	36.53	55.56	74.84	51.87	63.36

Table 3. Ablation study on background visual prompts initialization.

Number of Background Visual Prompts	mIoU
Number of Background Visual Prompts	Base	Novel	Mean
0	51.54	18.20	34.87
B/2	51.52	18.36	34.94
B	51.53	18.97	35.25
2B	51.46	18.93	35.20

Table 4. Ablation study on HCAM: the performance of the different roles of base and background prompts.

Roles of Base and Background Prompts	mIoU
Roles of Base and Background Prompts	Base	Novel	Mean
novel prompts as query, key and value	51.53	7.53	29.53
novel prompts as query background prompts as key, and value	51.53	8.02	29.78
novel prompts as query base prompts as key, and value	51.53	18.00	34.77
novel prompts as query base prompts and background prompts as key and value	51.53	18.97	35.25

Table 5. Ablation study on HCAM: whether novel prompts affect base prompts.

Novel Prompts Affect Base Prompts?	mIoU
Novel Prompts Affect Base Prompts?	Base	Novel	Mean
Yes	46.67	18.60	32.64
No	51.53	18.97	35.25

Table 6. Ablation study on background-enhanced segmentation head.

Segmentation Head of Novel Classes	mIoU
Segmentation Head of Novel Classes	Base	Novel	Mean
same segmentation head as the base classes	23.34	18.02	20.02
Residual, like [15]	51.52	18.53	35.03
background-enhanced segmentation head	51.53	18.97	35.25

Table 7. Ablation study on tuning methods.

Tuning Methods	mIoU
Tuning Methods	Base	Novel	Mean
Unfreeze backbone	26.78	30.94	28.86
Freeze backbone	51.60	18.36	34.98
SVF	51.53	18.97	35.25

Table 8. Ablation study on fine-tuning different subspaces of convolution layer.

Subspace			mIoU
U	S	V	Base	Novel	Mean
√			33.75	19.93	26.84
	√		51.53	18.97	35.25
		√	36.37	21.07	28.72
√	√		35.08	21.86	28.47
	√	√	35.36	20.39	27.88
√		√	20.35	21.54	20.95
√	√	√	19.81	23.02	21.42

Table 9. Ablation study on fine-tuning different layers of backbone.

Layers				mIoU
layer 1	layer 2	layer 3	layer 4	Base	Novel	Mean
				51.60	18.36	34.98
√				51.60	18.39	35.00
	√			51.59	18.50	35.05
		√		51.60	18.51	35.06
			√	51.59	18.61	35.10
√	√			51.51	18.43	34.97
√		√		51.50	18.44	34.97
√			√	51.54	18.41	34.98
	√	√		51.55	18.52	35.04
	√		√	51.54	18.70	35.12
		√	√	51.55	18.77	35.16
√	√	√		51.49	18.66	35.08
√	√		√	51.50	18.62	35.06
	√	√	√	51.53	18.97	35.25
√	√	√	√	51.42	18.99	35.21

Table 10. Computational efficiency comparison: parameter and 1-shot inference time comparison on PASCAL-5ⁱ.

Method	Params.	FLOPs	Inference Time
VP [15]	69.19 M	55.16G	0.015s
Beh-VPT (ours)	69.34 M	55.93G	0.017s

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, M.; Ma, X. Background-Enhanced Visual Prompting Transformer for Generalized Few-Shot Semantic Segmentation. Electronics 2025, 14, 1389. https://doi.org/10.3390/electronics14071389

AMA Style

Li M, Ma X. Background-Enhanced Visual Prompting Transformer for Generalized Few-Shot Semantic Segmentation. Electronics. 2025; 14(7):1389. https://doi.org/10.3390/electronics14071389

Chicago/Turabian Style

Li, Man, and Xiaodong Ma. 2025. "Background-Enhanced Visual Prompting Transformer for Generalized Few-Shot Semantic Segmentation" Electronics 14, no. 7: 1389. https://doi.org/10.3390/electronics14071389

APA Style

Li, M., & Ma, X. (2025). Background-Enhanced Visual Prompting Transformer for Generalized Few-Shot Semantic Segmentation. Electronics, 14(7), 1389. https://doi.org/10.3390/electronics14071389

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Background-Enhanced Visual Prompting Transformer for Generalized Few-Shot Semantic Segmentation

Abstract

1. Introduction

2. Related Work

2.1. Few-Shot Semantic Segmentation

2.2. Generalized Few-Shot Semantic Segmentation

2.3. Prompting

3. Problem Definition

3.1. Standard Few-Shot Segmentation

3.2. Generalized Few-Shot Segmentation

4. Method

4.1. Background-Enhanced Visual Prompting Transformer

4.2. Hybrid Causal Attention Module

4.3. Background-Enhanced Segmentation Head

4.4. Singular Value Fine-Tuning for Non-Meta Learning

5. Experiments and Discussion

5.1. Datasets

5.2. Evaluation Metrics

5.3. Implementation Details

5.4. Comparison with the State-of-the-Art

5.4.1. Quantitative Results

5.4.2. Qualitative Results

5.5. Ablation Study

5.5.1. Number of Background Visual Prompts

5.5.2. Hybrid Causal Attention Module

5.5.3. Background-Enhanced Segmentation Head

5.5.4. Singular Value Fine-Tuning

6. Discussion

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI