1. Introduction
Deep convolution neural networks driven by large-scale labeled datasets [
1,
2] have achieved great success in many visual recognition tasks [
3,
4]. However, collecting large-scale labeled data is laborious and expensive, especially for dense prediction tasks such as semantic segmentation [
5] and instance segmentation [
5]. In contrast, humans can easily recognize new patterns or objects guided by a small number of samples. In order to reduce the need for a large number of dense annotations and mimic human learning process, few-shot semantic segmentation (FSS) [
6] has been introduced to enable the segmentation of previously unseen objects with only a few annotated images. However, existing FSS methods [
7,
8] focus on improving the segmentation performance of the novel classes while ignoring the base classes. Thus, generalized few-shot semantic segmentation (GFSS) [
9] is proposed to segment the novel classes while maintaining the performance of the base classes. GFSS defines a more realistic scenario, which requires better generalization and robustness of the network.
In the field of natural language processing [
10], prompting techniques [
11] have been proven to enable models to learn from a small number of samples and quickly adapt to new tasks in the absence of large-scale labeled data. Similarly, prompting techniques have been applied to computer vision for visual prompt tuning [
12,
13,
14]. This is an effective mechanism to enhance the adaptability and learning ability of pre-trained models to new tasks. In light of the above, prompts techniques are inherently suitable for GFSS, which usually has sufficient base class data and a small amount of novel class data. The key to using visual prompting to solve GFSS problems lies in the following two points: (1) How can we make full use of the large amount of base class data to model an embedding to represent visual prompts? (2) How can we generalize the pre-trained visual prompts to the novel classes and maintain the performance of the base classes? Hossain et al. [
15] developed a multi-scale visual prompting Transformer, which defines a series of learnable embeddings to represent base class visual prompts. Then, cross attention is calculated between the base visual prompts and the image features, which enables the prompts to learn the information of each class and eventually use the prompts for dense prediction. For the learning of novel class visual prompts, [
15] designs a novel-to-base casual attention to pass the high-level information from the base visual prompts to the novel visual prompts. This attention mechanism guarantees that the novel prompts derived from a limited number of samples are distinctly separate from the base prompts, ultimately leading to the superior performance of GFSS.
Although [
15] demonstrates that visual prompting can be effectively applied to GFSS, there are still unresolved issues. Ref. [
15] focuses only on the foreground during pre-training of the base classes. However, as observed in [
16,
17], the background regions of the base class data are not homogeneous and potential novel class information is contained within these regions.
Figure 1 illustrates the percentage of potential novel classes for each fold in the background region of the base class image. The current method treats the novel classes as the background when pre-training base prompts, which will lead to two problems: (1) treating the novel classes as the background leads to confusion and degrades the potential of novel class fine-tuning; (2) the pre-trained base prompts only have information about the base class, and the novel class data are not fully utilized. Therefore, further mining the potential novel class information in the background is a promising optimization direction in visual prompting-based GFSS. In addition, due to the extremely limited novel class data in GFSS, over-fitting becomes a critical issue that needs to be handled carefully. Specifically, during novel class fine-tuning in [
15], the parameters of the backbone, pixel decoder, and base segmentation head are frozen, which leads to sub-optimal results of novel class learning. However, the model will quickly overfit on a sparse amount of novel class data without the freezing strategy, which will set back the performance of the base classes.
To address the above problem, we propose Beh-VPT, a background-enhanced visual prompting Transformer for GFSS. Beh-VPT uses a non-meta-learning paradigm, where the model is first trained based on a large amount of base class data. Then, several model modules are frozen and the remaining modules are fine-tuned using few data of novel classes. Our model design is inspired by [
15], yet several novel modules are introduced. Specifically, we introduce background visual prompts, which are defined as learnable embedding vectors like base and novel visual prompts. During base class pre-training, background visual prompts are fed into the cross-attention module along with base class visual prompts and interact with image features. Background visual prompts learn potential novel class information in the background of base class images. Then, during novel class fine-tuning, the Hybrid Causal Attention Module (HCAM) is proposed to convey novel class information from background visual prompting to novel visual prompts. In addition, we further refine the model design to prevent model over-fitting on novel classes. First, we introduce the background-enhanced segmentation head, which enhances the model capacity of segmentation head used to predict novel classes, without affecting the base counterpart. Then, Singular Value Fine-Tuning (SVF) [
19] is introduced in novel class fine-tuning, which makes the model less prone to over-fitting, leading to better model generalization. Our contributions can be summarized as follows:
We design a background-enhanced visual prompting Transformer (Beh-VPT) for generalized few-shot semantic segmentation. Background visual prompts are proposed to learn potential novel class information in the background during base class pre-training and to influence novel visual prompts during novel class fine-tuning via the Hybrid Causal Attention Module.
We introduce two methods to further enhance the model’s ability to learn novel classes and prevent over-fitting. The background-enhanced segmentation head boosts the capacity of the model to bring better performance of novel classes. Singular Value Fine-Tuning is introduced to unleash the full potential of the model. At the same time, none of them will degrade the performance of base classes.
Experiments on GFSS benchmark datasets demonstrate the effectiveness of our proposed method. The results show that the method proposed in this paper achieves better performance than current SOTA methods of GFSS in both one-shot and five-shot settings.
The remainder of this paper is organized as follows. In
Section 2, we introduce the related works in the field of GFSS. In
Section 3, we clarify the problem definition of GFSS. In
Section 4, we illustrate the details of the proposed Beh-VPT. The experimental results and discussion are presented in
Section 5. Finally, we conclude the paper in
Section 6.
4. Method
In this section, we first introduce the proposed background-enhanced visual prompting Transformer (Beh-VPT) model. Background visual prompts are introduced to learn more potential information about the novel classes during base pre-training phase. Then, we introduce the Hybrid Causal Attention Module (HCAM), which efficiently transfers information from base visual prompts and background visual prompts to novel counterparts unidirectionally. Finally, we introduce two methods to further improve the model’s learning ability for novel classes and prevent over-fitting.
4.1. Background-Enhanced Visual Prompting Transformer
The background-enhanced visual prompting Transformer (Beh-VPT) is trained in two training stages: base class pre-training and novel class fine-tuning. In both stages, the input image is fed into a convolution neural network-based encoder to extract features initially. The intermediate features are then up-sampled by the decoder to obtain multi-scale image features. In the decoding process, visual prompts representing base classes, novel classes, and backgrounds are used to calculate multi-scale cross-attention with the visual features to encode the corresponding classes information. Finally, a background-enhanced segmentation head takes the trained visual prompts and high-resolution image features as inputs and, finally, outputs dense predictions. The key idea of Beh-VPT is the design and use of visual prompts. Specifically, we use learnable embedding to define three types of visual prompts: base visual prompts, novel visual prompts, and background visual prompts. The three types of visual prompts are utilized in different training stages of the GFSS.
In the first stage, the model is pre-trained using base class data, as shown in
Figure 2. Given B base classes, we define B base visual prompts, which are randomly initialized. In particular, we use the same way as the base visual prompts to define the B background prompts, which are designed primarily to mine valuable knowledge in the background of images during the pre-training stage of the base class data. Then, the base visual prompts and background visual prompts are first processed by
, as shown in Equation (1). Then, concatenated vector is fed into the self-attention module to model the relationship between them. After learning the attention within visual prompts, cross-attention is utilized for the interaction between 2B visual prompts and multi-scale image features. The operations of self-attention and cross-attention are shown in Equation (2). The details of the equations are as follows:
where
and
represent visual prompts for the base class and background of layer
.
is obtained by concatenating them.
and
denote self-attention and cross-attention, respectively.
denotes the corresponding image feature from the decoder.
and
denote height and width of image features at layer
.
and
denote the embedding dimension. After the multi-level Transformer attention, the refined base prompts and background prompts
, and the high-resolution image features, are fed into the background-enhanced segmentation head to predict segmentation masks, which will be described in detail in
Section 4.3.
In the second stage, the model is fine-tuned using novel class data, as shown in
Figure 3. Given N novel classes, we define N novel visual prompts, which are initialized by masked average global pooling using images and the corresponding binary mask of N novel classes. Then, three kinds of visual prompts are fed into the proposed Hybrid Causal Attention Module (HCAM), which can teach the novel visual prompts how to prompt the image features. Details of HCAM will be introduced in
Section 4.2. After HCAM, similar to the approach in pre-training, visual prompts are cross-attended with the multi-scale image features to learn the corresponding category information. Finally, the refined novel prompts and the high-resolution image features are fed into the background-enhanced segmentation head to predict segmentation masks. In this stage, the parameters of base visual prompts, background visual prompts, decoder, and segmentation head for base classes are frozen. It is worth noting that a small number of parameters in the encoder are not frozen. These unfrozen parameters are determined by the Singular Value Fine-Tuning, which will be introduced in
Section 4.4.
4.2. Hybrid Causal Attention Module
Attributed to cross-attention learning with multi-scale image features, potential novel class information in the background of the base class images is captured by background prompts during base class pre-training. The next key challenge is how to transfer this information to the novel visual prompts. Ref. [
15] uses a casual attention module to enable novel prompts to obtain the contextual modeling capabilities of base prompts and prevent novel prompts from negatively affecting base prompts. Inspired by this, the Hybrid Causal Attention Module (HCAM) is proposed to accomplish the interaction between the three types of visual prompts simultaneously, as shown in
Figure 4. Specifically, self-attention is first computed only between novel visual prompts. Then, three types of visual prompts are fed to hybrid casual attention, which is based on the cross-attention mechanism in transformer decoder, and query
, key
, and value
are obtained according to the following equation:
where
,
, and
are trainable linear transformation matrices, and
represent visual prompts for
novel classes of decoder layer
. The calculated matrices
,
, and
are used for Hybrid Causal Attention operation, the process of which is shown in the following equation.
where
represents the dimension of the input vector.
In HCAM, novel prompts are updated by base prompts and background prompts through cross-attention mechanism. By treating novel prompts as query vectors and the combination of base and background prompts as key/value vectors, HCAM enables the base prompts to teach the novel prompts how to learn, i.e., to query novel class information in multi-scale visual features. HCAM can also pass potential information from the background prompts into the novel prompts, which enables the novel prompts to learn more novel class knowledge. In addition, base prompts and background prompts are directly transmitted to the next layer in HCAM, which ensures that they are not interrupted by novel prompts.
4.3. Background-Enhanced Segmentation Head
After the learning of visual prompts, the image features and visual prompts are fed into our proposed background-enhanced segmentation head for pixel-level prediction. Specifically, during base class training, base prompts
and background prompts
are first fed into the background-enhanced segmentation head to obtain the prototype. The background-enhanced segmentation head consists of base segmentation head
and background segmentation head
, both of which are multi-layer fully connected layers, as shown in
Figure 2. Base prompts are firstly dot-producted with image features to obtain the similarity; then, the per-pixel per-class probability
is obtained by
, as shown in Equation (5). In particular, background prompts are flattened along the channel dimension after performing the same dot product operation. Subsequently, we use
to obtain the output of the background segmentation head, as shown in Equation (6). It is denoted by
, which represents the probability that each pixel belongs to the background. The formulas are shown below:
During novel class fine-tuning, using base segmentation head to predict novel classes leads to significant over-fitting and poor base class segmentation performance. Therefore, Ref. [
15] freezes parameters of base segmentation head and only learns a residual, which effectively mitigates over-fitting. However, only learning a residual reduces the capacity of the model for novel classes and, thus, degrades the upper limit of novel class fine-tuning. To address the above issue, background segmentation head is used to make dense predictions for the novel classes. In contrast, the base segmentation head is frozen to maintain the performance of the base classes, as shown in
Figure 3. Flexible design of our background-enhanced segmentation head improves the model capacity and realizes the independence of the two segmentation heads of the base classes and the novel classes, thus leading to better GFSS performance.
4.4. Singular Value Fine-Tuning for Non-Meta Learning
Singular Value Fine-Tuning [
19] is proposed to alleviate the over-fitting of few-shot semantic segmentation. In the meta-learning paradigm, this method decomposes backbone parameters into three successive matrices via Singular Value Decomposition, then only fine-tunes the singular values and keeps others frozen. It preserves the rich semantic cues in the visual backbone while helping the model adapt to the segmentation task. Therefore, it is natural to apply this method to the novel class fine-tuning of the non-meta-learning paradigm. Specifically, after base class training, the convolution layers in the pre-trained encoder are first decomposed according to the following equation:
where
is flattened by weight matrix
of a convolution layer.
,
, and
represent the number of input channels, the number of output channels, and kernel size. Then, two matrices,
and
, construct two convolution layers: a
convolution layer and a
convolution layer.
is a diagonal matrix with singular values on the diagonal. Finally, a convolution layer in pre-trained encoder is decomposed into two new convolutions and an affine layer. Only
is fine-tuned, and its parameters take up a tiny fraction of the whole backbone. Singular Value Fine-tuning in the non-meta-learning paradigm effectively mitigates the over-fitting problem of the novel class fine-tuning and further improves the generalization of the model.
6. Discussion
In this section, we discuss the computational cost of the proposed method and then list the potential limitations and future directions. First, we evaluate the number of parameters and the inference efficiency of our model, as shown in
Table 10. The newly added parameters mainly come from the background prompts and background-enhanced segmentation head, which are negligible compared to attention mechanisms and visual encoder–decoder structures. Their effect on inference time is also very minimal. In addition, SVF does not have any effect on the number of parameters and inference efficiency.
Now, we discuss potential limitations and future directions. In this paper, the background prompt is a key component that conveys potentially valuable information in the background to the novel prompts through HCAM. However, the ability to learn potential information in the background depends on the data distribution of the background in the datasets. Despite working well on two classic benchmarks of FSS, our method still needs further cross-dataset tests to prove the generality. In addition, the application of SVF is restrained. It is only introduced to the visual encoder, which leaves its full potential still to be unleashed. In the future, we will further validate the effectiveness of our method on other cross-domain datasets such as remote sensing [
49,
50], medical imaging [
51], etc. As for the SVF, we consider applying it to the rest of the model, combined with some trick to prevent over-fitting.