**1. Introduction**

Attributed to the development of convolutional neural networks (CNNs) with its strong representation ability and the access of large-scale datasets, semantic segmentation and object detection have developed tremendously. However, it is worth to point out that annotating a large number of object masks is time-consuming, expensive, and sometimes infeasible in some scenarios, such as computer-aided diagnosis systems. Moreover, without massive annotated data, the performance of deep learning models drops dramatically on classes that do not appear in the training dataset. Few-shot segmentation (FSS) is a promising field to tackle this issue. Unlike conventional semantic segmentation, which merely segments the classes appearing in the training set, few-shot segmentation utilizes one or a few annotated samples to segmen<sup>t</sup> new classes.

They firstly extract features from both query and support images, and then the support features and their masks are encoded into a single prototype [1] to represent foreground semantics or a pair of prototypes [2,3] to represent the foreground and background. Finally, they conduct dense comparison between prototype(s) and query feature. Feature comparison methods are usually performed in one of two ways: explicit metric function, (e.g., cosine-similarity [3]) and implicit metric function (e.g., relationNet [4]).

As shown in Figure 1a, it is common-sense [2,5,6] that using a single prototype generated by masked average pooling is unable to carry sufficient information. Specifically, due to variant appearance and poses, using masked average pooling only retains the information of discriminative pixels and ignores the information of plain pixels. To overcome this problem, multi-prototype strategy [2,5,6] is proposed by dividing foreground regions into several pieces.

**Citation:** Ren, Q.; Chen, J. Dual Complementary Prototype Learning for Few-Shot Segmentation. *CSFM* **2022**, *3*, 8. https:// doi.org/10.3390/cmsf2022003008

Academic Editors: Kuan-Chuan Peng and Ziyan Wu

Published: 29 April 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

**Figure 1.** Illustration of difference in prototype learning for 1-shot segmentation. (**a**) Single prototype methods [1,7] tend to lose information as plain pixels. (**b**) Multi-prototype methods [2,5,8] based on regional division may damage the representation for the whole object. (**c**) Our Complementary Prototype Generation module retains the information of discriminative pixels and plain pixels adaptively.

However, as shown in Figure 1b, these multi-prototype methods still suffer from two drawbacks. Firstly, the whole representation of foreground region is weakened, since existing methods split regions into several pieces and damage the correlation among the generated prototypes. Moreover, current methods often ignore inter-class similarity between foreground and background, and their training strategy in the context of segmenting the main foreground objects leads to underestimating the discrimination between the foreground and background. As a result, existing multi-prototype methods tend to misclassify background pixels into foreground.

In this paper, we propose a simple ye<sup>t</sup> effective method, called Dual Complementary prototype Network (DCNet), to overcome the above mentioned drawbacks. Specifically, it is composed of two branches to segmen<sup>t</sup> the foreground and background in a complementary manner, and both segmentation branches rely on our proposed Complementary Prototype Generation (CPG) module. The CPG module is proposed to extract comprehensive support information from the support set. Through global average pooling with support mask, we extract the average prototype at first, and we obtain its attention weight on the support image by calculating the cosine distance between the foreground feature and the average prototype iteratively. In this way, we can easily figure out which part of the information is focused and which part of the information is ignored without segmentation on support image. Then we use this attention weight to generate a pair of prototypes to represent

the focused and the ignored region. By using a weight map to generate prototypes for comparison, we can preserve the correlation among the generated prototypes and avoid the information loss to a certain extent.

Furthermore, we introduce background guided learning to pay additional attention on the inter-class similarity between the foreground and background. Considering that the background in support images is not always the same as that in a query image, we adopt a different training manner from foreground segmentation, where the query background mask is used as guidance for query image background segmentation. In this way, our model could learn a more discriminative representation for distinguishing foreground and background. The proposed method effectively and efficiently improves the performance on FSS benchmarks without extra inference cost.

The main contributions of this work are summarized as follows.


### **2. Related Work**

### *2.1. Semantic Segmentation*

Semantic segmentation, which aims to perform classification for each pixel, has been extensively investigated. Following Fully Convolution Network (FCN) [9], which uses fully convolutional layers instead of fully connected layers as a classifier for semantic segmentation, large numbers of network frameworks have been designed. For example, Unet adopted a multi-scale strategy and a encoder-decoder architecture to improve the performance of FCN, and PSPNet was proposed to use the pyramid pooling module (PPM) to generate object details. Deeplab [10,11] designed an Atrous Spatial Pyramid Pooling (ASPP) module, conditional random field (CRF) module, and dilated convolution to FCN architecture. Recently, attention mechanism has been introduced, PSANet [12] was proposed to use point-wise spatial attention with a bi-directional information propagation paradigm. Channel-wise attention [13] and non-local attention [14–17] are also effective for segmentation. These methods have managed to succeed in large-scale datasets but they are not designed to deal with rare and unseen classes and cannot be accommodated without fine-tuning.

### *2.2. Few-Shot Learning*

Few-shot learning focuses on the generalization ability of models, so that they can learn to predict novel classes with a few annotated examples [4,18–21]. Matching networks [19] were proposed for 1-shot learning to exploit a special kind of mini-batches called episodes to match the training and testing environments, enhancing the generalization on the novel classes. Prototypical network [20] was introduced to compute the distances between the representation cluster centers for few-shot classification. Finn et al. [21] proposed an algorithm for meta-learning that is model-agnostic. Even though few shot learning has been extensively studied for classification task, it is still hard to adopt few-shot learning directly on segmentation due to the dense prediction.

### *2.3. Few-Shot Segmentation*

As the extension of few-shot learning, few-shot semantic segmentation has also received considerable attention very recently. Shaban et al. first proposed the few-shot segmentation problem with a two-branch conditional network that learned the parameters on support images. Different from [22], later works [1–3,23,24] follow the idea of metric learning. Zhang et al. generates the foreground object segmentation of the support class by measuring the embedding similarity between query and supports, where their embeddings are extracted by the same backbone model. Generally, metric learning based methods can be divided into two groups: one group is inspired by ProtoNet [20], e.g., PANet [3] first embeds different foreground objects and the background into different prototypes via a shared feature extractor, and then measures the similarity between the query and the prototypes. The other group is inspired by relationNet [4], which learns a metric function to measure the similarity, e.g., Refs. [1,7,8] use an FPN-like structure to perform dense comparison with affinity alignment. Then, considering the incomplete representation of a single prototype, Li et al. [5] divide the masked region into pieces, the number of which is decided by the area of the masked region and then conducts masked average pooling for each piece to generate the numbers of the prototypes.Zhang et al. [6] utilize the uncovered foreground region and covered foreground region through segmentation on support images to generate a pair of prototypes to retrieve the loss information. However, compared to self-segmentation mechanism [6], our CPG does not need to segmen<sup>t</sup> on support images and utilization of CPG obtains competitive performance with few cost.s Compared to cluster methods [5,8], the experiment in the ablation study shows that our method can avoid over-fitting and generate stable performance in each setting.

Moreover, recent methods such as MLC [25] and SCNet [26] start to make use of knowledge hidden in the background. By exploiting the pre-training knowledge for the discovery of the latent novel class in the background, their methods bring huge improvements to the few-shot segmentation task. However, we argue that such a method is difficult to apply in realistic scenarios, since a novel class object is not only unlabelled but also unseen in the training set. Instead, we propose background guided learning to enhance the feature discriminability between the foreground and the background, which also improves the performance of the model.

### **3. Proposed Methods**

### *3.1. Problem Setting*

The aim of few-shot segmentation is to obtain a model that can learn to perform segmentation from only a few annotated support images in novel classes. The few-shot segmentation model should be trained on a dataset *Dtrain* and evaluated on a dataset *Dtest*. Given the classes set in *Dtrain* is *Ctrain* and classes set in *Dtest* is *Ctest*, there is no overlap between training classes and test classes, e.g., *Ctrain* ∩ *Ctest* = ∅.

Following a previous definition [22], we divide the images into two non-overlapping sets of classes *Ctrain* and *Ctest*. The training set *Dtrain* is built on *Ctrain* and the test set is built on *Ctest*. We adopt the episode training strategy, which has been demonstrated as an effective approach for few-shot recognition. Each episode is composed of a shot support set *S* = {*Isk*, *Msk*}*Kk*=<sup>1</sup> and a query set *Q* = *Iq*, *Mq* to form a *K*-shot episode {*<sup>S</sup>*, *Iq*}, where *I*∗ and*M*∗ are the image and its corresponding mask label, respectively. Then, the training set and test set are denoted by *Dtrain* = {*S*}*Ntrain* and *Dtest* = {*Q*}*Ntest* , where *Ntrain* and *Ntest* is the number of episodes for the training and test set. Note that both the mask *Ms* of the support set and the mask *Mq* of the query set are provided in the training phase, but only the support image mask *Ms* is included in the test phase.

### *3.2. Overview*

As shown in Figure 2, our Dual Complementary prototype Network (DCNet) is trained via the episodic scheme on the support-query pairs. In episodic training, supports images and a query image are input to the share-weight encoder for feature extraction. Then,

the query feature is compared with prototypes of the current support class to generate a foreground segmentation mask via a FPN-like decoder. Besides, we propose an auxiliary supervision, named Background Guided Learning (BGL), where our network learns robust prototype representation for a class-agnostic background in an embedding space. In this supervision, the query feature is compared with prototypes of the query background to make a prediction on its own background. With this joint training strategy, our model can learn discriminative representation for foreground and background.

Thus, the overall optimization target can be briefly formulated as:

$$
\mathcal{L}\_{\text{overall}} = \mathcal{L}\_{f\text{\textquotedblleft}} + \gamma \mathcal{L}\_{b\text{\textquotedblright}} \tag{1}
$$

where L*f g* and <sup>L</sup>*bg* denote the foreground segmentation loss and background segmentation loss, respectively, and *γ* is the balance weight, which is simply set as 1.

**Figure 2.** The framework of the proposed DCNet for 1-shot segmentation. At first, the encoder generates feature maps *Fs* and *Fq* from the support images and query images. Then, the support image masks *Ms* and related features are fed into CPG to generate a pair of foreground prototypes *Ps*. Finally, *Ps* is expanded and concatenated with the query feature *Fq* as an input to the decoder to predict the foreground in the query image. In the meantime, in BGL, the query feature *Fq* and its background mask *Mbq* are fed into CPG to generate a pair of background prototypes *Pbq*. *Pbq* is expanded and concatenated with query feature *Fq* as an input to the decoder to predict the background in the query image.

In the following subsections, we first elaborate our prototype generation algorithm. Then, background-guided learning on 1-shot setting is introduced, followed by inference.

### *3.3. Complementary Prototypes Generation*

Inspired by SCL [6], we propose a simple and effective algorithm, named Complementary Prototypes Generation (CPG), as shown in Figure 3. This CPG algorithm generates a pair of complementary prototypes and aggregates information hidden in features based on cosine similarity. Specifically, given the support feature *F* ∈ R*H*×*W*×*<sup>C</sup>* with the mask region as *M* ∈ <sup>R</sup>*H*×*W*, we extract a pair of prototypes to fully represent the information in the mask region.

As the first step, we extract the targeted feature *F* ∈ R*H*×*W*×*<sup>C</sup>* filtered through mask *M* from *F*, in Equation (2),

$$F' = F \odot M \tag{2}$$

where represents element-wise multiplication. Then, we initiate prototype *P*0 by masked average pooling, in Equation (3),

$$P\_0 = \frac{\sum\_{i}^{H} \sum\_{j}^{W} F\_{i,j}'}{\sum\_{i}^{H} \sum\_{j}^{W} M\_{i,j}} \tag{3}$$

where *i*, *j* represents the coordination of each pixel, *H*, *W* denotes the width and height of feature *F*, respectively. Since *Mi*,*<sup>j</sup>* ∈ 0, 1, the sum of *M* represents the area of the foreground region.

**Figure 3.** Illustration of the proposed Complementary Prototypes Generation. Similarity *St* and prototype *<sup>P</sup>tc*,<sup>0</sup>is obtained in *t*-th iteration. The red arrow indicates the final result *S<sup>T</sup>* after T iterations.

In the next step, we aggregate the foreground features into two complementary clusters. For each iteration *t*, we first compute the cosine distance matrix *St* ∈ R*H*×*<sup>W</sup>* between the prototype *Pt*−<sup>1</sup> 0 and the targeted features *F* as follows,

$$S^t = \text{cosine}({F'}, P\_0^{t-1}) \tag{4}$$

As we keep the relu layer in the encoder layer, the cosine distance is limited in [0, 1]. To calculate the weight of target features contributed to *Pt*0, we normalize the *S* matrix as:

$$S\_{i,j}^t = \frac{S\_{i,j}^t}{\sum\_{i}^{H} \sum\_{j}^{W} S\_{i,j}^t} \tag{5}$$

Then, after the end of the iteration, based on matrix *S<sup>t</sup>*, we aggregate the features into two complementary prototypes as:

$$P\_1 = \sum\_{i}^{H} \sum\_{j}^{W} S\_{i,j} \ast F\_{i,j}^{\prime} \tag{6}$$

$$P\_2 = \sum\_{i}^{H} \sum\_{j}^{W} (1 - S\_{i,j}) \ast F'\_{i,j} \tag{7}$$

It is worth noting that these prototypes are not separated like priors and CPG algorithm utilizes a weighted map to generate a pair of complementary prototypes. In this way, we retain the correlation between the prototypes. The whole CPG is delineated in Algorithm 1.

### **Algorithm 1** Complementary Prototypes Generation (CPG).

**Input:** targeted feature *F*, corresponding mask *M*, the number of iteration *T*.

**init** prototype *<sup>P</sup>*0*<sup>c</sup>*,<sup>0</sup> by masked average pooling with *F*. *P*0 = ∑*Hi* ∑*Wj Fi*,*j* ∑*Hi* ∑*Wj Mi*,*<sup>j</sup>*

**for** iteration *t* in {1, .., T} **do**

Compute association matrix *S* between targeted feature *F* and prototype *Pt*−<sup>1</sup> 0 , *St* = cosine(*F*, *Pt*−<sup>1</sup> *<sup>c</sup>*,0 )

Standardize association *S<sup>t</sup>*,*Sti*,*j* = *<sup>S</sup>ti*,*j*/(∑*Hi* ∑*Wj <sup>S</sup>ti*,*j*) Update prototype *Pc*,0, *Pt*0 = ∑*Hi* ∑*Wj Sti*,*j* ∗ *Fi*,*j*

**end for**

**generate** complementary prototypes *Pc* from *<sup>S</sup>T*, *P*1 = ∑*Hi* ∑*Wj* (*STi*,*j*) ∗ *Fi*,*j P*2 = ∑*Hi* ∑*Wj* (1 − *<sup>S</sup>Ti*,*j*) ∗ *Fi*,*j*

**return** final prototypes *P*1, *P*2

*3.4. Background Guided Learning*

In previous works [1,5,6], the background information has not been adequately exploited for few-shot learning. Especially, these methods only use foreground prototypes to make a final prediction on the query image in the training. As a result, the representation on class-agnostic background is the lack of discriminability. To solve this problem, Background Guided Learning (BGL) is proposed via joint training strategy.

As shown in Figure 2, BGL is proposed to segmen<sup>t</sup> the background on the query image based on query background mask *Mbq*. As the first step, query feature *Fq* and its background mask *Mbq* are fed into the CPG module to generate a pair of complementary prototypes *Pbq* = *P*1, *P*2, following Algorithm 1. Next, we concatenate the complementary prototype *Pbq* with all spatial location in query feature map *Fq*, as Equation (8):

$$F\_{\rm II} = \mathfrak{e}(P\_1) \oplus \mathfrak{e}(P\_2) \oplus F\_{\mathfrak{q}\_{\prime}} \tag{8}$$

where  denotes the expansion operation and ⊕ denotes the concatenation operation, *P*1 and *P*2 are the complementary prototypes *Pbq* as well as *Fm*, denoting the concatenated feature. Then, concatenate feature *Fm* is fed into the decoder, generating the final prediction, as shown in Equation (9):

$$
\hat{M} = \mathcal{D}(F\_m),
\tag{9}
$$

where *M* ˆ is the prediction of the model, D is a decoder. The loss <sup>L</sup>*bg* is computed by:

$$\mathcal{L}\_{b\chi} = \text{CE}(\hat{M}\_{bq\prime}, \mathcal{M}\_{bq}) \tag{10}$$

where *M* ˆ *bq* denotes the background prediction on a query image and CE denotes the cross-entropy loss.

Intuitively, if the model can predict a good segmentation mask for the foreground using a prototype extracted from the foreground mask region, the prototype learned from the background mask region should be able to segmen<sup>t</sup> itself well. Thus, our BGL encourages the model to distinguish the background from the foreground better.
