**4. Experiments**

*4.1. Dataset and Evaluation Metrics*

### 4.1.1. Datasets

We evaluate our algorithm on two public few-shot datasets: PASCAL-5*<sup>i</sup>* [22] and COCO-20*<sup>i</sup>* [27]. PASCAL-5*<sup>i</sup>* is built from PASCAL VOC 2012 and SBD datasets. COCO-20*<sup>i</sup>* is built from MS-COCO dataset. In PASCAL-5*i*, 20 object classes of PASCAL VOC are split into 4 groups, in which each group contains 5 categories. In COCO-20*i*, as PASCAL-5*i*, we divide MS-COCO into 4 groups, in which each group contains 20 categories. For PASCAL-5*i* and COCO-20*i*, we evaluate our approach based on PFENet. We use the same categories division and randomly sample 20,000 support-query pairs to evaluate as PFENet.

For both datasets, we adopt 4-fold cross-validation i.e., a training model on three folds (base class) and the inference model on the remaining one (novel class). The experimental results are reported on each test fold, and we also report the average performance of all four test folds.

### 4.1.2. Evaluation Metrics

Following previous work [7,27], we use the widely adopted class mean intersection over union (mIoU) as our major evaluation metric for the ablation study, since the class mIoU is more reasonable than the foreground-background IoU (FB-IoU), as stated in [7]. For each class, the IoU is calculated by *TP TP*+*FN*+*FP* , where TP denotes the number of true positives, FP denotes the number of false postives and FN denotes the number of false negatives. Then, mIOU is the mean value of all classes IoU in the test set. For FB-IoU, only the foreground and background are considered ( *C* = 2). We take the average of the results on all folds as the final mIoU/FB-IoU.

### *4.2. Implementation Details*

Our approach is based on PFENet [1] with ResNet-50 as the backbone to create a fair comparison with the other methods. Following previous work [1,5,6], the parameters of the backbone are initialized with the pre-trained ImageNet, and is kept fixed during training. Other layers are initialized by the default setting of PyTorch. For PASCAL-5*i*, the network is trained with an initial learning rate of 2.5 × <sup>10</sup>−3, weight decay of 1 × <sup>10</sup>−4, and a momentum of 0.9 for only 100 epochs. The batch size is 4. For COCO-20*i*, the network is trained for 50 epochs with a learning rate of 0.005 and batch size of 8. We use data augmentation during training. Specifically, input images are transformed with random scale, horizontally flipped and rotated from [ −10, 10], and then all images are cropped to 473 × 473 (for PASCAL and COCO) or 641 × 641 (for COCO) as the training samples, for fair comparison. We implemented our model with 4 RTX2080Ti.

### *4.3. Comparisons with State-of-the-Art*

### 4.3.1. COCO-20*<sup>i</sup>* Result

COCO-20*<sup>i</sup>* is a very challenging dataset that contains the numbers of objects in realistic scene images. We compare our approach with others on this dataset, and our approach outperforms other approaches by a big margin, as shown in Table 1. It can be seen that our approach achieves state-of-the-art performance on both 1-shot and 5-shot settings with mIOU gain of 0.3% and 0.5%, respectively. Furthermore, compared to our baseline (PFENet with ResNet101), our approach (with ResNet101) obtains 9.1% and 12.6% mIoU increases for 1-shot and 5-shot settings. In Table 2, our method obtains a top-performing 1-shot result and competitive 5-shot result with respect to FB-IoU. Once again, these results demonstrate that the proposed method is able to deal with more complex cases, since MSCOCO is a much more challenging dataset with diverse samples and categories.


**Table 1.** Comparison with other state-of-the-art methods on COCO-20*<sup>i</sup>* for 1-shot and 5-shot settings. † denotes the model using size 641 × 641 as the training samples. All methods are tested on the original size. **Bold** denotes the best performance and red denotes the second best performance.

**Table 2.** Comparison of FB-IoU on COCO-20*<sup>i</sup>*


.

### 4.3.2. PASCAL-5*<sup>i</sup>* Result

In Table 3, we compare our method with other state-of-the-art methods on PASCAL-5*i*. It can be seen that our method achieves on par state-of-the-art performance on 1-shot setting and 5-shot setting. Additionally, our method significantly improves the performance of PFENet on 1-shot and 5-shot segmentation settings, with an mIOU increase of 1.6% and 4%, respectively. In Table 4, our method obtains competitive 1-shot results and topperforming 5-shot results with respect to FB-IoU. In Figure 4, we report some qualitative results generated by our approach with PFENet [1] as the baseline. Our method is capable of making correct predictions and each part of our method could independently improve the performance of the model.

**Table 3.** Comparison with state-of-the-art methods on PASCAL-5*<sup>i</sup>* for 1-shot and 5-shot settings. For fair comparison, all methods are evaluated with backbone ResNet50 and tested on labels with original sizes. **Bold** denotes the best performance and red denotes the second best performance.



**Table3.***Cont.*

**Table 4.** Comparison of FB-IoU on PASCAL-5*<sup>i</sup>* for 1-shot and 5-shot settings. We used ResNet50 as the backbone.


**Figure 4.** Qualitative examples of 5-shot segmentation on the PASCAL-5*i*. (**a**) The ground-truth of the query images. (**b**) Results of baseline (PFENet) . (**c**) Results of BGL. (**d**) Results of CPG. (**e**) Results of the combination of BGL and CPG. Best viewed in color and zoomed in.

### *4.4. Ablation Study*

To verify the effectiveness of out proposed methods, we conduct extensive ablation studies with a ResNet-50 backbone on PASCAL-5*i*.

### 4.4.1. The Effectiveness of CPG

To verify the effectiveness of CPG, we conduct several experiments on prototype generation and compare it with other prototype generation algorithms. As a kind of soft cluster algorithm, we first compare our method with Adaptive K-means Algorithm (AK) provided by ASGNet [5], and a traditional algorithm, Expectation-Maximization Algorithm (EM), as shown in Table 5. Compared to the baseline, both AK and EM degenerate the performance of segmentation in a 1-shot setting while our CPG offers 0.6% improvement on the baseline. Compared to SCL [6] which needs to segmen<sup>t</sup> both support images and query images, our approach uses less computation cost and inference times (in Table 6) with competitive results on both 1-shot and 5-shot settings. These indicated the superiority of CPG on the few-shot segmentation task.

**Table 5.** Ablation study on prototype generation in a 1-shot setting on PASCAL-5*i*.


**Table 6.** Ablation study on the effectiveness of different components, evaluated on PASCAL-5*i*. We report the mIoU and Frames (number of episodes) per second (FPS) for 1-shot and 5-shot. CPG: Complementary Prototypes Generation. BGL: Background Guided Learning.


### 4.4.2. The Effectiveness of BGL

To demonstrate the effectiveness of our proposed BGL, we conduct both qualitative and quantitative analysis on BGL. We assume the BGL has two sides of effectiveness on feature representation. The first one is the enhancement of feature representation for the novel classes and the second one is discrimination between the class-specific (foreground) feature and the class-agnostic (background) feature. Following [28], we measure the interclass variance, intra-class variance, and discriminative function *φ*. Here *φ* is defined as inter-class variance divided by the intra-class variance.

As shown in Figure 5a,b,d, BGL not only enlarges the inter-class variance for novel classes but also increases intra-class variance for novel classes. In other words, BGL does not improve the representation discriminability for novel classes. However, as shown in Figure 5c,e, BGL enlarges the inter-class distance and increases the discriminative function *φ* between the foreground and the background. Therefore, the effectiveness of BGL is in the promotion of discrimination between the foreground and background.

**Figure 5.** Discriminability analysis. (**a**) intra-class variance on novel classes. (**b**) Inter-class variance on novel classes. (**c**) Inter-class variance on the foreground/background. (**d**) Discriminative function *φ* on the novel class. (**e**) Discriminative function *φ* on the foreground/background.

### 4.4.3. The Effectiveness of BGL and CPG

To demonstrate the effectiveness of both CPG and BGL, ablation studies are conducted on PASCAL-5*i*, as shown in Table 6. Compared with the baseline, using CPG and BGL alone improves the performance by a large margin, 1.7% and 2.6% for mIoU on 5-shot setting, respectively. In addition, we show that using CPG alone could achieve the current SOTA performance provided by SCL [6], and using BGL could surpass thestate-of-theart performance with a 2.2% mIoU score. Then, combining both CPG and BGL achieves higher performance than the aforementioned one, with 4% improvement in total. In Figure 4, we show that using CPG and BGL alone may generate wrong segmentations on the background, but a combination of them could improve the results. In Figure 6, we show some representative heatmap examples, which further shows how the combination of CPG and BGL helps the model segmen<sup>t</sup> precisely and accurately.

**Figure 6.** Heatmap examples on PASCAL-5*<sup>i</sup>* in a 5-shot setting. (**a**) Result of baseline. (**b**) Result of CPG . (**c**) Result of BGL. (**d**) Result of the combination of BGL and CPG.
