In this section, we provide an explanation of the motivation and framework overview behind our proposed method, as detailed in
Section 3.1. We define the few-shot semantic segmentation task in
Section 3.2. And we introduce the specific implementation details of the adaptive self-distillation prototype module (ASD) in
Section 3.3. The implementation details of the self-supporting background prototype module (SBP) are presented in
Section 3.4. Finally, the optimization and inference of the model PCNet are described in
Section 3.5.
3.1. Problem Setting
The objective of few-shot segmentation is to accurately segment objects belonging to unseen classes using only a limited number of labeled samples as guidance. The datasets are divided into: training set and test set . In the standard approach for few-shot learning, the class is used as the basis for splitting the dataset, and we define the class in as the base class and the class in as the new class . Note that the two sets are disjoint, that is, . The episodic training mechanism used in this paper is an effective approach for few-shot learning. In this setup, each set is composed of a support set S and a query set Q, both belonging to the same category. There are K samples in the support set S, denoted by . Each represents an image-label sample pair in S, called a K-shot, where and denote the support image and the corresponding ground truth of this image, respectively. The query set has one sample , where and denote the query image and the corresponding ground truth of this image, respectively. The input to the model is a sample pair of query image and support set S, denoted by . During training, the model is guided by the support set to make predictions on the query image, and the model is continuously optimized. The ground truth of the query image is not visible in the test process, which is used to evaluate the performance of the method. After the training is completed, the model can directly predict the segmentation of the new class in the test set .
3.2. Motivation and Framework Overview
At present, the mainstream methods of FSS only rely on a single support prototype and perform dense matching on each pixel of the query image through the support prototype. However, the samples provided by the support set and query set may have great appearance differences due to various realistic factors, such as shooting angles, lighting, etc., which makes the model have to face the problem of intra-class gap. Even though the objects in the support and query image belong to the same class, they may look very different, and the features learned by the model can hardly generate representative prototypes. This problem has a huge impact on the model’s ability to generalize. In addition, Qi Fan et al. [
48] calculated the cosine similarity for cross/intra object pixels, in
Table 2. Through analysis, it is found that the background pixels of the support image and the query image seem to be chaotic, but there are certain similar characteristics between them. This may have something to do with the fact that objects of the same category often appear in similar scenes, such as boats and rivers, and airplanes and skies, which often appear together. In
Figure 2, Chen et al. [
49] recounted that some new classes would appear in the training images in each fold. In other words, new classes and base classes would co-appear in the training images, so the model would have a certain bias toward the base class. Through the study of previous models, we found that previous methods have limitations, such as too many parameters, difficulty in adapting to shapes from different angles and a lack of the ability to capture image position information, as shown in
Table 1. Aiming at the target features in the dataset, we draw the advantages of the previous methods and try to break through the limitations of the previous methods and propose PCNet, as shown in
Figure 3.
Based on the above problems and observations, we proposes an adaptive self-distillation prototype. The knowledge distillation method introduced by Hinton et al. [
50] inspired us to transfer the information contained in the support prototype and the query prototype to knowledge transfer. Therefore, we uses self-distillation to fuse the support prototype and the query prototype into a teacher prototype, so as to realize feature alignment. The adaptive self-distillation prototype can not only express the characteristics of the support prototype, but also be as close to the unique characteristics of the target class in the query image as possible. The base learner is introduced in the module to weaken the bias of the model to the base class. The adaptive ability of the model is improved to a great extent. Aiming at the observation that there is a certain similarity between the background pixels of the support image and query image, we proposes a self-support background prototype. Under the guidance of the adaptive self-distillation prototype, the model first makes a prediction on the query feature map to obtain a rough background mask of the query image. We further perform mask average pooling of the background mask with the query feature map to generate a self-support background prototype. Finally, we generate the final prediction map by combining the adaptive self-distillation prototype and self-support background prototype to densely match the feature maps of the query image.
3.3. Adaptive Self-Distillation Prototype Module
Current mainstream prototype learning methods, including PFENet, outperform prior work by a large degree on the PASCAL-
and COCO-
datasets. The prototypes generated by these methods can effectively transfer the target features to the query branch and have a certain guiding effect. However, there can often be notable differences between the support target and the query target in few-shot segmentation. It is difficult to extend the feature prototypes provided by the support image to the complex and diverse query image. Therefore, we want to generate a kind of prototype, called a query prototype, for the query image during the training process, which carries the unique characteristics of the target class in the query image. In this way, the two prototypes obtained can adaptively express more complete feature information. To address the challenge of aligning the features of the support prototype and the query prototype, we propose a self-distillation method. This approach facilitates the transfer of knowledge between the two prototypes, resulting in the generation of an adaptive self-distillation prototype. By leveraging this knowledge transfer, the model can effectively align the features of the support and query prototypes. After generating the prediction map under the guidance of the adaptive self-distillation prototype, we further introduce the base learner to suppress the base class in the prediction map, which is proved to be effective in BAM [
46].
As shown in
Figure 4, we combine the query prototype with unique features and the support prototype with common features. They form a teacher prototype, and the loss is calculated between the teacher prototype and the support prototype. This allows continuous adjustment and optimization of the teacher prototype. Finally, we apply self-distillation to generate the self-support adaptive prototype.
denotes the support prototype.
denotes the query prototype. The equation is as follows:
where
and
represent the support and query feature map obtained from the support and query image through the shared network CNN, respectively.
and
denote the support mask and query mask, respectively.
represents the masked GAP operation. The mask
M and the feature map
have the same height and width.
and
are the outputs of softmax operation on
and
. On the basis of knowledge distillation, we use the Kullback Leibler divergence to calculate loss to adjust and optimize the support prototype.
represents the loss of the support and teacher prototype in the knowledge distillation process:
where
denotes the teacher prototype and is equal to
.
represents the KL divergence function.
The support and query prototype can be seen as consisting of two parts: common features and unique features, namely, , . In the knowledge distillation process, the goal is to improve the prototype’s expression of common features while preserving the unique characteristics of a specific query target. This enhances the prototype’s consistency and adaptability. Therefore, this method will lead to two prototypes close to the common features, that is, . The effectiveness of this module is demonstrated in our experiments.
To extend the method to the K-shot segmentation task, we make some modifications to the teacher prototype generation process. Since the K support images are different, the generated support prototypes will also be different. The teacher prototype needs to supervise each support prototype. The respective teacher prototype is then generated for each support prototype.
is used as the loss between each support prototype and the teacher prototype. The equation is as follows:
where
denotes the prototype of the i-th support sample, and
denotes the teacher prototype generated when the i-th support image is input. Then, we activate the target region in
under the guidance of
and generate a prediction result
through the decoder.
where
represents the process of archetypal guidance for query feature map segmentation, and
represents the decoder network for meta-learning. We also need to compute the loss for
and
to update the parameters:
where
denotes the foreground mask of the query image, and
e denotes the number of training samples in each batch.
In the training process, the model will be activated by the error of the seen class, that is, the base class bias. We introduce a base learner to realize the inhibition of the base class, and the learner uses PSPNet [
24] to segment the base class target.
is the prediction mask of the base class by the PSPNet [
24] network, which is expressed as:
where
is the Block-4 query feature map generated by the encoder.
represents the decoder of the standard semantic segmentation.
denotes the number of base categories. The cross-entropy loss
is used to update the optimization parameters, which is expressed as:
where
is the ground truth of all base classes in the query image.
represents the number of training samples in each batch.
We first combine all the class prediction maps to obtain the base class region prediction relative to the few-shot task, that is, the irrelevant class that is likely to be misactivated. Then, we aggregate the predicted base class region with the predicted background mask to obtain the full background prediction map:
where
represents the background mask of the query image generated under the guidance of the adaptive self-distillation prototype.
represents the irrelevant class set mask output by the base learners.
represents the aggregation function, which is a 1*1 convolution operation with specific initial parameters.
represents the prediction of the background mask to be augmented with respect to the target class of the query image. It is also the output of the whole adaptive self-distillation module. We need to update the optimization parameters by calculating the loss
between the predicted background map to be augmented and the groundtruth background mask:
where
is the background mask of groundtruth, and
e represents the number of training samples in each batch.
3.4. Self-Support Background Prototype Module
Through the adaptive self-distillation module, we make the prototype used more suitable for the target class in the query image, and then we introduce the branch of the base learner to realize the inhibition of the base class target. In summary, we obtain a relatively accurate background mask prediction map of the target class. To enhance the fine-grained accuracy, we add the generation of the self-support background prototype module, as shown in
Figure 5.
For foreground pixels, they possess similar semantic information and appearance, making the prototype learning method applicable. However, background pixels are disorganized, as the background may contain numerous categories. Their semantic similarity is limited to a local scope, and finding shared semantic similarity for the entire background becomes challenging. This may require us to generate multiple targeted background prototypes and perform pixel grouping matching for each query pixel. For the problem of generating prototypes from the background, the clustering algorithm has been mentioned many times in the previous work, and experiments have been carried out. Although the problem is solved to some extent, the clustering algorithm has the disadvantage of instability.
We use the background prediction map of the query image generated by the adaptive self-distillation module to generate a background prototype with the query image itself after the masked GAP operation, which is shown as follows.
where
denotes the self-support background prototype. This way, we update the final prototype as:
. Finally, we obtain the final prediction map
by computing the cosine distance between the final prototype
and the query image
.
We compute the loss function
on the cosine distance map generated during the final training phase.
where BCE is the binary cross entropy loss, and
is the groundtruth of the query image.
3.5. Optimization
Based on the adaptive self-distillation module and the self-support background prototype module, we propose the PCNet model, as shown in
Figure 3. For the whole model, we calculate the
loss function of the support and teacher prototype in the self-distillation process. The query prediction mask
generated by the self-distillation process and the ground truth are supervised by the binary cross entropy loss function
. The cross-entropy loss function
is used for the base class prediction mask generated by the base learner. Since the base learner needs to be trained separately, the loss function of the base learner does not participate in the optimization process of the model. We also calculate the loss function
between the predicted background map to be augmented, generated by the adaptive self-distillation module and the background mask of the groundtruth, which is used to supervise the segmentation effect of this module. Finally, the query prediction mask and the ground truth generated after the correction of the self-support background prototype module are trained and supervised by the Binary Cross Entropy loss function
. Finally, we realize the end-to-end training process of the model through the joint optimization of all the above losses:
where
and
are the coefficients of
and
, which are used to balance the supervision effects of the four losses. In the ablation experiment section, we discuss the values of the two coefficients.