3.3.1. Dual-Filter Cross Attention Module
During the prototype learning stage, many FSMIS models leverage cross-attention-based approaches [
22,
23] to achieve the interactions between support and query features, enhancing support FG features’ ability to represent FG classes. However, feature interactions that fail to filter BG features may have the query BG incorrectly fused with the support FG features, which can directly lead to lower prototype quality and indirectly lower segmentation accuracy.
To address this problem, we restrict the interaction between query FG features and support FG features by purposefully filtering BG features with the proposed DFCA module. This design allows for the stable fusion of FG features, thereby significantly enhancing the representation of the support FG features while avoiding BG mismatching issue [
44].
The entire process is illustrated in
Figure 2. Firstly, we input the support features, query features, and support mask into the prior mask generation (PMG) block to obtain a prior query mask. Then, to reduce the disparity between the FG features of support and query images, we fuse the support features
, support mask
(or query features
, prior query mask
), and prior support prototype map
using 1 × 1 convolution, outputting the fused support features
(or fused query features
). Finally, we employ the proposed filter cross attention (FCA) block to facilitate mutual learning among the FG features while filtering out the influence of BG factors as much as possible, resulting in enhanced support features
(or enhanced query features
).
(1) Prior mask generation
The PMG module applies MAP to the support image and support mask to compute the prior support prototype. The prior support prototype is then used to calculate the cosine similarity with the query image, which is subsequently thresholded to obtain the prior query mask. Using the generated prior query mask, we can initially separate the FG and BG regions of the query image.
Specifically, as shown in
Figure 2, we conduct MAP [
24] to obtain a prior support prototype
by leveraging the support features
and corresponding mask
. The mathematical form of this process is expressed as follows:
where ⊙ represents the Hadamard product,
denotes the pixel position in the original mask,
and
represent the support features and the corresponding binary FG mask, respectively, and
represents the support prototype in the process of generating the prior mask.
Subsequently, we utilize
to compute the negative cosine similarity (i.e., anomaly scores) with each location in
. This can be denoted as:
where
is the anomaly score map for each position in
,
represents the query features,
is the scaling factor introduced by Oreshkin et al. [
45] to facilitate backpropagation, which is generally set to 20, and
represents the norm of a matrix.
After that, to make the process of thresholding the anomaly scores differentiable, we employ a shifted Sigmoid function [
25], which ensures that regions with anomaly scores below the prior threshold
obtain a higher FG probability, thereby obtaining the final prior query mask
.
is typically set to
[
25]. The entire process is illustrated in the following formula:
where
is the Sigmoid function with a steepness parameter
.
(2) Feature fusion
Before the support and query FG features interact, it is important to recognize that they may not be similar. In order to close their gaps and improve the quality of interaction, we fuse the query features , prior support prototype features , and prior query mask . Moreover, we fuse the support features , prior support prototype features , and support mask . This aligns the query features with the support features in the same feature space, reducing the distribution discrepancy between them.
Specifically, as shown in
Figure 2, we process
to match the size of
, thereby obtaining prior support prototype features
. Subsequently, we conduct channel concatenation on
,
, and
and use
convolution for dimensionality reduction to achieve feature fusion. Similarly,
will be concatenated with
and
to achieve feature fusion. The fused query features
and fused support features
are calculated as:
where
denotes channel concatenation and
is
convolution operation.
(3) Filter cross attention
As shown in
Figure 2, after feature fusion, FCA takes the fused support features
, fused query features
, and both corresponding masks as inputs and then outputs the enhanced support features
and enhanced query features
. We use masks as the first filter to filter BG factors in feature maps, which involves performing a Hadamard product between masks and features. This process is shown in the following formulas:
where
and
are the FG parts of fused support features and fused query features, respectively.
Then, we perform a cross-attention-based approach [
20,
23] to obtain
and
. We take obtaining
as an example to illustrate the whole process. FCA first projects support FG features into a sequence
and then projects the query FG features into sequences
.
where
and
are weight matrices and bias terms for generating
,
and
are weight matrices and bias terms for generating
, and
and
are weight matrices and bias terms for generating
. After that, FCA conducts matrix multiplication to calculate the similarity between
and
to obtain the similarity matrix
:
where
is the dimension of
.
However, considering the inaccuracy of the prior query mask, the first filter may not completely filter BG factors. We design an filtering function in cross attention as the second filter to filter BG factors in
:
where
is the filtering function,
is the filtered similarity matrix,
is the adaptive value for filtering BG factors, and
and
represent the operations of obtaining the maximum value and the average value, respectively.
The filtering function adaptively filters low-quality similarity scores that may exist in
. Then, we use the softmax function to normalize
and fuse the result with
, which can be denoted as:
where
is the FG regions of enhanced support features.
After that, FCA adds the corresponding BG information to
, yielding the enhanced support features
:
where
represents the operation of adding BG information.
Similarly, we can obtain the enhanced query features
. The whole process is as follows:
where sequence
is the query FG features being projected and sequences
are the support FG features being projected.
3.3.2. Onion Pooling
During the prototype extraction, prototype bias represents a significant challenge that every prototype learning method must confront. In addition, due to intra-class diversity, there will be significant differences between support and query features, so the intra-class bias problem is also a significant challenge.
To address the above challenges, we propose the OP module and introduce RPT module. We use the erosion pooling (EP) operation to help prototypes acquire richer contextual information, thus enhancing the ability of prototypes to represent FG features and alleviating the problem of prototype bias. Furthermore, we alleviate the intra-class bias by using the self-attention mechanism and the RPT module to reduce the inconsistency between prototypes.
Specifically, we use the proposed EP operation to erode the support mask, creating layer-by-layer onion masks that prepare for extracting prototypes. Then, we conduct the MAP method to generate prototypes. Finally, we employ a self-attention-based method to enable mutual learning within the prototypes. This process is illustrated in
Figure 3.
(1) Onion mask generation
Here, we refer to the eroded support masks as onion masks. We conduct several EP operations to shrink the FG region of the support mask progressively, resulting in multiple onion masks.
In the beginning, we obtain the BG mask by reversing the support mask. Subsequently, we perform a
max-pooling operation on the BG mask to reduce the size of the FG region. These two steps simultaneously expand the background region of the mask while reducing the foreground region. Finally, we reverse the pooled BG mask again to obtain the FG mask, which is the onion mask. This process can be expressed as follows:
where
represents the onion mask obtained after
j-th (
) erosion pooling. It is worth noting that
is support mask
when
.
represents the EP operation using a
window.
Finally, we obtain n onion masks by executing EP n — 1 times. Empirically, the upper limit of the number of onion layers is typically set to 4.
(2) Prototypes generation and enhancement
After obtaining onion masks, we conduct MAP to extract support prototypes by leveraging enhanced support features and onion masks:
where
represents the support prototype extracted using the
j-th onion mask. It is worth mentioning that we obtain
n support prototypes from the
n generated onion masks.
To alleviate intra-class bias, we employ a self-attention mechanism [
46] to enable prototypes to learn from each other internally. Precisely, we concatenate the prototypes into one, then project the result into the sequences
,
, and
:
where
,
,
are weight matrices for generating
,
,
, respectively.
,
, and
are bias terms for generating
,
, and
, respectively. Subsequently, we compute the dot product similarity between
and
, then normalize it using a softmax function to obtain the attention scores. We employ the attention scores to weight
and then split the result into
n prototypes. The entire process is illustrated as follows:
where
represents the support prototypes generated by mutual learning through the self-attention method (where
) and
represents splitting the concatenated prototype into
n prototypes.