1. Introduction
With the rapid improvement of computing power (supported by advanced hardware platforms) and the rise of deep learning algorithms, various scene awareness schemes have been integrated into smart cars. For instance, Light Detection and Ranging (LiDAR), Radio Detection and Ranging (RADAR) are 2 important auxiliary equipment for scene understanding under poor lighting conditions via radio waves and laser pulses, respectively [
1]. They are highly valued for precise distance measurement (for scene reconstruction and mapping), but behave poor in visual recognition aspect. Ultrasonic sensors are specially designed for parking assistance, but they are limited to short distance warning. By contrast, vision sensors with the state-of-the-art recognition and perception algorithms greatly improve the scene awareness ability of the intelligent vehicles, which also have lower prices than the abovementioned instruments. The commonly involved algorithms in scene understanding include object detection, semantic segmentation and depth recovery [
2] etc.
Traditional deep learning approaches often require a massive number of labeled samples. Computer vision tasks like image segmentation typically need numerous high-quality pixel-level annotations to guide model training [
3,
4]. However, acquiring such annotated data is expensive, time-consuming, or even infeasible. Few-shot learning leverages prior knowledge to generalize to new tasks with only a few supervised samples. Therefore, to reduce the labeling costs of image segmentation under scenarios with limited or scarce samples, many studies have incorporated few-shot learning into the image semantic segmentation field.
Few-Shot Semantic Segmentation (FSS) requires less annotation data to complete pixel-level semantic segmentation tasks, improving its generalization ability. Normally, a limited number of images is annotated for one category, which indicates that the model must learn intra-class features and migrate them to unseen classes. Currently, mainstream FSS includes metric learning-based methods, parameter prediction-based methods, fine tuning-based methods, and memory-based methods. Among them, metric learning-based methods [
5,
6,
7,
8,
9] play a dominant role. The distances between feature vectors supporting images and querying images in high-dimensional space are used [
5] to calculate the similarity between them so as to predict the category probability of each pixel in the image.
In this method, the amount of annotation of images without visible categories is greatly reduced, but the annotation requirement of visible categories is still indispensable in the training process. Therefore, it is still challenging to further reduce annotation requirements using few-shot semantic segmentation. A two-stage unsupervised image segmentation method was proposed by the authors of [
10], who used the K-means clustering method to cluster pixels in images into semantic groups to obtain significant regions with continuous semantic information. A self-supervised FSS method based on unsupervised significance for prototype learning was devised in [
11] to generate training pairs from a single training image to capture the similarity between the queried image and a specific region supporting the image. The feature vector of the Laplacian matrix derived from the feature affinity matrix of a self-supervised network was utilized in [
12] to eliminate the need for extensive annotation, effectively capturing the global representation of the object of interest in the supporting image. As mentioned earlier, traditional methods rely heavily on manual annotation. Although the above methods alleviate such problems to some extent under a self-supervised learning framework, most of them do not fully utilize the different scale features, leading to poor segmentation performance.
To solve the abovementioned problem, a self-supervised few-shot segmentation method based on saliency segmentation is proposed in this paper. Our method is built under a multi-task learning framework. Each saliency mask is divided into two parts, one of which is used as a support mask in a random way, while the other part is used as a query mask to participate in the model training of few-shot meta learning. To further enhance the meta learning effect, multiple learning tasks are proposed after saliency segmentation to jointly enhance the few-shot segmentation performance. At the same time, in order to enhance the robustness of the model, noise addition and image enhancement are applied to process the input image so as to better simulate FSS tasks. To fully utilize the multi-scale features, a dense attention calculation mechanism is developed, which transforms the multi-scale feature map into a multi-scale dense attention block to yield the final prediction result via inter-scale mixing. Finally, the self-supervised few-shot semantic segmentation method is formed, which is based on a multi-task learning scheme and dense attention calculation.
The main contributions of this paper are summarized as follows:
A self-supervised few-shot segmentation method is proposed based on a multi-task learning paradigm. The unsupervised salient part of the image is split into two parts; one of them is used as a support image mask for few-shot segmentation, and the other part and the entire image are used to calculate the cross-entropy with the prediction results to realize multi-task learning so as to improve the generalization ability;
An efficient few-shot segmentation network based on dense attention computation is proposed. Multi-scale feature extraction is carried out using Swin Transformer so as to make full use of the multi-scale pixel-level correlation.
Experimental results obtained on three mainstream datasets show that our method surpasses other popular methods in segmentation accuracy, demonstrating its efficacy.
3. Method
In this section, we introduce our proposed MLDAC in detail. First, the task of self-supervised few-shot segmentation is clarified; then, our multi-task framework is introduced in
Section 3.2. The core modules in MLDAC are described in
Section 3.3.
3.1. Problem Definition
Fully supervised few-shot semantic segmentation. Traditional FSS is always based on fully supervised learning. Specifically, given the same class of images and corresponding mask conditions in the training set (), the model aims to find the designated area related to the mask in another image based on the images and corresponding masks in the given test set () so as to accomplish the few-shot segmentation task. This is the meta-learning paradigm called episodic training. In real applications, both and consist of different classes of objects, and image pairs with the same category are selected to realize the meta-learning paradigm. over class has a completed annotation mask for every image. The classes () of have no shared classes, as is the case for (i.e., ∩ = ). In episodic training, each image pair contains a duplicate image, mask, and class information, where the class information is the same, i.e., for and , , where and are the images, and are the ground-truth binary masks, and and are the class labels corresponding to the mask.
Self-supervised few-shot semantic segmentation (SFSS). For the self-supervised few-shot semantic segmentation problem, consists of images without masks or labels so that the training process cannot be implemented. To solve this problem, a new SFSS method based on multi-task learning to build a self-supervised experimental process is proposed. After training, the same evaluation protocol as the standard FSS can be used to evaluate the learned meta-models for a multitude of segmentation tasks with few images.
3.2. Framework
To realize SFSS, a complete episodic training framework is constructed in this paper. The architecture of our proposed MLDAC based on multi-task learning consists of three inputs (query image, support image, and support mask) and one output (segmentation result), as shown in
Figure 1. The input is a single image without any annotation or class label (
∈
). Self-supervised learning usually uses data attributes to set unsupervised tasks instead of manual annotation. Therefore, unsupervised saliency prediction was utilized to obtain the saliency region (
), which depicts the arbitrary object in the image with continuous and accurate edge information. Next,
is divided into 2 parts, namely
and
, the former of which is used as the support mask that is input to MLDAC, while
and
are used to calculate the loss as follows:
then,
Equations (
1) and (
2) are both cross-entropy loss functions with slightly different implementation details. For Loss 1,
do not participate in the calculation of the loss function so that the model focuses on learning the query region, weakening the impact of the support region. Loss1 and Loss2 guide model training in an alternate way, with probabilities set to a and 1-a, respectively.
Meanwhile, since
and
come from the same source, to highlight the difference between them, we employ data augmentation techniques, including jittering, horizontal flip-flop, rotation, and random cropping. Gaussian noise is also added before image enhancement (i.e., the color of the selected query region is perturbed slightly to augment the diversity of the training data). The pseudo code of our proposed self-supervised few-shot semantic segmentation framework is expressed in Algorithm 1 as follows:
Algorithm 1 FSS self-supervised framework based on multi-task learning |
- 1:
multi-tasksplit(): - 2:
UnsupervisedSaliency Detection - 3:
- 4:
if then - 5:
- 6:
- 7:
else {} - 8:
- 9:
- 10:
end if - 11:
- 12:
- 13:
- 14:
|
3.3. MLDAC Network Architecture
As shown in
Figure 2, our proposed MLDAC consists of the following 3 parts:
In the first part, a pre-trained feature extractor is applied to process both the query and the support images to obtain multi-scale query and support features and support image masks of the corresponding size;
The second part inputs the query features, support features, and support masks at each scale into a multi-layer Dense Attention Computation Block (DACB) of the same scales as Q, K, and V. DACB performs a multi-scale query, support features, and support mask attention calculations;
The third part involves the aggregation of the outputs from the previous stage and the multi-scale features. This process produces the final prediction masks using a tailored mixer.
3.3.1. Feature Extraction and Masking
The first stage involves the extraction of different levels of semantic features. Here, Swin Transformer (Swin-B) is employed as the feature extractor, which captures both local fine-grained features and contextual semantic information. Through the bottom-up pathway, features at multiple scales are computed, enabling multi-scale feature learning. Following [
7], after capturing image features of different sizes, the image mask is scaled to different support mask sizes via linear interpolation, allowing for cross-feature attention in different layers. Compared with existing FSS models, the Swin Transformer model (Swin-B) is adapted to extract features and was pre-trained on ImageNet-1K and liu2021swin.
3.3.2. Dense Attention Computation Block(DACB)
Our proposed DACB aggregates multi-scale features to produce semantic information. The initial stage involves the transformer architecture (i.e., scaled dot-product attention). The corresponding calculation is written as follows:
where
, and
V are the sets of query, key, and value vectors, respectively; d represents the dimension of the query and key vectors; and
indicates that the location code has been added to
with absolute learnable position encoding.
In this paper, the query and support feature maps are denoted as
,
∈
, where
h,
w, and
c represent the height, width, and number of channels of the feature maps, respectively. As shown in
Figure 3, the support feature (Fq) and query feature (Fs) are flattened first, and each pixel value is regarded as a token. Then, after adding learnable linear position coding, the
,
matrix is generated from the flattened
and
, and the multi-head attention mechanism is implemented as follows:
where
, and the inputs
are the
group from
with dimension
. For the support mask, it is only necessary to flatten it to participate in the calculation of dense attention. Finally, the output of the multiple attention heads of each token is averaged, and the averaged output is reset to a two-dimensional tensor expressed as
with dimensions of
, which is the final dense attention computation output.
3.3.3. Inter-Scale Mixing and Up-Sampling Module
After cross-feature dense attention computation at different scales of features from multiple layers, it is necessary to mix attention results from these different scales to obtain the final prediction. Our inter-scale mixing and up-sampling module has 2 parts; one stitches the different layers directly after cross-feature dense attention computation, and the other one improves the recognition of image features using skip linking. In this step, the size of each layer is adjusted via continuous up-sampling operations.
First, the dense attention computation scale of
is specifically used to obtain the attention-weighted result, and
.
are subsequently processed through several convolution blocks that are finally merged into the resultant block after suitable up-sampling operations. The outputs of 1/32 and 1/16 are connected, resized, and concatenated to yield the outputs of the 1/8 scale as follows:
where ↑ is the up-sampling operation and ⊕ stands for the connection operation.
is then processed by a skip connection and decoding operation to obtain the final predicted mask. Then, the last layer of features extracted by the feature extractor with 1/4 and 1/8 scales are concatenated as follows:
Finally, the result is obtained using a decoder (
) to produce the final mask prediction as follows:
The decoder is composed of several convolutional modules and ReLU blocks that operate alternately, along with up-sampling operations to attain the final segmentation resolution. The decoder blocks gradually reduce the dimensions of the output channels to 2 (1 for foreground and the other for background) in one-way segmentation. Two interleaved up-sampling operations are used to restore the output size to match that of the input images.
4. Experiments and Results
Datasets. To validate the effectiveness of our proposed method, extensive experiments were conducted on the PASCAL-, COCO- and FSS-1000 datasets.
PASCAL-
is built upon PASCAL VOC [
37]. It has 20 categories that are further divided into 4 folds, namely
,
,
, and
. Each fold has different kinds of categories. For instance,
includes planes, bikes, birds, etc., while
includes buses, cars, chairs, etc. During the training for each fold, the other three folds are used as the training dataset. We need only image data for our unsupervised training, without the tasks or class information associated with the images. Hence, we use images from all the folds to support our training and use the unsupervised saliency map of all folds and the folded images to assess the average concurrency ratios of each fold and preserve the best outcomes.
Similar to PASCAL-
, COCO-
is derived from MS COCO [
38], which consists of more than 120,000 images from 80 categories. It is split into four folds denoted by
,
,
, and
, each of which contains 20 categories.
The FSS-1000 dataset [
39] is set up with well-established categories; we use only images from pre-trained categories as support and do not use images from the target categories as part of the training set. For all datasets, the mean intersection over union (mIoU) is used, and one-shot segmentation results are reported and compared.
4.1. Implementation Details
All experiments were performed using PyTorch framework. The pre-trained Swin-B-based model is used as the backbone feature extractor, (which is trained on ImageNet-1K [
33]). Both support and query images have input sizes of
pixels. For optimization, rgw Adam optimizer was applied with a learning rate of
, a weight decay of
, and pixel cross-entropy loss. Each model was trained on two 3090 GPUs for 100 epochs using the PASCAL dataset and 30 epochs using the COCO dataset, with a batch size of 16.
4.2. Comparison with Other Popular Methods
Comparisons are made in
Table 1 and
Table 2 between our method and other state-of-the-art supervised few-shot segmentation approaches and self-supervised semantic segmentation approaches. Here, avg represents the mean intersection over union,
represents the average category segmentation accuracy of all categories in the i-th fold, and FSS-1000 represents the segmentation accuracy on the FSS-1000 dataset. The supervised models utilized the ground-truth segmentation mask during usual fold-based training, whereas the unsupervised models were trained on a training set without the ground truth. As shown in
Table 1, we achieved the best results among all the self-supervised methods and even surpassed two of the fully supervised methods. Similarly for COCO and FSS-1000, we also achieved the best overall results among all the self-supervised methods, exceeding two of the fully supervised methods (on COCO).
Above all, the framework that we propose is highly effective. When comparing results on the PASCAL dataset, we use the results reported in MASKSPLIT, where the Saliency* and MaskContrast* methods are from [
10], optimized to obtain unsupervised approaches that represent the framework. In order to make a comprehensive comparison with the existing supervised few-shot method, the source code provided by MIANet [
42] is used to re-conduct the experiment under the self-supervised settings, obtaining a result of 53.8%. In comparison to supervised approaches, self-supervised approaches perform poorly because they do not learn intra-class information. Despite this, our approach performed exceptionally well, achieving a score of 55.1% on PASCAL. This score is two points higher than the initial slicing approach, MASKSPLIT, which is very competitive.
Table 2 shows a comparison between our method and other popular self-supervised and fully-supervised few-shot segmentation methods on COCO and FSS-1000. As shown by the results, the performance of our method was greatly enhanced compared to current self-supervised few-shot segmentation methods, with an increase from 23.3 to 26.8 on the COCO dataset and to 78.1 on the FSS-1000 dataset, which is a significant improvement. We attribute the superb results on the FSS-1000 dataset to the unsupervised saliency regions being more prominent and free of noise and to the relatively high within-class image similarity. It is worth noting that MaskSplit surpasses our method on
in both
Table 1 and
Table 2. The reason is that MaskSplit masks out all the background regions of the supported image (i.e., masked pooling) during the self-supervised training process. This strategy is effective when facing a complex background. However, we realize better overall results by using dense attention computation to acquire both the background and foreground information.
To further demonstrate the advantages of our model for cross-domain few-shot segmentation tasks, an additional experiment was conducted on the ISIC2018 dataset [
44], which contains skin lesion images and is mainly used for medical image analysis and model training. It comprises thousands of high-resolution images of skin lesions, including benign lesions such as moles and pigmented nevi, as well as malignant lesions such as melanoma and basal cell carcinoma. The major advantage is that the images have already been annotated by professionals.
Comparative results on ISIC are shown in
Table 3. PATNet [
45] proposes a few-shot segmentation network based on pyramid anchoring transformation, which converts domain-specific features into domain-independent features for downstream segmentation modules to quickly adapt to unknown domains. PMNet [
46] proposes a pixel matching network that extracts domain-independent pixel-level dense matches and captures pixel–pixel patch relationships in each supporting query pair using bidirectional 3D convolution. Compared with PATNet [
45] and PMNet [
46], we achieved much higher accuracy.
4.3. Analysis of the Computational Complexity
In this section, we analyze the computational complexity of MLDAC in terms of model parameters (Params), floating-point operations (FLOPs), and inference time. Among them, Params indicates the number of parameters that the model has (i.e., model size), and FLOPs indicates the computation cost during inference. The inference time is the time that the model spends to produce the segmentation results. The experiment was conducted on 2 RTX 3090 GPUs, and we adopted Swin-B as our backbone. A comparison is shown in
Table 4 below. Compared with HSNet [
7], although our method has higher computational costs, it requires fewer iterations (i.e., much shorter inference time).
4.4. Visualization Results
In this section, a comparison of visualization results is shown to demonstrate the segmentation results obtained by different methods. As shown in The first column in
Figure 4 shows the ground-truth segmentation results; the second column shows the supporting image and corresponding masks; and the third and the fourth columns show the segmentation results obtained by MaskSplit and our method, respectively. In comparison, our proposed method clearly delineates the boundary of each object better than MaskSplit. For complex backgrounds, MLDAC can better distinguish the foreground and the background based on the supporting images.
4.5. Ablation Study
A comprehensive ablation study was conducted on PASCAL to validate the effectiveness of our proposed method.
4.5.1. Multi-Task Learning Parameter Settings
In this section, the proposed multi-task learning parameter (a) and noise injection parameter (b) in MLDAC are described. Here,
a represents the probability of selecting task 1 or task 2 during training, which balances the proportion of the two few-shot segmentation target loss functions. When it is offset to 0 or 1, the network degenerates into an ordinary single-target structure. The experimental results obtained using different values of a are shown in
Figure 5(1). It can be seen that when a = 0.15, the model achieves the optimal result; therefore, we set a = 0.15 for all other experiments. Similarly, as shown in
Figure 5(2), the best result is reached when
. Here,
b represents the mean of Gaussian noise mixed into the image with additive noise.
4.5.2. The Architecture of MLDAC
As shown in
Table 5, the combination of different schemes was validated to search for the optimal settings of MLDAC. As shown in
Table 5, our proposed learnable linear positional encoding and skip connection are, indeed, effective. The former enhances the connections between different features, and the latter strengthens the semantic information, making it easier to obtain the correlated region between the supporting and query images. Meanwhile, the 1/4 and 1/8 features accomplish the segmentation task in a more effective way.
It can be seen from the data in
Table 5 that multi-scale DACBcan better capture semantic information of different scales and obtain better segmentation effects than the compared models. Removing 18 scale-intensive attention computing blocks reduces the model performance by 0.6%, and consecutive removal of
and
level dense attention computation blocks reduces the model’s performance.
4.5.3. Configuration of Learnable Absolute PE and Dense Skip Connections
Ablation studies were performed on the absolute learnable positional encodings and skip connections, and the results are shown in
Table 6. The absolute learnable position encoding we use is added only at the
and
levels of dense attention computation. The
level uses the encodings fixed by sine and cosine functions with different frequencies to save training costs. Our skip connections of the
scale refer to the up-sampling of features from the previous layer when using features from the later layer and performing the original skip-linking operation by using a conv module to resize the spliced features back to the original size. Ablation experiments show that absolute learnable position encoding facilitates the experiments. Using the dense skip connection operation only on
and
can improve the final segmentation by identifying the features in the intermediate layer more efficiently. The combination of
scale features and
scale features with conv blocks before skip connections can complete segmentation tasks more efficiently.
5. Conclusions
Self-supervised methods have begun to prevail in multiple computer vision tasks, including semantic segmentation. In this paper, a self-supervised few-shot segmentation method is proposed based on multi-task learning and dense attention computation. Our method utilizes unsupervised saliency regions for self-supervised learning for few-shot segmentation (FSS), which avoids the need for extensive manual annotation. The unsupervised saliency regions provide continuous semantic information to improve the training of self-supervised FSS. The self-supervised FSS method based on multi-task learning is proposed to solve the lack of category information, which divides the salient regions into query regions and support regions. The introduction of an attention mechanism improves the segmentation accuracy of the model. Extensive experiments were conducted on COCO- and PASCAL-, on which our model achieved 55.1% and 26.8% one-shot mIoU, respectively. In addition, it realized 78.1% on FSS-1000.
Despite the appealing results we achieved, the proposed self-supervised FSS method based on saliency segmentation still cannot effectively provide continuous salient regions for objects of the same category. In the future, we plan to introduce an image generation scheme to construct a meta-learning paradigm for FSS so as to achieve higher segmentation accuracies.