1. Introduction
Benefiting from the rapid development of deep learning, significant progress has been achieved in the realm of remote-sensing object detection. Nonetheless, nearly all deep learning detectors rely heavily on large-scale annotated datasets to achieve satisfactory performance. Unfortunately, collecting sufficient labeled data is time-consuming or even unfeasible, particularly for remote sensing images (RSIs) with large image sizes and complex backgrounds. In this case, existing detectors will suffer from severe overfitting. To mitigate this limitation, few-shot object detection (FSOD) has been proposed and has garnered increasing attention from researchers.
FSOD aims to leverage knowledge learned from data-rich base datasets to enhance the performance of novel classes with limited data. Concretely, most FSOD approaches can be divided into fine-tuning methods and prototype-based methods. Fine-tuning methods [
1,
2,
3,
4] transfer the knowledge from base classes to novel classes through fine-tuning. Methods based on fine-tuning are more straightforward, yet their detection performance remains unsatisfactory. Prototype-based methods aim to construct a class prototype for each class and detect few-shot classes by matching candidate regions with corresponding class prototypes. Pioneering works [
5,
6,
7,
8] construct class prototypes from few-shot support images to detect objects in query images, as illustrated in
Figure 1a. Inspired by the rapid development of natural language processing, current researchers are exploring the construction of class prototypes from textual class names [
9,
10] or class descriptions [
11], as illustrated in
Figure 1b. Owing to the robustness of class prototypes, prototype-based approaches have demonstrated superior detection performance compared to fine-tuning methods in numerous applications.
However, most prototype-based object detectors [
5,
6,
7,
8,
9,
10,
11] construct class prototypes from either a single-modal image feature or a class name, as depicted in
Figure 1a,b. These single-modal features constrain the feature extraction capabilities of prototypes. Actually, both visual and textual prototypes possess distinct strengths and drawbacks. On the one hand, textual prototypes generated from the text encoder contain rich prior knowledge about target classes and exhibit strong generalization capabilities, as the text encoder has been pre-trained on a large-scale natural language corpus. Nonetheless, since the textual prototypes are domain-agnostic, they cannot provide specific spatial details about target classes in RSIs, thereby limiting their applications in the field of remote-sensing object detection. On the other hand, a picture paints a thousand words. Compared to textual prototypes, visual prototypes generated from RSIs can provide richer details about target classes. Nevertheless, the visual prototypes generated from limited support images lack generalization capabilities in few-shot scenarios, which will significantly degrade detection performance. In summary, single-modal visual and textual prototypes have distinct strengths and limitations.
Therefore, to further boost the performance of FSOD in RSIs, we propose a framework that leverages both visual and textual prototypes and we introduce a prototype aggregation module (PAM) that integrates the generalized prior knowledge of textual prototypes with the spatial details of visual prototypes. To be specific, as illustrated in
Figure 1c, few-shot support images and textual class names are processed through their respective feature extractors to derive individual visual and textual prototypes. The highly generalizable textual prototypes and detail-rich visual prototypes are then aggregated by a prototype aggregator to obtain multi-modal prototypes. The produced multi-modal prototype retains both the generalized prior knowledge of textual prototypes and the detailed spatial information of visual prototypes. This multi-modal prototype is then integrated into the encoder and decoder of GroundingDINO [
10] to detect targets.
The introduction of pre-trained text encoders makes previous few-shot object detection training strategies no longer suitable for our approach, as these strategies have not taken the text encoder into consideration and the text encoder is inherently different from the vision encoder. Therefore, we have conducted extensive ablation studies on the generalization capability of the pre-trained text encoder for FSOD in RSIs. We surprisingly find that the pre-trained text feature extractor exhibits remarkable robustness in FSOD tasks on RSIs. Even without any fine-tuning of the text encoder, a detector based on the pre-trained text encoder can achieve comparable performance to previous state-of-the-art FSOD methods. We argue that the reason for this is that the text encoder is domain-agnostic. Unlike images that exhibit different characteristics across domains, textual semantic information typically maintains similarity across various domains. Based on ablation studies and previous findings, we propose a novel efficient two-stage training strategy (ETS) for FSOD on RSIs, which is the first to take the characteristics of the pre-trained text feature extractor into account.
By integrating the multi-modal prototypes and the efficient two-stage training strategy with an advanced detector [
10], we propose the multi-modal prototypical few-shot object detector (MP-FSDet) and achieve state-of-the-art few-shot object detection performance on DIOR and NWPU VHR-10.v2, which are two widely used datasets for few-shot object detection in RSIs. As shown in
Figure 2, our method remarkably outperforms other state-of-the-art detectors, achieving a maximum performance improvement of 8.7% in the remote-sensing FSOD task.
In summary, the contributions of this paper are summarized as follows.
1. We propose a few-shot object detector based on multi-modal prototypes, aiming to enhance few-shot detection performance by combining visual prototypes and textual prototypes.
2. We propose a prototype aggregating method for the construction of multi-modal prototypes, which maintains the spatial details of visual prototypes and semantic prior of textual prototypes.
3. Based on comprehensive analyses and extensive experiments on different components of detectors, we propose an efficient two-stage training strategy (ETS) for FSOD on RSIs, which, to the best of our knowledge, is the first to take the characteristics of the pre-trained text feature extractor into account.
4. By integrating our proposed methods with the state-of-the-art detector [
10], we achieve state-of-the-art detection performance on two widely used benchmarks, attaining a maximum performance improvement of 8.7% in the remote-sensing FSOD task.
2. Related Work
In this chapter, we will introduce relevant works from two aspects. Firstly, we will introduce some generic object detection methods. Secondly, we will discuss various few-shot object detection approaches and analyze the shortcomings of previous research.
2.1. Generic Object Detection
Based on handcraft anchors or reference points, early convolution-based detectors are designed as either two-stage or one-stage models. Two-stage detectors [
12,
13] typically incorporate a region proposal network (RPN) to propose potential boxes and subsequently classify them in the second stage. One-stage detectors like YOLO v2 [
14] and YOLO v3 [
15] directly output offsets relative to predefined anchors and classification scores. Currently, a transformer-based end-to-end detector [
16] has been proposed, which eliminates hand-designed components like anchors and NMS. To alleviate the computational burden, Deformable DETR [
17] predicts 2D anchor offsets and incorporates a deformable attention module that focuses on a specific set around a reference point. By using a contrastive way for denoising training, a mixed query selection method, and a look-forward-twice scheme, DINO [
18] further boosts the performance of transformer-based detectors.
Recently, the rapid advancement of natural language processing has enabled researchers to leverage textual information for object detection. Object detection approaches [
9,
10,
19,
20,
21] via text queries have exhibited remarkable advancements within natural scenes. They are trained using existing bounding-box annotations and aim at detecting arbitrary classes with the help of generalized text queries. OVR-CNN [
19] is a pioneering work that first pre-trains a visual projector on image–caption pairs to learn rich vocabularies and then fine-tunes a detector on the detection dataset. Then, by distilling the knowledge from the CLIP [
20] model into an R-CNN-like detector, ViLD [
21] further advances open-set object detection. GLIP [
9] first formulates object detection as a grounding task and achieves an even stronger performance on fully supervised detection benchmarks without fine-tuning. GroundingDINO [
10] utilizes the training strategy of GLIP as a stronger detector DINO [
18] and achieves a state-of-the-art open-set detection performance. The MQ-Det [
22] utilizes both textual and visual queries for detection, which is similar to our work. However, it solely utilizes vectors as image queries, thereby failing to preserve the spatial details of the target. Furthermore, MQ-Det [
22] is designed for natural scene images, which does not account for the domain gap between remote sensing and natural scene imagery. Since the models are only pre-trained on natural images and only use text embeddings as prototypes, the detection performance of these models is degraded on RSIs.
Along with the rapid development of detectors in natural scenes, object detection methods for aerial images have also achieved tremendous progress [
23]. To address the challenge of rotation invariance for feature representation, Mei et al. [
24] propose a novel cyclic polar coordinate convolutional layer. Considering the expansive spatial coverage of aerial images and diverse scales of objects, Deng et al. [
25] propose a multi-scale object proposal network (MS-OPN), which comprises three proposal branches to predict multi-scale proposals. Due to the unique bird’s-eye views of RSIs, objects in aerial images often exhibit arbitrary orientations. To achieve rotation-invariant detection, Cheng et al. [
26] propose a Rotation-Invariant CNN (RICNN), which incorporates a novel rotation-invariant layer. Li et al. [
27] propose a novel region proposal network that incorporates additional multi-angle anchors to detect objects with arbitrary orientations. Recent studies [
28,
29] have begun to explore the potential of detecting objects labeled with rotated bounding boxes.
Detectors, as mentioned above, have achieved remarkable performance. However, most of these are prone to overfitting when the labeled data become scarce and this significantly restricts the application of object detection for aerial images.
2.2. Few-Shot Object Detection
With the abomination of time-consuming or even impractical large-scale labeling, FSOD has attracted extensive attention.
Current FSOD methods can be divided into transfer-learning methods [
1,
2,
3,
30] and prototype-based methods [
5,
6,
7,
9,
10,
22,
31]. Transfer-learning methods opt to fine-tune a pre-trained model on limited data for the few-shot task. Flagship works include LSTD [
1], TFA [
2], FSCE [
3], and DeFRCN [
30]. Prototype-based methods [
5,
6,
7,
9,
10,
22,
31] aim to build a class prototype for each class and detect novel classes by matching candidate regions to class prototypes. They are trained across tasks for quick adaptation, and they are trained over multiple episodes. Early works [
5,
6,
7,
31,
32] only employ few-shot support images as class prototypes. However, the visual prototypes produced by the few-shot images lack generalization. Inspired by the rapid development of natural language processing, the authors of [
9,
10,
22] input class names into a pre-trained text encoder and utilize the resulting text features as class prototypes. As the text encoder is pre-trained on a large-scale textual corpus, textual prototypes generated from it exhibit strong generalization capabilities. Nevertheless, textual prototypes lack specific details about target classes.
Currently, FSOD for aerial images has attracted increasing attention. Li et al. [
33] introduce a novel meta-learning approach with multi-scale detection capabilities tailored for the FSOD task in aerial images. Then, Cheng et al. [
8] develop a framework named Prototype-CNN, which aims to leverage prototypes for recognizing objects with limited annotations. Similarly, Le et al. [
34] also propose a novel prototypical network based on representation learning. Huang et al. [
35] address the issue of significant data imbalance between the novel class and base classes by proposing a novel balanced fine-tuning approach and incorporating a shared attention module to exploit the abundant ground information within aerial images. DH-FSDet [
36] presents an innovative annotation sampling and preprocessing methodology. To tackle the issue of limited scale diversity in aerial images, MSOCL [
37] presents a multiscale object contrastive learning framework. TEMO [
11] employs class descriptions as class prototypes. However, the text encoder it utilizes is training from scratch, thereby lacking generalization capabilities. In addition, single-modal prototypes in TEMO [
11] cannot provide comprehensive information for target classes. Similarly, UMFT [
38] also proposes a multi-modal transformer to integrate the query features from ViT and the textual features extracted from BERT, aligning multi-modal representations in an end-to-end manner. To overcome the challenges of substantial scale and orientation variations of objects in RSIs, TINet [
39] proposes a novel feature pyramid network (FPN) and utilizes prototype features to enhance query features. To rectify the issue of inconsistent label assignments across base training and fine-tuning stages, SAE-FSDet [
40] proposes a novel label-consistent classifier named LCC and proposes a novel gradual rpn to enhance the localization performance of novel classes, inspired by the authors of [
41,
42,
43,
43].
Although prototype-based detection approaches have demonstrated remarkable performance in the field of FSOD, single-modality prototypes exhibit distinct strengths and limitations. To further boost the performance of FSOD, we propose a multi-modal prototype that can harness the strengths of both textual and visual prototypes. Moreover, the transferability of detectors based on multi-modal prototypes in the field of remote-sensing FSOD has not been explored. Through a comprehensive analysis, we find that the text encoder is domain-agnostic, while the visual encoder is domain-specific in RSIs, and we propose an efficient two-stage training strategy.