Next Article in Journal
An Efficient Acoustic Camera Simulator for Robotic Applications: Design, Implementation, and Validation
Previous Article in Journal
TUH-NAS: A Triple-Unit NAS Network for Hyperspectral Image Classification
Previous Article in Special Issue
Federated Learning in the Detection of Fake News Using Deep Learning as a Basic Method
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Extreme R-CNN: Few-Shot Object Detection via Sample Synthesis and Knowledge Distillation

1
School of Computer Science and Engineering, Macau University of Science and Technology, Macau 999078, China
2
School of Computer Technology, Beijing Institute of Technology, Zhuhai 519088, China
3
Artificial Intelligence and Big Data College, Chongqing Polytechnic University of Electronic Technology, Chongqing 401331, China
4
Guangdong BOHUA UHD Video Innovation Center Co., Ltd., Shenzhen 518172, China
*
Author to whom correspondence should be addressed.
Sensors 2024, 24(23), 7833; https://doi.org/10.3390/s24237833
Submission received: 15 October 2024 / Revised: 30 November 2024 / Accepted: 5 December 2024 / Published: 7 December 2024
(This article belongs to the Collection Artificial Intelligence in Sensors Technology)

Abstract

:
Traditional object detectors require extensive instance-level annotations for training. Conversely, few-shot object detectors, which are generally fine-tuned using limited data from unknown classes, tend to show biases toward base categories and are susceptible to variations within these unknown samples. To mitigate these challenges, we introduce a Two-Stage Fine-Tuning Approach (TFA) named Extreme R-CNN, designed to operate effectively with extremely limited original samples through the integration of sample synthesis and knowledge distillation. Our approach involves synthesizing new training examples via instance clipping and employing various data-augmentation techniques. We enhance the Faster R-CNN architecture by decoupling the regression and classification components of the Region of Interest (RoI), allowing synthetic samples to train the classification head independently of the object-localization process. Comprehensive evaluations on the Microsoft COCO and PASCAL VOC datasets demonstrate significant improvements over baseline methods. Specifically, on the PASCAL VOC dataset, the average precision for novel categories is enhanced by up to 15 percent, while on the more complex Microsoft COCO benchmark it is enhanced by up to 6.1 percent. Remarkably, in the 1-shot scenario, the AP50 of our model exceeds that of the baseline model in the 10-shot setting within the PASCAL VOC dataset, confirming the efficacy of our proposed method.

1. Introduction

Few-shot object detection is a task that poses significant challenges and deserves extensive research. Object detection forms the foundation for tasks such as intelligent surveillance and unmanned driving. To achieve better performance, we usually use a large annotated training dataset, which needs more compute resources and expensive labeled costs. Labeling samples for object-detection tasks is both time-consuming and requires significant human effort. In some scenarios, it is difficult to obtain samples for object detection, such as medical images of rare diseases and photos of rare animals and plants. In certain applications that rely on sensor-collected data, amassing a large amount of image data for rare events in a short period of time can be challenging. However, few-shot object detection only needs a few samples (usually less than 30 samples) to train the model. As a result, in recent years, an increasing number of researchers have focused their efforts on few-shot object detection.
Human learning also starts with a small number of samples and is an incremental process. When children learn new knowledge, they should not forget their existing knowledge, so our approach is also an incremental learning approach. We attempt to ensure that the detector does not forget the previously learned benchmark samples when learning the few-shot samples of unknown classes.
Few-shot object-detection models are capable of recognizing novel categories using only a limited number of samples, a feature that is essential for the rapid deployment of object-detection models in dynamic environments. Two primary challenges arise in the context of few-shot object detection: (1) Insufficient samples: Training a comprehensive model from scratch is impractical due to the lack of sufficient data. To mitigate this, transfer learning is commonly employed, where a model is first pre-trained on base categories with ample data and then fine-tuned on the few-shot dataset. However, the small sample size during fine-tuning can lead to overfitting, which undermines the model’s generalization ability. (2) Performance decline on base categories: After fine-tuning, while the model gains the ability to identify novel categories, its performance in recognizing the base categories often declines.
This paper addresses the aforementioned challenges through two key strategies: (1) Enhancing sample size and diversity: In scenarios with limited samples, we aim to increase the sample size and diversity through data synthesis, thereby reducing the risk of overfitting. (2) Preserving recognition ability: We use knowledge distillation to maintain the model’s ability to recognize base categories as much as possible during the fine-tuning process.
Recently, Knowledge Distillation (KD) [1] has been introduced as a framework to compress models, enhancing the efficiency of training deep neural networks through a teacher–student learning paradigm. In this paper, we adopt a two-stage fine-tuning method to train a few-shot object detector, leveraging intermediate representations from the teacher model to guide and improve both the training process and the final performance of the student model.
Sample synthesis can reduce overfitting in few-shot object detection. For object detection, training has to extract structure from very large, highly redundant datasets using enormous amounts of computation. The amount of training samples for an object detector is a crucial factor that significantly affects the robustness and performance of the detector. We crop instances of objects from the original samples and synthesize new samples by data-augmentation methods such as adding noise to the instances. The localization capability of the detector is motivated by randomly translating the target positions in the synthetic samples. Enhancing the instances with augmentation can improve the classification performance of the detector. Overfitting of the detector can be effectively mitigated by increasing the diversity of the sample distribution through sample synthesis. The detector’s overall performance is significantly enhanced.
To enhance instance-level discriminative feature representations on synthetic samples, we integrate contrastive learning into our framework. Contrastive learning has demonstrated its effectiveness in a variety of tasks, including classification [2], identification [3], and self-supervised learning [4,5,6]. This learning paradigm is particularly adept at generating representations that can effectively differentiate between similar instances, even when labeled data are scarce, by ensuring that representations of similar data points are closely aligned while those of dissimilar points are distinctly separated. We argue that object representations refined through contrastive learning, which emphasize intra-class compactness and inter-class distinctions, can reduce the likelihood of misclassifying novel objects as belonging to similar classes. By combining sample synthesis with contrastive learning, we aim to produce a more diverse and comprehensive set of training instances, enabling the model to learn finer-grained and context-sensitive representations, thus boosting its overall performance and robustness.
In this paper, we present Extreme R-CNN, shown in Figure 1, a model tailored for object detection in situations where the amount of training samples is extremely restricted. We begin by decoupling the regression and classification branches of the Faster R-CNN [7] head network. Initial training is conducted on base categories that are rich in data, followed by fine-tuning on a few-shot dataset comprising both base and novel categories. To boost the model’s performance in scenarios with extremely limited raw samples, we utilize data-augmentation techniques to produce new synthetic samples based on object instances. Knowledge distillation is applied to guide the training of the model’s backbone and Feature Pyramid Network (FPN). Additionally, on the classification branch of the head network, we implement Siamese networks and triplet loss to improve the model’s classification performance for Regions of Interest (RoIs). Ultimately, our model achieves AP50 scores in the 1-shot setting on the PASCAL VOC [8] dataset that exceed those of the baseline (TFA [9]) in the 10-shot setting.
To the best of our knowledge, our work is the first to integrate sample synthesis and knowledge distillation within Faster R-CNN for few-shot object detection. We refer to the model as Extreme R-CNN, which utilizes a straightforward yet effective two-stage fine-tuning approach (TFA) for few-shot object detection. Extensive experiments demonstrate the effectiveness of our straightforward design, which improves AP across all shot settings (1, 2, 3, 5, and 10 shots), achieving up to a +15% gain in average precision on the PASCAL VOC benchmark and a +6.1% gain on the challenging Microsoft COCO [10] benchmark compared to the baseline TFA.
Our contributions are threefold:
  • By synthesizing new samples through the data augmentation of object instances and fine-tuning the modified Faster R-CNN model with these samples, we significantly improve the model’s average precision for novel categories.
  • Utilizing knowledge distillation, we enhance the model’s recognition capability for novel categories while ensuring that its performance on base categories remains unaffected.
  • We validated the effectiveness of our approach in enhancing the performance of few-shot object-detection models through extensive experiments on the Microsoft COCO and PASCAL VOC datasets.

2. Related Work

2.1. Few-Shot Object Detection

Few-shot learning (FSL) aims to recognize new categories using only a limited number of labeled samples. The core concept behind FSL is to transfer knowledge from data-rich base categories to novel categories with few examples. In contrast, few-shot object detection presents a greater challenge, as it involves both classification and localization, and it remains underexplored. Few-shot object detection must identify novel objects and localize them within an image using only a few labeled samples. Two main approaches address the challenging problem of few-shot object detection (FSOD). Existing models can primarily be categorized into the following two types based on their architecture: (1) Single-branch based models [9,11,12,13,14]. These models aim to train object detectors using long-tailed training datasets that include both sample-rich base categories and sample-poor novel categories. The number of classes to be detected determines the architecture of the detector’s last classification layer. In [9], Wang et al. demonstrate that a simple two-stage fine-tuning approach (TFA) outperforms more complex meta-learning methods. To address the imbalance in the training set, two primary strategies are employed: re-sampling [9] and reweighting [15]. Subsequent studies have introduced a variety of techniques to enhance few-shot object detection (FSOD). These advancements include multi-scale positive sample refinement [12], which improves the quality of positive samples across different scales; image hallucination [13], which generates additional training data to augment the limited samples available; contrastive learning [11], which helps in learning more discriminative feature representations by contrasting positive and negative samples; and the incorporation of linguistic semantic knowledge [14], which leverages external information to improve model understanding and generalization. (2) Two-branch based models [16,17,18,19,20,21,22]. These models process support and query images concurrently through a Siamese network architecture and compute the similarity between image proposal regions and few-shot samples for detection. In [21], Kang et al. were the first to introduce a feature-reweighting module designed to aggregate features from the support and query sets. Multifeature fusion networks [16,18,22,23,24] have been introduced to achieve more robust feature aggregation. In [18], Han et al. utilized attention mechanisms to focus on foreground regions and performed feature alignment between the two inputs. In [17], Graph Convolutional Networks (GCNs) were applied to facilitate mutual adaptation between the two branches of the network. Other works [20,25,26,27] have employed more advanced nonlocal attention or transformer mechanisms [28,29,30] to enhance similarity learning between two inputs. Despite these advances, many studies still consider fine-tuning-based approaches as strong baselines that often outperform meta-learning methods. This suggests that the representation of features learned from base categories can be effectively transferred to novel categories, and simple modifications to the box predictor can lead to significant performance improvements [31]. We discourage the use of complex algorithms, as they tend to overfit and yield poor test results in few-shot object detection (FSOD). Our insight is that the degradation of average precision (AP) for novel categories primarily results from misclassifying novel instances as similar base categories. To address this issue, we leverage a Siamese network along with triplet loss to develop more discriminative object proposal representations, without increasing the model’s complexity. This approach enhances the model’s ability to distinguish between novel and base categories, thereby improving classification accuracy for novel instances.
In this paper, we utilize TFA, a form of transfer learning. In addition, we introduce knowledge-distillation and sample-synthesis techniques.

2.2. Sample Synthesis

Sample synthesis, also known as data augmentation or synthetic data generation, is a vital technique employed in deep learning to enhance the quality and quantity of training datasets. This approach tackles issues such as data scarcity, class imbalance, and overfitting. Sample synthesis reduces both costs and time, as collecting and labeling numerous real-world samples in any domain can be both tedious and resource intensive. In [32], Jonas Nilsson et al. enhanced their pedestrian-detection system by generating augmented images, which involved overlaying virtual pedestrians onto real image backgrounds. When these augmented data were combined with raw data, the detection accuracy was significantly improved compared to using raw data alone. In [33], Sebastiaen C. Wong et al. introduced two novel methods for data augmentation: synthetic data sampling and data warping. Synthetic data sampling generated additional samples within the feature space, while data warping created new samples within the data space. To evaluate the effectiveness of these methods, the researchers utilized convolutional backpropagation neural networks. Their findings indicated that data warping yielded better results than synthetic data sampling, provided that reasonable transformations could be identified for the data. In [34], Joseph Lemle et al. implemented smart augmentation, where they automatically combined images to enhance regularization. This method learns the optimal way to merge two or more examples from the same class, thereby identifying the most effective augmentation technique for a given dataset. In [35], Jia Shijie et al. assessed the effectiveness of different data-augmentation methods in tasks of image classification. They tested methods such as rotating, flipping, color jittering, PCA jittering, adding noise, and using Generative Adversarial Networks (GANs). Their conclusions stated that rotation, flipping, GANs, and cropping outperformed the other methods. In [36], Fujita et al. presented a novel data-augmentation strategy utilizing image transformations. By adding noise to the images, they were able to generate new samples, thereby alleviating the scarcity of data within the existing feature space. The model trained with the augmented dataset achieved better performance compared to the model trained with the raw dataset. In [37], Cheng Lei et al. evaluated the impact of various augmentation parameters on the performance of deep learning models. These parameters included the type of augmentation method, the number of samples per class in the original dataset, and the augmentation rate. Their findings revealed that geometric transformations do not consistently enhance model performance; In fact, combining two geometric transformations often resulted in diminished generalization capabilities. However, they observed that integrating geometric transformations with photometric transformations yielded significantly better outcomes. Medical image analysis often encounters the challenge of limited data, as raw images of specific diseases are not consistently plentiful. In [38], Abdulaziz Namozov et al. demonstrated that data-augmentation techniques can significantly enhance the classification capabilities of models by expanding and diversifying a computed tomography (CT) scan image dataset.
Unlike the aforementioned data-augmentation techniques, our approach involves cropping instances of novel categories from images. We then apply color channel transformations, add Gaussian noise, perform random translation, and more to generate multiple new samples, thereby increasing the variance of novel categories within the dataset.

2.3. Knowledge Distillation

Knowledge distillation is a technique designed to produce a smaller, more efficient model (student model) that can achieve comparable accuracy to larger models (teacher models) while being less computationally demanding and more readily deployable on resource-constrained devices or systems. This procedure aims to train a smaller model to mimic the output or behavior of larger models by distilling their knowledge, or valuable information. In [39], Bucila et al. presented an algorithm to train a single deep neural network by mimicking the output of a combination of models. In [40], Ba and Caruana compressed deep neural networks into shallower but broader architecture by applying the idea of Bucila et al. In [1], Hinton et al. used the teacher model’s output as a ‘soft label’ to train the student model and introduced temperature-scaled cross-entropy loss to replace the L2 loss. Compared to [39], they apply knowledge distillation as a more general method. In [41], Romero et al. employed a two-stage strategy to train the student model by using the output of the teacher model’s intermediate layers as ‘hints’.
Inspired by the aforementioned works, during the fine-tuning phase of training the few-shot object detector, we utilize the backbone and FPN network pre-trained on data-rich base categories from the first stage as the teacher model.

3. Method

3.1. Overview of Our Proposed Model (Extreme R-CNN)

In this work, we propose the Extreme R-CNN model, designed for few-shot object-detection scenarios based on the Faster R-CNN architecture. Our approach falls under the category of two-stage fine-tuning approaches in the field of few-shot object detection. Although existing few-shot object-detection models [9,11,12,18] have achieved promising results, they often experience a significant decrease in average precision (AP) for base categories when fine-tuning on datasets that include both base and novel categories.
In contrast to existing few-shot learning approaches, our approach focuses on a two-stage fine-tuning strategy that integrates sample synthesis and knowledge distillation. Our approach is designed to handle few-shot scenarios more effectively by augmenting the training set with synthetic data and using knowledge distillation to transfer the learned representations from the pre-trained model in the first stage. This combination helps improve generalization and robustness, especially when the number of raw samples is extremely limited.
Figure 1 illustrates an overview of our model. It is built upon the Faster R-CNN object-detection architecture. In Section 3.2, we detail how we synthesize samples for novel categories, and in Section 3.3 we explain the feature-distillation process for base categories. The architecture of the model’s head network is presented in Section 3.4 and Section 3.5.

3.2. Synthesizing Samples

In the fine-tuning phase, we synthesize samples, taking inspiration from diffusion models and generative adversarial networks. This approach ensures the generalization performance of the model even when the number of raw samples is extremely limited. We crop instances from raw images and apply data-augmentation techniques including RGB color channel transformation, adding Gaussian noise, and introducing salt-and-pepper noise. The augmented instances are then subjected to random zero-padding to generate new samples that match the size of the original image. A sample graph of the synthesized samples is illustrated in Figure 2.
From Figure 2, it can be seen that the first column shows the raw samples and the second column shows the instances cut from these raw samples. These instances are then augmented with color channel transformations, Gaussian noise, and salt and pepper noise to generate the synthesized samples shown in the second column. The locations of these instances are randomly shifted with respect to the raw samples. Similarly, the third column shows either different instances or the same instance synthesized using various data-augmentation methods. Both the data-augmentation technique and the position translation are applied randomly during each sampling.
Compared to commonly used data-augmentation methods, our sample-synthesis method only enhances the instances, increasing the number and diversity of training samples, and improves the generalization performance of the model during fine-tuning. At the same time, by randomly shifting the locations of the instances, this helps to enhance the class-agnostic object-localization capability of the model.

3.3. Knowledge Distillation for Base Categories

Knowledge Distillation (KD) is a machine learning technique that transfers the knowledge from a complex model, a.k.a. the teacher model, to a simpler model, referred to as the student model. When fine-tuning on novel and base categories in a few-shot setting, we use a backbone and Feature Pyramid Network (FPN) as the teacher model. We employ a hint loss to encourage the student model to mimic the feature representations learned by the intermediate layers of the teacher model. The weights obtained from the initial training phase are used in the teacher model to guide the subsequent fine-tuning phase. This process helps ensure that the performance on the base category remains robust. The mathematical formulation of the hint loss is provided in Equations (1) and (2).
f ( x i ) = 1 if x i D B a s e C a t e g o r i e s 0 if x i D B a s e C a t e g o r i e s
L Hints = 1 i = 1 N f ( x i ) i = 1 N f ( x i ) × F T ( x i ) F S ( x i ) 2

3.4. Decoupled Classification and Regression Heads

Decoupling the classification and regression head networks significantly enhances the overall accuracy of object-detection models [42,43], particularly in few-shot object-detection scenarios. This architectural choice allows each task—classification and regression—to be optimized independently, leading to more specialized and effective sub-networks for their respective objectives. In R-CNN-based detectors for regression and classification tasks, there are two commonly used head architectures: the convolutional head (conv-head) and the fully connected head (fc-head). Generally, the conv-head is better suited for localization tasks, while the fc-head excels in classification tasks. While the fc-head has greater spatial sensitivity than conv-head, which aids in distinguishing between complete and partial objects, it may not be as robust as the conv-head when it comes to the effective regression of the whole object. Overall, fully connected heads are ideal for classification tasks, while convolutional heads excel at regression. During the fine-tuning phase, which relies on a finite sample set, decoupling the classification and regression heads ensures that class-agnostic bounding box regression is largely unaffected by small sample sizes. In order to enhance the classification performance during fine-tuning in the presence of insufficient samples, Siamese network and triplet loss are introduced to allow the classification head to learn a feature space that is more easily classifiable, thus improving the classification accuracy. As a result, the decoupled head architecture is better suited for few-shot object-detection scenarios.

3.5. Using Siamese Network and Triplet Loss in Classification Head

Deep learning models tend to overfit when there are insufficient sample data. After decoupling the classification and regression head networks, we introduce the Siamese network and triplet loss during fine-tuning with the few-shot dataset. This approach minimizes the distance between positive examples (samples from the same class) and maximizes the distance between negative examples (samples from different classes), thus enabling the model to learn a more discriminative feature space. The architecture of the Siamese network is shown in Figure 1, and the formula for the triplet loss function is provided as Equation (3).
L Triplet = max ( α + f ( A ) · f ( N ) f ( A ) f ( N ) f ( A ) · f ( P ) f ( A ) f ( P ) , 0 )
where α is a positive constant known as the margin.
To summarize, the overall loss of the Extreme R-CNN model consists of the loss from the Faster R-CNN model, the distillation loss from the backbone and FPN, and the triplet loss from the Siamese network. The total training loss of the detector is denoted as LTotal in Equation (4)
L Total = λ d L Hints + L Faster R - CNN + λ t L Triplet
here, λ d and λ t are hyperparameters that indicate the weights of L Hints and L Triplet , respectively.

4. Experiments

We have conducted extensive experiments on both PASCAL VOC and Microsoft COCO benchmarks. Extreme R-CNN demonstrates significant performance improvements over all fine-tuning-based approaches, with a substantial margin in any shot scenario across all dataset splits. We strictly adhere to a consistent few-shot detection dataset-construction and -evaluation protocol [9,12,21,23] to ensure a fair and direct comparison. In this section, we begin by detailing the few-shot detection setup employed in our study. We then present a comprehensive comparison of our approach with contemporary few-shot detection methods on the PASCAL VOC and Microsoft COCO benchmarks. Finally, we conduct ablation experiments to analyze the impact of various components and design choices on the performance of our model.

4.1. Experiment Configuration

4.1.1. Datasets

Our method has been evaluated on the benchmark datasets PASCAL VOC and Microsoft COCO, following the protocols established by previous works [9,21]. To ensure a fair comparison, we adopt the data splits and annotations as described by TFA [9]. For the PASCAL VOC dataset, the 20 categories are divided into three distinct groups, each consisting of 15 base and 5 novel categories. The training is conducted using all available data from the base categories in the PASCAL VOC 2007 and 2012 trainval sets. In the K-shot settings, where K takes values of 1, 2, 3, 5, and 10, we randomly sample instances from previously unseen novel categories during training. In line with prior studies [9,21,23], we use the same three random partitions of base and novel categories, referred to as Novel Split 1, 2, and 3, as outlined in [21], to maintain consistency across experiments. Regarding the Microsoft COCO dataset, we define 60 categories that do not overlap with PASCAL VOC as our base categories, while the remaining 20 categories are designated as novel categories. We evaluate the detection performance for 10-shot and 30-shot settings on 5 K images from the Microsoft COCO 2014 validation dataset, ensuring a comprehensive evaluation of the few-shot learning capabilities of our model.

4.1.2. Evaluation Metrics

For the PASCAL VOC dataset, we present the Average Precision (AP) at an Intersection over Union (IoU) threshold of 0.5 for both base categories (denoted as bAP) and novel categories (denoted as nAP), evaluated on the PASCAL VOC 2007 test set. For the Microsoft COCO dataset, we provide the mean AP averaged across IoU thresholds ranging from 0.5 to 0.95 for novel categories (nAP), along with the AP at an IoU threshold of 0.75 for the novel categories (nAP75); these metrics are evaluated on a subset of 5K images from the Microsoft COCO 2014 validation dataset.

4.1.3. Implementation Details

Our model’s code is built using the Detectron2 [44] framework, employing a ResNet-101 [45] backbone along with a Feature Pyramid Network (FPN) [46]. All experiments are conducted in JupyterLab with an NVIDIA Tesla V100-SXM3 GPU (NVIDIA, Santa Clara, California, USA. NVIDIA Operations in Macau) We train all models using standard SGD with a momentum of 0.9 and a weight decay of 1 × 10−4. The learning rate is set to 0.02 for training base categories, and 0.001 for training novel categories. The batch size is 16 for all training runs. The model was fine-tuned for {3000, 5000, 6000, 10,000, 12,000} iterations for K = {1, 2, 3, 5, 10} shots in the PASCAL VOC dataset, and for {30,000, 40,000} iterations for K = {10, 30} shots in the Microsoft COCO dataset. Unless otherwise specified, we maintain the same hyperparameters as those used in Faster R-CNN.

4.2. Main Results

4.2.1. PASCAL VOC

As shown in Table 1, the Extreme R-CNN model achieves the second best average AP among the existing methods, trailing only the VFA model. Extreme R-CNN achieves the best results in 4 out of 15 settings and the second best in 11 out of 15 settings. In Novel Set 1, Extreme R-CNN outperforms the baseline (TFA w/cos) by a margin ranging from 15.8% to 72.5%. Notably, our 1-shot result even surpasses the baseline’s (TFA w/cos) 10-shot results (57% vs. 56%), indicating that Extreme R-CNN is particularly effective in data-scarce scenarios. Moreover, our improvement is consistently stable across all Novel Split sets, demonstrating that Extreme R-CNN is unbiased towards any specific subset of categories and has strong generalization capabilities. Additionally, Extreme R-CNN achieves a mean average precision of 54.9%, improving upon the baseline (TFA w/cos) by 15 percentage points (from 39.9% to 54.9%), which further underscores its effectiveness. Meanwhile, as shown in Figure 1, our model performs slightly worse than the VFA method. We note that VFA is a meta-learning paradigm that employs variational feature aggregation on samples. As suggested, this method of augmenting samples through variational feature aggregation may be more suitable for meta-learning. However, VFA requires training multiple models using a variational feature-aggregation dataset and then selecting the optimal model, making its training process more complex and computationally demanding compared to our approach.

4.2.2. Microsoft COCO

The few-shot detection results for Microsoft COCO are presented in Table 2. Our Extreme R-CNN achieves the second best nAP among the fine-tuning-based methods under the same testing protocol and metrics. Compared to the baseline (TFA w/cos), our model shows improvements of +6.5% nAP in the 10-shot scenario and +5.6% nAP in the 30-shot scenario. Based on the results presented in Table 1 and Table 2, our Extreme R-CNN model achieves the second best average AP on the PASCAL VOC dataset and also achieves the second best results on the Microsoft COCO dataset. This consistent performance across the two datasets highlights the effectiveness of our model in enhancing the performance of few-shot object detection. The improvement in the 10-shot setting is more pronounced than in the 30-shot setting, suggesting that our proposed model provides a more significant benefit when the sample size is smaller. Similarly, in Figure 1, our method performs better than the DeFRCN method, while in Figure 2, our method performs slightly worse. Compared to the Microsoft COCO dataset, PASCAL VOC has significantly fewer categories and sample sizes. This suggests that the effectiveness of our method in improving model performance decreases as the number of categories and sample size increases. This observation highlights the need for further research to enhance our method’s performance in more complex datasets.

4.3. Ablation Study

To decouple the classification and regression processes within the detector’s head network during the fine-tuning stage, we modified the Faster R-CNN model by bifurcating the head network into two distinct branches: one specialized for classification and the other for regression. Subsequently, we performed base training on this modified Faster R-CNN model using the base categories dataset, and then proceeded with fine-tuning on the novel categories dataset. The results from the fine-tuning process are depicted in the third row of Table 3.
We perform ablation experiments to demonstrate the effectiveness of primary components proposed in Extreme R-CNN. Our work proposed three components, synthesizing sample, knowledge distillation, Siamese network, and triplet loss. We utilize the ResNet-101 with a Feature Pyramid Network (FPN) as backbone, and gradually incorporate each component into the baseline model according to the parameter configurations detailed in Section 4.1. Unless otherwise specified, all ablation experiments are performed using PASCAL VOC Novel Split 1 and the results are shown in Table 3.

4.3.1. Synthesizing Samples

To verify the effectiveness of synthesizing samples, we utilized synthesized data for fine-tuning the baseline TFA model. As demonstrated in the fourth row of Table 3, this approach improved the average score of nAP50 by 8.1% (from 51.1% to 59.2%). This indicates that the method of synthesizing samples is effective in enhancing the performance of an object detector trained with few-shot samples. Notably, the enhancement is more pronounced in the 1-shot scenario. We attribute this to the increased diversity brought about by data augmentation applied to the synthesized samples, which boosts the model’s generalization capability.

4.3.2. Knowledge Distillation

To evaluate the impact of knowledge distillation on detector performance, we integrated a teacher model into the baseline during the fine-tuning phase. The fifth row of Table 3 presents the results of the baseline with knowledge distillation. We can observe that knowledge distillation improves the average score of nAP50 by 0.6% (from 59.4% to 60.0%). The results show that while knowledge distillation does not significantly improve the nAP50 of the model, it effectively preserves the bAP50 for the base categories without causing a significant drop. Since, in the fine-tuning phase, knowledge distillation is applied only to base categories, novel categories are unknown to the pre-trained model from the first phase. This means that the teacher model cannot guide the training of novel categories during the fine-tuning phase. Therefore, knowledge distillation does not significantly improve the recognition accuracy of the model for novel categories. The detailed data are presented in Table 4.

4.3.3. Siamese Network and Triplet Loss (SNTL)

To understand the significance of SNTL, we incorporate the Siamese network into the classification head network of Faster R-CNN during the fine-tuning stage. Triplet loss is used to ensure that the distance between RoIs of the same class is closer than the distance between RoIs of different classes by satisfying a predefined margin. This enhances the representation space of the classification head network for RoIs, thus improving the overall classification performance. As shown in the sixth row of Table 3, the implementation of the SNTL improves the average nAP50 score by 2.8% (from 60.0% to 62.8%).

5. Conclusions and Future Work

In this paper, we propose Extreme R-CNN, an extension of Faster R-CNN specifically designed for few-shot object detection. Extreme R-CNN adopts a two-stage fine-tuning approach (TFA) that integrates knowledge distillation and the Siamese network. During the fine-tuning phase, the model is trained using synthetic samples. The effectiveness of our method in detecting objects with few shots has been validated through a series of extensive experiments performed on the PASCAL VOC and Microsoft COCO datasets. In particular, Extreme R-CNN is able to quickly learn to perform object detection and recognize new classes even in extremely limited data samples, such as in a one-shot scenario. Therefore, our model has important applications in situations where it is difficult to obtain multiple samples.
There are two major limitations of our approach that could be addressed in future studies. First, as the number of samples in each category increases, the effect of sample synthesis on improving model performance diminishes. In other words, with a large sample size, the impact of sample synthesis on enhancing model performance is not significant. Second, synthesizing new samples by cutting and pasting instances from sample images is not effective for identifying small objects within the samples. In the future, we will continue to investigate the impact of sample synthesis on improving model performance from multiple perspectives, including different synthesis methods, and the relationship between the number of synthesized samples and model performance. Additionally, we will explore how to ensure that the model’s ability to recognize base categories does not decrease significantly when new categories are added. We also plan to apply our method to different models and datasets.

Author Contributions

Conceptualization, S.Z. (Shenyong Zhang) and W.W.; Formal analysis, S.Z. (Shenyong Zhang); Funding acquisition, W.W.; Investigation, Z.W., H.L., R.L. and S.Z. (Shixiong Zhang); Methodology, S.Z. (Shenyong Zhang) and W.W.; Project administration, W.W.; Software, S.Z. (Shenyong Zhang); Supervision, W.W.; Validation, Z.W., H.L., R.L. and S.Z. (Shixiong Zhang); Writing—original draft, S.Z. (Shenyong Zhang); Writing—review & editing, S.Z. (Shenyong Zhang) and W.W. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the Science and Technology Development Fund (FDCT) of Macau under Grant No. 0071/2022/A and No. 0095/2023/RIA2.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Some or all data, models or code generated or used during the study are available from the first author and the corresponding author by request.

Conflicts of Interest

Author Ruochen Li was employed by the company Guangdong BOHUA UHD Video Innovation Center Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

  1. Hinton, G.; Vinyals, O.; Dean, J. Distilling the Knowledge in a Neural Network. Comput. Sci. 2015, 14, 38–39. [Google Scholar]
  2. Schroff, F.; Kalenichenko, D.; Philbin, J. Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 815–823. [Google Scholar]
  3. Sun, Y.; Chen, Y.; Wang, X.; Tang, X. Deep learning face representation by joint identification-verification. Adv. Neural Inf. Process. Syst. 2014, 27, 1988–1996. [Google Scholar]
  4. Wu, Z.; Xiong, Y.; Yu, S.X.; Lin, D. Unsupervised feature learning via non-parametric instance discrimination. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 3733–3742. [Google Scholar]
  5. Xie, J.; Zhan, X.; Liu, Z.; Ong, Y.S.; Loy, C.C. Delving into inter-image invariance for unsupervised visual representations. Int. J. Comput. Vis. 2022, 130, 2994–3013. [Google Scholar] [CrossRef]
  6. Chen, T.; Kornblith, S.; Norouzi, M.; Hinton, G. A simple framework for contrastive learning of visual representations. In Proceedings of the International Conference on Machine Learning, PMLR, Virtual Event, 13–18 July 2020; pp. 1597–1607. [Google Scholar]
  7. Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
  8. Everingham, M.; Eslami, S.A.; Van Gool, L.; Williams, C.K.; Winn, J.; Zisserman, A. The pascal visual object classes challenge: A retrospective. Int. J. Comput. Vis. 2015, 111, 98–136. [Google Scholar] [CrossRef]
  9. Wang, X.; Huang, T.E.; Darrell, T.; Gonzalez, J.E.; Yu, F. Frustratingly simple few-shot object detection. arXiv 2020, arXiv:2003.06957. [Google Scholar]
  10. Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in context. In Proceedings of the Computer Vision–ECCV 2014: 13th European Conference, Proceedings, Part V 13, Zurich, Switzerland, 6–12 September 2014; Springer: Berlin/Heidelberg, Germany, 2014; pp. 740–755. [Google Scholar]
  11. Sun, B.; Li, B.; Cai, S.; Yuan, Y.; Zhang, C. Fsce: Few-shot object detection via contrastive proposal encoding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 19–25 June 2021; pp. 7352–7362. [Google Scholar]
  12. Wu, J.; Liu, S.; Huang, D.; Wang, Y. Multi-scale positive sample refinement for few-shot object detection. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Proceedings, Part XVI 16, Glasgow, UK, 23–28 August 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 456–472. [Google Scholar]
  13. Zhang, W.; Wang, Y.X. Hallucination improves few-shot object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 13008–13017. [Google Scholar]
  14. Zhu, C.; Chen, F.; Ahmed, U.; Shen, Z.; Savvides, M. Semantic relation reasoning for shot-stable few-shot object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 8782–8791. [Google Scholar]
  15. Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal Loss for Dense Object Detection. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 42, 2999–3007. [Google Scholar]
  16. Fan, Q.; Zhuo, W.; Tang, C.K.; Tai, Y.W. Few-shot object detection with attention-RPN and multi-relation detector. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 4013–4022. [Google Scholar]
  17. Han, G.; He, Y.; Huang, S.; Ma, J.; Chang, S.F. Query adaptive few-shot object detection with heterogeneous graph convolutional networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 3263–3272. [Google Scholar]
  18. Han, G.; Huang, S.; Ma, J.; He, Y.; Chang, S.F. Meta Faster R-CNN: Towards accurate few-shot object detection with attentive feature alignment. In Proceedings of the AAAI Conference on Artificial Intelligence, Enter Virtual, 22 February–1 March 2022; Volume 36, pp. 780–789. [Google Scholar]
  19. Han, G.; Chen, L.; Ma, J.; Huang, S.; Chellappa, R.; Chang, S.F. Multi-modal few-shot object detection with meta-learning-based cross-modal prompting. arXiv 2022, arXiv:2204.07841. [Google Scholar]
  20. Hsieh, T.I.; Lo, Y.C.; Chen, H.T.; Liu, T.L. One-shot object detection with co-attention and co-excitation. Adv. Neural Inf. Process. Syst. 2019, 32, 2725–2734. [Google Scholar]
  21. Kang, B.; Liu, Z.; Wang, X.; Yu, F.; Feng, J.; Darrell, T. Few-shot object detection via feature reweighting. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 8420–8429. [Google Scholar]
  22. Yan, X.; Chen, Z.; Xu, A.; Wang, X.; Liang, X.; Lin, L. Meta r-cnn: Towards general solver for instance-level low-shot learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9577–9586. [Google Scholar]
  23. Xiao, Y.; Lepetit, V.; Marlet, R. Few-shot object detection and viewpoint estimation for objects in the wild. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 3090–3106. [Google Scholar] [CrossRef] [PubMed]
  24. Yuan, D.; Zhang, H.; Shu, X.; Liu, Q.; Chang, X.; He, Z.; Shi, G. An Attention Mechanism Based AVOD Network for 3D Vehicle Detection. IEEE Trans. Intell. Veh. 2023, 8, 1–13. [Google Scholar] [CrossRef]
  25. Chen, D.J.; Hsieh, H.Y.; Liu, T.L. Adaptive image transformer for one-shot object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 12247–12256. [Google Scholar]
  26. Chen, T.I.; Liu, Y.C.; Su, H.T.; Chang, Y.C.; Lin, Y.H.; Yeh, J.F.; Chen, W.C.; Hsu, W.H. Dual-awareness attention for few-shot object detection. IEEE Trans. Multimed. 2021, 25, 291–301. [Google Scholar] [CrossRef]
  27. Doersch, C.; Gupta, A.; Zisserman, A. Crosstransformers: Spatially-aware few-shot transfer. Adv. Neural Inf. Process. Syst. 2020, 33, 21981–21993. [Google Scholar]
  28. Vaswani, A. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
  29. Wang, X.; Girshick, R.; Gupta, A.; He, K. Non-local neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7794–7803. [Google Scholar]
  30. Liu, Q.; Pi, J.; Gao, P.; Yuan, D. STFNet: Self-Supervised Transformer for Infrared and Visible Image Fusion. IEEE Trans. Emerg. Top. Comput. Intell. 2024, 8, 1513–1526. [Google Scholar] [CrossRef]
  31. Dhillon, G.S.; Chaudhari, P.; Ravichandran, A.; Soatto, S. A baseline for few-shot image classification. arXiv 2019, arXiv:1909.02729. [Google Scholar]
  32. Nilsson, J.; Andersson, P.; Gu, I.Y.H.; Fredriksson, J. Pedestrian detection using augmented training data. In Proceedings of the 2014 22nd International Conference on Pattern Recognition, Stockholm, Sweden, 24–28 August 2014; IEEE: New York, NY, USA, 2014; pp. 4548–4553. [Google Scholar]
  33. Wong, S.C.; Gatt, A.; Stamatescu, V.; McDonnell, M.D. Understanding data augmentation for classification: When to warp? In Proceedings of the 2016 International Conference on Digital Image Computing: Techniques and Applications (DICTA), Gold Coast, Australia, 30 November–2 December 2016; IEEE: New York, NY, USA, 2016; pp. 1–6. [Google Scholar]
  34. Lemley, J.; Bazrafkan, S.; Corcoran, P. Smart augmentation learning an optimal data augmentation strategy. IEEE Access 2017, 5, 5858–5869. [Google Scholar] [CrossRef]
  35. Shijie, J.; Ping, W.; Peiyi, J.; Siping, H. Research on data augmentation for image classification based on convolution neural networks. In Proceedings of the 2017 Chinese automation congress (CAC), Jinan, China, 20–22 October 2017; IEEE: New York, NY, USA, 2017; pp. 4165–4170. [Google Scholar]
  36. Fujita, K.; Kobayashi, M.; Nagao, T. Data augmentation using evolutionary image processing. In Proceedings of the 2018 Digital Image Computing: Techniques and Applications (DICTA), Canberra, Australia, 10–13 December 2018; IEEE: New York, NY, USA, 2018; pp. 1–6. [Google Scholar]
  37. Lei, C.; Hu, B.; Wang, D.; Zhang, S.; Chen, Z. A preliminary study on data augmentation of deep learning for image classification. In Proceedings of the 11th Asia-Pacific Symposium on Internetware, Fukuoka, Japan, 28–29 October 2019; pp. 1–6. [Google Scholar]
  38. Namozov, A.; Im Cho, Y. An improvement for medical image analysis using data enhancement techniques in deep learning. In Proceedings of the 2018 International Conference on Information and Communication Technology Robotics (ICT-ROBOT), Busan, Republic of Korea, 6–8 September 2018; IEEE: New York, NY, USA, 2018; pp. 1–3. [Google Scholar]
  39. Buciluǎ, C.; Caruana, R.; Niculescu-Mizil, A. Model compression. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Philadelphia, PA, USA, 20–23 August 2006; pp. 535–541. [Google Scholar]
  40. Ba, J.; Caruana, R. Do deep nets really need to be deep? Adv. Neural Inf. Process. Syst. 2014, 27, 2654–2662. [Google Scholar]
  41. Romero, A.; Ballas, N.; Kahou, S.E.; Chassang, A.; Gatta, C.; Bengio, Y. Fitnets: Hints for thin deep nets. arXiv 2014, arXiv:1412.6550. [Google Scholar]
  42. Zhang, S.; Wang, W.; Li, H.; Zhang, S. Bounding convolutional network for refining object locations. Neural Comput. Appl. 2023, 35, 19297–19313. [Google Scholar] [CrossRef]
  43. Wu, Y.; Chen, Y.; Yuan, L.; Liu, Z.; Wang, L.; Li, H.; Fu, Y. Rethinking Classification and Localization for Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
  44. Wu, Y.; Kirillov, A.; Massa, F.; Lo, W.Y.; Girshick, R. Detectron2. 2019. Available online: https://github.com/facebookresearch/detectron2 (accessed on 10 November 2022).
  45. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
  46. Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
  47. Chen, H.; Wang, Y.; Wang, G.; Qiao, Y. LSTD: A Low-Shot Transfer Detector for Object Detection. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2 February 2018. [Google Scholar]
Figure 1. Overview of the Extreme R-CNN Framework. In the base training phase, the whole object detector, comprising the backbone, FPN, and head network, is jointly trained on the base classes. During this initial stage, the parameters of the teacher model are trained on these base classes. In the few-shot fine-tuning phase, the backbone and FPN are guided by the intermediate representations learned by the teacher model, which enhances the training process and boosts the representation capabilities of the student model. Furthermore, the box predictor is fine-tuned on a balanced dataset that includes both base and novel classes, augmented with synthesized samples. We decouple the classification and box regression of RoIs in the head network. For classification, we introduce a Siamese network and a triplet loss. Note that the RPN branch is not visualized for simplicity.
Figure 1. Overview of the Extreme R-CNN Framework. In the base training phase, the whole object detector, comprising the backbone, FPN, and head network, is jointly trained on the base classes. During this initial stage, the parameters of the teacher model are trained on these base classes. In the few-shot fine-tuning phase, the backbone and FPN are guided by the intermediate representations learned by the teacher model, which enhances the training process and boosts the representation capabilities of the student model. Furthermore, the box predictor is fine-tuned on a balanced dataset that includes both base and novel classes, augmented with synthesized samples. We decouple the classification and box regression of RoIs in the head network. For classification, we introduce a Siamese network and a triplet loss. Note that the RPN branch is not visualized for simplicity.
Sensors 24 07833 g001
Figure 2. A sample graph of the synthesized samples.
Figure 2. A sample graph of the synthesized samples.
Sensors 24 07833 g002
Table 1. Experimental results on the VOC dataset. We evaluate the performance of Extreme R-CNN (AP50) on three different splits. The best results are in red and the second best results are in blue.
Table 1. Experimental results on the VOC dataset. We evaluate the performance of Extreme R-CNN (AP50) on three different splits. The best results are in red and the second best results are in blue.
Method/ShotNovel Split 1Novel Split 2Novel Split 3Avg.
123510123510123510
TFA w/cos (Baseline)39.836.144.755.75623.526.934.135.139.130.834.842.849.549.839.9
FSCE44.243.851.461.963.427.329.543.544.250.237.241.947.554.658.546.6
DeFRCN53.657.561.564.160.830.138.147.053.347.948.450.952.354.957.451.9
VFA57.764.664.767.267.441.446.251.151.851.648.954.856.659.058.956.1
Extreme R-CNN (Ours)57.058.362.765.366.240.246.447.852.453.849.552.354.457.359.454.9
Improvement (%)+43.2+61.5+40.3+17.2+18.2+71.1+72.5+40.2+49.3+37.6+60.1+50.3+27.1+15.8+19.3+37.5
Table 2. Experimental results on the Microsoft COCO dataset. The backbone is the same as in Table 1. The best results are in red and the second best results are in blue.
Table 2. Experimental results on the Microsoft COCO dataset. The backbone is the same as in Table 1. The best results are in red and the second best results are in blue.
MethodnAP 10nAP 30
LSTD [47]3.26.7
FSRW5.69.1
MetaDet7.111.3
Meta-RCNN8.712.4
MPSR9.814.1
TFA w/cos (Baseline)10.013.7
FSCE11.916.4
FADI12.216.1
VFA16.218.9
DeFRCN18.522.6
Extreme R-CNN (ours)16.519.3
Table 3. Ablation for primary components proposed in Extreme R-CNN.
Table 3. Ablation for primary components proposed in Extreme R-CNN.
MethodSSKDSNTLNovel AP50 Split 1
1510Avg.
TFA w/cos (Baseline)---39.855.756.050.5
TFA w/cos (Our reimpl.)40.255.957.251.1
Extreme R-CNN (Ours)51.162.764.359.4
51.863.365.060.0
57.065.366.262.8
Table 4. Base category forgetting comparison on PASCAL VOC Split 1. Prior to fine-tuning, the base AP50 achieved during base training is 81.2. Bold indicates the best value.
Table 4. Base category forgetting comparison on PASCAL VOC Split 1. Prior to fine-tuning, the base AP50 achieved during base training is 81.2. Bold indicates the best value.
MethodBase AP50Novel AP50
1351013510
MPSR [12]59.467.868.4-41.751.455.261.8
FSCE [11]78.974.176.6-44.251.461.963.4
TFA w/cos [9] (Baseline)79.177.377.0-39.844.655.756.0
TFA w/cos [9] (Our reimpl.)78.877.577.277.040.245.355.957.2
TFA w/cos [9] + SS77.576.77675.451.156.362.764.3
TFA w/cos [9] + SS + KD78.477.37776.551.856.963.365.0
Extreme R-CNN (Ours)78.27776.776.35762.765.366.2
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhang, S.; Wang, W.; Wang, Z.; Li, H.; Li, R.; Zhang, S. Extreme R-CNN: Few-Shot Object Detection via Sample Synthesis and Knowledge Distillation. Sensors 2024, 24, 7833. https://doi.org/10.3390/s24237833

AMA Style

Zhang S, Wang W, Wang Z, Li H, Li R, Zhang S. Extreme R-CNN: Few-Shot Object Detection via Sample Synthesis and Knowledge Distillation. Sensors. 2024; 24(23):7833. https://doi.org/10.3390/s24237833

Chicago/Turabian Style

Zhang, Shenyong, Wenmin Wang, Zhibing Wang, Honglei Li, Ruochen Li, and Shixiong Zhang. 2024. "Extreme R-CNN: Few-Shot Object Detection via Sample Synthesis and Knowledge Distillation" Sensors 24, no. 23: 7833. https://doi.org/10.3390/s24237833

APA Style

Zhang, S., Wang, W., Wang, Z., Li, H., Li, R., & Zhang, S. (2024). Extreme R-CNN: Few-Shot Object Detection via Sample Synthesis and Knowledge Distillation. Sensors, 24(23), 7833. https://doi.org/10.3390/s24237833

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop