1. Introduction
Conventional neural image classifiers have demonstrated exceptional performance across diverse setups and benchmarks. However, they still heavily rely on abundant labeled training data. Most significantly, these methods struggle to adapt learned knowledge to target categories.
Inspired by humans’ ability for few-shot learning, the concept of few-shot image classification has been proposed. It aims to identify target categories by adjusting the knowledge learned previously from source categories. This knowledge is typically stored within a deep embedding model, utilized for supporting and querying generic matches of image pairs. A promising approach involves leveraging meta-learning frameworks to optimize the initialization of model parameters [
1,
2,
3] or replace standard parameter linear classifier heads [
4,
5,
6] with class-agnostic distance functions.
Although these methods have shown significant performance, the key challenge of few-shot image classification still lies in eliminating the inductive bias problem from source classes to tailor its preference for hypotheses according to the few training instances from new target classes. On the other hand, it could lead to getting a flawed embedding due to the inadequacy of one or just a few instances in effectively capturing the data distributions pertaining to the novel classes. To address these issues, refs. [
7,
8,
9] introduce variations in the training data to encourage the model to learn a more general and adaptive representation. Furthermore, augmenting the learned embeddings with data variety forces the model to acquire more comprehensive and discriminative embeddings. However, as depicted in the left images of
Figure 1a, The outcomes generated from randomly obscured samples might either fully expose or entirely obscure crucial regions. This circumstance might hamper the model’s acquisition of new features or impact its capacity to discriminate among distinct categories and effectively generalize. Illustrated in the right images of
Figure 1a, these approaches only consider mixing a given random pair of input data and do not fully utilize the rich informative supervisory signal in training data.
To address these challenges in few-shot learning, this paper introduces three modules: (1) Mask Mix (M-Mix), (2) Saliency Fuse (SF), and (3) Confident Clip Selector (CCS). The M-Mix module masks the images guided by the saliency map and then blends them, as shown in the upper image of
Figure 1b, producing images that are more representative and contain richer information. Humans can identify objects by seeing only a part of them, and these images mimic this scenario, enabling models to acquire such abilities which allows models to learn better embeddings. To get closer to realizing a half-obscured scene, we pass the query through the CCS module, making it only part of the object as well. On the other hand, the SF module systematically examines the mix-matching of distinct salient regions across all input data, ensuring that each produced mixup example incorporates a maximum number of salient regions from diverse input sources. This process is conducted with the careful consideration of maintaining diversity among the resulting mixup examples.
The main contributions of this paper are as follows:
- (i)
We have introduced a novel approach called M-Mix, designed to generate positive and negative samples for contrastive learning. By applying masks and merging images, this method enables the model to recognize object categories based on parts of objects, reducing erroneous pixel correlations, and enhancing the model’s generalization capabilities.
- (ii)
The CCS method we proposed has enabled us to generate a larger quantity of higher-quality query instances. This increase in training sample diversity aids the model in better coping with noise, generalizing to unknown data, and improving overall performance.
- (iii)
Our method performs excellently on the miniImageNet, tieredImageNet, FC-100, and CUB datasets, surpassing state-of-the-art methods by approximately 0.2–1%. These results demonstrate the significant potential and advantages of our method in small-sample image classification.
2. Related Works
This work involves the application of meta-learning [
1] and data augmentation in few-shot learning. To address the issues of insufficient samples and poor feature representation ability in few-shot image classification, a data augmentation method is proposed. This section reviews the work related to the field of few-shot learning relevant to this study.
Few-Shot Learning
Few-shot learning has garnered increasing attention in recent years, with current research in this field broadly divided into two categories: gradient-based methods and metric-based methods.
Gradient-based methods enhance the generalization ability and performance of the model by promoting faster and more accurate adaptation to new tasks and new data through gradient descent. The goal is to extract general knowledge from various few-shot tasks and quickly adapt to new tasks. MetaBaseline [
1] enables rapid adaptation and high accuracy when faced with new categories or tasks by performing meta-learning on the evaluation metrics of the overall classification pre-trained model. LSFLS [
2] improves the robustness and generalization ability of few-shot learning by utilizing implicit prior information in the data to learn more generalized features.
The main objective of metric learning is to improve the accuracy of similarities or distances between samples in the feature space, minimizing intra-class distance while maximizing inter-class distance. Typically, traditional methods consist of two components: a feature extractor and a classifier. ProtoNet [
4] represents each category with a mean vector in the embedded space learned by the neural network, serving as the prototype for that category, which has provided significant inspiration for subsequent related research. DeepEMD [
5] integrates the Earth Mover’s Distance (EMD) into the network and designs a cross-reference mechanism to mitigate the impact of complex backgrounds and intra-class appearance variations, achieving end-to-end training for few-shot image classification with optimal matching-based image region distance metrics under a small number of labeled samples.
3. Problem Definition
In this section, we delve into the domain of few-shot recognition. Here, , , and delineate the training, validation, and test sets, respectively, while their corresponding labels are denoted as , , and . The entirety of these sets is represented as , , and . As for , , and , they symbolize the classes found within the training, validation, and test sets, respectively. Few-shot learning (FSL) distinguishes itself slightly from the conventional supervised learning paradigm by having no intersection between and ().
Few-shot learning (FSL) distinguishes itself slightly from the conventional supervised learning paradigm. Specifically, the classes present in the training and test sets are entirely distinct, signifying that there is no intersection between and ().
The primary challenge in few-shot learning is to enable a model to generalize well from a limited number of training examples. This involves developing methods that can effectively leverage the small dataset to achieve high performance on unseen data. Our approach addresses this by introducing the Mask Mixup Model and Enhanced Contrastive Learning techniques, which focus on improving the representation and discrimination of the limited training data.
4. Methods
4.1. Method Overview
We introduce a novel model called the Mask Mixup Model (as illustrated in
Figure 2a), which comprises three key modules. The first module is M-Mix, which optimizes the infoNCE loss to suit few-shot learning, leveraging augmentation techniques to unearth challenging samples. The second component is a module named CCS, utilizing a confidence-based method to select desired query samples. The third module, SF, eliminates the masking process from the M-Mix method, primarily aimed at introducing noise for better embedding in the model.
Data augmentation techniques play a crucial role in improving model generalization. In this paper, we propose the Mask Mixup and Saliency Fuse methods, which selectively retain and blend important features from images to generate more diverse and representative training samples. Additionally, the Confident Clip Selector enhances the diversity of training data by randomly cropping and selecting high-quality query samples. Our approach outperforms traditional methods such as random occlusion and mixup interpolation by effectively preserving important image information and clarifying class boundaries, thereby significantly boosting model performance in few-shot image classification tasks.
One of the challenges in contrastive learning lies in the selection of informative positive and negative samples. Contrastive pairs were built using support features. Each query instance had its label. Thus, for each query instance
, we considered support instances with the same label as its positives and instances with different labels as negatives. The same embedding network
was used for both query and support instances. Our loss function was based on
where
denotes the loss calculated for the
i-th query sample. Here,
refers to the feature representation of the
i-th query sample, while
and
represent the feature representations of support samples
j and
k, respectively. Sample
j shares the same label
as the query sample
i, while
k has a different label
.
4.2. Mask Mix
Support images act as reference templates. Two images with differing labels from the query are chosen as negative samples, and two images sharing the same labels as the query are selected as positive samples saliency maps are generated for each image, facilitating the identification of the most significant information. We then semi-randomly mask the images: dividing each image into
blocks, designating the top
prominent blocks for labeling, randomly masking
blocks among these designated
blocks, and randomly masking
blocks within less prominent regions.
and
are randomly chosen, following the constraint
, where
M is a fixed value. Subsequently, we fuse the two positive samples and two negative samples separately. Each block on one image corresponds to a block on the other image, as illustrated in
Figure 2b. If both blocks represent either prominent or less prominent regions, they undergo mixup. Given two pairs of samples (input-label pairs)
and
, where
and
are input samples, and
and
are their corresponding labels, mixup creates new training samples and labels through linear interpolation, expressed as:
, where
is a random weight sampled from a Beta distribution.
If one block is significant while the other is not, the features of the significant region will be prioritized and preserved during the blending process to ensure that the model focuses on the crucial information in the image.If either block is masked, it directly substitutes the other block. Let
and
be two input images represented as
and
, where
and
denote non-prominent regions, and
and
denote prominent regions. Masks can be represented as
M. The fusion formula can be represented as:
This operation generates contrasting samples, emphasizing differences, enabling the model to prioritize crucial image regions. Consequently, the model enhances its capacity to grasp pivotal information. The inner product of the query and support is expressed as follows:
where
and
represent the first and second of the two supports we, respectively, picked out. The fused infoNCE can be written as:
where
represents the loss value for the
i-th query sample.
denotes the label of the
i-th query sample.
denotes the label of the
j-th support sample.
F generally indicates a similarity measure between two samples, typically computed by the model. For example,
represents the similarity between the query sample
and the support sample
, where
is the feature extraction function.
represents the sum of the similarity scores (exponentiated) of all support samples whose labels match the label of the query sample
.
represents the sum of the similarity scores (exponentiated) of all support samples whose labels do not match the label of the query sample
.
Specifically, the numerator consists of the similarity scores of support samples that have the same label as the query sample i, and the denominator includes the similarity scores of all support samples regardless of their labels. This contrastive loss function aims to maximize the similarity between samples with the same labels while minimizing the similarity between samples with different labels.
4.3. Confident Clip Selector
The quality and diversity of query samples directly impact the robustness and adaptability of models. While leveraging diverse and representative query samples helps models adapt to different environmental and contextual variations, it is crucial to note that not all cropped images from the query might be optimal; some might contain valuable samples, while others might not. Therefore, identifying and selecting the high-quality samples becomes pivotal for us. To better utilize query data, we propose the CCS method, as illustrated in
Figure 2c, to handle the query images.We have the original image and performed random cropping to generate approximately six new images. Each image has been processed by the same feature extractor to create corresponding feature vectors.
The design inspiration of the Confident Clip Selector (CCS) method draws partially from pseudo-labeling and pseudo-ensemble techniques, which have been successfully applied in other machine learning domains. For instance, Bachman [
10] discussed strategies in their paper “Learning with Pseudo-Ensembles” that enhance model performance by generating pseudo-labels. Similarly, CCS generates multiple query samples and selects high-quality ones, thereby enhancing the model’s robustness and adaptability.
We have the
n cropped images, the matrix of feature vectors, denoted as
, is of size
, where
d is the dimension of the feature vectors
which aims to minimize the difference between the feature matrix
and the predicted values
by adjusting the parameter
, measured through the squared Frobenius norm. In this formula,
represents the matrix of image features,
stands for the parameters of a linear model used to combine these features, and
denotes the regularization term applied to control the complexity of the model. The ultimate objective is to sort the images based on certain criteria.
As changes, becomes increasingly sparse until all elements vanish. We identify the value where disappears, ranking pseudo-labeled data based on this vanishing point. However, extremely confident or uncertain images are unnecessary. We remove those with the highest and lowest confidence, using the rest as individual queries for comparative learning.
4.4. Utilizing Saliency Fuse for Data Augmentation
Our data augmentation strategy has been revised. The necessity for a Mask in classification loss has been eliminated. Consequently, after acquiring saliency maps, we partition the image into
blocks. Among these blocks, we designate the most salient
for labeling, integrating the remaining blocks through the SF formula
This approach accentuates the significance of salient areas within an image, as illustrated in
Figure 2d. Varied blending techniques result in images exhibiting unique visual traits. This blending potential fosters the creation of more distinctive and discriminative images, providing diverse viewpoints for model training to adeptly recognize and understand crucial features. The formula for the mixed labels is:
where
represents the resultant mixed label, where
and
correspond to the labels of two original samples. The term
denotes the image size or the number of patches in the image.
represents the count of activated elements, namely the number of elements activated (with a value of 1) in the binary mask
m. The parameter
in the formula is the mixing weight determining the degree of blending between the two labels. Additionally,
and
, respectively, signify the count of inactive patches in the two samples. This formula aims to generate the final mixed label based on the mixing weight and the number of mixed patches.
4.5. Algorithm Description
Algorithm 1, referred to as the Mask Mixup Model, functions on both training and test datasets, producing saliency maps for the support images. For each support-query pair, random cropping of the query images is performed to create new instances, followed by feature vector generation and confidence evaluation using regression techniques. Finally, the algorithm iteratively computes the loss functions
and
for each pair in both the training and test sets, where
represents the classification loss.
Algorithm 1 Mask Mixup Model algorithm |
Require: Ensure: Loss function
1: for all in do
2: Generate saliency maps for s
3: Compute
4: Randomly crop q to create new images
5: Generate feature vectors for each cropped image
6: Evaluate confidence using linear regression or other methods
7: Compute the inner product F between q and the fused support samples.
8: Compute Loss
9: Compute
10: Compute Loss
11: end for |
5. Experiments
5.1. Datasets
To validate the method proposed in this article, we conducted experiments on several widely used datasets. miniImageNet is a subset of ImageNet, consisting of a total of 100 classes, with 600 instances per class. These classes are divided into training, validation, and test sets, comprising 64, 16, and 20 classes, respectively. tieredImageNet is also sampled from ImageNet, consisting of 779,165 images from 608 classes. They are divided into training, validation, and test sets, comprising 351, 97, and 160 classes, respectively. Fewshot-CIFAR100 (FC100) data are a subset of CIFAR-100. A common partitioning is 60 classes for training, 20 classes for validation, and testing each. CUB is a fine-grained classification dataset covering 200 bird species, with 100 used for base categories, 50 for evaluation, and the remaining 50 for novel categories. These datasets are essential for evaluating the performance of our proposed method. For each dataset, we detail the number of classes, the number of samples per class, and the specific splits used for training, validation, and testing. This detailed description ensures that the datasets are directly related to the
Section 5 providing clarity on how they are utilized to benchmark our model’s performance.
5.2. Implementation Details
For a fair comparison with previous works, we employed ResNet12 as the foundational architecture for our model, meticulously following the specifications outlined in TADAM Initialization of model parameters adhered to the he-normal method. In terms of optimization, we chose Stochastic Gradient Descent (SGD), setting the initial learning rate to 0.1. In the course of our experiments with miniImageNet, we iteratively adjusted the learning rate at the 12,000th, 14,000th, and 16,000th episodes, respectively. Concurrently, in the tieredImageNet experiments, the learning rate underwent halving every 24,000 episodes. Model evaluation encompassed a total of 2000 episodes across all experiments. During the training phase, each batch comprised four episodes selected as training samples.
5.3. Comparison with Other Methods
To assess the performance of our model, we compared it to several previous methods, including ProtoNet [
4], infoPatch [
8], LSFSL [
2], MML [
6], TALDS-Net [
3], and RENet [
11] among others. These methods either represent classical Few-Shot Learning (FSL) approaches or have previously demonstrated the best-reported results. Our findings are presented in
Table 1.
In
Table 1, this paper compares miniImageNet and tieredImageNet datasets using Resnet-12 as the backbone for the compared methods. Compared with DeepEMD [
5], this paper requires less computation, and compared with RENet [
11] and CVET [
12], it does not require additional parameters, demonstrating the superiority of the proposed method over others. In
Table 2, we compared our method with others on the FC100 dataset, using ResNet-12 as the backbone. In
Table 3, the comparison is made on the CUB-200 dataset. Even though the backbone network parameters of some methods are greater than those of this paper, the proposed method still outperforms others.
By visualizing spatial correspondences, we confirm our ability to recognize images using partial information. Utilizing the features extracted from the supporting image, we compute the inner product for each segment of the query image.
Figure 3 demonstrates our method’s superiority in spatial relationship comprehension over the benchmark. Notably, our model accurately and comprehensively covers the foreground and excels in recognizing objects within noisy images without false recognitions elsewhere. This further validates the effectiveness of our model representation.
Table 1.
This is a comparison of our method with other methods using ResNet12 as the backbone network, on the miniImageNet and tieredImageNet datasets. Bold indicates the best performance.
Table 1.
This is a comparison of our method with other methods using ResNet12 as the backbone network, on the miniImageNet and tieredImageNet datasets. Bold indicates the best performance.
Method | miniImageNet | tieredImageNet |
---|
1-Shot | 5-Shot | 1-Shot | 5-Shot |
---|
ProtoNet [4] | 62.39 ± 0.21 | 80.53 ± 0.14 | 68.23 ± 0.23 | 84.03 ± 0.16 |
CAN [13] | 63.85 ± 0.48 | 79.44 ± 0.34 | 69.89 ± 0.51 | 84.23 ± 0.37 |
DeepBDC [14] | 67.83 ± 0.43 | 84.46 ± 0.28 | 72.34 ± 0.49 | 87.31 ± 0.32 |
DeepEMD [5] | 65.91 ± 0.82 | 82.41 ± 0.56 | 71.16 ± 0.87 | 86.03 ± 0.58 |
RENet [11] | 67.60 ± 0.44 | 82.58 ± 0.30 | 71.61 ± 0.51 | 85.28 ± 0.35 |
DMF [15] | 67.76 ± 0.46 | 82.71 ± 0.31 | 71.89 ± 0.52 | 85.96 ± 0.35 |
FRN [16] | 66.45 ± 0.19 | 82.83 ± 0.13 | 72.06 ± 0.22 | 86.89 ± 0.14 |
ZN [17] | 67.35 ± 0.43 | 83.04 ± 0.29 | 72.28 ± 0.51 | 87.20 ± 0.34 |
MML [6] | 67.58 ± 0.23 | 81.41 ± 0.20 | 71.38 ± 0.25 | 84.65 ± 0.20 |
infoPatch [8] | 67.67 ± 0.45 | 82.44 ± 0.31 | 71.51 ± 0.52 | 85.44 ± 0.35 |
LSFSL [2] | 64.67 ± 0.49 | 81.79 ± 0.18 | 71.17 ± 0.52 | 86.23 ± 0.22 |
TALDS-Net [3] | 67.89 ± 0.20 | 84.31 ± 0.44 | 71.34 ± 0.32 | 86.12 ± 0.33 |
ours | 70.03 ± 0.36 | 84.74 ± 0.53 | 73.14 ± 0.57 | 86.72 ± 0.69 |
Table 2.
The 5-way 1-shot and 5-way 5-shot few-shot accuracies on FC100. All results of competitors are from the original papers. The bold data indicates optimal performance.
Table 2.
The 5-way 1-shot and 5-way 5-shot few-shot accuracies on FC100. All results of competitors are from the original papers. The bold data indicates optimal performance.
FC100 Accuarices |
---|
Model | 5-Way 1-Shot | 5-Way 5-Shot |
---|
MAML [18] | 38.1 ± 1.7 | 50.4 ± 1.0 |
TADAM [19] | 40.1 ± 0.4 | 56.1 ± 0.4 |
ProtoNet [4] | 37.5 ± 0.6 | 52.5 ± 0.6 |
MetaOptNet [20] | 41.1 ± 0.6 | 55.5 ± 0.6 |
DeepEMD [5] | 46.5 ± 0.8 | 63.2 ± 0.7 |
Rethink-Distill [21] | 44.6 ± 0.7 | 60.1 ± 0.6 |
MML [6] | 44.43 ± 0.2 | 59.56 ± 0.3 |
infoPatch [8] | 43.8 ± 0.4 | 58.0 ± 0.4 |
ours | 45.2 ± 0.3 | 60.8 ± 0.4 |
We employed tSNE plots to visualize the embeddings. Specifically, we constructed a larger episode from the target categories within the miniImageNet dataset and fed it into both the baseline model [
1] and our complete model.
Figure 4 presents the visualizations of the embeddings. From the figure, it is apparent that the images generated by the baseline model lack compactness, with many categories blending together, whereas our method produces images that are more compact, allowing the images of each category to cluster more effectively.
Table 3.
The table compares our method with other methods on the CUB dataset. The bold data indicates optimal performance.
Table 3.
The table compares our method with other methods on the CUB dataset. The bold data indicates optimal performance.
Method | Backbone | CUB |
---|
1-Shot | 5-Shot |
---|
Robust-20 [22] | ResNet-18 | 58.67 ± 0.7 | 75.62 ± 0.5 |
RelationNet [23] | ResNet-18 | 67.59 ± 1.0 | 82.75 ± 0.6 |
MAML [18] | ResNet-18 | 68.42 ± 1.0 | 83.47 ± 0.6 |
ProtoNet [4] | ResNet-18 | 71.88 ± 0.9 | 86.64 ± 0.5 |
Baseline++ [24] | ResNet-18 | 67.02 ± 0.9 | 83.58 ± 0.5 |
S2M2R [25] | ResNet-34 | 72.92 ± 0.83 | 86.55 ± 0.67 |
AA [26] | ResNet-18 | 74.22 ± 1.09 | 88.65 ± 0.55 |
MixtFSL [27] | ResNet-18 | 73.94 ± 1.1 | 86.01 ± 0.5 |
LaplacianShot [28] | ResNet-18 | 80.96 | 88.68 |
RENet [11] | ResNet-12 | 79.49 ± 0.44 | 91.11 ± 0.24 |
setfeat [29] | ResNet-12 | 79.60 ± 0.80 | 90.48 ± 0.44 |
LRD [25] | ResNet-12 | 79.56 ± 0.87 | 90.67 ± 0.35 |
ours | ResNet-12 | 81.33 ± 0.59 | 91.73 ± 0.37 |
5.4. Ablation Study
We conducted comparative ablation studies of M-Mix, CCS, and SF against the baseline [
1]. As illustrated in
Table 4, each component contributed to improvements. Specifically using the miniImageNet dataset for this analysis, we observed the significant impact of each component. Our M-Mix notably outperformed the baseline model. While individually adding CCS did not notably enhance the model, its combined utilization with M-Mix proved effective. Employing our proposed SF slightly improved the model’s generalization.
When constructing support images for contrastive learning, we divided the images into grids and applied masking to them. For the sake of experimentation, we exclusively conducted the analysis on the miniImageNet dataset. To ensure consistency in our experiments, we opted to divide the images into blocks with a number that corresponds to the square root of a natural number. We chose grid sizes of 8 × 8, 10 × 10, and 12 × 12 to create blocks. Under the 8 × 8 grid, we masked 10, 16, and 20 blocks; with the 10 × 10 grid, we used 20, 25, and 30 blocks; and for the 12 × 12 grid, we employed 30, 40, and 50 masked blocks. We utilized these synthetically created images for contrastive learning. As shown in
Table 5, using larger grid sizes leads to superior results. Nevertheless, due to our input size being 84 × 84, larger grid sizes might introduce more noise within the blocks. Thus, opting for moderately sized grids allows us to synthesize more information-rich samples, thereby enhancing performance.
In the CCS module, we conducted ablation experiments on the number of random crops, specifically testing 4, 6, 8, 12, and 16 crops. Keeping other settings constant, experiments were conducted on the miniImageNet dataset as shown in
Figure 5. The results indicate that the performance is optimal when using six random crops. Fewer crops result in insufficient samples in the query set, while a larger number increases dataset size but can lead to overfitting. Therefore, we choose to use six random crops to achieve the best experimental outcomes.
5.5. Limitation
Our study has several limitations that warrant consideration. Firstly, although the dataset used in this research is quite comprehensive, it may lack certain aspects of real world variability, potentially limiting the generalizability of our findings. Additionally, the M-Mix generated synthesized images might result in some distortion or bias compared to real data.
6. Conclusions
This paper presents an innovative approach to address the challenge of small-sample image classification. Traditional methods often address problems with limited data, prompting the use of data augmentation techniques like random occlusion or mixup interpolation to increase the diversity and generalization of labeled samples. However, these methods have drawbacks: random occlusion can result in the loss of crucial information, while mixup interpolation can create overly homogeneous data distributions that hinder class differentiation. To overcome these issues, this paper introduces a novel data augmentation method based on saliency mask mixing. This technique uses visual feature fusion and confidence pruning to intelligently select and preserve key image features. Additionally, it employs a visual feature saliency fusion approach to evaluate the importance of different regions, guiding the fusion process to generate more diverse and nuanced images. This enhances the model’s ability to distinguish between classes. The proposed method demonstrates superior performance across multiple standard small-sample image classification datasets, outperforming current state-of-the-art methods by approximately 0.2–1%.
Author Contributions
Conceptualization, K.X., Y.G. and X.C.; methodology, K.X. and Y.G.; software, K.X. and X.C.; validation, K.X.; writing—original draft preparation, K.X.; writing—review and editing, K.X., Y.G. and Y.C.; project administration, K.X. and Y.C.; funding acquisition, K.X. and Y.C. All authors have read and agreed to the published version of the manuscript.
Funding
This work was funded in part by the National Natural Science Foundation of China under grants 61802197 and 62072449.
Data Availability Statement
The data presented in this study are openly available in Refs. [
30,
31,
32,
33].
Acknowledgments
The authors thank the anonymous reviewers and the editors for their insightful comments and helpful suggestions for improving our manuscript.
Conflicts of Interest
Author Kai Xie was employed by the company Institute of NR Electric Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
References
- Chen, Y.; Liu, Z.; Xu, H.; Darrell, T.; Wang, X. Meta-Baseline: Exploring Simple Meta-Learning for Few-Shot Learning. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 9042–9051. [Google Scholar] [CrossRef]
- Padmanabhan, D.; Gowda, S.; Arani, E.; Zonooz, B. LSFSL: Leveraging Shape Information in Few-shot Learning. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Vancouver, BC, Canada, 17–24 June 2023; pp. 4971–4980. [Google Scholar]
- Qiao, Q.; Xie, Y.; Zeng, Z.; Li, F. TALDS-Net: Task-Aware Adaptive Local Descriptors Selection for Few-shot Image Classification. arXiv 2023, arXiv:2312.05449. [Google Scholar]
- Snell, J.; Swersky, K.; Zemel, R. Prototypical networks for few-shot learning. Adv. Neural Inf. Process. Syst. 2017, 30, 4077–4087. [Google Scholar]
- Zhang, C.; Cai, Y.; Lin, G.; Shen, C. Deepemd: Differentiable earth mover’s distance for few-shot learning. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 5632–5648. [Google Scholar] [CrossRef] [PubMed]
- Chen, H.; Li, H.; Li, Y.; Chen, C. Multi-level metric learning for few-shot image recognition. In Proceedings of the International Conference on Artificial Neural Networks, Bristol, UK, 6–9 September 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 243–254. [Google Scholar]
- Mangla, P.; Kumari, N.; Sinha, A.; Singh, M.; Krishnamurthy, B.; Balasubramanian, V.N. Charting the right manifold: Manifold mixup for few-shot learning. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Snowmass Village, CO, USA, 1–5 March 2020; pp. 2218–2227. [Google Scholar]
- Liu, C.; Fu, Y.; Xu, C.; Yang, S.; Li, J.; Wang, C.; Zhang, L. Learning a few-shot embedding model with contrastive learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, 2–9 February 2021; Volume 35, pp. 8635–8643. [Google Scholar]
- Zhuo, L.; Fu, Y.; Chen, J.; Cao, Y.; Jiang, Y.G. Tgdm: Target guided dynamic mixup for cross-domain few-shot learning. In Proceedings of the 30th ACM International Conference on Multimedia, Lisboa, Portugal, 10–14 October 2022; pp. 6368–6376. [Google Scholar]
- Bachman, P.; Alsharif, O.; Precup, D. Learning with pseudo-ensembles. Adv. Neural Inf. Process. Syst. 2014, 27. [Google Scholar]
- Kang, D.; Kwon, H.; Min, J.; Cho, M. Relational embedding for few-shot classification. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 8822–8833. [Google Scholar]
- Yang, Z.; Wang, J.; Zhu, Y. Few-shot classification with contrastive learning. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 6–9 September 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 293–309. [Google Scholar]
- Hou, R.; Chang, H.; Ma, B.; Shan, S.; Chen, X. Cross attention network for few-shot classification. Adv. Neural Inf. Process. Syst. 2019, 32. [Google Scholar]
- Xie, J.; Long, F.; Lv, J.; Wang, Q.; Li, P. Joint distribution matters: Deep brownian distance covariance for few-shot classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 7972–7981. [Google Scholar]
- Xu, C.; Fu, Y.; Liu, C.; Wang, C.; Li, J.; Huang, F.; Zhang, L.; Xue, X. Learning dynamic alignment via meta-filter for few-shot learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 5182–5191. [Google Scholar]
- Wertheimer, D.; Tang, L.; Hariharan, B. Few-shot classification with feature map reconstruction networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 8012–8021. [Google Scholar]
- Fei, N.; Gao, Y.; Lu, Z.; Xiang, T. Z-score normalization, hubness, and few-shot learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 142–151. [Google Scholar]
- Finn, C.; Abbeel, P.; Levine, S. Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the International Conference on Machine Learning. PMLR, Sydney, Australia, 6–11 August 2017; pp. 1126–1135. [Google Scholar]
- Oreshkin, B.; Rodríguez López, P.; Lacoste, A. Tadam: Task dependent adaptive metric for improved few-shot learning. Adv. Neural Inf. Process. Syst. 2018, 31. [Google Scholar]
- Zhang, X.; Meng, D.; Gouk, H.; Hospedales, T.M. Shallow bayesian meta learning for real-world few-shot recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 651–660. [Google Scholar]
- Tian, Y.; Wang, Y.; Krishnan, D.; Tenenbaum, J.B.; Isola, P. Rethinking few-shot image classification: A good embedding is all you need? In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part XIV 16. Springer: Berlin/Heidelberg, Germany, 2020; pp. 266–282. [Google Scholar]
- Dvornik, N.; Schmid, C.; Mairal, J. Diversity with cooperation: Ensemble methods for few-shot classification. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 3723–3731. [Google Scholar]
- Sung, F.; Yang, Y.; Zhang, L.; Xiang, T.; Torr, P.H.; Hospedales, T.M. Learning to compare: Relation network for few-shot learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 1199–1208. [Google Scholar]
- Chen, Z.; Fu, Y.; Wang, Y.X.; Ma, L.; Liu, W.; Hebert, M. Image deformation meta-networks for one-shot learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 8680–8689. [Google Scholar]
- Yang, S.; Liu, L.; Xu, M. Free lunch for few-shot learning: Distribution calibration. arXiv 2021, arXiv:2101.06395. [Google Scholar]
- Afrasiyabi, A.; Lalonde, J.F.; Gagné, C. Associative alignment for few-shot image classification. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part V 16. Springer: Berlin/Heidelberg, Germany, 2020; pp. 18–35. [Google Scholar]
- Afrasiyabi, A.; Lalonde, J.F.; Gagné, C. Mixture-based feature space learning for few-shot image classification. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 9041–9051. [Google Scholar]
- Ziko, I.; Dolz, J.; Granger, E.; Ayed, I.B. Laplacian regularized few-shot learning. In Proceedings of the International Conference on Machine Learning. PMLR, Virtual, 13–18 July 2020; pp. 11660–11670. [Google Scholar]
- Afrasiyabi, A.; Larochelle, H.; Lalonde, J.F.; Gagné, C. Matching feature sets for few-shot image classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 9014–9024. [Google Scholar]
- Vinyals, O.; Blundell, C.; Lillicrap, T.; Wierstra, D. Matching networks for one shot learning. Adv. Neural Inf. Process. Syst. 2016, 29. [Google Scholar]
- Ren, M.; Triantafillou, E.; Ravi, S.; Snell, J.; Swersky, K.; Tenenbaum, J.B.; Larochelle, H.; Zemel, R.S. Meta-learning for semi-supervised few-shot classification. arXiv 2018, arXiv:1803.00676. [Google Scholar]
- Bertinetto, L.; Henriques, J.F.; Torr, P.H.; Vedaldi, A. Meta-learning with differentiable closed-form solvers. arXiv 2018, arXiv:1805.08136. [Google Scholar]
- Chen, C.; Li, O.; Tao, D.; Barnett, A.; Rudin, C.; Su, J.K. This looks like that: Deep learning for interpretable image recognition. Adv. Neural Inf. Process. Syst. 2019, 32. [Google Scholar]
| Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).