Mask Mixup Model: Enhanced Contrastive Learning for Few-Shot Learning

Xie, Kai; Gao, Yuxuan; Chen, Yadang; Che, Xun

doi:10.3390/app14146063

Open AccessArticle

Mask Mixup Model: Enhanced Contrastive Learning for Few-Shot Learning

¹

Key Laboratory of Computer Network and Information Integration, School of Computer Science and Engineering, Southeast University, Nanjing 211189, China

²

Institute of NR Electric Co., Ltd., Nanjing 211102, China

³

School of Computer Science, Nanjing University of Information Science and Technology, Nanjing 210044, China

⁴

School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing 210094, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(14), 6063; https://doi.org/10.3390/app14146063

Submission received: 6 June 2024 / Revised: 27 June 2024 / Accepted: 8 July 2024 / Published: 11 July 2024

Download

Browse Figures

Versions Notes

Abstract

Few-shot image classification aims to improve the performance of traditional image classification when faced with limited data. Its main challenge lies in effectively utilizing sparse sample label data to accurately predict the true feature distribution. Recent approaches have employed data augmentation techniques like random Mask or mixture interpolation to enhance the diversity and generalization of labeled samples. However, these methods still encounter several issues: (1) random Mask can lead to complete blockage or exposure of foreground, causing loss of crucial sample information; and (2) uniform data distribution after mixture interpolation makes it difficult for the model to differentiate between different categories and effectively distinguish their boundaries. To address these challenges, this paper introduces a novel data augmentation method based on saliency mask blending. Firstly, it selectively preserves key image features through adaptive selection and retention using visual feature occlusion fusion and confidence clipping strategies. Secondly, a visual feature saliency fusion approach is employed to calculate the importance of various image regions, guiding the blending process to produce more diverse and enriched images with clearer category boundaries. The proposed method achieves outstanding performance on multiple standard few-shot image classification datasets (miniImageNet, tieredImageNet, Few-shot FC100, and CUB), surpassing state-of-the-art methods by approximately 0.2–1%.

Keywords:

few-shot; contrastive learning; mixup; data augmentation

1. Introduction

Conventional neural image classifiers have demonstrated exceptional performance across diverse setups and benchmarks. However, they still heavily rely on abundant labeled training data. Most significantly, these methods struggle to adapt learned knowledge to target categories.

Inspired by humans’ ability for few-shot learning, the concept of few-shot image classification has been proposed. It aims to identify target categories by adjusting the knowledge learned previously from source categories. This knowledge is typically stored within a deep embedding model, utilized for supporting and querying generic matches of image pairs. A promising approach involves leveraging meta-learning frameworks to optimize the initialization of model parameters [1,2,3] or replace standard parameter linear classifier heads [4,5,6] with class-agnostic distance functions.

Although these methods have shown significant performance, the key challenge of few-shot image classification still lies in eliminating the inductive bias problem from source classes to tailor its preference for hypotheses according to the few training instances from new target classes. On the other hand, it could lead to getting a flawed embedding due to the inadequacy of one or just a few instances in effectively capturing the data distributions pertaining to the novel classes. To address these issues, refs. [7,8,9] introduce variations in the training data to encourage the model to learn a more general and adaptive representation. Furthermore, augmenting the learned embeddings with data variety forces the model to acquire more comprehensive and discriminative embeddings. However, as depicted in the left images of Figure 1a, The outcomes generated from randomly obscured samples might either fully expose or entirely obscure crucial regions. This circumstance might hamper the model’s acquisition of new features or impact its capacity to discriminate among distinct categories and effectively generalize. Illustrated in the right images of Figure 1a, these approaches only consider mixing a given random pair of input data and do not fully utilize the rich informative supervisory signal in training data.

To address these challenges in few-shot learning, this paper introduces three modules: (1) Mask Mix (M-Mix), (2) Saliency Fuse (SF), and (3) Confident Clip Selector (CCS). The M-Mix module masks the images guided by the saliency map and then blends them, as shown in the upper image of Figure 1b, producing images that are more representative and contain richer information. Humans can identify objects by seeing only a part of them, and these images mimic this scenario, enabling models to acquire such abilities which allows models to learn better embeddings. To get closer to realizing a half-obscured scene, we pass the query through the CCS module, making it only part of the object as well. On the other hand, the SF module systematically examines the mix-matching of distinct salient regions across all input data, ensuring that each produced mixup example incorporates a maximum number of salient regions from diverse input sources. This process is conducted with the careful consideration of maintaining diversity among the resulting mixup examples.

The main contributions of this paper are as follows:

(i): We have introduced a novel approach called M-Mix, designed to generate positive and negative samples for contrastive learning. By applying masks and merging images, this method enables the model to recognize object categories based on parts of objects, reducing erroneous pixel correlations, and enhancing the model’s generalization capabilities.
(ii): The CCS method we proposed has enabled us to generate a larger quantity of higher-quality query instances. This increase in training sample diversity aids the model in better coping with noise, generalizing to unknown data, and improving overall performance.
(iii): Our method performs excellently on the miniImageNet, tieredImageNet, FC-100, and CUB datasets, surpassing state-of-the-art methods by approximately 0.2–1%. These results demonstrate the significant potential and advantages of our method in small-sample image classification.

2. Related Works

This work involves the application of meta-learning [1] and data augmentation in few-shot learning. To address the issues of insufficient samples and poor feature representation ability in few-shot image classification, a data augmentation method is proposed. This section reviews the work related to the field of few-shot learning relevant to this study.

Few-Shot Learning

Few-shot learning has garnered increasing attention in recent years, with current research in this field broadly divided into two categories: gradient-based methods and metric-based methods.

Gradient-based methods enhance the generalization ability and performance of the model by promoting faster and more accurate adaptation to new tasks and new data through gradient descent. The goal is to extract general knowledge from various few-shot tasks and quickly adapt to new tasks. MetaBaseline [1] enables rapid adaptation and high accuracy when faced with new categories or tasks by performing meta-learning on the evaluation metrics of the overall classification pre-trained model. LSFLS [2] improves the robustness and generalization ability of few-shot learning by utilizing implicit prior information in the data to learn more generalized features.

The main objective of metric learning is to improve the accuracy of similarities or distances between samples in the feature space, minimizing intra-class distance while maximizing inter-class distance. Typically, traditional methods consist of two components: a feature extractor and a classifier. ProtoNet [4] represents each category with a mean vector in the embedded space learned by the neural network, serving as the prototype for that category, which has provided significant inspiration for subsequent related research. DeepEMD [5] integrates the Earth Mover’s Distance (EMD) into the network and designs a cross-reference mechanism to mitigate the impact of complex backgrounds and intra-class appearance variations, achieving end-to-end training for few-shot image classification with optimal matching-based image region distance metrics under a small number of labeled samples.

3. Problem Definition

In this section, we delve into the domain of few-shot recognition. Here,

D_{train}

,

D_{val}

, and

D_{test}

delineate the training, validation, and test sets, respectively, while their corresponding labels are denoted as

Y_{train}

,

Y_{val}

, and

Y_{test}

. The entirety of these sets is represented as

D_{train} = {X_{train}, Y_{train}}

,

D_{val} = {X_{val}, Y_{val}}

, and

D_{test} = {X_{test}, Y_{test}}

. As for

C_{train}

,

C_{val}

, and

C_{test}

, they symbolize the classes found within the training, validation, and test sets, respectively. Few-shot learning (FSL) distinguishes itself slightly from the conventional supervised learning paradigm by having no intersection between

C_{train}

and

C_{test}

(

C_{train} \cap C_{test} = \emptyset

).

Few-shot learning (FSL) distinguishes itself slightly from the conventional supervised learning paradigm. Specifically, the classes present in the training and test sets are entirely distinct, signifying that there is no intersection between

C_{train}

and

C_{test}

(

C_{train} \cap C_{test} = \emptyset

).

The primary challenge in few-shot learning is to enable a model to generalize well from a limited number of training examples. This involves developing methods that can effectively leverage the small dataset to achieve high performance on unseen data. Our approach addresses this by introducing the Mask Mixup Model and Enhanced Contrastive Learning techniques, which focus on improving the representation and discrimination of the limited training data.

4. Methods

4.1. Method Overview

We introduce a novel model called the Mask Mixup Model (as illustrated in Figure 2a), which comprises three key modules. The first module is M-Mix, which optimizes the infoNCE loss to suit few-shot learning, leveraging augmentation techniques to unearth challenging samples. The second component is a module named CCS, utilizing a confidence-based method to select desired query samples. The third module, SF, eliminates the masking process from the M-Mix method, primarily aimed at introducing noise for better embedding in the model.

Data augmentation techniques play a crucial role in improving model generalization. In this paper, we propose the Mask Mixup and Saliency Fuse methods, which selectively retain and blend important features from images to generate more diverse and representative training samples. Additionally, the Confident Clip Selector enhances the diversity of training data by randomly cropping and selecting high-quality query samples. Our approach outperforms traditional methods such as random occlusion and mixup interpolation by effectively preserving important image information and clarifying class boundaries, thereby significantly boosting model performance in few-shot image classification tasks.

One of the challenges in contrastive learning lies in the selection of informative positive and negative samples. Contrastive pairs were built using support features. Each query instance had its label. Thus, for each query instance

x_{q}^{i}

, we considered support instances with the same label as its positives and instances with different labels as negatives. The same embedding network

Φ

was used for both query and support instances. Our loss function was based on

L_{i} = - log \frac{\sum_{y_{j}^{s} = y_{i}^{q}} e^{f_{i}^{q} f_{j}^{s}}}{\sum_{y_{j}^{s} = y_{i}^{q}} e^{f_{i}^{q} f_{j}^{s}} + \sum_{y_{k}^{s} \neq y_{i}^{q}} e^{f_{i}^{q} f_{k}^{s}}},

(1)

where

L_{i}

denotes the loss calculated for the i-th query sample. Here,

f_{i}^{q}

refers to the feature representation of the i-th query sample, while

f_{j}^{s}

and

f_{k}^{s}

represent the feature representations of support samples j and k, respectively. Sample j shares the same label

y_{i}^{q}

as the query sample i, while k has a different label

y_{k}^{s} \neq y_{i}^{q}

.

4.2. Mask Mix

Support images act as reference templates. Two images with differing labels from the query are chosen as negative samples, and two images sharing the same labels as the query are selected as positive samples saliency maps are generated for each image, facilitating the identification of the most significant information. We then semi-randomly mask the images: dividing each image into

N_{total}

blocks, designating the top

N_{fore}

prominent blocks for labeling, randomly masking

M_{fore}

blocks among these designated

N_{fore}

blocks, and randomly masking

M_{back}

blocks within less prominent regions.

M_{fore}

and

M_{back}

are randomly chosen, following the constraint

M_{fore} + M_{back} = M

, where M is a fixed value. Subsequently, we fuse the two positive samples and two negative samples separately. Each block on one image corresponds to a block on the other image, as illustrated in Figure 2b. If both blocks represent either prominent or less prominent regions, they undergo mixup. Given two pairs of samples (input-label pairs)

(x_{i}, y_{i})

and

(x_{j}, y_{j})

, where

x_{i}

and

x_{j}

are input samples, and

y_{i}

and

y_{j}

are their corresponding labels, mixup creates new training samples and labels through linear interpolation, expressed as:

M i x u p (x_{i}, y_{i}) = λ \cdot x_{i} + (1 - λ) \cdot x_{j}

, where

λ

is a random weight sampled from a Beta distribution.

If one block is significant while the other is not, the features of the significant region will be prioritized and preserved during the blending process to ensure that the model focuses on the crucial information in the image.If either block is masked, it directly substitutes the other block. Let

{\hat{x}}_{1}

and

{\hat{x}}_{2}

be two input images represented as

{\hat{x}}_{1} = x_{1}^{l} + x_{1}^{m}

and

{\hat{x}}_{2} = x_{2}^{l} + x_{2}^{m}

, where

x_{1}^{l}

and

x_{2}^{l}

denote non-prominent regions, and

x_{1}^{m}

and

x_{2}^{m}

denote prominent regions. Masks can be represented as M. The fusion formula can be represented as:

M (x_{1}, x_{2}) = \{\begin{matrix} M & if (x_{1}^{m} \cap M \neq \emptyset) \\ and (x_{2}^{m} \cap M \neq \emptyset) \\ x_{1} & if (x_{1}^{m} \cap x_{2}^{l} \neq \emptyset) \\ x_{2} & if (x_{1}^{l} \cap x_{2}^{m} \neq \emptyset) \\ Mixup (x_{1}, x_{2}) & otherwise . \end{matrix}

(2)

This operation generates contrasting samples, emphasizing differences, enabling the model to prioritize crucial image regions. Consequently, the model enhances its capacity to grasp pivotal information. The inner product of the query and support is expressed as follows:

F = f_{i} (q) \times M [f_{j} (s_{1}), f_{j} (s_{2})],

(3)

where

s_{1}

and

s_{2}

represent the first and second of the two supports we, respectively, picked out. The fused infoNCE can be written as:

L_{i} = - log \frac{\sum_{y_{j}^{s} = y_{i}^{q}} e^{F}}{\sum_{y_{j}^{s} = y_{i}^{q}} e^{F} + \sum_{y_{k}^{s} \neq y_{i} (q)} e^{F}},

(4)

where

L_{i}

represents the loss value for the i-th query sample.

y_{i}^{q}

denotes the label of the i-th query sample.

y_{j}^{s}

denotes the label of the j-th support sample. F generally indicates a similarity measure between two samples, typically computed by the model. For example,

F (f (x_{i}^{q}), f (x_{j}^{s}))

represents the similarity between the query sample

x_{i}^{q}

and the support sample

x_{j}^{s}

, where

f (\cdot)

is the feature extraction function.

\sum_{y_{j}^{s} = y_{i}^{q}} e^{F}

represents the sum of the similarity scores (exponentiated) of all support samples whose labels match the label of the query sample

y_{i}^{q}

.

\sum_{y_{k}^{s} \neq y_{i} (q)} e^{F}

represents the sum of the similarity scores (exponentiated) of all support samples whose labels do not match the label of the query sample

y_{i}^{q}

.

Specifically, the numerator consists of the similarity scores of support samples that have the same label as the query sample i, and the denominator includes the similarity scores of all support samples regardless of their labels. This contrastive loss function aims to maximize the similarity between samples with the same labels while minimizing the similarity between samples with different labels.

4.3. Confident Clip Selector

The quality and diversity of query samples directly impact the robustness and adaptability of models. While leveraging diverse and representative query samples helps models adapt to different environmental and contextual variations, it is crucial to note that not all cropped images from the query might be optimal; some might contain valuable samples, while others might not. Therefore, identifying and selecting the high-quality samples becomes pivotal for us. To better utilize query data, we propose the CCS method, as illustrated in Figure 2c, to handle the query images.We have the original image and performed random cropping to generate approximately six new images. Each image has been processed by the same feature extractor to create corresponding feature vectors.

The design inspiration of the Confident Clip Selector (CCS) method draws partially from pseudo-labeling and pseudo-ensemble techniques, which have been successfully applied in other machine learning domains. For instance, Bachman [10] discussed strategies in their paper “Learning with Pseudo-Ensembles” that enhance model performance by generating pseudo-labels. Similarly, CCS generates multiple query samples and selects high-quality ones, thereby enhancing the model’s robustness and adaptability.

We have the n cropped images, the matrix of feature vectors, denoted as

X_{cropped}

, is of size

n \times d

, where d is the dimension of the feature vectors

arg \min_{γ \in R^{n \times N}} {∥ {\tilde{X}}_{cropped} - {\tilde{X}}_{cropped} γ ∥}_{F}^{2} + λ R (γ),

(5)

which aims to minimize the difference between the feature matrix

{\tilde{X}}_{cropped}

and the predicted values

{\tilde{X}}_{cropped} γ

by adjusting the parameter

γ

, measured through the squared Frobenius norm. In this formula,

{\tilde{X}}_{cropped}

represents the matrix of image features,

γ

stands for the parameters of a linear model used to combine these features, and

R (γ)

denotes the regularization term applied to control the complexity of the model. The ultimate objective is to sort the images based on certain criteria.

As

λ

changes,

γ

becomes increasingly sparse until all elements vanish. We identify the

λ

value where

γ_{i}

disappears, ranking pseudo-labeled data based on this vanishing point. However, extremely confident or uncertain images are unnecessary. We remove those with the highest and lowest confidence, using the rest as individual queries for comparative learning.

4.4. Utilizing Saliency Fuse for Data Augmentation

Our data augmentation strategy has been revised. The necessity for a Mask in classification loss has been eliminated. Consequently, after acquiring saliency maps, we partition the image into

N_{total}

blocks. Among these blocks, we designate the most salient

N_{fore}

for labeling, integrating the remaining blocks through the SF formula

S (x_{1}, x_{2}) = \{\begin{matrix} x_{1} & if (x_{1}^{m} \cap x_{2}^{l} \neq \emptyset) \\ x_{2} & if (x_{1}^{l} \cap x_{2}^{m} \neq \emptyset) \\ Mixup (x_{1}, x_{2}) & otherwise . \end{matrix}

(6)

This approach accentuates the significance of salient areas within an image, as illustrated in Figure 2d. Varied blending techniques result in images exhibiting unique visual traits. This blending potential fosters the creation of more distinctive and discriminative images, providing diverse viewpoints for model training to adeptly recognize and understand crucial features. The formula for the mixed labels is:

\begin{matrix} g (y_{0}, y_{1}) = & \frac{W \times W}{c (minter)} (λ y_{0} + (1 - λ) y_{1}) \\ + \frac{W \times W}{c (\neg minter \cdot m_{0})} y_{0} + \frac{W \times W}{c (\neg minter \cdot m_{1})} y_{1}, \end{matrix}

(7)

where

g (y_{0}, y_{1})

represents the resultant mixed label, where

y_{0}

and

y_{1}

correspond to the labels of two original samples. The term

W \times W

denotes the image size or the number of patches in the image.

c (minter)

represents the count of activated elements, namely the number of elements activated (with a value of 1) in the binary mask m. The parameter

λ

in the formula is the mixing weight determining the degree of blending between the two labels. Additionally,

c (\neg minter \cdot m_{0})

and

c (\neg minter \cdot m_{1})

, respectively, signify the count of inactive patches in the two samples. This formula aims to generate the final mixed label based on the mixing weight and the number of mixed patches.

4.5. Algorithm Description

Algorithm 1, referred to as the Mask Mixup Model, functions on both training and test datasets, producing saliency maps for the support images. For each support-query pair, random cropping of the query images is performed to create new instances, followed by feature vector generation and confidence evaluation using regression techniques. Finally, the algorithm iteratively computes the loss functions

L_{i}

and

L_{c}

for each pair in both the training and test sets, where

L_{c}

represents the classification loss.

Algorithm 1 Mask Mixup Model algorithm

Require:

D_{train}, D_{val}, D_{test}, X_{cropped}, x_{1}, x_{2}

Ensure: Loss function

L_{i}

1: for all

(s, q)

in

(D_{train}, D_{test})

do
2: Generate saliency maps for s
3: Compute

M (s_{1}, s_{2})

4: Randomly crop q to create new images
5: Generate feature vectors for each cropped image
6: Evaluate confidence using linear regression or other methods
7: Compute the inner product F between q and the fused support samples.
8: Compute Loss

L_{i}

9: Compute

S (s_{1}, s_{2})

10: Compute Loss

L_{c}

11: end for

5. Experiments

5.1. Datasets

To validate the method proposed in this article, we conducted experiments on several widely used datasets. miniImageNet is a subset of ImageNet, consisting of a total of 100 classes, with 600 instances per class. These classes are divided into training, validation, and test sets, comprising 64, 16, and 20 classes, respectively. tieredImageNet is also sampled from ImageNet, consisting of 779,165 images from 608 classes. They are divided into training, validation, and test sets, comprising 351, 97, and 160 classes, respectively. Fewshot-CIFAR100 (FC100) data are a subset of CIFAR-100. A common partitioning is 60 classes for training, 20 classes for validation, and testing each. CUB is a fine-grained classification dataset covering 200 bird species, with 100 used for base categories, 50 for evaluation, and the remaining 50 for novel categories. These datasets are essential for evaluating the performance of our proposed method. For each dataset, we detail the number of classes, the number of samples per class, and the specific splits used for training, validation, and testing. This detailed description ensures that the datasets are directly related to the Section 5 providing clarity on how they are utilized to benchmark our model’s performance.

5.2. Implementation Details

For a fair comparison with previous works, we employed ResNet12 as the foundational architecture for our model, meticulously following the specifications outlined in TADAM Initialization of model parameters adhered to the he-normal method. In terms of optimization, we chose Stochastic Gradient Descent (SGD), setting the initial learning rate to 0.1. In the course of our experiments with miniImageNet, we iteratively adjusted the learning rate at the 12,000th, 14,000th, and 16,000th episodes, respectively. Concurrently, in the tieredImageNet experiments, the learning rate underwent halving every 24,000 episodes. Model evaluation encompassed a total of 2000 episodes across all experiments. During the training phase, each batch comprised four episodes selected as training samples.

5.3. Comparison with Other Methods

To assess the performance of our model, we compared it to several previous methods, including ProtoNet [4], infoPatch [8], LSFSL [2], MML [6], TALDS-Net [3], and RENet [11] among others. These methods either represent classical Few-Shot Learning (FSL) approaches or have previously demonstrated the best-reported results. Our findings are presented in Table 1.

In Table 1, this paper compares miniImageNet and tieredImageNet datasets using Resnet-12 as the backbone for the compared methods. Compared with DeepEMD [5], this paper requires less computation, and compared with RENet [11] and CVET [12], it does not require additional parameters, demonstrating the superiority of the proposed method over others. In Table 2, we compared our method with others on the FC100 dataset, using ResNet-12 as the backbone. In Table 3, the comparison is made on the CUB-200 dataset. Even though the backbone network parameters of some methods are greater than those of this paper, the proposed method still outperforms others.

By visualizing spatial correspondences, we confirm our ability to recognize images using partial information. Utilizing the features extracted from the supporting image, we compute the inner product for each segment of the query image. Figure 3 demonstrates our method’s superiority in spatial relationship comprehension over the benchmark. Notably, our model accurately and comprehensively covers the foreground and excels in recognizing objects within noisy images without false recognitions elsewhere. This further validates the effectiveness of our model representation.

Table 1. This is a comparison of our method with other methods using ResNet12 as the backbone network, on the miniImageNet and tieredImageNet datasets. Bold indicates the best performance.

Method	miniImageNet		tieredImageNet
Method	1-Shot	5-Shot	1-Shot	5-Shot
ProtoNet [4]	62.39 ± 0.21	80.53 ± 0.14	68.23 ± 0.23	84.03 ± 0.16
CAN [13]	63.85 ± 0.48	79.44 ± 0.34	69.89 ± 0.51	84.23 ± 0.37
DeepBDC [14]	67.83 ± 0.43	84.46 ± 0.28	72.34 ± 0.49	87.31 ± 0.32
DeepEMD [5]	65.91 ± 0.82	82.41 ± 0.56	71.16 ± 0.87	86.03 ± 0.58
RENet [11]	67.60 ± 0.44	82.58 ± 0.30	71.61 ± 0.51	85.28 ± 0.35
DMF [15]	67.76 ± 0.46	82.71 ± 0.31	71.89 ± 0.52	85.96 ± 0.35
FRN [16]	66.45 ± 0.19	82.83 ± 0.13	72.06 ± 0.22	86.89 ± 0.14
ZN [17]	67.35 ± 0.43	83.04 ± 0.29	72.28 ± 0.51	87.20 ± 0.34
MML [6]	67.58 ± 0.23	81.41 ± 0.20	71.38 ± 0.25	84.65 ± 0.20
infoPatch [8]	67.67 ± 0.45	82.44 ± 0.31	71.51 ± 0.52	85.44 ± 0.35
LSFSL [2]	64.67 ± 0.49	81.79 ± 0.18	71.17 ± 0.52	86.23 ± 0.22
TALDS-Net [3]	67.89 ± 0.20	84.31 ± 0.44	71.34 ± 0.32	86.12 ± 0.33
ours	70.03 ± 0.36	84.74 ± 0.53	73.14 ± 0.57	86.72 ± 0.69

Table 2. The 5-way 1-shot and 5-way 5-shot few-shot accuracies on FC100. All results of competitors are from the original papers. The bold data indicates optimal performance.

FC100 Accuarices
Model	5-Way 1-Shot	5-Way 5-Shot
MAML [18]	38.1 ± 1.7	50.4 ± 1.0
TADAM [19]	40.1 ± 0.4	56.1 ± 0.4
ProtoNet [4]	37.5 ± 0.6	52.5 ± 0.6
MetaOptNet [20]	41.1 ± 0.6	55.5 ± 0.6
DeepEMD [5]	46.5 ± 0.8	63.2 ± 0.7
Rethink-Distill [21]	44.6 ± 0.7	60.1 ± 0.6
MML [6]	44.43 ± 0.2	59.56 ± 0.3
infoPatch [8]	43.8 ± 0.4	58.0 ± 0.4
ours	45.2 ± 0.3	60.8 ± 0.4

We employed tSNE plots to visualize the embeddings. Specifically, we constructed a larger episode from the target categories within the miniImageNet dataset and fed it into both the baseline model [1] and our complete model. Figure 4 presents the visualizations of the embeddings. From the figure, it is apparent that the images generated by the baseline model lack compactness, with many categories blending together, whereas our method produces images that are more compact, allowing the images of each category to cluster more effectively.

Table 3. The table compares our method with other methods on the CUB dataset. The bold data indicates optimal performance.

Method	Backbone	CUB
Method	Backbone	1-Shot	5-Shot
Robust-20 [22]	ResNet-18	58.67 ± 0.7	75.62 ± 0.5
RelationNet [23]	ResNet-18	67.59 ± 1.0	82.75 ± 0.6
MAML [18]	ResNet-18	68.42 ± 1.0	83.47 ± 0.6
ProtoNet [4]	ResNet-18	71.88 ± 0.9	86.64 ± 0.5
Baseline++ [24]	ResNet-18	67.02 ± 0.9	83.58 ± 0.5
S2M2R [25]	ResNet-34	72.92 ± 0.83	86.55 ± 0.67
AA [26]	ResNet-18	74.22 ± 1.09	88.65 ± 0.55
MixtFSL [27]	ResNet-18	73.94 ± 1.1	86.01 ± 0.5
LaplacianShot [28]	ResNet-18	80.96	88.68
RENet [11]	ResNet-12	79.49 ± 0.44	91.11 ± 0.24
setfeat [29]	ResNet-12	79.60 ± 0.80	90.48 ± 0.44
LRD [25]	ResNet-12	79.56 ± 0.87	90.67 ± 0.35
ours	ResNet-12	81.33 ± 0.59	91.73 ± 0.37

5.4. Ablation Study

We conducted comparative ablation studies of M-Mix, CCS, and SF against the baseline [1]. As illustrated in Table 4, each component contributed to improvements. Specifically using the miniImageNet dataset for this analysis, we observed the significant impact of each component. Our M-Mix notably outperformed the baseline model. While individually adding CCS did not notably enhance the model, its combined utilization with M-Mix proved effective. Employing our proposed SF slightly improved the model’s generalization.

When constructing support images for contrastive learning, we divided the images into grids and applied masking to them. For the sake of experimentation, we exclusively conducted the analysis on the miniImageNet dataset. To ensure consistency in our experiments, we opted to divide the images into blocks with a number that corresponds to the square root of a natural number. We chose grid sizes of 8 × 8, 10 × 10, and 12 × 12 to create blocks. Under the 8 × 8 grid, we masked 10, 16, and 20 blocks; with the 10 × 10 grid, we used 20, 25, and 30 blocks; and for the 12 × 12 grid, we employed 30, 40, and 50 masked blocks. We utilized these synthetically created images for contrastive learning. As shown in Table 5, using larger grid sizes leads to superior results. Nevertheless, due to our input size being 84 × 84, larger grid sizes might introduce more noise within the blocks. Thus, opting for moderately sized grids allows us to synthesize more information-rich samples, thereby enhancing performance.

In the CCS module, we conducted ablation experiments on the number of random crops, specifically testing 4, 6, 8, 12, and 16 crops. Keeping other settings constant, experiments were conducted on the miniImageNet dataset as shown in Figure 5. The results indicate that the performance is optimal when using six random crops. Fewer crops result in insufficient samples in the query set, while a larger number increases dataset size but can lead to overfitting. Therefore, we choose to use six random crops to achieve the best experimental outcomes.

5.5. Limitation

Our study has several limitations that warrant consideration. Firstly, although the dataset used in this research is quite comprehensive, it may lack certain aspects of real world variability, potentially limiting the generalizability of our findings. Additionally, the M-Mix generated synthesized images might result in some distortion or bias compared to real data.

6. Conclusions

This paper presents an innovative approach to address the challenge of small-sample image classification. Traditional methods often address problems with limited data, prompting the use of data augmentation techniques like random occlusion or mixup interpolation to increase the diversity and generalization of labeled samples. However, these methods have drawbacks: random occlusion can result in the loss of crucial information, while mixup interpolation can create overly homogeneous data distributions that hinder class differentiation. To overcome these issues, this paper introduces a novel data augmentation method based on saliency mask mixing. This technique uses visual feature fusion and confidence pruning to intelligently select and preserve key image features. Additionally, it employs a visual feature saliency fusion approach to evaluate the importance of different regions, guiding the fusion process to generate more diverse and nuanced images. This enhances the model’s ability to distinguish between classes. The proposed method demonstrates superior performance across multiple standard small-sample image classification datasets, outperforming current state-of-the-art methods by approximately 0.2–1%.

Author Contributions

Conceptualization, K.X., Y.G. and X.C.; methodology, K.X. and Y.G.; software, K.X. and X.C.; validation, K.X.; writing—original draft preparation, K.X.; writing—review and editing, K.X., Y.G. and Y.C.; project administration, K.X. and Y.C.; funding acquisition, K.X. and Y.C. All authors have read and agreed to the published version of the manuscript.

Funding

This work was funded in part by the National Natural Science Foundation of China under grants 61802197 and 62072449.

Data Availability Statement

The data presented in this study are openly available in Refs. [30,31,32,33].

Acknowledgments

The authors thank the anonymous reviewers and the editors for their insightful comments and helpful suggestions for improving our manuscript.

Conflicts of Interest

Author Kai Xie was employed by the company Institute of NR Electric Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Chen, Y.; Liu, Z.; Xu, H.; Darrell, T.; Wang, X. Meta-Baseline: Exploring Simple Meta-Learning for Few-Shot Learning. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 9042–9051. [Google Scholar] [CrossRef]
Padmanabhan, D.; Gowda, S.; Arani, E.; Zonooz, B. LSFSL: Leveraging Shape Information in Few-shot Learning. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Vancouver, BC, Canada, 17–24 June 2023; pp. 4971–4980. [Google Scholar]
Qiao, Q.; Xie, Y.; Zeng, Z.; Li, F. TALDS-Net: Task-Aware Adaptive Local Descriptors Selection for Few-shot Image Classification. arXiv 2023, arXiv:2312.05449. [Google Scholar]
Snell, J.; Swersky, K.; Zemel, R. Prototypical networks for few-shot learning. Adv. Neural Inf. Process. Syst. 2017, 30, 4077–4087. [Google Scholar]
Zhang, C.; Cai, Y.; Lin, G.; Shen, C. Deepemd: Differentiable earth mover’s distance for few-shot learning. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 5632–5648. [Google Scholar] [CrossRef] [PubMed]
Chen, H.; Li, H.; Li, Y.; Chen, C. Multi-level metric learning for few-shot image recognition. In Proceedings of the International Conference on Artificial Neural Networks, Bristol, UK, 6–9 September 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 243–254. [Google Scholar]
Mangla, P.; Kumari, N.; Sinha, A.; Singh, M.; Krishnamurthy, B.; Balasubramanian, V.N. Charting the right manifold: Manifold mixup for few-shot learning. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Snowmass Village, CO, USA, 1–5 March 2020; pp. 2218–2227. [Google Scholar]
Liu, C.; Fu, Y.; Xu, C.; Yang, S.; Li, J.; Wang, C.; Zhang, L. Learning a few-shot embedding model with contrastive learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, 2–9 February 2021; Volume 35, pp. 8635–8643. [Google Scholar]
Zhuo, L.; Fu, Y.; Chen, J.; Cao, Y.; Jiang, Y.G. Tgdm: Target guided dynamic mixup for cross-domain few-shot learning. In Proceedings of the 30th ACM International Conference on Multimedia, Lisboa, Portugal, 10–14 October 2022; pp. 6368–6376. [Google Scholar]
Bachman, P.; Alsharif, O.; Precup, D. Learning with pseudo-ensembles. Adv. Neural Inf. Process. Syst. 2014, 27. [Google Scholar]
Kang, D.; Kwon, H.; Min, J.; Cho, M. Relational embedding for few-shot classification. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 8822–8833. [Google Scholar]
Yang, Z.; Wang, J.; Zhu, Y. Few-shot classification with contrastive learning. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 6–9 September 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 293–309. [Google Scholar]
Hou, R.; Chang, H.; Ma, B.; Shan, S.; Chen, X. Cross attention network for few-shot classification. Adv. Neural Inf. Process. Syst. 2019, 32. [Google Scholar]
Xie, J.; Long, F.; Lv, J.; Wang, Q.; Li, P. Joint distribution matters: Deep brownian distance covariance for few-shot classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 7972–7981. [Google Scholar]
Xu, C.; Fu, Y.; Liu, C.; Wang, C.; Li, J.; Huang, F.; Zhang, L.; Xue, X. Learning dynamic alignment via meta-filter for few-shot learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 5182–5191. [Google Scholar]
Wertheimer, D.; Tang, L.; Hariharan, B. Few-shot classification with feature map reconstruction networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 8012–8021. [Google Scholar]
Fei, N.; Gao, Y.; Lu, Z.; Xiang, T. Z-score normalization, hubness, and few-shot learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 142–151. [Google Scholar]
Finn, C.; Abbeel, P.; Levine, S. Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the International Conference on Machine Learning. PMLR, Sydney, Australia, 6–11 August 2017; pp. 1126–1135. [Google Scholar]
Oreshkin, B.; Rodríguez López, P.; Lacoste, A. Tadam: Task dependent adaptive metric for improved few-shot learning. Adv. Neural Inf. Process. Syst. 2018, 31. [Google Scholar]
Zhang, X.; Meng, D.; Gouk, H.; Hospedales, T.M. Shallow bayesian meta learning for real-world few-shot recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 651–660. [Google Scholar]
Tian, Y.; Wang, Y.; Krishnan, D.; Tenenbaum, J.B.; Isola, P. Rethinking few-shot image classification: A good embedding is all you need? In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part XIV 16. Springer: Berlin/Heidelberg, Germany, 2020; pp. 266–282. [Google Scholar]
Dvornik, N.; Schmid, C.; Mairal, J. Diversity with cooperation: Ensemble methods for few-shot classification. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 3723–3731. [Google Scholar]
Sung, F.; Yang, Y.; Zhang, L.; Xiang, T.; Torr, P.H.; Hospedales, T.M. Learning to compare: Relation network for few-shot learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 1199–1208. [Google Scholar]
Chen, Z.; Fu, Y.; Wang, Y.X.; Ma, L.; Liu, W.; Hebert, M. Image deformation meta-networks for one-shot learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 8680–8689. [Google Scholar]
Yang, S.; Liu, L.; Xu, M. Free lunch for few-shot learning: Distribution calibration. arXiv 2021, arXiv:2101.06395. [Google Scholar]
Afrasiyabi, A.; Lalonde, J.F.; Gagné, C. Associative alignment for few-shot image classification. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part V 16. Springer: Berlin/Heidelberg, Germany, 2020; pp. 18–35. [Google Scholar]
Afrasiyabi, A.; Lalonde, J.F.; Gagné, C. Mixture-based feature space learning for few-shot image classification. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 9041–9051. [Google Scholar]
Ziko, I.; Dolz, J.; Granger, E.; Ayed, I.B. Laplacian regularized few-shot learning. In Proceedings of the International Conference on Machine Learning. PMLR, Virtual, 13–18 July 2020; pp. 11660–11670. [Google Scholar]
Afrasiyabi, A.; Larochelle, H.; Lalonde, J.F.; Gagné, C. Matching feature sets for few-shot image classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 9014–9024. [Google Scholar]
Vinyals, O.; Blundell, C.; Lillicrap, T.; Wierstra, D. Matching networks for one shot learning. Adv. Neural Inf. Process. Syst. 2016, 29. [Google Scholar]
Ren, M.; Triantafillou, E.; Ravi, S.; Snell, J.; Swersky, K.; Tenenbaum, J.B.; Larochelle, H.; Zemel, R.S. Meta-learning for semi-supervised few-shot classification. arXiv 2018, arXiv:1803.00676. [Google Scholar]
Bertinetto, L.; Henriques, J.F.; Torr, P.H.; Vedaldi, A. Meta-learning with differentiable closed-form solvers. arXiv 2018, arXiv:1805.08136. [Google Scholar]
Chen, C.; Li, O.; Tao, D.; Barnett, A.; Rudin, C.; Su, J.K. This looks like that: Deep learning for interpretable image recognition. Adv. Neural Inf. Process. Syst. 2019, 32. [Google Scholar]

Figure 1. (a) represents prior work. (b) showcases images generated by our M-Mix method and SF method.

Figure 2. (a) represents the overall scheme. (b) represents our M-Mix process. After the support image undergoes feature extraction, its saliency map is generated. Random blocking is applied, and the resulting positive and negative samples are fused separately. These samples are then used for contrastive learning against the query image. (c) showcases our CCS method. After random cropping of the query image, confidence scoring occurs post-feature extraction. The highest and lowest scores are filtered out. (d) displays our SF method, distinguishing itself from M-Mix by not applying blocking to the images. The resulting labels are mixed labels.

Figure 3. The heatmaps for image-spatial correspondence are displayed. We performed inner product calculations using features generated by the network for both support and query images, visualizing the outcomes as heatmaps.

Figure 4. We visualized tSNE plots of samples from the target classes. The upper plot represents the visualization for the baseline model, while the lower plot shows our model’s visualization.

Figure 5. Comparison experiments of different crop quantities in the CCS method on the miniImageNet dataset.

Table 4. Ablation experiments on different modules of the entire model reveal the individual significance of each component. The ones without a check mark are baseline. Bold represents the best.

M-Mix	CCS	SF	5-Way 1-Shot	5-Way 5-Shot
			63.17%	78.31%
✓			67.33%	81.93%
	✓		64.07%	79.39%
		✓	65.21%	81.44%
✓	✓		69.72%	83.47%
✓	✓	✓	70.43%	85.41%

Table 5. Ablation experiments on different modules of the entire model reveal the individual significance of each component.

miniImagenet
Grid Size	Number of Mask Blocks	Accuracy
8 × 8	10	69.71%
	16	68.97%
	20	68.58%
10 × 10	20	68.75%
	25	70.43%
	30	69.84%
12 × 12	30	69.14%
	40	68.85%
	50	68.43%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Xie, K.; Gao, Y.; Chen, Y.; Che, X. Mask Mixup Model: Enhanced Contrastive Learning for Few-Shot Learning. Appl. Sci. 2024, 14, 6063. https://doi.org/10.3390/app14146063

AMA Style

Xie K, Gao Y, Chen Y, Che X. Mask Mixup Model: Enhanced Contrastive Learning for Few-Shot Learning. Applied Sciences. 2024; 14(14):6063. https://doi.org/10.3390/app14146063

Chicago/Turabian Style

Xie, Kai, Yuxuan Gao, Yadang Chen, and Xun Che. 2024. "Mask Mixup Model: Enhanced Contrastive Learning for Few-Shot Learning" Applied Sciences 14, no. 14: 6063. https://doi.org/10.3390/app14146063

APA Style

Xie, K., Gao, Y., Chen, Y., & Che, X. (2024). Mask Mixup Model: Enhanced Contrastive Learning for Few-Shot Learning. Applied Sciences, 14(14), 6063. https://doi.org/10.3390/app14146063

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Mask Mixup Model: Enhanced Contrastive Learning for Few-Shot Learning

Abstract

1. Introduction

2. Related Works

Few-Shot Learning

3. Problem Definition

4. Methods

4.1. Method Overview

4.2. Mask Mix

4.3. Confident Clip Selector

4.4. Utilizing Saliency Fuse for Data Augmentation

4.5. Algorithm Description

5. Experiments

5.1. Datasets

5.2. Implementation Details

5.3. Comparison with Other Methods

5.4. Ablation Study

5.5. Limitation

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI