Few-Shot Learning Based on Dimensionally Enhanced Attention and Logit Standardization Self-Distillation

Tang, Yuhong; Li, Guang; Zhang, Ming; Li, Jianjun

doi:10.3390/electronics13152928

Open AccessArticle

Few-Shot Learning Based on Dimensionally Enhanced Attention and Logit Standardization Self-Distillation

School of Digital and Intelligent Industry, Inner Mongolia University of Science and Technology, Baotou 014010, China

^*

Author to whom correspondence should be addressed.

Electronics 2024, 13(15), 2928; https://doi.org/10.3390/electronics13152928

Submission received: 25 June 2024 / Revised: 13 July 2024 / Accepted: 23 July 2024 / Published: 24 July 2024

(This article belongs to the Special Issue Applications of Artificial Intelligence, Machine Learning, Deep Learning, and Explainable AI (XAI))

Download

Browse Figures

Versions Notes

Abstract

:

Few-shot learning (FSL) is a challenging problem. Transfer learning methods offer a straightforward and effective solution to FSL by leveraging pre-trained models and generalizing them to new tasks. However, pre-trained models often lack the ability to highlight and emphasize salient features, a gap that attention mechanisms can fill. Unfortunately, existing attention mechanisms encounter issues such as high complexity and incomplete attention information. To address these issues, we propose a dimensionally enhanced attention (DEA) module for FSL. This DEA module introduces minimal additional computational overhead while fully attending to both channel and spatial information. Specifically, the feature map is first decomposed into 1D tensors of varying dimensions using strip pooling. Next, a multi-dimensional collaborative learning strategy is introduced, enabling cross-dimensional information interactions through 1D convolutions with adaptive kernel sizes. Finally, the feature representation is enhanced by calculating attention weights for each dimension using a sigmoid function and weighting the original input accordingly. This approach ensures comprehensive attention to different dimensions of information, effectively characterizing data in various directions. Additionally, we have found that knowledge distillation significantly improves FSL performance. To this end, we implement a logit standardization self-distillation method tailored for FSL. This method addresses the issue of exact logit matching, which arises from the shared temperature in the self-distillation process, by employing logit standardization. We present experimental results on several benchmark datasets where the proposed method yields significant performance improvements.

Keywords:

few-shot learning; attention mechanism; logit standardization; self-distillation

1. Introduction

With the rapid advancement of artificial intelligence (AI) technology, deep learning techniques are continuously evolving and improving. Deep learning, a crucial component of AI, finds widespread application across various domains, including image recognition [1,2], video retrieval [3,4], and object detection [5]. Developing a robust deep learning model typically necessitates a substantial volume of data to attain optimal performance. Insufficient data may lead to overfitting, compromising the model’s generalization capability. To address this challenge and enable models to learn from limited examples, researchers have explored leveraging prior knowledge acquired from previous tasks and transferring this knowledge to new tasks. This concept has inspired the development of FSL [6], which aims to facilitate rapid learning with minimal data.

In recent years, numerous meta-learning methods have been proposed to address few-shot problems. Meta-learning, also referred to as “learn-to-learn”, seeks to facilitate rapid learning in new tasks for models. Among the various meta-learning algorithms, metric-based methods [7,8,9,10], initialization-based methods [11,12,13], and transfer learning methods [14,15,16,17] are widely utilized. Metric-based methods focus on acquiring a well-defined embedding space and learning feature representations within this space through complex algorithms. Initialization-based methods seek to initialize the meta-learning parameters for other few-shot tasks. Transfer learning methods aim to obtain a pre-trained model that can be better generalized to new tasks through fine-tuning or no fine-tuning. Chen et al. [14] argued that initializing model parameters performs slightly worse than meta-learning algorithms, and they pre-trained the model with base class data and extracted novel class features by fine-tuning the pre-trained model. Tian et al. [15] contended that learning a good feature embedding model is more efficient than complex meta-learning algorithms, and they achieved the few-example learning of novel classes by training a deep learning model on the base class data without fine-tuning, using the trained model as a base learner.

However, pre-trained models lack the ability to actively focus on the key information within the feature maps. To address this limitation, it has been proposed to incorporate attention mechanisms into deep learning models [18,19,20,21]. This enhancement enables models to selectively emphasize important information. Chen et al. [18] proposed a mutual correlation network (MCNet) to explore the global correlation between feature maps using the self-attention module, which improves the FSL performance through the powerful global information capturing ability of the self-attention module. Zhao et al. [19] proposed FSL-PRS based on prototype modification using a self-attention module to extract task-relevant features from a pre-trained network. Liu et al. [20] leveraged an SE module to enhance feature extraction through channel correlations. Zhu et al. [21] utilized a CBAM module to capture channel and spatial information, enabling the model to learn channel and spatial information autonomously.

All of the above methods use the attention mechanism to improve FSL performance. However, the self-attention module achieves high accuracy through complex attention computations, resulting in significant computational overhead that severely impacts model inference speed. The SE module focuses solely on channel information while disregarding spatial details, leading to suboptimal performance. The CBAM module addresses long-range spatial dependencies using large kernel convolutions, but this approach comes at the cost of extensive computational time. To effectively leverage channel and spatial information within feature maps without compromising accuracy and to alleviate computational burden, we propose a dimensionally enhanced attention (DEA) module. Initially, the feature map is decomposed into 1D tensors along different directions using strip pooling. Subsequently, 1D convolutions with adaptive kernel sizes are employed to collaboratively learn feature information across these directions. Finally, attention weights for different directions are derived using sigmoid functions to emphasize original information. Unlike common attention mechanisms, the DEA module achieves substantial improvements in model performance with minimal additional computational overhead.

In FSL, knowledge distillation has emerged as an effective approach for model compression [15,22,23,24,25]. Tian et al. [15] proposed a simple baseline algorithm that incorporates knowledge distillation to enhance model performance. Similarly, Rizve et al. [22] introduced a novel training mechanism that enables the model to learn joint feature representations that are both invariant and isovariant. Their approach leverages knowledge distillation to further enhance training and achieve significant improvements in FSL performance.

However, while these knowledge distillation methods have shown promise in boosting FSL performance, they do have certain limitations. Specifically, existing research has demonstrated that logit-based knowledge distillation methods often share the temperature between the teacher and student models during the distillation process. This sharing of temperature results in the exact matching of logits but disregards their differences, thereby limiting the potential of the student model. To address this issue, Sun et al. [23] proposed an alternative approach. They suggest setting the temperature as a weighted standard deviation of the logits and employing logit standardization as a pre-processing step before applying the softmax function. Inspired by their work, a logit standardization self-distillation method for FSL is constructed. In this method, standardized pre-processing of the logit during self-distillation is first performed. Subsequently, the pre-processed logit is transformed into a probability vector using a softmax function with temperature. Finally, the distillation effect is enhanced by minimizing the Kullback–Leibler (KL) divergence, thereby improving the accuracy of few-shot image classification.

In summary, the contributions of our paper are as follows:

An efficient dimensionally enhanced attention (DEA) module for FSL is proposed, enabling the model to highlight and emphasize critical information. Meanwhile, a multi-dimensional collaborative learning strategy is proposed to characterize information of different dimensions by 1D convolutions with adaptive kernel sizes.

A logit standardization self-distillation method applicable to FSL is constructed. This method enhances distillation effects by standardizing logits during the self-distillation process, leading to a significant improvement in few-shot image classification accuracy.

Extensive experiments on several benchmark datasets demonstrate that the proposed method achieves competitive performance and effectively enhances FSL.

The remainder of this article is organized as follows. Section 2 summarizes the related work. Section 3 describes the proposed methodologies in detail. Section 4 presents the details and results of the experimental and ablation studies. Section 5 shows the t-SNE visualization results. Finally, Section 6 summarizes our work.

2. Related Work

2.1. Few-Shot Learning

In recent years, FSL methods have been rapidly studied, and FSL algorithms can generally be categorized into three main categories.

The first category comprises metric-based methods [7,8,9,10,26,27]. Snell et al. [7] proposed PrototNet for learning a metric space where classification relies on calculating distances to prototypical representations of each class. Lyu et al. [8] proposed compositional prototypical networks (CPNs), aimed at learning transferable prototypes for each human-annotated attribute to explore class transferability and reuse novel class features using transferability laws. Li et al. [9] proposed the bi-similarity network (BSNet), which learns feature mappings based on two similarity measures (Euclidean and cosine distances) for different features. Du et al. [10] put forward a new framework named ProtoDiff, which provides effective class representations by employing task-guided diffusion models in the meta-training phase to incrementally generate prototypes. All these methods share a common approach: first, learning a robust metric space; then, obtaining image embeddings in this space; and finally, computing image similarities using distance metrics.

The second category encompasses initialization-based methods [11,12,13]. Finn et al. [11] proposed MAML, which improves initialization parameters through a small number of gradient updates. It trains a base learner in an inner loop for rapid adaptation and a meta-learner in an outer loop to enhance FSL performance. Wang et al. [12] introduced a neural architecture search algorithm for FSL based on MAML. Lee et al. [13] proposed MetaOptNet, aiming to achieve efficient feature embeddings by exploring the implicit differentiation of optimality conditions for convex problems and pairwise formulations of the optimality problem. Initialization-based methods involve fine-tuning model parameters, enabling generalized performance across other few-shot tasks, which is too cumbersome and less flexible.

The third category consists of transfer learning methods [14,15,16,17,28]. Chen et al. [14] proposed a pre-training base class classifier and then transferred the pre-trained model to new tasks. Tian et al. [15] adapted to new tasks by simply pre-training the model without fine-tuning it. Sun et al. [16] introduced meta-transfer learning (MTL) to transfer the weight parameters of a deep neural network (DNN) to an FSL task, thereby enhancing FSL performance. Upadhyay et al. [17] combined MTL with two-layer meta-optimization to integrate a multi-task learning set architecture into MTL scenarios, proposing MTML. Li et al. [28] combined visual and semantic information to transfer knowledge acquired in base classes to unseen classes. Transfer learning methods operate by training a model on one or more source tasks and generalizing the model to new tasks. Our approach is most relevant to transfer learning methods.

2.2. Attention Mechanisms

The attention mechanism, inspired by human vision studies, enhances model performance by focusing on relevant information and attenuating irrelevant details. Hu et al. [29] introduced an SE module, adapting inter-channel weights to optimize task-specific performance. Woo et al. [30] proposed a CBAM module, integrating channel and spatial attention to improve feature focus across dimensions. Hou et al. [31] developed a CA module, embedding spatial directional information into channel attention for enhanced model efficacy. Quyang et al. [32] designed an EMA module, grouping channels into sub-features to evenly distribute spatial semantic information. Vaswani et al. [33] originally proposed the self-attention module, which is now widely used in various visual tasks to dynamically adjust attentional weights to capture remote dependencies. Qin et al. [34] applied multi-instance attention networks (MIANs), combining transformers with the multi-head self-attention module to enrich feature representation. Hou et al. [35] created a CAN, employing cross-attention mappings to refine discriminative features between support and query instances. Xiao et al. [36] emphasized semantic features, using word embeddings and semantic cross-attention modules to strengthen correlations between visual and semantic features. These studies underscore the efficacy of attention mechanisms in bolstering model performance across various tasks.

2.3. Knowledge Distillation

Knowledge distillation serves as a pivotal tool for compressing a complex teacher model into a lightweight student model by transferring “dark” knowledge. Initially proposed by Hinton et al. [37], the method trains a student model by minimizing the KL divergence between the predicted probabilities of the teacher and student models. This process uses a temperature-constrained softmax function to soften labels and refine predictive probabilities. Lu et al. [24] employed knowledge distillation to transfer knowledge from a large supervised pretrained model to a smaller one. They also introduced a self-supervised knowledge distillation method for unsupervised FSL, effectively enhancing performance. Le et al. [25] introduced an FSL method called the task affinity score (TAS), which identifies related tasks via a task affinity metric and employs knowledge distillation to improve FSL outcomes. Zhou et al. [38] innovated with a cross-view contrastive learning strategy and developed a multi-head knowledge distillation technique to further boost model performance. These studies collectively demonstrate the efficacy of knowledge distillation in FSL, showcasing its utility in improving model performance across various contexts.

3. Methodology

3.1. Problem Definition

FSL consists of two learning phases where the dataset is partitioned into two disjointed class sets, the base classes

D_{B}

and the novel classes

D_{N}

. In the first phase, the base classes

D_{B} = {(X_{B}, Y_{B})}

are used to perform standard supervised learning to train the model, aiming at obtaining a pre-trained model with good generalization performance and adapted to the new task, where

X_{B}

is the set of images and

Y_{B}

is the corresponding set of labels. In the second phase, we follow the general setup of meta-learning methods, i.e., the N-way K-shot setup, where we randomly sample from the novel classes

D_{N}

and divide the randomly selected samples into support and query sets, where the support set

X_{S} = {(X_{i}, Y_{i})}_{i = 0}^{N \times K}

consists of

N

categories, each containing

K

-labeled samples, and in general,

N

is relatively small, e.g., 1, 5. The query set

X_{Q} = {(Q_{i}, Y_{i})}_{i = 0}^{N \times M}

contains

M

samples from the same

N

categories. The second phase aims to correctly categorize the samples of the query set. We summarize the notations for convenience as shown in Table 1.

3.2. Method Pipeline

Our overall approach can be represented in three phases: pre-training on the base classes, self-distillation on the base classes, and testing on the novel classes. Figure 1 illustrates the overall framework of the algorithm.

The classical ResNet12 [1] is used as the feature extractor. ResNet12 consists of four consecutive residual blocks, each with three convolutional layers of convolutional kernel size 3 × 3, and at the end of the residual block, a 2 × 2 max pooling layer is applied. After the four residual blocks, there is an adaptive average pooling layer for generating feature embeddings.

The feature extractor is trained using data from the base classes. Due to the small number of samples in the novel classes, it is challenging for the pre-trained model to learn a generalized feature representation; in addition, the pre-trained model struggles to focus on the salient features in the feature map, which can easily lead to a deterioration in the generalization ability for the novel classes. In order to give the model the ability to highlight salient features, a dimensionally enhanced attention (DEA) module is proposed as a plug-and-play module, which is embedded into each residual block of ResNet12 for model performance enhancement. The whole pre-training is shown in Figure 1a, and the model is optimized using a cross-entropy loss function.

L_{C E} = \frac{1}{m} \sum_{i} - [Y_{B}^{i} \log {\hat{Y}}_{B}^{i} + (1 - Y_{B}^{i}) \log (1 - {\hat{Y}}_{B}^{i})]

(1)

where

m

denotes the number of categories,

i

denotes the i-th image in the base classes,

Y_{B}^{i}

denotes the true label, and

{\hat{Y}}_{B}^{i}

denotes the predicted label.

To further enhance model performance, we incorporate self-distillation for retraining, depicted in Figure 1b. In this phase, a new technique is introduced, called logit standardization, to mitigate performance degradation caused by temperature sharing between the teacher and student models. Based on this work, a logit standardization self-distillation method for FSL is constructed. Specifically, the pre-trained model is used as the teacher model, and the knowledge is transferred from the teacher model to the structurally identical student model via self-distillation. During knowledge transfer, the temperature is set as a weighted standard deviation of logits, allowing the student model to focus more on the underlying logit relationships of the teacher model. During self-distillation, the model is trained jointly with the cross-entropy loss function and the self-distillation loss function. The entire process can be expressed as

L = α \cdot L_{C E} + β \cdot L_{K D}

(2)

where

α

and

β

are weighting factors, and we set

α

to 0.1 and

β

to 0.9;

L_{C E}

is cross-entropy loss; and

L_{K D}

is knowledge distillation loss.

In the testing phase, we train logistic regression (LR) as a classifier through the support set, which in turn predicts the category of the query image and obtains the corresponding predicted labels, as illustrated in Figure 1c.

3.3. Dimensionally Enhanced Attention (DEA) Module

Before introducing the DEA module, we revisit the popular attention mechanisms. We find that the SE module [29] focuses only on inter-channel feature information and fails to model spatial location information, leading to a catastrophic loss of information as shown in Figure 2a. The CBAM module [30] attempts to map spatial information using large kernel convolutions, but this introduces serious redundancy and additional computational complexity, as shown in Figure 2c. The CA module [31] employs strip pooling to encode feature maps along both horizontal and vertical directions, aiming to map spatial information to the channel dimension for modeling. However, the repeated use of 2D convolutions introduces substantial computational complexity, which hinders efficient model inference, as shown in Figure 2b. Additionally, we note that SE [29], CBAM [30], and CA [31] employ dimensionality reduction operations, which can result in serious information loss.

To address the above problems of the attention mechanisms, an efficient DEA module is designed, as shown in Figure 2d. Specifically, the DEA module can be viewed as a computational unit that describes the transformation process of a feature, transforming an input feature tensor into an enhanced tensor representation. The input feature tensor size is assumed to be

F \in R^{C \times H \times W}

, where

C

,

H

, and

W

denote channel, height, and width, respectively. The DEA module first uses strip pooling to decompose

F

along the three directions: channel, height, and width. This decomposition allows the feature information to be learned independently across different dimensions, while also capturing the dependencies within the dimensions, and modeling feature information across different dimensions. By dimensionally decomposing

F

, we can obtain the channel descriptor

φ (F) = [φ_{1}, φ_{2}, …, φ_{C}] \in R^{C \times 1 \times 1}

, the height descriptor

ϕ (F) = [ϕ_{1}, ϕ_{2}, …, ϕ_{H}] \in R^{1 \times H \times 1}

, and the width descriptor

ψ (F) = [ψ_{1}, ψ_{2}, …, ψ_{W}] \in R^{1 \times 1 \times W}

. Mathematically, this process can be expressed as

φ_{C} (c) = \frac{1}{H} \sum_{h = 1}^{H} F (c, h) \cdot \frac{1}{W} \sum_{w = 1}^{W} F (c, w)

(3)

ϕ_{H} (h) = \frac{1}{C} \sum_{c = 1}^{C} F (c, h) \cdot \frac{1}{W} \sum_{w = 1}^{W} F (h, w)

(4)

ψ_{W} (w) = \frac{1}{C} \sum_{c = 1}^{C} F (c, w) \cdot \frac{1}{H} \sum_{h = 1}^{H} F (h, w)

(5)

After obtaining feature descriptors for the channel, height, and width dimensions, a multi-dimensional collaborative learning strategy is proposed to fully utilize information from these dimensions. In this strategy, we employ a simple and efficient method for attention computation using 1D convolution instead of traditional 2D convolution to enhance feature representation. This choice is motivated by treating the feature descriptor obtained from the aforementioned decomposition as a 1D sequential signal, where 1D convolution excels in processing sequential signals compared to 2D convolution, while also being computationally lighter. Additionally, within this strategy, the size of the convolution kernel in 1D convolution is adaptive. It dynamically adjusts the kernel size based on the number of channels, allowing the model to flexibly determine the extent of neighborhood interaction. This adaptive approach effectively reduces computational complexity while minimizing redundant information.

Taking the channel and height dimensions as examples, in the channel dimension, we first transform the obtained 1D channel

C \times 1 \times 1

tensor into a

1 \times C

tensor suitable for 1D convolution. Subsequently, we consider the relationship between each channel and its

k

neighbors, employing nonlinear local cross-channel interactions to learn correlations between channels. In the height dimension, the obtained 1D height tensor

1 \times H \times 1

is transformed into a

1 \times H

tensor, considering the information differences in the height of the feature map, and we learn the correlation between heights through cross-height interactions, which can effectively avoid the interference of irrelevant information on heights and can effectively reduce redundancy. The learning process for the width dimension and the learning process for both the channel and height dimensions will be synchronized and similar in principle. Unlike the SE, CBAM, and CA modules, the DEA module utilizes 1D convolution to facilitate information interaction across dimensions, which can effectively avoid the loss of information caused by dimensionality reduction. Moreover, the DEA module involves only a minimal number of parameters, ensuring computational efficiency without complexity.

In the process of multi-dimensional collaborative learning, the size of the convolution kernel will adaptively change according to the number of channels. This adaptation allows layers with a larger number of channels to perform more cross-dimensional interactions, thereby enriching the feature information. The specific calculations are as follows:

k = {|\frac{\log_{2} (C)}{γ} - \frac{λ}{γ}|}_{o d d}

(6)

where we set

γ

to 1.5,

λ

to 1, and

{|ζ|}_{o d d}

to denote the nearest odd number less than or equal to

ζ

.

Learning feature information in various dimensions through 1D convolutional collaboration with an adaptive convolutional kernel size avoids dimensionality reduction and effectively captures cross-dimensional information interactions. Simultaneously, the model refines the feature representation through calculating attention across different dimensions, thereby mitigating information redundancy. The interplay of information across channels directs the model’s focus towards objects within the feature map. Similarly, information exchange across heights and widths guides the model to emphasize object positional information, facilitating the establishment of robust spatial dependencies. During the 1D convolution learning process, the following representation is employed:

η_{C} = f_{k} (ϕ_{C})

(7)

η_{H} = f_{k} (φ_{H})

(8)

η_{W} = f_{k} (ψ_{W})

(9)

where

f_{k}

denotes the 1D convolution operation and

k

is calculated by Equation (6) and represents the size of the adaptive convolution kernel.

Finally, the obtained 1D tensor is transformed into the tensor size after strip pooling. The weights for the different dimensions are then obtained using the sigmoid function, which are multiplied with the input tensor to produce an enhanced feature representation. The process can be represented as follows:

F^{'} = F \cdot σ (η_{C} (c)) \cdot σ (η_{H} (h)) \cdot σ (η_{W} (w))

(10)

where

σ

denotes the sigmoid activation function.

We embed the DEA module into the residual block of ResNet12, as shown in Figure 3b. Additionally, we provide the pseudo-code for the DEA module to facilitate easy reproduction, as shown in Algorithm 1.

Algorithm 1. PyTorch code for our proposed DEA module

def DEA(x, channel, lambd = 1, gamma = 1.5):

# x: input features with shape [N, C, H, W]

# lambd, gamma: parameters of the mapping function in Equation (6)

N, C, H, W = x.size()

kernel = int(abs((math.log(channel, 2) − lambd)/gamma))

kernel_size = kernel if kernel % 2 else kernel −1

conv = nn.Conv1d(1, 1, kernel_size = kernel_size, padding = (kernel_size − 1)/2, bias = False)

sigmoid = nn.Sigmoid()

out_c = x.mean(dim = 2, keepdim = True).mean(dim = 3, keepdim = True)

out_c = out_c.view(N, C, 1).permute(0, 2, 1)

out_c = sigmoid(conv(out_c).permute(0, 2, 1).view(N, C, 1, 1))

out_h = x.mean(dim = 1, keepdim = True).mean(dim = 3, keepdim = True)

out_h = out_h.view(N, 1, H)

out_h = sigmoid(conv(out_h).view(N, 1, H, 1))

out_w = x.mean(dim = 1, keepdim = True).mean(dim = 2, keepdim = True)

out_w = out_w.view(N, 1, W)

out_w = sigmoid(conv(out_w).view(N, 1, 1, W))

return x × out_c × out_h × out_w

3.4. Logit Standardization Self-Distillation

Conventional knowledge distillation involves training a teacher model with a large amount of data and then training a smaller student model under the guidance of the teacher model [37]. However, this method requires a substantial volume of data for model training. In FSL environments, the scarcity of data poses challenges when using conventional knowledge distillation to train student models, which often leads to difficulties in adequately training models and effectively characterizing knowledge. Therefore, the conventional knowledge distillation method is not suitable for FSL.

To address these challenges, the student model is trained using a self-distillation method. In self-distillation, both the teacher model and the student model share the same architecture, mitigating the risk of model complexity. The approach allows for using a smaller teacher model to obtain a student model with the same architecture, which effectively overcomes the data scarcity challenge in few-shot environments.

In the self-distillation process, we assume that the teacher model is denoted by

f_{T}

and the student model is denoted by

f_{S}

. The model is trained on the base classes

{(X_{B}, Y_{B})}

, where

X_{B} \in R^{C \times H \times W}

and

Y_{B} \in [1, K]

denote the image and the corresponding label, respectively. Given the input

{(X_{B}, Y_{B})}

, teacher

f_{T}

and student

f_{S}

predict the logit vectors

v_{B}

and

z_{B}

, respectively, and the process can be represented as

v_{B} = f_{T} (X_{B})

(11)

z_{B} = f_{S} (X_{B})

(12)

where

X_{B}

are the base class images,

f_{T}

is the teacher model, and

f_{S}

is the student model.

The logit is then converted into a probability vector using a softmax function with temperature

T

, where the k-th item is expressed as follows:

q {(v_{B})}^{(k)} = \frac{\exp (v_{B}^{(k)} / T)}{\sum_{m = 1}^{K} \exp (v_{B}^{(m)} / T)}

(13)

q {(z_{B})}^{(k)} = \frac{\exp (z_{B}^{(k)} / T)}{\sum_{m = 1}^{K} \exp (z_{B}^{(m)} / T)}

(14)

where

v_{B}^{(k)}

and

z_{B}^{(k)}

denote the k-th item of

v_{B}

and

z_{B}

, respectively. Finally, the student model is trained by minimizing the KL divergence, which can be expressed as

L_{K L} (q (v_{B}) ‖q (z_{B})) = \sum_{k = 1}^{K} q {(v_{B})}^{(k)} \log (\frac{q {(v_{B})}^{(k)}}{q {(z_{B})}^{(k)}})

(15)

where

v_{B}

and

z_{B}

denote the prediction logit vectors for the teacher and student models, respectively.

The temperature

T

is shared between both the teacher and student models in the logit-based self-distillation method, disregarding the potential variance in temperature values across KL divergence, which could vary between the teacher and student models, as well as across different samples. To tackle this problem, a weighted logit standard deviation is introduced as an adaptive temperature, and logit standardization is used as a pre-processing step to map an arbitrary range of logits into a bounded range [23]. This approach allows for arbitrary ranges and shared variances in student logits, effectively maintaining the inherent relationships with the teacher logits. Specifically, we compute the mean and variance of the logit vector

x

, which can be expressed as

\bar{x} = \frac{1}{K} \sum_{k = 1}^{K} x^{(k)}

(16)

\nabla (x) = \sqrt{\frac{1}{K} \sum_{k = 1}^{K} {(x^{(k)} - \bar{x})}^{2}}

(17)

where

x

denotes the logit vector.

By calculating the mean and variance of any logit vector

x

through Equations (16) and (17), the logit standardization can be expressed as

ω (x)

.

ω (x) = \frac{x - \bar{x}}{\nabla (x)}

(18)

The standardized logit is then converted into a probability vector using a softmax function with temperature

T

, which can be expressed as

q {(z_{B}; {\bar{z}}_{B}, σ (z_{B}))}^{(k)} = \frac{\exp (ω {(z_{B} / T)}^{(k)})}{\sum_{m = 1}^{K} \exp (ω {(z_{B} / T)}^{(m)})}

(19)

where

T

is a fundamental temperature parameter and

ω

denotes the

ω (x)

obtained from Equation (18).

Based on the above theoretical analysis, a logit standardization self-distillation method suitable for FSL is constructed. Figure 4 illustrates this process, where the teacher model is trained on the base class data through the cross-entropy loss function in the first phase, and then the knowledge is transferred from the teacher model to the student model through self-distillation. The student model undertakes two tasks in the logit standardization self-distillation process. The first task involves calculating the categorical loss directly via the cross-entropy loss, which helps the model converge quickly. The student obtaining a prediction is referred to as hard prediction, where there is no warming of the softmax function, i.e., the temperature

T = 1

. In the second task, the logit vectors from both the teacher and student models are standardized. These standardized logits are then converted into a probability vector using the softmax function with temperature. The teacher’s prediction results are called soft labels, and the student’s results are referred to as soft predictions. Finally, the KL divergence is minimized to compute the self-distillation loss, ensuring the predictions of the student are closer to those of the teacher. Throughout the process, we jointly optimize the model with the cross-entropy loss in the first task and the self-distillation loss in the second task. These two losses are balanced using hyper-parameters

α

and

β

. For Equation (2), it can be rewritten as

L = α \cdot L_{C E} (Y_{B}, f_{S} (X_{B})) + β \cdot T^{2} L_{K L} (q (v_{B}), q (z_{B}))

(20)

where

α

and

β

are weighting factors, and we set

α

to 0.1 and

β

to 0.9;

T

is a fundamental temperature parameter;

L_{C E}

is cross-entropy loss; and

L_{K L}

is KL divergence.

4. Experiments

4.1. Datasets

We conduct experiments on two benchmark datasets, miniImageNet [39] and CIFAR-FS [40].

miniImageNet. The miniImageNet dataset is derived from ImageNet and consists of 60,000 RGB images distributed across 100 categories. Each category contains 600 images, with each image exhibiting a resolution of 84 × 84 pixels. We follow [41] to divide the dataset into 64 training sets, 16 validation sets, and 20 test sets.

CIFAR-FS. The CIFAR-FS dataset is derived from CIFAR-100 and comprises 100 categories, each with 600 images. The images in this dataset are smaller with a resolution size of 32 × 32 pixels, which presents a challenge for classification tasks. We randomly divide it into 64 training sets, 16 validation sets, and 20 test sets.

4.2. Implementation Details

Our experimental platform runs on a computer with Windows 10 and an NVIDIA GeForce GTX 1080 Ti GPU (MSI, Taiwan, China). All code is implemented in PyTorch version 1.10.1. We use ResNet12 [1] as the backbone network and optimize the model using the SGD optimizer with a momentum of 0.9 and weight decay set to 5 × 10⁻⁴. During training on all datasets, we train the model for 100 epochs with a batch size of 64. The learning rate is initialized at 0.05 and is decayed by a factor of 0.1 at the 60th and 80th epochs. In the self-distillation phase, we maintain the same experimental setup and set

α

to 0.1 and

β

to 0.9. For the testing phase, we randomly sample 600 tasks to train logistic regression as a classifier. We use two popular FSL settings, the 5-way 1-shot and the 5-way 5-shot, and report classification results with 95% confidence intervals.

4.3. Comparison with State-of-the-Art Methods

Following the experimental setup, we conduct experiments on the miniImageNet and CIFAR-FS datasets. The results are shown in Table 2. We selected PrototNet [7], MAML [11], Baseline++ [14], MetaOptNet [13], MTL [16], Hyper ProtoNet [26], RFS-simple [15], RFS-distill [15], DSN [27], Meta-Baseline [42], TAS-simple [25], TAS-distill [25], MIAN [34], SENet [43], KT [44], and EFTS [45] as comparison methods. Among them, PrototNet [7], Hyper ProtoNet [26], and DSN [27] are the metric-based methods; Baseline++ [14], MTL [16], RFS-simple [15], and Meta-Baseline [42] are the commonly used baseline methods in FSL, similar to transfer learning principles; MIAN [34] employs an attention mechanism to improve the model performance, similar to our approach; MetaOptNet [13] and MAML [11] are initialization-based methods; and SENet [43], KT [44], and EFTS [45] are all state-of-the-art FSL methods from recent years. For fairness, the results of these comparison methods are derived from those reported in the literature. As in previous works, the average accuracy is adopted to assess the effectiveness of all the FSL methods in 5-way 1-shot and 5-way 5-shot settings. The results of the comparison experiments are shown in Table 2.

First, compared to the four baseline methods—Baseline++ [14], MTL [16], RFS-simple [15], and Meta-Baseline [42]—our method improves performance by 13.35%, 6.12%, 5.3%, and 4.15%, respectively, in the 5-way 1-shot setting on miniImageNet. In the 5-way 5-shot setting, the improvements are 6.86%, 7.26%, 3.12%, and 3.5%, respectively. Similar performance gains are observed on CIFAR-FS. PrototNet [7], Hyper ProtoNet [26], and DSN [27] aim to learn a distance metric suitable for FSL, and our method achieves the best average accuracy on miniImageNet and CIFAR-FS compared to these metric-based methods. MetaOptNet [13] and MAML [11] improve FSL performance by initializing task-dependent parameters, and our approach outperforms these initialization-based methods, achieving the best average accuracy on both datasets.

Second, our method achieves the best average accuracy on both datasets compared to RFS-distill [15] and TAS-distill [25], two methods using knowledge distillation. MIAN [34] employs a self-attention mechanism to enhance FSL performance, but our DEA module is simple and effective compared to self-attention. In terms of classification accuracy, it surpasses that achieved by MIAN.

Finally, our method is superior to the state-of-the-art methods SENet [43], KT [44], and EFTS [45]. SENet [43] balances prototype and example representations through spectral filtering, KT [44] proposes an effective data enhancement strategy for improving FSL performance, and EFTS [45] introduces an episodic free task selection strategy for choosing tasks with the highest affinity scores for co-training a meta-learner. Compared to these state-of-the-art methods, our method remains highly competitive, achieving the best average accuracy on both datasets.

4.4. Ablation Studies

We conduct ablation studies to evaluate core components of our methods: the DEA module and logit standardization self-distillation. Table 3 shows the experimental results of the ablation experiments on miniImageNet and CIFAR-FS under 5-way 1-shot and 5-way 5-shot settings.

Through these ablation experiments, we can find that the DEA module can effectively improve the model performance. In the 5-way 1-shot setting, average accuracies on miniImageNet and CIFAR-FS are improved by 2.47% and 1.60%, respectively. In the 5-way 5-shot setting, average accuracies are improved by 1.61% and 1.05%, respectively. The DEA module can significantly improve the FSL performance by focusing on learning channel and spatial information in the full dimension, thereby enhancing feature representation. Additionally, the logit standardization self-distillation method also improves model performance. In the 5-way 1-shot setting, the average accuracies on both datasets are improved by 3.03% and 2.12%, respectively. In the 5-way 5-shot setting, the average accuracies are improved by 1.93% and 1.38%, respectively. These ablation experiments demonstrate the effectiveness of the DEA module and logit standardization self-distillation.

Most importantly, combining the two methods further demonstrates that the DEA module does not conflict with logit standardization self-distillation. Instead, using both methods simultaneously achieves maximum performance gains. In the 5-way 1-shot setting, average accuracies on both datasets are improved by 4.01% and 3.71%, respectively. In the 5-way 5-shot setting, average accuracies are improved by 3.03% and 2.29%, respectively, which is a significant performance improvement.

4.5. Performance Analysis of DEA Module

This section consists of two subsections in which we compare the DEA module with state-of-the-art attention methods and validate the effect of different components of the DEA module on model performance.

4.5.1. Comparison with State-of-the-Art Attention Mechanisms

To validate the effectiveness of the proposed DEA module, we compared the DEA module with several state-of-the-art attention methods, including the SE [29], CBAM [30], CA [31], and EMA [32] modules. For experimental fairness, we used the same experimental environment and experimental setup for all experiments. To ensure experimental soundness, we set the reduction rates of the SE [29], CBAM [30], CA [31] and EMA [32] modules to 4, 32, 16, and 8, respectively, and use the same ResNet12 as the backbone network. We report the 1-shot accuracy, 5-shot accuracy, number of parameters, FLOPs, and inference speed. The experimental results are shown in Table 4 and Table 5. From the experimental results, it can be seen that in both datasets, adding an attention mechanism to the baseline improves the model performance on both datasets, indicating that attention mechanisms are applicable to FSL and can enhance classification accuracy. And our proposed DEA module achieves the best results, with an overall performance improvement on both datasets. On miniImageNet, the DEA module improves 1-shot accuracy by approximately 1.13%, 0.89%, 1.69%, and 1.88% over the SE [29], CBAM [30], CA [31], and EMA [32] methods, respectively. In 5-shot accuracy, the DEA module improves by 0.84%, 0.65%, 0.71%, and 1.18%, respectively. Additionally, the DEA module hardly increases the number of parameters of the model and results in the least number of FLOPs, indicating that the DEA module has significant advantages.

It is important to note that the SE module [29], lacking attention to spatial information, achieves faster inference but at the cost of lower accuracy. In contrast, the CBAM [30], CA [31], EMA [32], and DEA modules all incorporate spatial information, which reduces inference speed. Among these, the DEA module demonstrates the fastest inference speed, suggesting it as the most efficient in attention computation. The CBAM [30], CA [31] and EMA [32] modules reduce channel dimensions through reduction rates during feature mapping, potentially leading to information loss due to reduced information encoding, while the DEA module employs multi-dimensional collaborative learning, learning different dimensions of information separately, which can effectively reduce the redundancy of the information and prevent information loss caused by dimensionality reduction. Moreover, the DEA module utilizes 1D convolution, reducing computational complexity compared to the CBAM module [30] and CA module [31] which use 2D convolutions. Notably, 1D convolution in the DEA module allows adaptive convolution kernel sizes, accommodating various receptive field sizes during feature map compression. This capability helps model long-range spatial dependencies while reducing memory costs.

To visually demonstrate the DEA module’s advantages, we employ the Grad-CAM tool [46] to generate gradient backpropagation heat maps. Figure 5 shows the visualization results of different modules under the 5-way 1-shot setting. The visualization shows that the DEA module outperforms others by accurately focusing on regions of interest across different categories. It effectively highlights and emphasizes broader areas compared to other modules, which contributes to improving the model performance.

In summary, the proposed DEA module enhances feature representation across different dimensions efficiently and performs attention computation effectively. Importantly, it achieves this with a minimal increase in model parameters, effectively reducing model complexity.

4.5.2. Ablation Analysis of Different Components in DEA Module

To verify the effectiveness of multi-dimensional collaborative learning in the DEA module, we conduct ablation experiments on its different components. We decompose the DEA module into two dimensions of attention: channel and spatial. The channel dimension attention consists of channel attention, while the spatial dimension attention includes attention in both the height and width directions. The results of these experiments on miniImageNet and CIFAR-FS are presented in Table 6.

Our findings indicate that channel dimension attention improves model performance, albeit modestly. In the 5-way 1-shot setting, average accuracies on both datasets are improved by 0.18% and 0.46%, respectively. In the 5-way 5-shot setting, average accuracies are improved by 0.14% and 0.50%, respectively. Although these gains are not substantial, they suggest that channel attention has a positive impact. However, spatial dimension attention improves the model performance significantly. In the 5-way 1-shot setting, average accuracies on both datasets are improved by 0.78% and 1.05%, respectively. In the 5-way 5-shot setting, average accuracies are improved by 0.52% and 0.73%, respectively. These results underscore the critical importance of focusing on spatial information for improving model performance. Further, we synergize the channel dimension and spatial dimension information for learning and improve the average accuracy by 2.47% and 1.60% in both datasets in the 5-way 1-shot setting, and 1.61% and 1.05% in the 5-way 5-shot setting. These results clearly demonstrate that integrating information from both channel and spatial dimensions is essential and that a multi-dimensional collaborative learning strategy is highly effective.

To provide a more intuitive representation of the influence of different dimensions of information attention on model performance, Figure 6 shows the heat map generated by the Grad-CAM tool [46], visualizing different dimensions of attention under the 5-way 1-shot setting. The visualization indicates that a singular focus on one dimension of information leads the model to concentrate on non-comprehensive information, potentially resulting in information loss. Conversely, a comprehensive focus on all dimensions enriches the feature information, enabling the model to better highlight and focus on salient regions.

4.6. Logit Standardization Self-Distillation Experimental Analysis

This section comprises three subsections. First, we verify the effect of the temperature

T

on self-distillation. Next, we explore the validity of the logit standardization. Finally, we investigate the impact of different values of the hyper-parameters

α

and

β

in the loss

L

on the experimental results.

4.6.1. Effect of Temperature T on the Effect of Self-Distillation

To analyze the effect of the temperature

T

on the model performance during the logit standardization self-distillation process, we conduct ablation experiments demonstrated on miniImageNet and CIFAR-FS datasets. The experimental results are shown in Figure 7. We observe that logit standardization self-distillation is insensitive to the temperature

T

. The best performance on miniImageNet is achieved when the temperature

T = 4

, while the optimal performance on CIFAR-FS occurs at

T = 6

. Based on these findings, we set

T

to 4 for miniImageNet and 6 for CIFAR-FS, unless otherwise stated.

4.6.2. Effect of Logit Standardization on Self-Distillation

To validate the effect of logit standardization on the distillation effect during the self-distillation, we conduct experimental validation on miniImageNet and CIFAR-FS. The experimental results are shown in Table 7. According to our findings, the inclusion of logit standardization leads to performance improvements on both datasets. On mimiImageNet, the 1-shot accuracy increased by 0.76% and the 5-shot accuracy increased by 1.04%. On CIFAR-FS, 1-shot and 5-shot accuracy improved by 0.96% and 0.89% respectively. These results demonstrate that logit standardization significantly enhances the distillation effect and improves the model performance. Moreover, self-distillation based on logit standardization is suitable for FSL and can substantially enhance the classification accuracy of few-shot.

4.6.3. Validity Analysis of Hyper-Parameters

To investigate the impact of hyper-parameters

α

and

β

in the loss function

L

of logit standardization self-distillation on model performance, we conduct experimental validation on miniImageNet and CIFAR-FS. As shown in Figure 8, we show the results for different values of the hyper-parameter

α

. Note that all experiments follow this setup with

α + β = 1

. As the value of

α

increases, the value of

β

is decreasing, which indicates that the weight factor controlling the cross-entropy loss gradually increases, while the weight factor controlling the self-distillation loss gradually decreases. We observe that the model’s performance fluctuates during this process. However, overall, as

α

keeps increasing, the model’s performance exhibits a decreasing trend. This suggests that in the logit standardization self-distillation process, the self-distillation loss has a greater impact on model performance than the cross-entropy loss. Eventually, we can observe that for both datasets, the best results are obtained when

α = 0.1

and

β = 0.9

.

5. t-SNE Visualization

To verify the effectiveness of the proposed method in FSL, we conduct a visualization analysis using the t-SNE visualization tool [47]. Each subfigure in Figure 9 shows the visualization results for a random sample of 5-way 1-shots from each dataset, with Figure 9a showing the results for mimiImageNet and Figure 9b for CIFAR-FS. From this figure, it is evident that without any additional methods, the boundaries between categories are fuzzy and the cluster distributions are more dispersed, resulting in poor clustering performance. However, when the proposed DEA module is incorporated, significant improvements are observed on both datasets. The class boundaries become more distinct, indicating that the DEA module enhances the discriminative properties between classes, which is helpful for the performance improvement of the model. Further improvements are noted when logit standardization self-distillation is applied. The clustering effect is optimal on both datasets, with the clearest class boundaries. This demonstrates that logit standardization self-distillation can further improve the model performance in favor of FSL. The t-SNE visualization convincingly shows that the proposed method significantly improves model performance, making the categories more distinguishable, which is beneficial for few-shot image classification tasks.

6. Conclusions

Few-shot learning (FSL) is a crucial approach for addressing data scarcity and expensive data labeling. FSL aims to quickly learn from limited labeled data, and transfer learning methods are commonly employed in FSL. To enhance the performance of pre-trained models, we propose an efficient dimensionally enhanced attention (DEA) module. This DEA module decomposes the feature map into tensors of different orientations through strip pooling, allowing the model to capture feature information efficiently and highlight salient features using a multi-dimensional collaborative learning strategy. This strategy learns cross-dimensional information interactions through 1D convolutions with adaptive kernel sizes. Furthermore, we construct a logit standardization self-distillation method applicable to FSL. By employing logit standardization, we avoid exact logit matching caused by shared temperature, thereby enhancing the self-distillation effect. We conduct comprehensive experiments on few-shot benchmark datasets to evaluate the feasibility of the proposed method. The results demonstrate that our method significantly improves FSL performance. However, the proposed method is implemented by three stages, which are pre-training on the base classes, self-distillation on the base classes, and testing on the novel classes, and the process can seem cumbersome; furthermore, our method has not been validated in other few-shot domains. In future work, we will optimize the proposed method. In addition, the proposed method may benefit other FSL tasks, and we plan to explore the feasibility of the proposed method in cross-domain few-shot image classification tasks.

Author Contributions

Methodology, Y.T.; Software, Y.T.; Validation, Y.T.; Formal analysis, Y.T.; Investigation, G.L. and M.Z.; Writing—original draft, Y.T.; Writing—review & editing, J.L.; Funding acquisition, J.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work is funded by the National Natural Science Foundation of China (grant number: 62066036) and supported by the Program for Young Talents of Science and Technology in Universities of Inner Mongolia Autonomous Region (grant numbers: NJYT22074); the Basic scientific research business fee project for directly affiliated universities in Inner Mongolia Autonomous Region (grant numbers: 2023XKJX020); and the Inner Mongolia Autonomous Region Natural Science Foundation (grant numbers: 2022MS06009).

Data Availability Statement

Our experimental data are open-source, and it is a dataset commonly used in few-shot image classification. These datasets can be obtained on Kaggle. The specific addresses are as follows: miniImagenet can be obtained from https://www.kaggle.com/datasets/arjunashok33/miniimagenet (accessed on 18 April 2024). CIFAR-FS can be obtained from https://www.kaggle.com/datasets/seba1511/cifarfs (accessed on 18 April 2024). You can obtain the data from the author Yuhong Tang ([email protected]) on reasonable request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Jiang, S.; Zhu, Y.; Liu, C.; Song, X.; Li, X.; Min, W. Dataset bias in few-shot image recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 229–246. [Google Scholar] [CrossRef] [PubMed]
Zhu, B.; Flanagan, K.; Fragomeni, A.; Wray, M.; Damen, D. Video Editing for Video Retrieval. arXiv 2024, arXiv:2402.02335. [Google Scholar]
Xian, Y.; Korbar, B.; Douze, M.; Torresani, L.; Schiele, B.; Akata, Z. Generalized few-shot video classification with video retrieval and feature generation. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 8949–8961. [Google Scholar] [CrossRef] [PubMed]
Xin, Z.; Chen, S.; Wu, T.; Shao, Y.; Ding, W.; You, X. Few-shot object detection: Research advances and challenges. Inf. Fusion 2024, 107, 102307. [Google Scholar] [CrossRef]
Li, F.-F. A Bayesian approach to unsupervised one-shot learning of object categories. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Nice, France, 13–16 October 2003; pp. 1134–1141. [Google Scholar]
Snell, J.; Swersky, K.; Zemel, R. Prototypical networks for few-shot learning. In Proceedings of the Neural Information Processing Systems (NeurIPS), Long Beach, CA, USA, 4–9 December 2017; Volume 30. [Google Scholar]
Lyu, Q.; Wang, W. Compositional Prototypical Networks for Few-Shot Classification; Association for the Advancement of Artificial Intelligence (AAAI): Washington, DC, USA, 2023; pp. 9011–9019. [Google Scholar]
Li, X.; Wu, J.; Sun, Z.; Ma, Z.; Cao, J.; Xue, J.H. BSNet: Bi-similarity network for few-shot fine-grained image classification. IEEE Trans. Image Process. IEE TIP 2020, 30, 1318–1331. [Google Scholar] [CrossRef] [PubMed]
Du, Y.; Xiao, Z.; Liao, S.; Snoek, C. ProtoDiff: Learning to Learn Prototypical Networks by Task-Guided Diffusion. In Proceedings of the Neural Information Processing Systems (NeurIPS), New Orleans, LA, USA, 10–16 December 2023; Volume 36. [Google Scholar]
Finn, C.; Abbeel, P.; Levine, S. Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the International Conference on Machine Learning (ICML), Sydney, Australia, 6–11 August 2017; pp. 1126–1135. [Google Scholar]
Wang, H.; Wang, Y.; Sun, R.; Li, B. Global convergence of maml and theory-inspired neural architecture search for few-shot learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 9797–9808. [Google Scholar]
Lee, K.; Maji, S.; Ravichandran, A.; Socatto, S. Meta-learning with differentiable convex optimization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 10657–10665. [Google Scholar]
Chen, W.Y.; Liu, Y.C.; Kira, Z.; Wang, Y.C.F.; Huang, J.B. A closer look at few-shot classification. In Proceedings of the International Conference on Learning Representations (ICLR), New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
Tian, Y.; Wang, Y.; Krishnan, D.; Tenenbaum, J.B.; Isola, P. Rethinking few-shot image classification: A good embedding is all you need? In Proceedings of the Computer Vision–ECCV(ECCV), Glasgow, UK, 23–28 August 2020; pp. 266–282. [Google Scholar]
Sun, Q.; Liu, Y.; Chua, T.S.; Schiele, B. Meta-transfer learning for few-shot learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–20 June 2019; pp. 403–412. [Google Scholar]
Upadhyay, R.; Chhipa, P.C.; Phlypo, R.; Saini, R.; Liwicki, M. Multi-task meta learning: Learn how to adapt to unseen tasks. In Proceedings of the 2023 International Joint Conference on Neural Networks (IJCNN), Gold Coast, Australia, 18–23 June 2023; pp. 1–10. [Google Scholar]
Chen, D.; Chen, F.; Ouyang, D.; Shao, J. Mutual Correlation Network for few-shot learning. Neural Netw. 2024, 175, 106289. [Google Scholar] [CrossRef] [PubMed]
Zhao, P.; Wang, L.; Zhao, X.; Liu, H.; Ji, X. Few-shot learning based on prototype rectification with a self-attention mechanism. Expert Syst. Appl. 2024, 249, 123586. [Google Scholar] [CrossRef]
Liu, Y.; Zhang, H.; Yang, Y. Few-shot image classification based on asymmetric convolution and attention mechanism. In Proceedings of the 2022 4th International Conference on Natural Language Processing (ICNLP), Xi’an, China, 25–27 March 2022; pp. 217–222. [Google Scholar]
Zhu, Y.; Liu, C.; Jiang, S. Multi-attention Meta Learning for Few-shot Fine-grained Image Recognition. In Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence IJCAI-20, Virtual, 7–15 January 2021; pp. 1090–1096. [Google Scholar]
Rizve, M.N.; Khan, S.; Khan, F.S.; Shah, M. Exploring complementary strengths of invariant and equivariant representations for few-shot learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 10836–10846. [Google Scholar]
Sun, S.; Ren, W.; Li, J.; Wang, R.; Cao, X. Logit standardization in knowledge distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 17–21 June 2024; pp. 15731–15740. [Google Scholar]
Lu, Y.; Wen, L.; Liu, J.; Liu, Y.; Tian, X. Self-supervision can be a good few-shot learner. In Proceedings of the European Conference on Computer Vision (ECCV), Tel Aviv, Israel, 23–27 October 2022; Springer Nature: Cham, Switzerland, 2022; pp. 740–758. [Google Scholar]
Le, C.P.; Dong, J.; Soltani, M.; Tarokh, V. Task affinity with maximum bipartite matching in few-shot learning. arXiv 2021, arXiv:2110.02399. [Google Scholar]
Khrulkov, V.; Mirvakhabova, L.; Ustinova, E.; Oseledets, I.; Lempitsky, V. Hyperbolic image embeddings. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 6418–6428. [Google Scholar]
Simon, C.; Koniusz, P.; Nock, R.; Harandi, M. Adaptive subspaces for few-shot learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 4136–4145. [Google Scholar]
Li, M.; Wang, R.; Yang, J.; Xue, L.; Hu, M. Multi-domain few-shot image recognition with knowledge transfer. Neurocomputing 2021, 442, 64–72. [Google Scholar] [CrossRef]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Hou, Q.; Zhou, D.; Feng, J. Coordinate attention for efficient mobile network design. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 13713–13722. [Google Scholar]
Ouyang, D.; He, S.; Zhang, G.; Luo, M.; Guo, H.; Zhan, J.; Huang, Z. Efficient multi-scale attention module with cross-spatial learning. In Proceedings of the ICASSP 2023–2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; pp. 1–5. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jonse, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 5998–6008. [Google Scholar]
Qin, Z.; Wang, H.; Mawuli, C.B.; Han, W.; Zhang, R.; Yang, Q.; Shao, J. Multi-instance attention network for few-shot learning. Inf. Sci. 2022, 611, 464–475. [Google Scholar] [CrossRef]
Hou, R.; Chang, H.; Ma, B.; Shan, S.; Chen, X. Cross attention network for few-shot classification. In Proceedings of the Neural Information Processing Systems (NeurIPS), Vancouver, BC, Canada, 8–14 December 2019; Volume 23. [Google Scholar]
Xiao, B.; Liu, C.L.; Hsaio, W.H. Semantic Cross Attention for Few-shot Learning. In Proceedings of the Asian Conference on Machine Learning (ACML), Istanbul, Türkiye, 11–14 November 2023; pp. 1165–1180. [Google Scholar]
Hinton, G.; Vinyals, O.; Dean, J. Distilling the knowledge in a neural network. arXiv 2015, arXiv:1503.02531. [Google Scholar]
Zhou, J.; Cai, Q. Global and local representation collaborative learning for few-shot learning. J. Intell. Manuf. 2024, 35, 647–664. [Google Scholar] [CrossRef]
Vinyals, O.; Blundell, C.; Lillicrap, T.; Wierstra, D. Matching networks for one shot learning. In Proceedings of the Advances in Neural Information Processing Systems (NIPS), Barcelona, Spain, 5–10 December 2016; pp. 3630–3638. [Google Scholar]
Bertinetto, L.; Henriques, J.F.; Torr, P.H.S.; Vedaldi, A. Meta-learning with differentiable closed-form solvers. In Proceedings of the International Conference on Learning Representations (ICLR), Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
Ravi, S.; Larochelle, H. Optimization as a model for few-shot learning. In Proceedings of the International Conference on Learning Representations (ICLR), Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
Chen, Y.; Liu, Z.; Xu, H.; Darrell, T.; Wang, X. Meta-baseline: Exploring simple meta-learning for few-shot learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Virtual, 11–17 October 2021; pp. 9062–9071. [Google Scholar]
Zhang, T.; Huang, W. SENet: A Spectral Filtering Approach to Represent Exemplars for Few-shot Learning. arXiv 2024, arXiv:2305.18970. [Google Scholar]
Li, P.; Liu, F.; Jiao, L.; Li, S.; Li, L.; Huang, X. Knowledge transduction for cross-domain few-shot learning. Pattern Recognit. 2023, 141, 109652. [Google Scholar] [CrossRef]
Zhang, T. Episodic-free Task Selection for Few-shot Learning. arXiv 2024, arXiv:2402.00092. [Google Scholar]
Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 618–626. [Google Scholar]
Van der Maaten, L.; Hinton, G. Visualizing data using t-SNE. J. Mach. Learn. Res. 2008, 9, 2579–2605. [Google Scholar]

Figure 1. Overview of our proposed FSL approach. (a) The DEA module is embedded into the model, pre-trained with the base class data, and the model is optimized with the cross-entropy loss function. “FC” denotes the adaptive average pooling layer. (b) The knowledge is transferred from the pre-trained model to the new model with the same structure by logit standardization self-distillation. The new model is optimized by combining cross-entropy loss and self-distillation loss. (c) Using the support set in the novel classes, we train a logistic regression (LR) classifier and make category prediction on the query set in the novel classes.

Figure 2. Structural diagrams of the different attention mechanisms: (a) SE module, (b) CA module, (c) CBAM module, and (d) the proposed dimensionally enhanced attention (DEA) module, where “X Avg pool” and “Y Avg pool” denote 1D horizontal global pooling and 1D vertical global pooling, respectively; “BN” denotes batch normalization; “GAP” and “GMP” denote global average pooling and global max pooling, respectively; and “C Avg pool”, “H Avg pool”, and “W Avg pool” denote the average pooling in the channel direction, height direction, and width direction, respectively.

Figure 3. Residual structure diagram. (a) Original residual block; (b) DEA-based residual block.

Figure 4. Overview of logit standardization self-distillation. We train the teacher and student models on the base class data, where the teacher

f_{T}

and the student

f_{S}

have the same structure and are both ResNet12 models. During the self-distillation process, we standardize logit and optimize the student model by combining cross-entropy loss and self-distillation loss.

Figure 4. Overview of logit standardization self-distillation. We train the teacher and student models on the base class data, where the teacher

f_{T}

and the student

f_{S}

have the same structure and are both ResNet12 models. During the self-distillation process, we standardize logit and optimize the student model by combining cross-entropy loss and self-distillation loss.

Figure 5. Grad-CAM visualization results under 5-way 1-shot setting. We have visualized four methods: SE [29], CBAM [30], CA [31], and DEA. It can be observed that the DEA module is capable of focusing more on the region of interest in the feature map and covering a wider range of regions.

Figure 6. Grad-CAM visualization results under the 5-way 1-shot setting. “Channel” denotes channel dimensional attention, “Spatial” denotes spatial dimensional attention, and “Full” represents the combination of both the channel and spatial dimensions of attention.

Figure 7. Logit standardization self-distillation temperature

T

with values ranging from 2 to 7. We validated the effect of temperature

T

on model performance on miniImageNet and CIFAR-FS.

Figure 7. Logit standardization self-distillation temperature

T

with values ranging from 2 to 7. We validated the effect of temperature

T

on model performance on miniImageNet and CIFAR-FS.

Figure 8. Ablation experiments on the effect of hyper-parameters

α

and

β

in self-distillation loss

L

. The effect of hyper-parameters are verified on miniImageNet and CIFAR-FS.

Figure 8. Ablation experiments on the effect of hyper-parameters

α

and

β

in self-distillation loss

L

. The effect of hyper-parameters are verified on miniImageNet and CIFAR-FS.

Figure 9. t-SNE visualization effect plots for each subfigure: (i) the effect plot obtained at baseline; (ii) the effect plot obtained when the DEA module is added; (iii) the effect plot obtained when the DEA module and logit standardization self-distillation are incorporated.

Table 1. Definition of mathematical notations.

Notations
$D_{B}$	Base classes
$D_{N}$	Novel classes
${(X_{B}, Y_{B})}$	Base class images and corresponding labels
$X_{S} = {(X_{i}, Y_{i})}_{i = 0}^{N \times K}$	Support set
$X_{Q} = {(Q_{i}, Y_{i})}_{i = 0}^{N \times M}$	Query set
$N$	The number of classes
$K$	The number of samples in each class of the support set
$M$	The number of samples in each class of the query set

Table 2. Performance comparison of both 5-way 1-shot and 5-shot tasks in terms of accuracy (%) with 95% confidence intervals on miniImageNet and CIFAR-FS.

Method	Backbone	miniImageNet		CIFAR-FS
Method	Backbone	5-Way 1-Shot	5-Way 5-Shot	5-Way 1-Shot	5-Way 5-Shot
ProtoNet [7]	ResNet-12	60.37 ± 0.83	78.02 ± 0.57	72.20 ± 0.70	83.50 ± 0.50
MAML [11]	Conv-4	48.70 ± 1.84	63.11 ± 0.92	58.90 ± 1.90	71.50 ± 0.28
Baseline++ [14]	ResNet-12	53.97 ± 0.79	75.90 ± 0.61	67.02 ± 0.90	83.58 ± 0.54
MetaOptNet [13]	ResNet-12	62.64 ± 0.61	78.63 ± 0.46	72.0 ± 0.70	84.2 ± 0.50
MTL [16]	ResNet-12	61.2 ± 1.8	75.5 ± 0.8	—	—
Hyper ProtoNet [26]	ResNet-18	59.47 ± 0.20	76.84 ± 0.14	64.02 ± 0.24	82.53 ± 0.14
RFS-simple [15]	ResNet-12	62.02 ± 0.63	79.64 ± 0.44	71.5 ± 0.80	86.0 ± 0.50
RFS-distill [15]	ResNet-12	64.82 ± 0.60	82.14 ± 0.43	73.9 ± 0.80	86.9 ± 0.50
DSN [27]	ResNet-12	62.64 ± 0.66	78.83 ± 0.45	72.3 ± 0.78	85.1 ± 0.50
Meta-Baseline [42]	ResNet-12	63.17 ± 0.23	79.26 ± 0.17	—	—
TAS-simle [25]	ResNet-12	64.71 ± 0.43	82.08 ± 0.45	73.47 ± 0.42	86.82 ± 0.49
TAS-distill [25]	ResNet-12	65.13 ± 0.39	82.47 ± 0.52	74.02 ± 0.55	87.65 ± 0.58
MIAN [34]	ResNet-12	64.27 ± 0.35	81.24 ± 0.26	—	—
SENet [43]	ResNet-12	61.67 ± 0.65	80.04 ± 0.49	71.63±0.73	86.23 ± 0.48
KT [44]	ResNet-12	66.69 ± 1.01	81.53 ± 0.73	74.9 ± 1.1	86.4 ± 0.7
EFTS [45]	ResNet-12	63.77 ± 0.85	79.82 ± 0.55	74.85 ± 0.84	87.41 ± 0.59
Ours	ResNet-12	67.32 ± 0.67	82.76 ± 0.61	77.62 ± 0.52	88.95 ± 0.87

Table 3. Average accuracy (%) of different components of the proposed method on miniImageNet and CIFAR-FS.

Method	miniImageNet 5-Way		CIFAR-FS 5-Way
Method	1-Shot	5-Shot	1-Shot	5-Shot
Baseline	63.31 ± 0.65	79.73 ± 0.57	73.91 ± 0.75	86.66 ± 0.81
+DEA	65.78 ± 0.57	81.34 ± 0.67	75.51 ± 0.34	87.71 ± 0.38
+logit standardization self-distillation	66.34 ± 0.53	81.66 ± 0.65	76.03 ± 0.52	88.04 ± 0.53
Ours Full	67.32 ± 0.67	82.76 ± 0.61	77.62 ± 0.52	88.95 ± 0.87

Table 4. Experimental results of different attention methods on miniImageNet.

Method	Backbone	1-Shot (%)	5-Shot (%)	Params	FLOPs	Inference
Baseline	ResNet12	63.31 ± 0.65	79.73 ± 0.57	12.42 M	3.51 G	8.82 FPS
+SE [29]	ResNet12	64.65 ± 0.37	80.50 ± 0.46	12.70 M	3.51 G	8.81 FPS
+CBAM [30]	ResNet12	64.89 ± 0.33	80.69 ± 0.54	12.46 M	3.51 G	8.26 FPS
+CA [31]	ResNet12	64.09 ± 0.65	80.63 ± 0.55	12.53 M	3.51 G	8.42 FPS
+EMA [32]	ResNet12	63.90 ± 0.51	80.16 ± 0.63	12.51 M	3.69 G	7.56 FPS
+DEA (ours)	ResNet12	65.78 ± 0.57	81.34 ± 0.67	12.42 M	3.51 G	8.47 FPS

Table 5. Experimental results of different attention methods on CIFAR-FS.

Method	Backbone	1-Shot (%)	5-Shot (%)	Params	FLOPs	Inference
Baseline	ResNet12	73.91 ± 0.75	86.66 ± 0.81	12.42 M	523.11 M	65.77 FPS
+SE [29]	ResNet12	74.14 ± 0.37	86.93 ± 0.65	12.70 M	523.38 M	63.19 FPS
+CBAM [30]	ResNet12	74.26 ± 0.67	87.02 ± 0.30	12.46 M	523.28 M	51.55 FPS
+CA [31]	ResNet12	75.03 ± 0.36	87.06 ± 0.22	12.53 M	523.89 M	57.01 FPS
+EMA [32]	ResNet12	74.28 ± 0.47	87.09 ± 0.65	12.51 M	550.97 M	48.22 FPS
+DEA (ours)	ResNet12	75.51 ± 0.34	87.71 ± 0.38	12.42 M	523.11 M	57.97 FPS

Table 6. Average accuracy (%) of different dimensions of attention on miniImageNet and CIFAR-FS.

Method	miniImageNet 5-Way		CIFAR-FS 5-Way
Method	1-Shot	5-Shot	1-Shot	5-Shot
Baseline	63.31 ± 0.65	79.73 ± 0.57	73.91 ± 0.75	86.66 ± 0.81
Channel dimensional attention	63.49 ± 0.67	79.87 ± 0.81	74.37 ± 0.49	87.16 ± 0.43
Spatial dimensional attention	64.09 ± 0.31	80.25 ± 0.44	74.96 ± 0.72	87.39 ± 0.31
DEA (Full)	65.78 ± 0.57	81.34 ± 0.67	75.51 ± 0.34	87.71 ± 0.38

Table 7. Effect of logit standardization on the effect of self-distillation, where “—” indicates self-distillation without the introduction of logit standardization.

Method	miniImageNet 5-Way		CIFAR-FS 5-Way
Method	1-Shot	5-Shot	1-Shot	5-Shot
—	66.56 ± 0.51	81.72 ± 0.49	76.66 ± 0.52	88.06 ± 0.34
+logit standardization	67.32 ± 0.67	82.76 ± 0.61	77.62 ± 0.52	88.95 ± 0.87

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Tang, Y.; Li, G.; Zhang, M.; Li, J. Few-Shot Learning Based on Dimensionally Enhanced Attention and Logit Standardization Self-Distillation. Electronics 2024, 13, 2928. https://doi.org/10.3390/electronics13152928

AMA Style

Tang Y, Li G, Zhang M, Li J. Few-Shot Learning Based on Dimensionally Enhanced Attention and Logit Standardization Self-Distillation. Electronics. 2024; 13(15):2928. https://doi.org/10.3390/electronics13152928

Chicago/Turabian Style

Tang, Yuhong, Guang Li, Ming Zhang, and Jianjun Li. 2024. "Few-Shot Learning Based on Dimensionally Enhanced Attention and Logit Standardization Self-Distillation" Electronics 13, no. 15: 2928. https://doi.org/10.3390/electronics13152928

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Few-Shot Learning Based on Dimensionally Enhanced Attention and Logit Standardization Self-Distillation

Abstract

1. Introduction

2. Related Work

2.1. Few-Shot Learning

2.2. Attention Mechanisms

2.3. Knowledge Distillation

3. Methodology

3.1. Problem Definition

3.2. Method Pipeline

3.3. Dimensionally Enhanced Attention (DEA) Module

3.4. Logit Standardization Self-Distillation

4. Experiments

4.1. Datasets

4.2. Implementation Details

4.3. Comparison with State-of-the-Art Methods

4.4. Ablation Studies

4.5. Performance Analysis of DEA Module

4.5.1. Comparison with State-of-the-Art Attention Mechanisms

4.5.2. Ablation Analysis of Different Components in DEA Module

4.6. Logit Standardization Self-Distillation Experimental Analysis

4.6.1. Effect of Temperature T on the Effect of Self-Distillation

4.6.2. Effect of Logit Standardization on Self-Distillation

4.6.3. Validity Analysis of Hyper-Parameters

5. t-SNE Visualization

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI