1. Introduction
With the rapid development of neural networks, remarkable progress has been made in image processing, such as classification [
1,
2], object detection [
3], and semantic segmentation [
4]. This improvement heavily relies on large-scale model training on numerous labelled images. However, data annotation is both costly and time-consuming, resulting in a limited number of labelled data. This issue leads to the overfitting or underfitting of the model, which may degrade the performance [
5]. To address this challenge, the computer vision community has proposed few-shot learning methods. These methods mimic human reasoning and quickly acquire new knowledge with only a few examples. Specifically, few-shot learning adopts an episodic learning strategy. In every episode, the model is trained through a support set and evaluated through a query set.
Scholars have attempted to apply transfer learning [
6] and meta-learning to achieve few-shot learning. Currently, meta-learning-based few-shot image classification primarily relies on metric learning. The main idea is to compute the distance between the query feature and the support feature using predefined mathematical metrics or pre-trained classifiers. In other words, the classification results depend on the distance between the query set and the descriptors [
7] or points of the support sets in the latent space. The typical ProtoNet [
8] calculates the prototypes of support sets according to their Euclidean distances, believing that the prototypes capture the inductive bias of support images. The DeepEMD [
9] introduced the Earth Mover’s Distance (EMD) and compared the distance between feature representations. This approach divides images into small patches and calculates the best match to determine the distance. Additionally, DeepDBC [
10] employs the distance through Brownian Distance Covariance, which quantifies the difference between the joint characteristic distributions of input images. Instead of the method mentioned above using predefined mathematical metrics, the Relation Network [
11] trains a learnable classifier to compare the feature vector of input images for classification.
Applying few-shot learning to fine-grained image classification effectively overcomes the challenge of limited labelled data and achieves positive results [
12]. However, the challenge in fine-grained image classification extends beyond the issue of limited labelled data. The difference between fine-grained image classification and general image classification can be seen in
Figure 1. General images from different classes exhibit significant differences, so the background has a minimal impact on the classification. However, the significant intra-class differences and subtle inter-class similarities between subcategories severely impact the classification. As shown in
Figure 1a, horizontally, images in the first row belong to California gulls. However, due to variations in their backgrounds and postures, they look quite different. In contrast, images in the same column belong to different gulls but share some similarities in background and posture, with slight differences in their wings and beaks. This challenge, unique to fine-grained images, amplifies the difficulties of few-shot classification. Therefore, accurately distinguishing these subtle but critical features has played a significant role in few-shot fine-grained image classification.
Some existing metric-based methods for few-shot learning are directly employed in fine-grained image classification, relying on complex network structures to extract features [
13]. FEAT [
14] generates distinctive and task-specific features through set-to-set functions and embedding adaptation. CTX [
15], which introduces a cross transformer to retrieve features, maps query instances to the supporting latent space and classifies targets through self-supervised learning. HelixFormer [
16] leverages the cross-image semantic relationships between the query and support features within a transformer-based structure. CSCAM [
17] introduces a module combining channel attention and spatial attention, and aims to extract discriminative regions through a cross-attention module.
However, there is a problem with this kind of method. After extracting feature maps using various efficient extractors, it is necessary to convert them into single-vector representations for metric functions. The process of converting spatial features into vector representations results in the loss of spatial or positional information, as well as leading to potential overfitting for posture. Taking global pooling as an example, the commonly used softmax classifier averages the input image [
8,
11], preserving its overall details and location, but leads to overfitting to postures and overlooking potential information. Some attempts address this issue by expanding the receptive field [
15], but induce new problem that the model overfits to irrelevant information like the background. Therefore, FRN [
18] introduces a novel approach that reconstructs the query feature using the support feature and then compares the reconstructed feature with the query feature for classification. Specifically, it is easier for support images to reconstruct query images belonging to the same class because they share some similar feature mappings. On the contrary, reconstructing query images from different classes will cause substantial errors due to inter-class variations. This is why the category similarity between images can be measured by calculating the reconstruction error. Compared to metric learning, this method preserves spatial details and avoids overfitting to the posture, thus decreasing the influence of inter-class variations. However, while the support–query feature reconstruction encourages the model to learn distinct differences between classes, helping to address inter-class variations, it struggles to capture subtle differences within the same class.
Consequently, we propose a channel-wise attention-enhanced feature mutual reconstruction approach for few-shot fine-grained image classification. We treat feature reconstruction as a ridge regression problem and achieve the best reconstruction using the least square method. Besides the support–query feature reconstruction, we additionally adopt a reverse query–support reconstruction strategy, which aims to reduce the differences between same-class images. This strategy compresses the intra-class differences, encouraging the model to learn more consistent and compact representations for similar instances. The support–query feature reconstruction improves the separability between classes, while the reverse query–support reconstruction focuses on reducing discrepancies within the same class.
This seemingly simple method encourages the model to focus not only on the significant differences between categories (through the support–query feature reconstruction), but also on reducing the gap within the same category (through the query–support feature reconstruction). This mutual learning mechanism enables our model to perform more robustly in fine-grained image classification tasks, especially when the training samples are scarce.
Our channel-wise feature mutual reconstruction contains four modules: (1) a feature extractor, (2) a channel-wise attention module, (3) a feature mutual reconstruction module, and (4) a feature similarity calculation module. In order to weaken the semantic difference caused by background and posture, we propose a channel-wise attention module. This module highlights the key parts of the targets and ensures that the features accurately represent the category information.
In summary, our contributions can be listed as follows:
We propose a channel-wise attention mechanism. This approach uses channel-wise self-attention to obtain object-specific channel weights. These weights help features to depress the background noise and focus on the salient feature of the target. To reduce the classification errors caused by background similarities, we minimize the inter-class similarities.
We introduce a feature mutual reconstruction module. This module reconstructs images using channel-wise enhanced features. This mutual reconstruction ensures a larger intra-class variation and smaller inter-class similarities. Ablation experiments show that mutual reconstruction promotes a stronger interaction between the support and query sets, maximizing their contributions to the classification task.
To prove the validity of our approach, we conduct several experiments on classic fine-grained image datasets, including CUB-200-2011 [
19], Stanford Cars [
20], Stanford Dogs [
21], and Aircraft [
22], and compare them with other advanced methods.
The structure of this paper can be summarized as follows:
Section 2 provides an overview of the materials and methods proposed in this paper, and details the application of using a channel-wise attention module, which is complementary to the feature mutual reconstruction.
Section 3 presents the experimental results, comparing models across few-shot fine-grained datasets, while examining the impact of each branch on performance. Finally,
Section 4 concludes the proposed method, discussing model results, limitations, and future directions for few-shot fine-grained image classification.
2. Materials and Methods
The overall architecture of our approach is illustrated in
Figure 2. There is a feature extractor to compute the feature maps for both support and query instances in every episode. We then employ a channel-wise attention module (CAM) to generate attention weights that emphasize the most informative regions of the objects. This attention mechanism works by redistributing weights to object-relevant channels, effectively enhancing the feature maps for subsequent processing. After that, we apply a feature mutual reconstruction module (FMRM) to reconstruct both the support images and query images, leveraging the mutual relationships between the enhanced features. The classification results are determined by the similarity between reconstructed features and channel-wise enhanced features.
2.1. Problem Formulation
In a standard few-shot classification, we divided the datasets into three parts, namely the training set , the test set and the validation set , similar to other traditional model training processes. During training, the model improves its performance on a C-way K-shot classification. In every training episode, the model is provided with a support set (meta-training set) and a query set (meta-test set), which are divided from the training set . Specifically, in every episode, C classes are randomly selected from , and for each of these classes, K labelled images are provided as the support set and M unlabelled images are provided as the query set . Images in the support set and query set belong to the same class but do not overlap. After data loading, the total number of samples in each episode is . This setup ensures that the model is trained to recognize new classes with a limited number of data.
2.2. Channel-Wise Attention Module (CAM)
After the feature extractor, we obtain the feature representations of the support and query images:
where the
represents images of the
cth class of support sets and
is the
ith instance of the query sets. The feature map
, where
D,
H, and
W denote the number of channels, height, and weight.
Previous work like SeNet [
23] has proved that channel attention weights are beneficial for image classification. They reassign weights across different channels, enabling the proposed model to focus on the distinct areas of input features. In fine-grained image classification, this helps to reduce the impact of the background and highlights the fine-grained objects. The channel-wise attention module we proposed is shown in
Figure 3. The core idea is to calculate the correlation along the channel dimension of the input features, allowing the model to focus on distinctive regions of the input features. Specifically, we compute the correlation between feature channels and aggregate these correlations.
The input features of the channel-wise attention module are denoted as
F. We normalize
F in the channel dimension, and obtain the
Q,
K, and
V through a
convolution kernel.
where
with
. The structure of channel-wise self-attention (CSA) is quite similar to the Multi-Head Self-Attention (MHSA) introduced by ViT [
24].
However, there are still some differences. In MHSA, and . In our proposed CSA, self-attention focuses on the channel feature, so our .
2.3. Feature Mutual Reconstruction Module (FMRM)
The reconstruction process is shown in
Figure 4, including two branches: the query feature reconstruction branch and the support feature reconstruction branch. QFR is designed to ensure the inter-class differences, while SFR is designed to minimize the inter-class similarity. Before the reconstruction, we enhanced the support feature
and query feature
using the support channel-wise weights
:
where
represents the scalar value at the
jth dimension of
.
is the
dth channel of the support feature
.
is the
dth channel of the query feature
.
.
Feature reconstruction aims to figure out a matrix that satisfies
,
, in which
is reshaped to
,
, which represents the feature pools of the
cth class. Solving this formulation using the least square method, we can find that
where
denotes the Frobenius norm and
is the ridge regression penalty, which is designed to ensure the optimization is tractable. The reconstruction can be calculated as follows:
where
is the new query feature reconstructed by the support feature and
is set as
inspired by [
18]. After a query feature reconstruction branch, we obtain
and an enhanced query feature
that focuses on support classes. Similarly, we enhanced the support feature and query feature by query weights
:
We reconstruct a new support feature using
:
where
is set as
.
2.4. Classifier and Loss
We obtain the reconstructed features
and
and the enhanced features
and
after FMRM. We calculate the similarity between the enhanced query feature
and the reconstructed query feature
as
and compute the similarity between the enhanced support feature
and the reconstructed suport feature
as
Through Equations (
9) and (
10), we measure the reconstruction similarities of QFR and SFR. To measure the total reconstruction similarities of the FMRM, we calculate the total distance vis the weighted summation of the two distances. Thus, the total distance between the
ith query instance and the
cth class is
Inspired by [
15,
18], we set
,
, and
as three learnable parameters, and their initial value is 1.00.
and
are designed to dynamically adjust the importance of each branch.
is introduced in order to control the peakiness of Equation (
11). The possibility that the
ith query instance belongs to the
cth class is given by
During training, we employ a cross-entropy function to calculate the loss of our classification:
To improve the quality of the reconstructed feature, we additionally introduce a reconstructing loss:
where
and
is row-normalized, and
q is the number of query images. This loss ensures the orthogonality between features, so it encourages larger differences between features. This helps the module to reduce the similarity caused by the background. The total loss of the training is
Following [
25], we set
as 0.03.
4. Conclusions
In this paper, we introduced a channel-wise attention-enhanced feature mutual reconstruction mechanism, a reconstruction-based method for few-shot fine-grained image classification designed to alleviate significant intra-class differences and subtle inter-class similarities. We utilized a channel-wise attention module (CAM) to reassign the channel weights of support and query weights. This enabled the model to focus on the distinguishing parts of the targets. Then we reconstructed support and query features with these attention-enhanced features. Support features were reconstructed using a support-weight-reassigned feature map to minimize intra-class variation, while query features were reconstructed with a query-weight-reassigned feature map to maximize inter-class variation. We obtained the classification results based on the similarity between the reconstructed features and attention-enhanced features.
The results based on four widely-used fine-grained benchmarks indicate that our classification method is superior to the previous method and support the robustness of our model. Additionally, the ablation studies confirm that our CAM and FMRM play an essential and complementary role in enhancing overall performance and reducing classification errors. Each branch of reconstruction impacts differently on the module, and they are all indispensable. From the visualization of our model, we can conclude that SFR reduces the difference from the same classes and QFR helps to learn the differences.
Despite the positive results achieved by our model, it suffers from a few limitations. As shown in
Figure 7, although SFR has demonstrated its effectiveness, the influence of pose variations still negatively impacts its performance. Furthermore, the model’s performance is highly dependent on computational resources during training. Addressing these two limitations will be a focus of our future work.