1. Introduction
Fine-grained Visual Categorization (FGVC) presents significant challenges due to subtle inter-class differences and large intra-class variations [
1]. Across different categories, objects often share similarities in shape, color, and texture [
2], making classification difficult. At the same time, variations within a single category arise due to factors such as pose, viewpoint, and background differences, further complicating the recognition process. For example, in bird species classification, distinguishing between two visually similar species may rely on minute differences in their feather patterns or beak structure. Additionally, variations in viewing angles can result in vastly different visual representations of the same bird. This discrepancy occurs because an image is a two-dimensional projection of a three-dimensional object, where changes in perspective naturally yield distinct patterns and textures.
Due to the high cost of manual labeling [
3], weakly supervised learning has become a widely adopted approach for FGVC. Deep learning-based methods [
4] have significantly advanced this field, and among them, high-order feature pooling techniques have been gained. These methods apply higher-order transformations to convolutional neural network (CNN) features before classification, with notable examples including bilinear models [
5,
6,
7,
8,
9,
10,
11,
12,
13,
14] and kernel pooling [
15,
16]. High-order feature fusion enhances contextual semantic information by capturing complex feature interactions, leading to notable accuracy improvements across different FGVC tasks [
17,
18].
A key challenge in high-order feature fusion is the presence of confounding factors in images. When features are extracted and merged without considering these confounders, the model may incorporate misleading information, leading to incorrect classifications. Objects in FGVC tasks often appear in distinct environments that may influence classification decisions. As illustrated in
Figure 1, Great Crested Flycatchers are commonly seen in forested areas (green backgrounds), while Olive-Sided Flycatchers are typically observed against a sky backdrop (blue). Consequently, a model may misclassify a bird solely based on its background rather than its intrinsic attributes.
Given the limitations of conventional background noise removal techniques, distinguishing meaningful background cues from misleading ones is crucial. Unlike standard denoising methods that indiscriminately suppress background information, a more effective approach should be able to selectively preserve useful features while mitigating the impact of confounding factors.
To address this, causal intervention is introduced to improve fine-grained recognition by disentangling relevant and irrelevant information. In recent years, causal models have been successfully applied in computer vision [
19,
20,
21,
22,
23], particularly through causal representation learning [
24,
25]. Unlike traditional machine learning models that assume independent and identically distributed (i.i.d.) data, causal representation learning leverages stable causal mechanisms across datasets, making it robust to challenges such as limited samples, imbalanced data, and biased observations.
Inspired by the work of Rao et al. [
26] on counterfactual attention learning for fine-grained image classification, this study extends causal reasoning into a comprehensive high-order feature learning framework. To better utilize high-order features for fine-grained categorization, Interventional High-Order Feature Fusion (IHFF) is proposed, integrating causal reasoning with high-order feature learning. The approach is built upon structural causal models (SCMs) [
24,
25], aiming to reconstruct latent causal relationships that capture meaningful semantic interactions.
Previous methods relied on high-order feature fusion to extract fine-grained details. However, when encountering highly similar objects across different classes, conventional feature fusion is often insufficient for accurate classification. To address this limitation, high-order features are decomposed into two components:
(1) Intra-object relationships—capturing structural dependencies within the same object.
(2) Inter-object relationships—modeling semantic differences among distinct objects within the same category.
By fusing these two levels of semantic relationships, a more discriminative and context-aware representation of fine-grained details is obtained. Furthermore, analysis of the structural causal model for FGVC enables the identification of causal links through which confounders affect classification performance. Causal interventions are then applied to sever these links, reducing confounding effects and increasing the proportion of effective features in high-order representations. Causal intervention enables the selective extraction of high-order features while preserving essential patterns without overemphasizing misleading information. The contributions of this paper are summarized below:
(1) Constructing a general SCM for FGVC and introducing high-order feature fusion via tensor product spaces to describe the relationships between different semantic information spaces.
(2) Analyzing the reconstructed SCM to identify causal links through which confounders influence classification outcomes and applying causal interventions to mitigate their impact.
(3) Proposing IHFF, which simultaneously applies high-order feature fusion and causal intervention.
(4) Conducting comprehensive experiments on three widely used fine-grained public datasets (CUB-200-2011, FGVC-Aircraft, and Stanford Cars) to demonstrate the effectiveness of the proposed method.
The subsequent sections of this paper are structured as follows:
Section 2 reviews the relevant literature.
Section 3 introduces the structural causal model and its application to FGVC.
Section 4 details a proposed methodology.
Section 5 presents the experimental setup, results, and analysis. Finally,
Section 6 concludes the study.
3. Structural Causal Model for FGVC
This section outlines the construction of a causal graph for fine-grained image classification, illustrated in
Figure 2 [
24,
25]. It begins by analyzing causal assumptions and identifying confounding factors within a baseline model. Subsequently, the section outlines methods to implement causal interventions by severing causal links effectively. This approach clarifies the mechanisms underpinning the baseline model and pinpoints strategic interventions to mitigate biases and enhance accuracy.
3.1. General Structural Causal Model for FGVC
S→X: In this causal link, X represents the feature representation, while S refers to the input image, which inherently contains both natural semantic information and confounding factors. For instance, consider a dataset S and its corresponding feature extraction network, . This causal link implies that the feature map X is derived from the input image S through the transformation applied by the network .
S→P←X: P represents a transformed version of X, with its foundation stemming from S. This link consists of two key components:
S→P: The space P is spanned by the basis of the feature space and serves as the projection of the input image S onto this feature space. This projection is typically realized through linear transformations in neural networks. Consequently, P encapsulates not only essential semantic information but also features influenced by confounding factors.
X→P: Feature map X undergoes linear or nonlinear transformations, resulting in the formation of the feature space P, which subsequently feeds into the fully connected layers of the model.
This paper differentiates the foundation of the feature space P into two distinct levels: the classifier level and the class label level. To mitigate the influence of confounding factors, causal interventions are applied through backdoor adjustments at each of these levels separately.
X→Y←P: Let Y represent the classification outcome (e.g., logits), which is influenced by the feature map X through two distinct pathways: (1) a direct projection from X to Y, and (2) an intermediary projection via P. Typically, the direct path can be disregarded if X is fully encapsulated by P, especially when taking into account the dimensionality reduction in features. For example, in a conventional neural network architecture composed of convolutional and pooling layers, the feature map X can be completely described in a linear manner using the basis of the feature space. Consequently, the structural causal model simplifies to S→P→Y. However, in FGVC, this simplification toward causal links overlooks the role of contextual factors, which are critical in high-level feature fusion, as these high-level features cannot be linearly represented by their bases. Regarding the pathway through the intermediary P, this mechanism naturally arises because the variables forming the basis of the final classification function are derived from the basis of P, suggesting that the classification function can always be expressed as an implicit function of the feature space P.
3.2. Reconstructed Structural Causal Model for FGVC
In FGVC, capturing higher-order semantic information is crucial. Advanced feature extractors that can handle high-order features have shown superior effectiveness in achieving precise classification. Their success is largely due to their ability to discern subtle differences between highly similar sub-categories. Additionally, these extractors are pivotal in identifying invariant features across different poses, scales, and rotations. Traditional causal interventions typically focus on an SCM with low-dimensional features, which are inadequate for FGVC. Therefore, we propose reconstructing the causal link to amplify the impact of higher-order semantic information through the following model:
X→H←P: H represents the pairwise features derived from two distinct sets of features extracted by networks, such as outputs from different convolutional neural networks (CNNs). This link helps reconstruct the relationships between high-order features (pairwise features) and their representation in lower dimensions. For example, let
P denote a linear combination of
k +
m base vectors, along with a residual noise component. This approach enhances our understanding and manipulation of complex feature interactions within FGVC:
where
e is the residual noise,
and
are from two different feature extractors. The relationship between them can be described by the tensor product:
Definition 1. Tensor product of multilinear functions.
Given a
k-linear function
, the set of all multilinear functions L, and an
m-linear function
, define the tensor product of both as a (
k+
m)-linear function
; it satisfies the following:
It is evident that the basis of
H is derived from the basis of
P, enabling the causal relationship of
H to inherit the causal structure of
P. In other words, three causal links can be identified:
X→
H,
P→
H, and
H→
Y. Let the last convolutional layers of two feature extractors be represented by two vector spaces,
V and
W. Thus,
P is the direct sum of
V and
W:
, while
H is the tensor product space of
V and
W (as discussed in
Section 4.1):
. There are fundamental differences between
H and
P, as
and
. Therefore, due to the nature of
H as an increased dimensionality representation of the feature space
P (as proven in
Appendix A),
P can be fully represented by
H. Consequently, similar to the previous analysis, the connection
P→
H can be omitted. Thus, only the causal link
X→
H→
Y needs to be considered. This simplification allows for focusing on the backdoor adjustment of the
S→
X path in causal interventions.
4. Method
The structural causal model for FGVC in
Section 3 consists of a single causal link from
S to
Y:
S→
X→
H→
Y. To mitigate the influence of confounding factors on the classification outcome, the backdoor adjustments [
46] are applied.
This section constructs mathematical models for causal interventions in fine-grained visual classification FGVC, consisting of two parts. As shown in
Figure 3, first, the high-order feature space
H is defined. Second, backdoor adjustments are applied by severing the causal link where confounding factors influence
Y, specifically the path
S→
X.
4.1. High-Order Feature Space
Given the last convolutional layers
V of a network,
V is manually divided into
k parts:
. Similarly, the last convolutional layer
W of another network is partitioned into
m parts:
. Each part is treated as a group representing a subset of semantic information. Furthermore, each part is subdivided into smaller vector spaces. For instance,
and
, with each
corresponding positionally to
. It is essential that
k and
m are equal. The classification effects are then represented by two multilinear functions:
where
is the classification labels.
In fact,
,
can be treated as linear functions on
and
, respectively. Thus, Equation (
4) is obtained:
Assume that
V has a basis
and
W has a basis
. Thus, the tensor product space
has a basis
. Moreover, based on the tensor product of functions defined in Equation (
2), each element
belongs to the space
, thus
is a subspace of
.
is the basis of
. It follows that
.
The derivation above demonstrates that the constructed bilinear mapping satisfies a universal property. This implies that, in an isomorphic sense, the bilinear mapping from V and W to L is unique. Hence, the tenser product is rigorously proved unique as well.
4.2. Feature Extraction with Contextual Semantic Information
To extract high-order features with contextual semantic information, it is essential to pairwisely train one single image while considering the inter-relationships within it. Suppose
V is the last convolutional layer of image
: the inter-relationship feature can be extracted by
. Typically, high-order features are extracted using two different networks on a single image. However, this does not render the tensor product on a feature space itself meaningless. Actually, experiments on a bilinear CNN (B-CNN) have indicated that utilizing the same two neural networks to extract features can also enhance the accuracy of classification. The discussion in
Appendix A.2 offers a more rational mathematical explanation for these experimental outcomes: Taking the tensor product of a feature space with itself essentially results in dimensionality expansion.
Next, consider another image that shares the same label as , along with its corresponding final convolutional layer representation W. Pairwise relationships between and are captured using the tensor product . With both inter-relationships () and outer-relationships () established, feature fusion is then applied to integrate them. In tensor product operations, the sequence in which inter- or outer-relational features are extracted does not affect the final outcome of feature fusion. This property is consistent with real-world scenarios. A formal proof of this property is presented below.
The element
f within
can be unfolded as follows:
Similarly, function
g within
can be unfolded as follows:
where
is a basis of
Z,
is a basis of
. Similarly,
is also a basis of
. Thus, Equation (
7) is obtained:
Then, the high-order feature space
H is revisited, comprising the inter-relationship, which is composed of the inter-relationship
and outer-relationship
:
From Equation (
8), it is clear that the calculation order does not alter the result. Whether we calculate
first or calculate
first, they both ultimately equal Equation (
8). This equation confirms the uniqueness of the constructed higher-order feature space
H from another perspective.
4.3. Causal Intervention with Rebuilt Causal Link
Let
Y be the classification effect,
X be the input feature, and
z be the semantic information set containing confounding factors. Then the probability output formula in the general network is as follows:
Causal intervention is essentially an adaptive weighted probability involving the traversal of all objects in the
Z set and the calculation of the conditional probability after intervention. Normally, in an SCM with only one collider (In an SCM, the junction
S→
P←
X is called a “collider”, making
S and
X independent even though
S and
X are linked via
P), the intervention is as follows:
Furthermore, by taking into account the internal relationship of a single image, Equation (
10) is transformed into the following:
where
and
are two events that occur within the same scenario
Z.
Assume that
and
are causally influential features co-determining label
Y. For example, let
Y = 008. Rhinoceros_Auklet,
represents double white stripes on the eyes and
represents a red bird beak. Then, under Equation (
11),
=
, hence we have
Obviously, this probability approaches 1 in theoretical computation, indicating that applying backdoor adjustment after high-order feature fusion is effective. Equation (
10) can correctly adjust its probability.
Similarly to the previous derivation, to perform backdoor adjustment with pairwise features from two images, the features of the two images can be averaged, as they cannot appear simultaneously:
where
and
are different images with the same category labels.
4.4. Interventional High-Order Feature Learning
Suppose that
V is the last convolutional layer of image
through network
.
V is divided into
k equal-sized, disjoint subsets in order. Similarly, let
W denote the last convolutional layer of image
through network
. Thus, Equation (
14) is obtained:
Since the output layer is divided in order, the semantic information within each individual sub-feature space should be considered positionally similar. Therefore, the focus is on extracting the relationships between them.
For each subspace, it is further divided into
p parts in order. If
, Equation (
15) is obtained:
The tensor product is then applied to each pair of
:
where
.
Thus, the tensor product space
is obtained. Next, the same operation is applied on
and
W to obtain the following:
As there is no prior knowledge while training, it is difficult to determine the number of features that have a causal effect toward classification. In other words, the number of
is unknown and infinite to some extent. Thus, it is prohibitive to achieve the above backdoor adjustment through Equation (
13). However, the probability can be approximated using the inverse probability weighting formulation in Equation (
18):
Thus, a multi-head strategy is naturally applied [
47]. For every
, where
, it can be considered a fine-grained sampling. Hence, the logit calculation with the classifiers’ backdoor adjustments for
can be formulated as follows:
where
is the weight.
Next, feature backdoor adjustments are applied. Class adjustments are quantized into weights, which are multiplied by
, and subsequently normalized:
where
represents the probability that
x belongs to the
i-th label and
is the mean feature of the
i-th class.
To make the causal intervention more fine-grained, two backdoor adjustments are applied simultaneously. By combining and organizing Equations (19) and (20), the following is obtained:
where ⊕ denotes vector concatenation. This combination is straightforward: vector concatenation treats classifier backdoor adjustments as equally important as feature backdoor adjustments.
5. Experiments
This section evaluates the experimental results from three key perspectives: (1) a comparative analysis of traditional accuracy metrics, (2) ablation studies to assess the effectiveness of the proposed method, and (3) an investigation of the impact of different hyperparameter values through comparative experiments.
5.1. Datasets
The effectiveness of the proposed interventional multilinear learning method is assessed on three widely used datasets for Fine-grained Visual Categorization, including Caltech-UCSD Birds (CUB-200-2011) [
48], FGVC-Aircraft [
49], and Stanford Cars [
50]. The datasets’ details are shown in
Table 1: (1) Caltech-UCSD Birds-200-2011 (CUB) is an extension of CUB-200, which includes 200 classes, and each class has around 60 samples. (2) The FGVC-Aircraft dataset contains 10,200 aircraft images, with each of the 100 different aircraft model variants having 102 images. (3) The Stanford Cars dataset consists of 196 classes of cars with a total of 16,185 images. It is important to note that only category labels are used in experiments.
5.2. Implementation Details
Overall framework. The 16-layer Visual Geometry Group (VGG-16) and the 18-layer ResNet (ResNet-18) [
51] were pre-trained and used as backbones. When VGG-16 was employed as the backbone, the input image was first resized to 448 × 448 pixels, which is the required input size for VGG-16. For fine-tuning the fully connected layers, the 1000-way classification layer pre-trained on the ImageNet dataset was replaced with a
k-way softmax layer, where
k corresponds to the number of classes in the fine-grained dataset. The final pooling layer was then replaced with a high-order feature pooling layer.
It is important to note that the network’s previous parameters were frozen to allow for training of only the last layer. For high-order feature pooling, bilinear feature fusion was used as the inter-relationship feature extractor on a single image. To capture outer-relationship features, this bilinear feature was fused with the output layer of another image sharing the same label but differing from the first. The parameters of the softmax layer were randomly initialized. The fully connected layer was trained with a higher learning rate while monitoring the accuracy of the validation set. After training the fully connected layer, its parameters were incorporated into the overall network training, with previously frozen parameters unfrozen.
The classification layer during initial training can be interpreted as a prior probability. Due to the difficulty and labor-intensive nature of obtaining part-level annotations in fine-grained datasets, backdoor adjustment could not be performed as part-level prior probabilities were unavailable. Therefore, using an adaptive network to acquire part-level prior probabilities proved both suitable and efficient.
Feature fusion details. Taking VGG-16 as an example, the output feature size is 512 × 28. Through bilinear fusion with itself, the inter-relationship is represented as a 512 × 512 feature map. For causal intervention, the feature space is partitioned into k parts (with k set to 8 in VGG-16), resulting in eight smaller subspaces, each of size 64 × 64. Similarly, the output layer of another image is segmented into eight parts, each of size 64.
For outer-relationship feature fusion, each 64-dimensional vector is first expanded into a 64 × 1 matrix. The Kronecker product is then computed between each 64 × 64 matrix and the corresponding 64 × 1 matrix, yielding a high-order feature subspace of size 64 × 64 × 64, which can be reshaped into a 512 × 512 structure. By introducing causal intervention into high-order feature fusion, the final feature space is reduced from 512 × 512 × 512 to 8 × 512 × 512, while preserving the rank structure of the Kronecker product.
The operation of reshaping the fused feature into an 8 × 512 × 512 structure is intentional. Here, the “8” represents the use of eight parallel classifiers, each corresponding to a distinct layer of backdoor adjustment within the causal intervention framework. These classifiers are designed to process the feature maps at different levels, allowing the model to simulate multiple intervention scenarios and refine the feature representations accordingly.
This design enables the model to learn not just from observed data, but also from intervention-based reasoning, which strengthens its ability to generalise and identify causally relevant patterns. While the causal reasoning module may appear abstract, it plays a vital role in guiding the model toward more discriminative and reliable decision-making in fine-grained classification tasks.
Configuration details. During the preprocessing of the training set, data augmentation is performed using RandomHorizontalFlip, followed by random cropping to a size of 448 and normalization. To ensure consistency with real-world image classification tasks, RandomHorizontalFlip is not applied to the validation or test sets. Initially, only the classifiers are trained using logistic regression, with a batch size of 16, a weight decay of 1 × , and a learning rate of 1. Subsequently, the entire network is fine-tuned using stochastic gradient descent, with a batch size of 64, a weight decay of 1 × , and a learning rate of 1 × .
5.3. Results and Analysis
5.3.1. Comparative Analysis and Efficacy of IHFF in FGVC
Table 2 and
Figure 4 present the comparative experiments between the proposed method and several classical methods, all of which were fine-tuned. The key findings are as follows:
(1) Accuracy improvements: The results demonstrate that the new method, IHFF, leads to significant accuracy enhancements across a variety of datasets and backbone networks, particularly showing notable improvements over the baseline model (B-CNN with VGG-16 as the backbone). This indicates that IHFF is effective across different datasets and backbone architectures.
(2) Effectiveness of feature fusion and causal reasoning: The data in
Table 2 clearly show that methods incorporating feature fusion are generally more effective. For instance, IHFF achieves an accuracy of 90.7% on the CUB dataset using a ResNet backbone, compared to DBTNet’s accuracy at 88.1%. This underscores the utility of causal reasoning in enhancing Fine-grained Visual Categorization, with IHFF showing an average improvement of 3.40% over B-CNN. Furthermore, among the newer models, IHFF—except for DCAL—has achieved state-of-the-art performance. Models such as GBP, SFSCF-Net, and I2-HOFI are all enhancements based on high-order feature fusion, which underscores the effectiveness and significance of the causal intervention strategy employed in IHFF.
(3) Comparison with B-CNN: The B-CNN initially introduced high-order feature fusion into Fine-grained Visual Categorization. Our proposed IHFF method outperforms the B-CNN, demonstrating an average improvement of 3.43% across three datasets and an even higher accuracy gain of 6.07% with the ResNet backbone. These findings validate the application of causal reasoning in this domain and illustrate that integrating causal interventions into high-order feature fusion boosts performance rather than causing antagonism.
(4) Comparison with CAL: CAL pioneers the use of counterfactual causal reasoning in Fine-grained Visual Categorization but shows a lower accuracy on the CUB dataset compared to IHFF. The variance may be linked to the differing depths of the backbone networks utilized. Nonetheless, IHFF’s superior accuracy supports the effectiveness of our causal intervention approach and its underlying mathematical principles, showcasing the potential of causal reasoning in fine-grained image analysis.
(5) Comparison with DCAL: DCAL, currently the top-performing network in FGVC, achieves a higher overall accuracy than IHFF. Both methods employ high-order feature fusion, but DCAL may have an edge due to its integration of self-attention mechanisms, which likely improves its capability of capturing contextual information.
(6) Dataset-Specific Performance: IHFF exhibits notably higher performance improvement on the CUB dataset compared to the Aircraft and Cars datasets. This may be attributed to the CUB dataset having a larger number of categories and more training images per category. The data suggest that causal intervention learning particularly enhances network performance on datasets with smaller samples by mitigating confounding factors through backdoor adjustments, thereby focusing on truly impactful features. Conversely, the lesser improvement on the Aircraft dataset may be due to the classification task relying less on feature interactions to extract contextual semantic information, as identifying aircraft types might often depend more on recognizing distinct physical features, such as the number of windows.
Table 2.
Top-1 classification accuracy.
Table 2.
Top-1 classification accuracy.
Method | Backbone | CUB | Cars | Aircraft | Feature Fusion |
---|
ResNet-50 [51] | ResNet-50 | 84.5 | - | - | × |
B-CNN [5] | VGGD+VGGM | 84.1 | 91.3 | 83.9 | ✓ |
DBTNet [13] | ResNet-101 | 88.1 | 94.5 | 91.6 | × |
Improved B-CNN [52] | VGGD+VGGM | 85.8 | 92.0 | 88.5 | ✓ |
LRBP [7] | VGG-16 | 84.2 | 90.9 | 87.3 | ✓ |
HBP [11] | VGG-16 | 87.1 | 93.7 | 90.3 | ✓ |
GBP [53] | GCNN | 87.8 | 93.5 | 89.6 | ✓ |
SFSCF-Net [54] | ResNet-50 | 89.6 | 94.5 | - | ✓ |
I2-HOFI [54] | ResNet-50 | 90.1 | 94.3 | 92.3 | ✓ |
CAL [26] | ResNet-101 | 90.6 | 95.5 | 94.2 | × |
DCAL [55] | R50-ViT-Base | 92.0 | 95.3 | 93.3 | ✓ |
IHFF (ours) | VGG-16 | 87.4 | 93.9 | 88.2 | ✓ |
ResNet-50 | 90.7 | 94.8 | 92.0 | ✓ |
Figure 4.
Bar chart of Top-1 classification accuracy.
Figure 4.
Bar chart of Top-1 classification accuracy.
5.3.2. Ablation Studies
The ablation experiment freezes the parameters of the backbone network and only trains the fully connected layer with a learning rate of 1, without fine-tuning. It is important to note that the baseline model removes the final pooling and fully connected layers from the backbone networks, replacing them with bilinear pooling layers [
5]:
where
and
represent the output layer parts of different networks that correspond to the same position.
Table 3 and
Figure 5 reveal that prior research has highlighted the value of feature fusion in Fine-grained Visual Categorization, and our findings substantiate this further:
(1) Enhanced bilinear pooling: Utilizing higher-order feature fusion via the Kronecker product has significantly enhanced two-dimensional bilinear pooling. This approach resulted in accuracy increases of 4.96% for VGG-16 and 4.81% for ResNet-18.
(2) Backdoor adjustment efficacy: Applying backdoor adjustments to the class has boosted performance by an average of 6.28%, whereas adjustments to the classifier have shown a slightly higher average improvement of 6.44%. However, applying backdoor adjustments across all methods only yielded a modest average increase of 1.22%, indicating that the overall impact of backdoor adjustments may be limited.
(3) Impact of post-adjustment: Implementing post-adjustments without incorporating high-order feature fusion led to a performance boost of about 1%, which is lower than when using high-order feature fusion alone. Nevertheless, the results were less effective compared to scenarios where adjustments were applied after high-order feature fusion. This finding underscores the potential of causal interventions in managing high-dimensional data by eliminating confounding factors and retaining more impactful features.
(4) Overall methodology impact: The collective application of all methods led to an improvement of approximately 7.58% over the baseline model and about 1.20% over using high-order feature fusion alone. Although the proposed method markedly enhances accuracy compared to the baseline, the combined benefit of all methods is not additive, likely due to the diminishing returns associated with increased complexity.
Table 4 demonstrates the effectiveness of the proposed IHFF module after fine-tuning the entire network with a batch size of 64, a weight decay of 1 ×
, and a learning rate of 1 ×
. Even after end-to-end training, the high-order feature fusion and causal intervention module continued to yield an improvement in classification accuracy. However, the overall accuracy gain was lower compared to the scenario where only the proposed module was trained while keeping the backbone frozen. This may be attributed to the increased complexity and number of trainable parameters during full fine-tuning, which potentially introduces model instability and partial overfitting to the training data.
Table 3.
Ablation experiment results with only high-order feature fusion and causal intervention modules trained.
Table 3.
Ablation experiment results with only high-order feature fusion and causal intervention modules trained.
Backbone | Method | Accuracy | Comparison | High-Order Features |
---|
VGG-16 | Baseline | 74.86 | - | × |
With feature fusion | 79.81 | +4.95 | ✓ |
With class BDA | 81.05 | +6.19 | ✓ |
With classifier BDA | 81.24 | +6.38 | ✓ |
With both BDA | 80.17 | +5.31 | × |
With all | 82.22 | +7.36 | |
ResNet-18 | Baseline | 77.92 | - | × |
With feature fusion | 82.73 | +4.81 | ✓ |
With class BDA | 84.29 | +6.37 | ✓ |
With classifier BDA | 84.42 | +6.50 | ✓ |
With both BDA | 84.06 | +6.14 | × |
With all | 85.71 | +7.79 | ✓ |
Figure 6 shows the results of an ablation study on pairwise learning, utilizing a VGG-16 backbone network. It should be noted that this ablation experiment froze the parameters of the backbone network and only trained high-order feature fusion and causal intervention modules. The figure clearly demonstrates that feature fusion strategies that incorporate outer relationships generally outperform those limited to inter-relationships. With few exceptions, the accuracy gained by integrating both inter- and outer-relationships consistently surpasses that achieved through inter-relationship alone. Since traditional feature fusion methods primarily leverage inter-relationship information derived from features within the same image, these findings are significant. They point to a promising new direction for enhancing feature fusion techniques in Fine-grained Visual Categorization.
Figure 5.
Results for ablation studies with different backbones on CUB-200-2011. The blue line refers to backbone VGG-16, the orange line refers to ResNet-18.
Figure 5.
Results for ablation studies with different backbones on CUB-200-2011. The blue line refers to backbone VGG-16, the orange line refers to ResNet-18.
Table 4.
Ablation experiment results with fine tuned network.
Table 4.
Ablation experiment results with fine tuned network.
Backbone | Method | Accuracy | Comparison | High-Order Features |
---|
VGG-16 | Baseline | 84.12 | - | × |
With feature fusion | 85.39 | +1.17 | ✓ |
With class BDA | 86.13 | +2.01 | ✓ |
With classifier BDA | 86.38 | +2.26 | ✓ |
With both BDA | 86.00 | +1.82 | × |
With all | 87.42 | +3.30 | ✓ |
Figure 6.
Ablation study results using VGG-16 backbone on pairwise learning. The blue line represents IHFF, while the orange line denotes the variant where outer-relationships are excluded during feature fusion, illustrating the impact of these relationships on performance.
Figure 6.
Ablation study results using VGG-16 backbone on pairwise learning. The blue line represents IHFF, while the orange line denotes the variant where outer-relationships are excluded during feature fusion, illustrating the impact of these relationships on performance.
5.3.3. Research on Varying Numbers of Classifiers
Figure 7 illustrates the accuracy variation with respect to different values of
(the number of classifiers) and epochs for the backbone models VGG-16 and ResNet-18. It can be observed that as
increases, the initial accuracy improves more rapidly. This improvement may be due to the multiple classifier layers, intervened by the backdoor, pre-training the fully connected layers, which leads to better initial performance. However, as training progresses, an excessive number of classifiers results in a decrease in accuracy.
Specifically,
Figure 8 shows accuracy variations for different values of the number of classifiers. Generally, performance is better when the hyperparameter
is set to 8 or 16. Beyond this range, a sharp decline in performance occurs. Contrary to the analysis in ablation study point 4, this decline is not caused by overfitting. This can be explained by the total dimension at the final classification layer, which is given by
where
f is the classification function.
From Equation (
23), it is evident that when the number of classifiers doubles, the overall dimensionality is reduced by a factor of 4. This suggests that an excessive number of classifiers will lead to a loss of semantic information. By dividing the feature space too much, the same semantic features could be split into separate parts, resulting in the loss of fine-grained part-level details. This finding indirectly supports the effectiveness of the proposed approach in extracting contextual semantic information, as discussed in
Section 4.2.
5.3.4. Convergence Speed Analysis
Figure 9 illustrates the convergence speed of the loss function under different values of the hyperparameter
. The experiment compares model convergence performance across varying
settings. Results show that increasing
significantly accelerates convergence and reduces the final loss. The blue curve, representing the baseline model without causal reasoning, exhibits the slowest convergence, with the loss stabilizing around 2 after 30 epochs. This indicates that the absence of causal reasoning hinders convergence efficiency. With
= 4 (orange curve), the initial loss reduction is faster than the baseline, but the overall decline remains moderate, ending with a final loss around 1. When
= 8 (red curve), the convergence further improves, showing a sharp decline in the first 10 epochs and reaching a final loss near 0.6. The best performance is observed with
= 16 (green curve), where the loss drops rapidly in the early epochs and stabilizes after 20 epochs at approximately 0.3. These findings suggest that higher
values enhance both convergence speed and model performance. The inclusion of causal reasoning contributes significantly to more efficient training and faster inference.
5.3.5. Visualization
Figure 10 presents the generated heat map, highlighting the selected high-response regions. The results indicate that IHFF consistently identifies the most discriminative areas within an image. Specifically, in the CUB dataset, it effectively emphasizes feature-rich regions, such as the bird’s beak and feather texture.
In DCAL, self-attention is modified by replacing the keys (K) and values (V) with representations from two separate images, allowing the model to capture interactions between different instances of the same class. Inspired by this, to extract more effective features by using paired learning IHFF introduces an auxiliary feature extraction path (Stream C) that processes a second image of the same category. The resulting features are then downsampled to reduce their impact on the main feature stream, allowing the model to benefit from inter-class guidance while preserving the stability and fine-grained nature of the bilinearly pooled features from Streams A and B.
The goal is not to alter or dilute the main feature representation, but to provide additional semantic cues that can help the model better identify discriminative regions by learning from another instance of the same class. This mechanism effectively encourages the model to focus attention on the most informative parts of the object. The visualization experiment supports the effectiveness of this auxiliary feature extraction path. In
Figure 10a, it can be seen that the high response attention area of IHFF is concentrated in areas where fine-grained targets have identifiable features, such as bird heads and beaks. This proves that IHFF, like most models, can effectively guide model attention to effective areas.
In
Figure 10b, it can be observed that causal intervention effectively reduces misleading attention, guiding the network toward more relevant discriminative cues. For instance, the first image in
Figure 10b primarily focuses on the bird itself, excluding tree branches from the high-response features. This suggests that causal intervention helps disentangle features with statistical correlation but no causal relationship. Specifically, since this bird species frequently perches on tree branches, the model may mistakenly learn branch features as intrinsic to the bird. However, tree branches are merely environmental elements and not part of the bird itself. The causal intervention successfully mitigates this confounding factor.
Similarly, the third image in
Figure 10b demonstrates that causal intervention prevents the misinterpretation of tree stump textures as part of the bird’s feather pattern, ensuring that high-response areas are concentrated in the correct regions. Furthermore, the second image in
Figure 10b demonstrates the ability of causal representation learning to reinforce essential discriminative features. Since the bird’s tail feathers are its most distinctive characteristic, causal representation learning effectively captures this feature even in the presence of a complex background. Rather than dispersing attention across the entire bird, it focuses on the most discriminative region, enhancing classification accuracy. This discriminative and targeted focus prevents misclassification, which could otherwise arise due to the bird’s predominantly black body blending with the background.
6. Conclusions
This paper introduces a novel high-order feature learning framework with causal inference for fine-grained categorization. By leveraging a tensor product space, the framework extracts high-order feature representations while mitigating the influence of confounding factors through causal interventions, specifically backdoor adjustments. To the best of the author’s knowledge, this is the first comprehensive application of causal representation learning in fine-grained image analysis tasks.
The proposed method does not require bounding boxes or part annotations and can be trained end-to-end, making it flexible and widely applicable. Extensive experiments on three benchmark datasets—CUB-200, FGVC-Aircraft, and Stanford Cars—demonstrate the effectiveness and robustness of the IHFF approach.
By leveraging causal inference and high-order feature learning, this method enhances the influence robustness and interpretability, making it beneficial in real-world scenarios where fine-grained distinctions are crucial. Future work will inspire further exploration of causal interventions in fine-grained visual classification and other CV tasks. The integration of causal reasoning into computer vision presents a promising direction, and the success of this method will encourage the adoption of causal models across diverse deep learning domains. This could further drive the development of multimodal model fusion, akin to the advancements seen with transformer architectures.