3.1. Problem Setup
In class-incremental learning, a model learns a sequence of tasks . For each task , there is a dataset that contains training examples, and their labels . contains classes, because the class sets are disjoint, the total number of classes is . After training on a task, a memory buffer retains a small number of samples to help the model mitigate forgetting. The memory buffer contains samples from previous tasks. The data available for the current training session is represented by . Let denote the module added at step i. Given an input x, it outputs a feature vector is , where d is the embedding dimension.
Dynamically expandable networks (DENs). When learning a new task, DENs augments the model with a new module, which can be a feature extractor or a task token. Previous modules are frozen to preserve their learned information. The features or embeddings processed by each module are then fed into a classifier. As the specific implementations of the DENs framework, DER adds a new feature extractor
for each new task, and previous feature extractors
are frozen. The concatenated features
are then sent to the task-specific classifier
at step
t. DyTox [
5] adds a new task token for each task. Each task-specific embedding from the task attention block is fed to task-specific classifiers. MCG [
6] uses DER as its baseline and further incorporates additional feature extractors. For each task, it computes multiple centers so that, during testing, the weights are derived by comparing the similarity between a sample and each task center. These weights are then used to amplify the contribution of the most relevant feature extractor.
Auxiliary loss. To enable the newly added feature extractor
to specifically focus on task
and effectively distinguish between new and old tasks, DER uses auxiliary loss to train
. An auxiliary classifier
is used to solve a
-way classification problem on
, treating all previously learned classes as one class.
is the class number in
. Auxiliary loss can help DER improve both average accuracy and last step accuracy by at least
[
4], and is widely used in DyTox and MCG.
3.2. Rethinking the Stability–Plasticity Dilemma
DENs add a fresh module for every incoming task and freeze earlier modules to curb forgetting. However, our findings indicate that DENs still suffer from catastrophic forgetting.
Catastrophic forgetting in early tasks. After training on all tasks, we observe a substantial performance drop for early tasks in the final model configuration for DER. As shown in
Figure 1, after completing training on 10 tasks in ImageNet100-B0S10, a significant degradation in performance is evident, with an accuracy gap exceeding
between task
and
. We attribute this primarily to the auxiliary loss’s inadequate utilization of old task information and the feature bias arising from the imbalance between new and old class samples.
Feature bias in DENs. The data imbalance problem arises during the training of the new task, where the number of samples of new classes in
is greater than the samples of old classes in memory
. This imbalance leads to a bias in the model, favoring the classification of inputs as new classes. Given that the modules corresponding to the old tasks have been frozen, this dominance is due to the features extracted by the new feature extractor. As illustrated in
Figure 2, following the training of 10 tasks on CIFAR100-B0 with DER. For each task
, features extracted by
exhibit significantly higher norms than those for any other task
extracted by
, and higher than those for
extracted by any other extractor
, where
.
Consequently, the new module focuses on the current task, and dominates in the model, leading to the catastrophic forgetting problem. Although this specialization is expected, it risks under-utilizing cross-task knowledge. These observations emphasize the critical need for strategies that more effectively balance the training influence between new and old tasks, aiming to mitigate feature bias.
Auxiliary loss hurts stability. Auxiliary loss helps the new module focus on learning the new task. However, while this approach may reduce the complexity of the feature space, it inadvertently increases confusion among old task classes. Specifically, the auxiliary loss fails to effectively enhance intra-class compactness and inter-class separation, leading to a degradation in discriminative ability among classes. Data imbalance amplifies the problem: the new-task module dominates, causing accuracy on earlier tasks to plummet. By design, the auxiliary loss does not directly address the need to maintain robust feature representations for old tasks, thereby undermining the stability of the model’s learning over time.
Auxiliary loss influences plasticity. During training, the auxiliary loss deliberately avoids distinguishing the classes from previous tasks so that the new feature extractor can concentrate on the current task. This blurring of distinctions among old classes can degrade the model’s capacity to recognize the unique characteristics of each class, leading to a diminished ability to effectively adapt and apply learned insights to new classes. Consequently, while auxiliary loss helps in mitigating interference from old tasks, it might compromise the nuanced understanding needed for new tasks, thus affecting the plasticity. Following the completion of each task , we assess the impact of auxiliary loss on the plasticity of each module associated with its corresponding task . After training task and obtaining N frozen modules , we evaluate the plasticity of each module by training a new classifier solely with for classifying the corresponding task . This setup constitutes a -way classification problem, where is the number of classes in task .
As depicted in
Figure 3a, for the initial 10 classes of CIFAR100, the classification accuracy of the two methods is comparable. However, as the number of previously learned classes grows, the auxiliary loss begins to negatively impact the ability of module
to effectively learn its corresponding task
, when compared to CEC loss.
Figure 3b illustrates that on ImageNet-1000, CEC loss consistently outperforms auxiliary loss.
These observations suggest that, while auxiliary loss facilitates easier learning for new tasks, it does so at the cost of impairing the model’s ability to accurately recall information from previously learned tasks, and affecting the new module’s plasticity. To mitigate these issues, it is crucial to develop strategies that not only balance the learning between new and old tasks but also fully leverage the existing knowledge from old tasks to support new task learning. This approach will enhance feature separation and compactness across all classes, thereby reducing forgetting.
3.3. Asymmetric Contrastive Auxiliary Loss
We therefore propose two asymmetric contrastive auxiliary losses for DENs: Current feature contrastive Loss (CEC loss) and cross-feature contrastive loss (CFC loss). These losses aim to enhance the network’s capability to retain knowledge from previously learned tasks while effectively learning new tasks.
As illustrated in
Figure 4, when the model undertakes a new task
at step t, a corresponding new module
is introduced. Previous modules
are frozen to prevent interference from subsequent learning processes.
Current feature contrastive loss (CEC). The CEC loss is engineered to refine the representation of the current task by leveraging both new and old task samples, thereby enhancing intra-class compactness and inter-class separation within the new module. This not only improves the plasticity of the module with respect to new tasks but also equips it to better differentiate between classes from earlier tasks.
When training on a new task
, the module processes a batch of
N samples from the current task combined with samples from the memory buffer
. Two data augmentation techniques are applied to produce a total of
examples. The CEC loss trains the new module
to discriminate between the current task and samples from
. The features generated by
are transformed into a projected vector
by a two-layer perceptron projector
. The CEC loss is as follows:
where
represents the set of indices corresponding to samples of the same class as
,
j
(
j ≠
i)
(
=
)}.
is a positive scalar temperature parameter. By using Equation (
1),
learns to distinguish each class from both the old and new tasks. Old samples obtained from
can serve as anchors to aid
in learning to distinguish them from other classes. This is particularly beneficial for the model in mitigating the forgetting of old classes.
Cross-feature contrastive loss (CFC). During the training of old tasks, current task is not visible to modules , which were trained on previous tasks. As a result, struggle to differentiate the features related to . Features of derived from are ineffectual, offering no aid in learning . However, as have accumulated valuable knowledge from past tasks, this existing knowledge can be harnessed to enhance the learning capabilities of . Cross-feature contrastive loss (CFC) leverages features from old tasks preserved in to aid in effectively distinguishing between old and new tasks, thus helping to establish robust decision boundaries. This will enhance the model’s generalization capabilities across tasks.
The operational mechanism of the CFC loss involves projecting features from both old and new tasks into a shared feature space where they can be directly compared. Features from old tasks stored in the memory buffer
, processed by
, are projected by a shared projector
, denoted by
. Similarly, the features from the new module
, corresponding to all available samples are also projected and denoted by
. The combined set of features,
, serves as the basis for the contrastive learning process:
The projector is shared by CEC loss and CFC loss, and will be discarded once the specific task training is complete.
Total loss for learning new modules. Following DER, we concatenate the features from all modules, and the concatenated features will be sent to the classifier for DER(For DyTox, following DyTox, we feed the embeddings to correspond classifier). Concatenated features are recorded as
. The classifier is recorded as
. Cross-entropy loss is used on the current task data
and memory buffer
as follows:
The overall loss for training
is
where
and
are the hyper-parameters, and
when the model learns the first task.
3.4. Feature Adjustment
After obtaining modules
at stage 1, we next address the feature bias problem at stage 2. Due to data imbalance, both the module and the classifier tend to be biased toward the new task. The frozen modules
store knowledge for the previous corresponding task
. Fine-tuning the feature extractor on balanced datasets will impact the model’s performance. For solving the features bias problem, we introduce learnable parameters
for each module
, which is optimized to ensure balanced representation across tasks. These parameters can be adjusted to determine suitable weights for adapting to the entire task. We then concatenate the weighted features
as follows:
All feature modules remain frozen so that their knowledge is preserved. Following DER, we then sample-balanced datasets from both the current task and the memory buffer , ensuring equitable representation across all classes. We re-train a fresh classifier with temperature-scaled cross-entropy (temperature ) while jointly learning .
Figure 5 illustrates the impact of our feature adjustment mechanism. Prior to adjustment, the feature norms of earlier tasks are significantly larger than those of later tasks, indicating an imbalance. Our feature adjustment (F.A.) mechanism reduces the norms of earlier tasks while increasing those of later tasks, resulting in a more balanced representation across all tasks and effectively mitigating the imbalance.