Rethinking the Stability–Plasticity Dilemma of Dynamically Expandable Networks

Dong, Mingda; Li, Rui

doi:10.3390/sym17091379

Open AccessArticle

Rethinking the Stability–Plasticity Dilemma of Dynamically Expandable Networks

by

Mingda Dong

^1,2,* and

Rui Li

³

¹

Lab of Artificial Intelligence for Education, East China Normal University, Shanghai 200062, China

²

School of Computer Science and Technology, East China Normal University, Shanghai 200062, China

³

China Mobile (HangZhou) Information Technology Co., Ltd., Hangzhou 311121, China

^*

Author to whom correspondence should be addressed.

Symmetry 2025, 17(9), 1379; https://doi.org/10.3390/sym17091379

Submission received: 15 July 2025 / Revised: 8 August 2025 / Accepted: 12 August 2025 / Published: 23 August 2025

(This article belongs to the Section Computer)

Download

Browse Figures

Versions Notes

Abstract

Symmetry and asymmetry between past and future knowledge are at the heart of continual learning. Deep neural networks typically lose the temporal symmetry that would preserve earlier knowledge when the network is trained sequentially, a phenomenon known as catastrophic forgetting. Dynamically expandable networks (DENs) attempt to restore symmetry by allocating a dedicated module—such as a feature extractor or a task token—for every new task while freezing all previously learned modules. Although this strategy yields high average accuracy, we observe a pronounced asymmetry: earlier tasks still degrade over time, indicating that frozen modules alone do not guarantee knowledge conservation. Moreover, feature bias, arising from the imbalance between old and new samples, further exacerbates the forgetting issue. This raises a fundamental challenge: how can multiple feature extractors be coordinated more effectively to mitigate catastrophic forgetting while enabling the robust acquisition of new tasks? To address this challenge, we propose two asymmetric, contrastive auxiliary losses that exploit rich information from previous tasks to guide new task learning. Specifically, our approach integrates features extracted by both frozen and current modules to reinforce task boundaries while facilitating the learning process. In addition, we introduce a feature adjustment mechanism to alleviate the bias caused by class imbalance. Extensive experiments on benchmarks, including DyTox and MCG, demonstrate that our approach reduces catastrophic forgetting and achieves state-of-the-art performance on ImageNet-100.

Keywords:

continual; catastrophic forgetting; image classification

1. Introduction

Human cognition naturally maintains symmetry between assimilating new information and preserving prior knowledge. In deep models, preserving temporal symmetry means giving equal importance to all tasks throughout training. Sequential training breaks this symmetry, resulting in catastrophic forgetting [1], where the model shifts towards the new task and forgets previous ones. Continual learning aims to balance plasticity (acquiring new skills) and stability (retaining prior knowledge). Yet achieving high accuracy within a single network while respecting this symmetry remains a formidable challenge.

To address this challenge, various approaches have been developed. DualNet [2] employs two feature extractors, mimicking the hippocampus and neocortex to create complementary learning systems [3]. Dynamically expandable networks add separate, expandable modules (feature extractors or task tokens) for each task while freezing these modules when learning subsequent tasks. For instance, DER [4] proposes adding a new feature extractor for each new task and freezing it once the task is learned. This mechanism allows the model to consolidate old knowledge in corresponding feature extractors, akin to synaptic consolidation in the brain [3]. To facilitate the effective learning of new tasks while distinguishing them from old tasks, DER introduced the auxiliary loss, which has been widely adopted in methods such as DyTox [5], MCG [6] and DEN [7]. DyTox [5] expands task tokens for each task based on transformer architectures, using auxiliary loss to enhance diversity between task tokens. Task tokens are learnable vectors—one per task—appended only in the final Transformer block.

A crucial question is whether freezing old feature extractors effectively prevents catastrophic forgetting. Our observations reveal that earlier tasks suffer more severe forgetting compared to later ones. As illustrated in Figure 1, DER shows significantly lower accuracy on the first and second tasks compared to subsequent tasks, as also encountered by DyTox. When learning old tasks, future classes remain invisible to existing modules, causing the features obtained for samples from new tasks on previous modules to lack meaningful information. This results in confusion when these features are fed to classifiers, leading to the misclassification of new classes by old modules.

Moreover, the auxiliary loss forces new modules to collapse all old classes into a single category, which limits their ability to distinguish among old classes. Additionally, the imbalance between old and new classes allows the new-task feature extractor to dominate, biasing the model toward the new classes and causing it to misclassify samples from old classes—a phenomenon we term feature bias. While the knowledge of old classes is consolidated in the corresponding modules, this consolidation does not effectively translate into the new modules’ learning process.

A key challenge is how to effectively coordinate multiple feature extractors and integrate them into a unified model to mitigate catastrophic forgetting and enhance new task learning. By leveraging the consolidated knowledge from previous tasks embedded in old modules, we can enhance the new modules’ ability to better distinguish the features of different classes. We propose using asymmetric contrastive loss to not only aid in learning new tasks but also endow new modules to categories in old tasks, thus reducing feature confusion. Our main contributions are as follows:

We reveal that simply freezing feature extractors fails to prevent forgetting in dynamically expandable networks—an overlooked limitation in existing research. Through in-depth analysis, we identify key factors shaping the stability–plasticity trade-off, demonstrating that widely used auxiliary losses and feature bias arising from the imbalance between old and new classes are major contributors to catastrophic forgetting.
We introduce two novel asymmetric contrastive auxiliary losses: current feature contrastive loss (CEC loss) and cross feature contrastive loss (CFC Loss) as alternatives to the traditional auxiliary loss. These losses help the model achieve a better balance between plasticity and stability. Our approach integrates seamlessly with models that employ an auxiliary loss. We propose learning adjustable parameters to mitigate feature bias, ensuring that the corresponding modules maintain their dominance when handling their respective tasks.
Our approach significantly enhances accuracy on three benchmarks compared to baseline methods, achieving state-of-the-art performance on the ImageNet benchmark. Furthermore, it offers substantial benefits to other methods such as DyTox and MCG, despite their differing network structures (ResNet and ViT) and methodologies.

2. Related Works

Continual learning aims to learn new tasks while mitigating catastrophic forgetting on old tasks. Elastic weight consolidation (EWC) [8] introduces a regularization term, limiting the modification of parameters crucial for old tasks to prevent catastrophic forgetting. Ref. [9] aim to mitigate catastrophic forgetting by imposing constraints on network parameters.

Knowledge distillation [10] is another widely used technique in incremental learning [11,12,13]. Ref. [14] augments the features of rehearsal samples at each layer for rehearsal-based methods.

Sample-imbalance between old and new categories often biases the classifier toward the latter. Several methods [12,15,16,17,18,19] explicitly address this bias. Dongwan Kim [20] proposed analytical tools for quantifying the stability–plasticity trade-off of learned features.

Single-extractor models often struggle to retain high accuracy under continual learning. AANets [21] and DualNet [2] used two feature extractors to achieve a better stability–plasticity trade-off. However, fully eliminating catastrophic forgetting with only two extractors remains challenging. Dynamically expandable networks allocate a fresh module—such as a feature extractor or task token—for every task while freezing all prior modules [6,22,23,24]. DER [4] dynamically expands the network and employs an auxiliary loss to guide each new extractor toward the current task. DyTox [5] instead appends a task-specific token per task within a Transformer backbone. SEED [25] trains multiple networks in which several tasks share one extractor while discarding old samples. DNE [7] utilizes dense connections between the intermediate layers of task expert networks, employing feature sharing and reusing to facilitate knowledge transfer from old to new tasks. MCG [6] adds a gating network that predicts the task identity and selects the most relevant extractor at inference time. Ref. [26] introduces a data-augmentation scheme shown to alleviate forgetting.

Contrastive learning [27,28,29] formulates pretext tasks on unlabeled data so that a network learns richer representations. Such representations transfer readily to downstream tasks. Supervised contrastive learning [30] extends the self-supervised objective to fully supervised settings. Co2L [31] shows that contrastively learned representations exhibit greater robustness to catastrophic forgetting. Ref. [32] propose maximizing mutual information online to combat forgetting. Ref. [33] combines supervised and self-supervised contrastive losses during pre-training to improve few-shot class-incremental learning.

Recent work leverages pre-trained models—such as Vision Transformers (ViT) [34], and introduces prompt-based strategies for incremental learning [35,36,37,38,39,40,41]. Ref. [42] expands the subspace with an adapter for each new task in pre-trained models.

3. Methodology

3.1. Problem Setup

In class-incremental learning, a model learns a sequence of tasks

{T_{1}, \dots, T_{N}}

. For each task

T_{i}

, there is a dataset

D_{i} = {(x_{i}^{j}, y_{i}^{j})}_{j = 1}^{N_{i}}

that contains

N_{i}

training examples, and their labels

y_{i}^{j}

.

D_{i}

contains

C_{i}

classes, because the class sets are disjoint, the total number of classes is

C = \sum_{i = 1}^{N} C_{i}

. After training on a task, a memory buffer retains a small number of samples to help the model mitigate forgetting. The memory buffer

M_{i} = {(x_{j}, y_{j})}_{j = 1}^{m}

contains samples from previous tasks. The data available for the current training session is represented by

{\hat{D}}_{t} = D_{t} \cup M_{t}

. Let

F_{i}

denote the module added at step i. Given an input x, it outputs a feature vector

F_{i}

is

r = F_{i} (x) \in R^{d}

, where d is the embedding dimension.

Dynamically expandable networks (DENs). When learning a new task, DENs augments the model with a new module, which can be a feature extractor or a task token. Previous modules are frozen to preserve their learned information. The features or embeddings processed by each module are then fed into a classifier. As the specific implementations of the DENs framework, DER adds a new feature extractor

F_{t}

for each new task, and previous feature extractors

F_{1}, \dots, F_{t - 1}

are frozen. The concatenated features

μ = [F_{1} (x), \dots, F_{t} (x)] = Φ_{t} (x)

are then sent to the task-specific classifier

H_{t}

at step t. DyTox [5] adds a new task token for each task. Each task-specific embedding from the task attention block is fed to task-specific classifiers. MCG [6] uses DER as its baseline and further incorporates additional feature extractors. For each task, it computes multiple centers so that, during testing, the weights are derived by comparing the similarity between a sample and each task center. These weights are then used to amplify the contribution of the most relevant feature extractor.

Auxiliary loss. To enable the newly added feature extractor

F_{t}

to specifically focus on task

T_{t}

and effectively distinguish between new and old tasks, DER uses auxiliary loss to train

F_{t}

. An auxiliary classifier

H_{t}^{a}

is used to solve a

| C_{t} + 1 |

-way classification problem on

{\hat{D}}_{t}

, treating all previously learned classes as one class.

C_{t}

is the class number in

D_{t}

. Auxiliary loss can help DER improve both average accuracy and last step accuracy by at least

2 %

[4], and is widely used in DyTox and MCG.

3.2. Rethinking the Stability–Plasticity Dilemma

DENs add a fresh module for every incoming task and freeze earlier modules to curb forgetting. However, our findings indicate that DENs still suffer from catastrophic forgetting.

Catastrophic forgetting in early tasks. After training on all tasks, we observe a substantial performance drop for early tasks in the final model configuration for DER. As shown in Figure 1, after completing training on 10 tasks in ImageNet100-B0S10, a significant degradation in performance is evident, with an accuracy gap exceeding

24 %

between task

T_{0}

and

T_{7}

. We attribute this primarily to the auxiliary loss’s inadequate utilization of old task information and the feature bias arising from the imbalance between new and old class samples.

Feature bias in DENs. The data imbalance problem arises during the training of the new task, where the number of samples of new classes in

T_{i}

is greater than the samples of old classes in memory

M_{i}

. This imbalance leads to a bias in the model, favoring the classification of inputs as new classes. Given that the modules corresponding to the old tasks have been frozen, this dominance is due to the features extracted by the new feature extractor. As illustrated in Figure 2, following the training of 10 tasks on CIFAR100-B0 with DER. For each task

T_{i}

, features extracted by

F_{i}

exhibit significantly higher norms than those for any other task

T_{j}

extracted by

F_{i}

, and higher than those for

T_{i}

extracted by any other extractor

F_{j}

, where

(j \neq i)

.

Consequently, the new module

F_{t}

focuses on the current task, and

F_{t}

dominates in the model, leading to the catastrophic forgetting problem. Although this specialization is expected, it risks under-utilizing cross-task knowledge. These observations emphasize the critical need for strategies that more effectively balance the training influence between new and old tasks, aiming to mitigate feature bias.

Auxiliary loss hurts stability. Auxiliary loss helps the new module focus on learning the new task. However, while this approach may reduce the complexity of the feature space, it inadvertently increases confusion among old task classes. Specifically, the auxiliary loss fails to effectively enhance intra-class compactness and inter-class separation, leading to a degradation in discriminative ability among classes. Data imbalance amplifies the problem: the new-task module dominates, causing accuracy on earlier tasks to plummet. By design, the auxiliary loss does not directly address the need to maintain robust feature representations for old tasks, thereby undermining the stability of the model’s learning over time.

Auxiliary loss influences plasticity. During training, the auxiliary loss deliberately avoids distinguishing the classes from previous tasks so that the new feature extractor can concentrate on the current task. This blurring of distinctions among old classes can degrade the model’s capacity to recognize the unique characteristics of each class, leading to a diminished ability to effectively adapt and apply learned insights to new classes. Consequently, while auxiliary loss helps in mitigating interference from old tasks, it might compromise the nuanced understanding needed for new tasks, thus affecting the plasticity. Following the completion of each task

T_{i}

, we assess the impact of auxiliary loss on the plasticity of each module

F_{i}

associated with its corresponding task

T_{i}

. After training task

T_{N}

and obtaining N frozen modules

F_{1}, F_{2}, \dots, F_{N}

, we evaluate the plasticity of each module

F_{i}

by training a new classifier solely with

F_{i}

for classifying the corresponding task

T_{i}

. This setup constitutes a

| C_{i} |

-way classification problem, where

C_{i}

is the number of classes in task

T_{i}

.

As depicted in Figure 3a, for the initial 10 classes of CIFAR100, the classification accuracy of the two methods is comparable. However, as the number of previously learned classes grows, the auxiliary loss begins to negatively impact the ability of module

F_{i}

to effectively learn its corresponding task

T_{i}

, when compared to CEC loss. Figure 3b illustrates that on ImageNet-1000, CEC loss consistently outperforms auxiliary loss.

These observations suggest that, while auxiliary loss facilitates easier learning for new tasks, it does so at the cost of impairing the model’s ability to accurately recall information from previously learned tasks, and affecting the new module’s plasticity. To mitigate these issues, it is crucial to develop strategies that not only balance the learning between new and old tasks but also fully leverage the existing knowledge from old tasks to support new task learning. This approach will enhance feature separation and compactness across all classes, thereby reducing forgetting.

3.3. Asymmetric Contrastive Auxiliary Loss

We therefore propose two asymmetric contrastive auxiliary losses for DENs: Current feature contrastive Loss (CEC loss) and cross-feature contrastive loss (CFC loss). These losses aim to enhance the network’s capability to retain knowledge from previously learned tasks while effectively learning new tasks.

As illustrated in Figure 4, when the model undertakes a new task

T_{t}

at step t, a corresponding new module

F_{t}

is introduced. Previous modules

{F_{1} \dots F_{t - 1}}

are frozen to prevent interference from subsequent learning processes.

Current feature contrastive loss (CEC). The CEC loss is engineered to refine the representation of the current task by leveraging both new and old task samples, thereby enhancing intra-class compactness and inter-class separation within the new module. This not only improves the plasticity of the module with respect to new tasks but also equips it to better differentiate between classes from earlier tasks.

When training on a new task

T_{t}

, the module processes a batch of N samples from the current task combined with samples from the memory buffer

M_{t}

. Two data augmentation techniques are applied to produce a total of

2 N

examples. The CEC loss trains the new module

F_{t}

to discriminate between the current task and samples from

M_{t}

. The features generated by

F_{t} (x)

are transformed into a projected vector

z = P r o j (F_{t} (x))

by a two-layer perceptron projector

P r o j

. The CEC loss is as follows:

L_{C E C} = \sum_{i = 1}^{2 N} \frac{- 1}{| P (i) |} \sum_{j \in P (i)} log \frac{exp (z_{i} \cdot z_{j} / τ)}{\sum_{\binom{k = 1}{k \neq i}}^{2 N} exp (z_{i} \cdot z_{k} / τ)},

(1)

where

P (i)

represents the set of indices corresponding to samples of the same class as

x_{i}

,

P (i) = {

j

\in {1, \dots, 2 N} ∣

(j ≠ i)

\cap

(

y_{i}

=

y_{j}

)}.

τ

is a positive scalar temperature parameter. By using Equation (1),

F_{t}

learns to distinguish each class from both the old and new tasks. Old samples obtained from

M_{t}

can serve as anchors to aid

F_{t}

in learning to distinguish them from other classes. This is particularly beneficial for the model in mitigating the forgetting of old classes.

Cross-feature contrastive loss (CFC). During the training of old tasks, current task

T_{t}

is not visible to modules

F_{o l d} = {F_{1}, \dots, F_{t - 1}}

, which were trained on previous tasks. As a result,

F_{o l d}

struggle to differentiate the features related to

T_{t}

. Features of

T_{t}

derived from

F_{o l d}

are ineffectual, offering no aid in learning

T_{t}

. However, as

F_{o l d}

have accumulated valuable knowledge from past tasks, this existing knowledge can be harnessed to enhance the learning capabilities of

F_{t}

. Cross-feature contrastive loss (CFC) leverages features from old tasks preserved in

F_{o l d}

to aid

F_{t}

in effectively distinguishing between old and new tasks, thus helping to establish robust decision boundaries. This will enhance the model’s generalization capabilities across tasks.

The operational mechanism of the CFC loss involves projecting features from both old and new tasks into a shared feature space where they can be directly compared. Features from old tasks stored in the memory buffer

M_{t}

, processed by

F_{o l d}

, are projected by a shared projector

P r o j

, denoted by

Q_{t_p r e} = {F_{i} (x_{j}) | F_{i} \in F_{o l d}, x_{j} \in M_{t}}

. Similarly, the features from the new module

F_{t}

, corresponding to all available samples are also projected and denoted by

Q_{t_c u r} = {F_{t} (x_{j}) | x_{j} \in (M_{t} \cup T_{t})}

. The combined set of features,

Q_{t} = Q_{t_c u r} \cup Q_{t_p r e}

, serves as the basis for the contrastive learning process:

L_{C F C} = \sum_{\binom{i = 1}{x_{i} \in T_{t}}}^{2 N} \frac{- 1}{| P (i) |} \sum_{j \in P (i)} log \frac{exp (z_{i} \cdot z_{j} / τ)}{\sum_{\binom{z_{k} \in Q_{t}}{z_{k} \neq z_{i}}}^{2 N} exp (z_{i} \cdot z_{k} / τ)} .

(2)

The projector is shared by CEC loss and CFC loss, and will be discarded once the specific task training is complete.

Total loss for learning new modules. Following DER, we concatenate the features from all modules, and the concatenated features will be sent to the classifier for DER(For DyTox, following DyTox, we feed the embeddings to correspond classifier). Concatenated features are recorded as

Φ_{t} (x) = [F_{1} (x) \dots F_{t} (x)]

. The classifier is recorded as

H_{t}

. Cross-entropy loss is used on the current task data

D_{t}

and memory buffer

M_{t}

as follows:

L_{c e} (x, y) = \sum_{i = 1}^{| M_{t} | + | D_{c u r} |} - δ_{i = y} l o g (H_{t} (Φ_{t} (x))) .

(3)

The overall loss for training

F_{t}

is

L (x, y) = L_{C E} + λ_{1} L_{C E C} + λ_{2} L_{C F C},

(4)

where

λ_{1}

and

λ_{2}

are the hyper-parameters, and

λ_{2} = 0

when the model learns the first task.

3.4. Feature Adjustment

After obtaining modules

F_{1} \dots F_{t}

at stage 1, we next address the feature bias problem at stage 2. Due to data imbalance, both the module and the classifier tend to be biased toward the new task. The frozen modules

F_{o l d}

store knowledge for the previous corresponding task

T_{i}

. Fine-tuning the feature extractor on balanced datasets will impact the model’s performance. For solving the features bias problem, we introduce learnable parameters

α_{i}

for each module

F_{i}

, which is optimized to ensure balanced representation across tasks. These parameters can be adjusted to determine suitable weights for adapting to the entire task. We then concatenate the weighted features

α_{i} \times F_{i}

as follows:

F_{b a l a n c e} = c o n c a t e n a t e (α_{i} \cdot F_{i} (x)) |_{i = 1}^{t} .

(5)

All feature modules remain frozen so that their knowledge is preserved. Following DER, we then sample-balanced datasets from both the current task

T_{t}

and the memory buffer

M_{t}

, ensuring equitable representation across all classes. We re-train a fresh classifier with temperature-scaled cross-entropy (temperature

δ

) while jointly learning

α_{i}

.

Figure 5 illustrates the impact of our feature adjustment mechanism. Prior to adjustment, the feature norms of earlier tasks are significantly larger than those of later tasks, indicating an imbalance. Our feature adjustment (F.A.) mechanism reduces the norms of earlier tasks while increasing those of later tasks, resulting in a more balanced representation across all tasks and effectively mitigating the imbalance.

4. Experiments

This section evaluates the proposed method on three widely used benchmarks—CIFAR-100, ImageNet-100, and ImageNet-1000—under multiple class-incremental settings.

4.1. Setup

Datasets. ImageNet-1000 consists of 1000 classes with approximately 1.2 million training images and 50,000 validation images. ImageNet-100 is the 100-class subset. CIFAR100 includes 32 × 32 pixel color images across 100 classes, with 50,000 training images and 10,000 evaluation images.

Benchmarks. We follow standard splits on ImageNet-100, ImageNet-1000, and CIFAR-100. For all settings, we use the same class order as DER for ImageNet-1000 for fairness.

(1).: On ImageNet100 B0, the first task includes 10 classes, with each subsequent task adding 10 classes; the task size is 10, and the memory size is 2000.
(2).: On ImageNet100 B50, the first task has 50 classes, and each new task will add 5 classes, whilst the task size is 11. Each class will preserve 20 samples.
(3).: On ImageNet1000 B0, the first task has 100 classes, each task will add 100 classes, and task size is 10. The memory size is 20,000.

For CIFAR100, we evaluate the first task with 5, 10, 20 classes, and each subsequent task will add 5, 10, 20 classes, respectively, and the task size is 20, 10, 5, respectively. Memory size is 2000 examples in these settings.

In these experiments, we use herding selection strategy following DER to select samples to build

M

.

Implementation Details. Our implementation is based on PyTorch 1.10.2. Following DER, we adopt the same network structure for the feature extractor

F_{t}

. Specifically, for CIFAR100, we employ a modified ResNet-18 as the feature extractor, while for ImageNet-100 and ImageNet-1000, we utilize the standard ResNet-18. We use SGD as optimizer, and the learning rate is

0.1

, momentum is

0.9

, and weight decay is 0.0005. For CIFAR100, the batch size is 128. For ImageNet-100 and ImageNet-1000, the batch size is 256. Following [30], the temperature

τ

is fixed at

0.1

.

The projector head uses the configuration of [30]. The output size of projector is 128 for all experiments. After completing each task, the projector is discarded. No projector is needed during testing. For the temperature parameter

δ

, we fully adhere to the settings used in DER [4], maintaining consistency by setting

δ = 5

for CIFAR-100 and

δ = 1

for ImageNet. In comparison to DER, only the same number of variables as the feature extractor is added.

Evaluation metrics. We report last-step and average accuracy. For ImageNet-100 and ImageNet-1000, we evaluate the top-1 and top-5 accuracy. Following the methodologies used by DER and DyTox, we calculated several key performance metrics: backward transfer (BWT), forward transfer (FWT), average accuracy, last step accuracy, and forgetting.

A^{i}

denotes the mean accuracy on all classes learned before task

T_{i}

.

A_{T_{j}}^{i}

is defined as the accuracy on Task

T_{j}

after the completion of task

T_{i}

.

FWT. Following GEM [43] and DER [4], We calculate the FWT as follows:

F W T = \frac{1}{T - 1} \sum_{i = 2}^{T} (A_{T_{i}}^{i} - {\hat{A}}_{T_{i}}^{i}),

(6)

{\hat{A}}_{T_{i}}^{i}

is the test accuracy obtained by model trained on available data with only cross-entropy loss at random initialization.

BWT. Following GEM [43] and DER [4], we calculate the BWT as follows:

B W T = \frac{1}{T - 1} \sum_{i = 2}^{T} \frac{1}{i} \sum_{j = 1}^{i} (A_{T_{j}}^{i} - A_{T_{j}}^{j}) .

(7)

Avg. The average accuracy is average accuracy of each step after finishing the last task T:

A v g = \frac{1}{T} \sum_{i = 1}^{T} A_{i} .

(8)

Last. Last step accuracy is defined as the accuracy measured on the entire test set after all training on the final task has been completed.

Comparative methods. We compare our method with iCaRL [11], BiC [16], WA [17], DER [4], DyTox [5], FOSTER [22], BEEF [23], MCG [6].

4.2. Evaluation on ImageNet-100

Table 1 top shows the test accuracy on ImageNet-100-B0. Compared to the baseline method DER, our method improves the last step accuracy from

66.70 %

to

72.36 %

(+ 5.66 %)

. Average accuracy rises from

77.18 %

to

80.61 %

(+ 3.43 %)

. Compared with SOTA methods MCG [6], our model has

11.2

million fewer parameters, and the last step accuracy of top-1 accuracy is

0.84 %

higher than MCG [6]. For top-5 accuracy, our method achieves

0.25 %

higher average accuracy than the SOTA method. Our method achieves

0.64 %

higher accuracy for the last step accuracy. Table 1 bottom shows that our method outperforms DER by

2.15 %

in average accuracy, surpassing the MCG by

0.52 %

in average accuracy on ImageNet-100-B50. Figure 6 illustrates the accuracy of each step and ImageNet-100.

As shown in Figure 7, we use t-SNE [44] to visualize the feature distributions. The left plot presents the results of the baseline DER method, where features of old task classes appear scattered and less compact, indicating weaker retention of old task representations. In contrast, the right plot highlights the results of our proposed method, where features form more distinct clusters, demonstrating the better preservation of old task boundaries and improved feature separability.

4.3. Evaluation on ImageNet-1000

Table 1 top shows the test accuracy on ImageNet-1000-B0. Our method achieves state-of-the-art performance on the ImageNet-1000-B0 benchmark. Compared with baseline method DER, our method improves the last step accuracy from

60.16 %

to

64.27 %

(+ 4.11 %)

. The average accuracy improves from

68.84 %

to

72.19 %

(+ 3.35 %)

. Compared with the SOTA method, we use 11.2 million fewer parameters than the SOTA method. Our method outperforms the SOTA method by about

1.93 %

for the last step top-1 accuracy, and the average accuracy can be higher than the SOTA method by about

1.39 %

. For the last step accuracy of top-5, our method is

2.09 %

higher than the SOTA method.

4.4. Evaluation on CIFAR100

Table 2 presents the results on CIFAR-100. Compared with the baseline method DER, our method demonstrates improved accuracy across three settings. In the CIFAR-100-B0 10-step setting, the last-step accuracy increases from

65.22 %

to

68.13 \pm 0.28 %

. For the 20-step setting, our method boosts the last-step accuracy from

62.48 %

to

64.37 \pm 0.42 %

. Our method also enhances the average accuracy from

74.09 \pm 0.33 %

to

76.02 \pm 1.27 %

. In the 5-step setting, our method improves the accuracy from

76.80 \pm 0.79 %

to

78.4 \pm 0.99 %

. Compared with state-of-the-art methods such as MCG, our approach utilizes approximately 11.2 million fewer parameters while achieving competitive results. Figure 8 illustrates the accuracy of each step on CIFAR-100.

RandAugment. Following the approach of MCG [6], we employ RandAugment, as depicted in Table 3, Table 4 and Table 5. On CIFAR100-B0S10, our method achieves a last-step accuracy that is higher by

0.73 %

compared to MCG. By implementing RandAugment, the last-step accuracy increases from

68.34 %

to

69.93 %

. Table 5 presents the results on CIFAR100-B50, where the first task includes 50 classes, and each subsequent task adds 25 and 10 classes, respectively. Utilizing RandAugment, we achieve higher accuracy on CIFAR100 than MCG.

4.5. Ablation Study

We conduct an ablation study on ImageNet100-B0S10 to analyze the effect of each component, as shown in Table 6. Using only

L_{C E C}

improves the last step accuracy from

67.6 %

to

70.30 %

(+ 2.7 %)

. Feature adjustment further enhances the accuracy to

71.44 %

(+ 1.14 %)

. With the assistance of

L_{C E C}

and

L_{C F C}

, the last step accuracy improves to

72.2 %

(+ 4.6 %)

. Feature adjustment is beneficial for DER, improving the last step accuracy from

67.6 %

to

69.16 %

.

To further elucidate the impact of our method on plasticity and stability, we calculated the backward transfer (BWT) for representation, and forward transfer (FWT) for representation [4]. Our results, presented in Table 7, demonstrate that our method significantly enhances the model’s ability to resist forgetting, CEC contributes positively to both plasticity and stability. On the other hand, the CFC markedly improves stability by effectively distinguishing between the features of new and old tasks. However, CFC may slightly constrain plasticity by rigidly separating task features. Our method achieves better a performance by balancing plasticity and stability.

4.6. Effects of Auxiliary Loss

To assess the effects of auxiliary losses on the interaction between new and old tasks, we compare the performance of DER using auxiliary loss and our proposed losses in predicting task ID, as illustrated in Figure 9. The comparison reveals that our proposed losses significantly enhance the ability to maintain task boundaries and reduce detrimental impacts on previously learned tasks. Figure 10 shows that, while our losses start slightly worse, they progressively reduce interference with prior tasks.

4.7. Sensitive Study of Hyperparameters

We study the sensitivity to

λ_{1}

and

λ_{2}

, detailed results of which appear in Table 8 and Table 9. We use values of 0.9 and 0.7, respectively.

4.8. Computational and Parameter Overhead

In DER, a separate feature extractor is added for each task, making feature extraction the dominant runtime cost (30 ms per iteration) and incurring an additional 25 ms for each new extractor. In contrast, computing CEC and CFC adds merely 1.1 ms per training iteration. Both losses are disabled at inference, so runtime speed is unaffected. The feature-adjustment step in DER’s second stage merely scales each extractor’s output by a learned scalar, and our lightweight projector is instantiated only for the newly added extractor and removed after training, resulting in negligible parameter overhead.

4.9. Memory Size Setting

Table 10 reports results with memory budgets of 500, 1000 and 2000 exemplars. As expected, larger buffers yield higher accuracy.

4.10. Evaluation on DyTox

DyTox proposes a transformer architecture-based model with shared encoders and decoders across all tasks. A task token is added for learning each corresponding task. DyTox applies task attention in the final Transformer block only. Each task-specific embedding from the task attention block is fed to task-specific classifiers. DyTox uses cross-entropy loss, auxiliary loss, and additionally, a distillation loss.

Implementation on DyTox. For DyTox, we only replaced the auxiliary loss used in DyTox with our proposed contrastive loss. Additionally, we introduced a projector during training, which is discarded after training. We use the task tokens from the last task attention block to calculate the CEC and CFC loss.

Results on DyTox. We evaluated our contrastive loss on DyTox, which has corrected its results on their official GitHub repository for the use of distributed memory. For consistency and fairness, all experiments detailed in Table 11 were conducted under the distributed memory setting using two 3090 Ti GPUs. DyTox uses task-specific classifiers, so feature adjustment will not be utilized.

As shown in Table 11, with the help of CEC and CFC loss, the average accuracy on CIFAR100 improves from

71.20 %

to

73.21 %

, and the last step accuracy improves from

57.3 %

to

60.33 %

.

Although DER (CNN backbone) and DyTox (ViT backbone + task tokens) introduce different per-task modules, our CEC and CFC losses improve both models, confirming that the two losses are architecture-agnostic. By injecting old-task information during new-task training, CEC and CFC sharpen the separation between task-specific modules while preserving the stability of previously learned ones.

Task tokens distance. DyTox reported that auxiliary loss enhances the diversity among task tokens by increasing the minimal Euclidean distance between them by

8 %

. Following DyTox, we compared the task tokens distance obtained using various loss, as shown in Table 11, the minimal Euclidean distance increases by

7.3 %

with the help of CEC compared to the distance with auxiliary loss. This finding underscores that our loss function offers enhanced capability in effectively differentiating between task tokens, suggesting potential improvements in task-specific classifier performance.

We compared the accuracy of DyTox at each step using different loss functions on CIFAR-100B0S10, as illustrated in Figure 11. During the learning of the first 10 classes, the original DyTox method (using auxiliary loss) demonstrates accuracy comparable to our methods. However, as more tasks are learned, our methods show a clear accuracy advantage. The DyTox with CFC method maintains similar accuracy to the original DyTox in the early stages, but outperforms it as the number of tasks increases. CEC loss, in contrast, rapidly demonstrates a marked improvement in accuracy right from the early stages.

As shown in Table 12, with the help of our loss, the average accuracy on ImageNet100B0S10 improves from

73.22 %

to

74.80 %

, the last step accuracy improves from

60.82 %

to

63.62 %

.

Forgetting and BWT. Following DyTox, we calculate the forgetting and backward transfer for representation, as shown in Table 11. We observed that compared to CEC, CFC demonstrates a more pronounced effect in mitigating BWT and reducing forgetting. This phenomenon is consistent within both DER and DyTox frameworks. CFC exhibits a stronger ability to resist forgetting than CEC.

4.11. Evaluation on MCG

Results on MCG. MCG [6] builds upon DER by assigning a new feature extractor for each task and freezing those from previous tasks to preserve knowledge. For each task, multiple centers are computed based on the feature distribution of its samples. A dedicated feature extractor is used to extract global features, which are then matched with the task-specific centers to compute dynamic weights. These weights adjust the contribution of each feature extractor, ensuring that the most relevant one plays a dominant role while minimizing the influence of less relevant ones.

We apply our contrastive loss to the MCG model [6]. We only replaced the auxiliary loss used in MCG with our proposed contrastive loss. As shown in Table 3, our method improves the average accuracy on CIFAR-100 from

77.69 %

to

78.84 %

, and the last-step accuracy increases from

67.57 %

to

69.4 %

. Similarly, as shown in Table 13 on ImageNet100-B0S10, our approach raises the average accuracy from

80.46 %

to

81.07 %

, and the last-step accuracy improves from

71.52 %

to

72.1 %

.

5. Conclusions

We analyze the factors that govern stability–plasticity in dynamically expandable networks. Although auxiliary loss assists new modules in learning new tasks, it adversely affects both plasticity and stability. We address the stability–plasticity trade-off in dynamically expandable networks by replacing the standard auxiliary loss with an asymmetric contrastive auxiliary loss and by introducing learnable feature-weight adjustment to counter sample-induced feature bias, together yielding consistent accuracy gains on DER, DyTox, and MCG.

Author Contributions

Conceptualization, M.D. and R.L.; methodology, M.D.; software, M.D.; validation, M.D. and R.L.; formal analysis, M.D.; investigation, M.D.; resources, M.D.; data curation, M.D.; writing—original draft preparation, M.D.; writing—review and editing, R.L.; visualization, M.D.; supervision, M.D.; project administration, M.D. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The datasets used in this study are publicly available. CIFAR-10 can be accessed at https://www.cs.toronto.edu/~kriz/cifar.html (accessed on 7 August 2025) and ImageNet can be accessed at https://www.image-net.org/ (accessed on 7 August 2025).

Acknowledgments

The authors would like to thank our supervisor and colleagues for their valuable guidance and support throughout this work.

Conflicts of Interest

Author Rui Li was employed by the company “China Mobile (Hangzhou) Information Technology Co., Ltd.”. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

McCloskey, M.; Cohen, N. Catastrophic Interference in Connectionist Networks: The Sequential Learning Problem. In Psychology of Learning and Motivation; Elsevier: Amsterdam, The Netherlands, 1989; Volume 24, pp. 109–165. [Google Scholar]
Pham, Q.; Liu, C.; Hoi, S. DualNet: Continual Learning, Fast and Slow. In Proceedings of the 35th Conference on Neural Information Processing Systems (NeurIPS), Virtual Event, 6–14 December 2021; pp. 16131–16144. [Google Scholar]
McClelland, J.; McNaughton, B.; O’Reilly, R. Why There Are Complementary Learning Systems in the Hippocampus and Neocortex: Insights from the Successes and Failures of Connectionist Models of Learning and Memory. Psychol. Rev. 1995, 102, 419–457. [Google Scholar] [CrossRef] [PubMed]
Yan, S.; Xie, J.; He, X. DER: Dynamically Expandable Representation for Class Incremental Learning. In Proceedings of the 34th IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 19–25 June 2021; pp. 3014–3023. [Google Scholar]
Douillard, A.; Ramé, A.; Couairon, G.; Cord, M. DyTox: Transformers for Continual Learning with Dynamic Token Expansion. In Proceedings of the 35th IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 19–24 June 2022; pp. 9285–9295. [Google Scholar]
Cai, T.; Zhang, Z.; Tan, X.; Qu, Y.; Jiang, G.; Wang, C.; Xie, Y. Multi-Centroid Task Descriptor for Dynamic Class Incremental Inference. In Proceedings of the 36th IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 18–22 June 2023; pp. 7298–7307. [Google Scholar]
Hu, Z.; Li, Y.; Lyu, J.; Gao, D.; Vasconcelos, N. Dense Network Expansion for Class Incremental Learning. In Proceedings of the 36th IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 18–22 June 2023; pp. 11858–11867. [Google Scholar]
Kirkpatrick, J.; Pascanu, R.; Rabinowitz, N.; Veness, J.; Desjardins, G.; Rusu, A.; Milan, K.; Quan, J.; Ramalho, T.; Grabska-Barwinska, A.; et al. Overcoming Catastrophic Forgetting in Neural Networks. Proc. Natl. Acad. Sci. USA 2017, 114, 3521–3526. [Google Scholar] [CrossRef] [PubMed]
Aljundi, R.; Babiloni, F.; Elhoseiny, M.; Rohrbach, M.; Tuytelaars, T. Memory Aware Synapses: Learning What (Not) to Forget. In Proceedings of the 15th European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 139–154. [Google Scholar]
Hinton, G.; Vinyals, O.; Dean, J. Distilling the Knowledge in a Neural Network. arXiv 2015, arXiv:1503.02531. [Google Scholar]
Rebuffi, S.A.; Kolesnikov, A.; Sperl, G.; Lampert, C. iCarl: Incremental Classifier and Representation Learning. In Proceedings of the 30th IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 2001–2010. [Google Scholar]
Douillard, A.; Cord, M.; Ollion, C.; Robert, T.; Valle, E. PODNet: Pooled Outputs Distillation for Small-Tasks Incremental Learning. In Proceedings of the 16th European Conference on Computer Vision (ECCV), Virtual Event, 23–28 August 2020; pp. 86–102. [Google Scholar]
Yu, L.; Twardowski, B.; Liu, X.; Herranz, L.; Wang, K.; Cheng, Y.; Jui, S.; van de Weijer, J. Semantic Drift Compensation for Class-Incremental Learning. In Proceedings of the 33rd IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Virtual Event, 14–19 June 2020; pp. 6982–6991. [Google Scholar]
Zheng, B.; Zhou, D.W.; Ye, H.J.; Zhan, D.C. Multi-Layer Rehearsal Feature Augmentation for Class-Incremental Learning. In Proceedings of the 41st International Conference on Machine Learning(ICML), Vienna, Austria, 21–27 July 2024. [Google Scholar]
Hou, S.; Pan, X.; Loy, C.; Wang, Z.; Lin, D. Learning a Unified Classifier Incrementally via Rebalancing. In Proceedings of the 32nd IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–20 June 2019; pp. 831–839. [Google Scholar]
Wu, Y.; Chen, Y.; Wang, L.; Ye, Y.; Liu, Z.; Guo, Y.; Fu, Y. Large Scale Incremental Learning. In Proceedings of the 32nd IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–20 June 2019; pp. 374–382. [Google Scholar]
Zhao, B.; Xiao, X.; Gan, G.; Zhang, B.; Xia, S.T. Maintaining Discrimination and Fairness in Class Incremental Learning. In Proceedings of the 33rd IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Virtual Event, 14–19 June 2020; pp. 13208–13217. [Google Scholar]
Boschini, M.; Bonicelli, L.; Buzzega, P.; Porrello, A.; Calderara, S. Class-Incremental Continual Learning into the Extended Der-Verse. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 5497–5512. [Google Scholar] [CrossRef] [PubMed]
He, J. Gradient Reweighting: Towards Imbalanced Class-Incremental Learning. In Proceedings of the 41st IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 17–21 June; pp. 16668–16677.
Kim, D.; Han, B. On the Stability-Plasticity Dilemma of Class-Incremental Learning. In Proceedings of the 36th IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 18–22 June 2023; pp. 20196–20204. [Google Scholar]
Liu, Y.; Schiele, B.; Sun, Q. Adaptive Aggregation Networks for Class-Incremental Learning. In Proceedings of the 34th IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 19–25 June 2021; pp. 2544–2553. [Google Scholar]
Wang, F.Y.; Zhou, D.W.; Ye, H.J.; Zhan, D.C. FOSTER: Feature Boosting and Compression for Class-Incremental Learning. In Proceedings of the 17th European Conference on Computer Vision (ECCV), Tel Aviv, Israel, 23–27 October 2022; Springer: Berlin/Heidelberg, Germany; pp. 398–414. [Google Scholar]
Wang, F.Y.; Zhou, D.W.; Liu, L.; Ye, H.J.; Bian, Y.; Zhan, D.C.; Zhao, P. BEEF: Bi-Compatible Class-Incremental Learning via Energy-Based Expansion and Fusion. In Proceedings of the The 11th International Conference on Learning Representations (ICLR), Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
Chen, X.; Chang, X. Dynamic Residual Classifier for Class Incremental Learning. In Proceedings of the 19th IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 2–6 October 2023; pp. 18743–18752. [Google Scholar]
Rypeść, G.; Cygert, S.; Khan, V.; Trzcinski, T.; Zieliński, B.; Twardowski, B. Divide and Not Forget: Ensemble of Selectively Trained Experts in Continual Learning. In Proceedings of the The 12th International Conference on Learning Representations (ICLR), Vienna, Austria, 7–11 May 2024. [Google Scholar]
Ferdinand, Q.; Clement, B.; Papadakis, P.; Oliveau, Q.; Le Chenadec, G. Feature expansion and enhanced compression for class incremental learning. Neurocomputing 2025, 634, 129807. [Google Scholar] [CrossRef]
He, K.; Fan, H.; Wu, Y.; Xie, S.; Girshick, R. Momentum Contrast for Unsupervised Visual Representation Learning. In Proceedings of the 33rd IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Virtual Event, 14–19 June 2020; pp. 9729–9738. [Google Scholar]
Grill, J.B.; Strub, F.; Altché, F.; Tallec, C.; Richemond, P.; Buchatskaya, E.; Doersch, C.; Avila Pires, B.; Guo, Z.; Gheshlaghi Azar, M.; et al. Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning. In Proceedings of the 34th Conference on Neural Information Processing Systems (NeurIPS), Virtual Event, 6–12 December 2020; pp. 21271–21284. [Google Scholar]
Chen, T.; Kornblith, S.; Swersky, K.; Norouzi, M.; Hinton, G. Big Self-Supervised Models are Strong Semi-Supervised Learners. arXiv 2020, arXiv:2006.10029. [Google Scholar]
Khosla, P.; Teterwak, P.; Wang, C.; Sarna, A.; Tian, Y.; Isola, P.; Maschinot, A.; Liu, C.; Krishnan, D. Supervised Contrastive Learning. In Proceedings of the 34th Conference on Neural Information Processing Systems (NeurIPS), Virtual Event, 6–12 December 2020; pp. 18661–18673. [Google Scholar]
Cha, H.; Lee, J.; Shin, J. Co2L: Contrastive Continual Learning. In Proceedings of the 18th IEEE/CVF International Conference on Computer Vision (ICCV), Virtual Event, 11–17 October 2021; pp. 9516–9525. [Google Scholar]
Guo, Y.; Liu, B.; Zhao, D. Online Continual Learning Through Mutual Information Maximization. In Proceedings of the 39th International Conference on Machine Learning (ICML). PMLR, Baltimore, MD, USA, 17–23 July 2022; pp. 8109–8126. [Google Scholar]
Ahmed, N.; Kukleva, A.; Schiele, B. OrCo: Towards Better Generalization via Orthogonality and Contrast for Few-Shot Class-Incremental Learning. In Proceedings of the 41st IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 17–21 June 2024. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In Proceedings of the International Conference on Learning Representations (ICLR), Virtual Event, 3–7 May 2021. [Google Scholar]
Wang, Z.; Zhang, Z.; Lee, C.Y.; Zhang, H.; Sun, R.; Ren, X.; Su, G.; Perot, V.; Dy, J.; Pfister, T. Learning to Prompt for Continual Learning. In Proceedings of the 35th IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 19–24 June 2022; pp. 139–149. [Google Scholar]
Wang, Z.; Zhang, Z.; Ebrahimi, S.; Sun, R.; Zhang, H.; Lee, C.Y.; Ren, X.; Su, G.; Perot, V.; Dy, J.; et al. DualPrompt: Complementary Prompting for Rehearsal-Free Continual Learning. In Proceedings of the 17th European Conference on Computer Vision (ECCV), Tel Aviv, Israel, 23–27 October 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 631–648. [Google Scholar]
Wang, Y.; Huang, Z.; Hong, X. S-Prompts Learning with Pre-Trained Transformers: An Occam’s Razor for Domain Incremental Learning. In Proceedings of the 36th Conference on Neural Information Processing Systems (NeurIPS), New Orleans, LA, USA, 28 November–9 December 2022. [Google Scholar]
Smith, J.; Karlinsky, L.; Gutta, V.; Cascante-Bonilla, P.; Kim, D.; Arbelle, A.; Panda, R.; Feris, R.; Kira, Z. CODA-Prompt: COntinual Decomposed Attention-Based Prompting for Rehearsal-Free Continual Learning. In Proceedings of the 36th IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 18–22 June 2023; pp. 11909–11919. [Google Scholar]
Wang, L.; Xie, J.; Zhang, X.; Huang, M.; Su, H.; Zhu, J. Hierarchical Decomposition of Prompt-Based Continual Learning: Rethinking Obscured Sub-Optimality. In Proceedings of the 37th Conference on Neural Information Processing Systems, New Orleans, LA, USA, 10–16 December 2023. [Google Scholar]
Wang, R.; Duan, X.; Kang, G.; Liu, J.; Lin, S.; Xu, S.; Lü, J.; Zhang, B. AttriCLIP: A Non-Incremental Learner for Incremental Knowledge Learning. In Proceedings of the 36th IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 18–22 June 2023; pp. 3654–3663. [Google Scholar]
Lu, H.Y.; Lin, L.K.; Fan, C.; Wang, C.; Fang, W.; Wu, X.J. Knowledge-guided prompt-based continual learning: Aligning task-prompts through contrastive hard negatives. Knowl.-Based Syst. 2025, 310, 113009. [Google Scholar] [CrossRef]
Zhou, D.W.; Sun, H.L.; Ye, H.J.; Zhan, D.C. Expandable Subspace Ensemble for Pre-Trained Model-Based Class-Incremental Learning. In Proceedings of the 36th IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 17–21 June 2024. [Google Scholar]
Lopez-Paz, D.; Ranzato, M. Gradient Episodic Memory for Continuum Learning. arXiv 2017, arXiv:1706.08840. [Google Scholar]
Van der Maaten, L.; Hinton, G. Visualizing Data Using t-SNE. J. Mach. Learn. Res. 2008, 9, 2579–2605. [Google Scholar]

Figure 1. Model performance on each task with different auxiliary loss after finishing training the last task on ImageNet100-B0 of 10 steps. We compare the accuracy of the two proposed contrastive auxiliary losses (CEC and CFC) and DER. The x axis shows the class ID for each task.

Figure 2. Task-specific feature norm bias. The plot reports the mean feature norm for each task

T_{i}

when its samples are fed through every DER feature extractor

F_{j}

on CIFAR-100 B0S10. The x axis lists the class indices belonging to each task. Y axis means feature norm for each task.

Figure 2. Task-specific feature norm bias. The plot reports the mean feature norm for each task

T_{i}

when its samples are fed through every DER feature extractor

F_{j}

on CIFAR-100 B0S10. The x axis lists the class indices belonging to each task. Y axis means feature norm for each task.

Figure 3. Test accuracy of each task

T_{i}

. When finishing learning N tasks (DER means training with auxiliary loss, CEC means using our proposed CEC loss to replace the auxiliary loss used in DER for training the new feature extractor), we obtain N feature extractors corresponding to N tasks. Subsequently, We use the frozen feature extractor

F_{i}

and corresponding task data

D_{i}

to train a new classifier

H_{i}

(number of categories is

| C_{i} |

). We use the frozen feature extractor

F_{i}

and new classifier

H_{i}

to test

T_{i}

.

Figure 3. Test accuracy of each task

T_{i}

. When finishing learning N tasks (DER means training with auxiliary loss, CEC means using our proposed CEC loss to replace the auxiliary loss used in DER for training the new feature extractor), we obtain N feature extractors corresponding to N tasks. Subsequently, We use the frozen feature extractor

F_{i}

and corresponding task data

D_{i}

to train a new classifier

H_{i}

(number of categories is

| C_{i} |

). We use the frozen feature extractor

F_{i}

and new classifier

H_{i}

to test

T_{i}

.

Figure 4. Overview of our asymmetric contrastive auxiliary loss and feature adjustment. This figure illustrates the two-stage process when the model encounters a new task

T_{t}

. New module (feature extractor or task tokens)

F_{t}

will be added to the model. We use the cross feature contrastive loss (CFC) and current feature contrastive loss (CEC) to help

F_{t}

learn more discriminative representations (Stage 1). Both CEC and CFC can be used in feature extractors or task tokens. After finishing the training of the feature extractors, we introduce adjustable parameters for each feature extractor to fine-tune the weight of each feature (Stage 2).

Figure 4. Overview of our asymmetric contrastive auxiliary loss and feature adjustment. This figure illustrates the two-stage process when the model encounters a new task

T_{t}

. New module (feature extractor or task tokens)

F_{t}

will be added to the model. We use the cross feature contrastive loss (CFC) and current feature contrastive loss (CEC) to help

F_{t}

learn more discriminative representations (Stage 1). Both CEC and CFC can be used in feature extractors or task tokens. After finishing the training of the feature extractors, we introduce adjustable parameters for each feature extractor to fine-tune the weight of each feature (Stage 2).

Figure 5. Average feature norm of each task. Green line is original feature. Red line is feature norm adjusted by feature adjustment.

Figure 6. Comparison of Top-5 accuracy at each step with other methods on ImageNet100-B0S10.

Figure 7. t-SNE [44] visualization of the test set for ImageNet100B0S10 using the first task for illustration. The star symbols denote the prototypes of each class. The left plot presents the visualization of the baseline DER method, while the right plot showcases the results achieved with our proposed method.

Figure 8. Comparison of accuracy at each step with other methods on CIFAR100-B50S5.

Figure 9. Predicting task ID. Model performance in predicting task ID with different auxiliary loss for DER after having finished training the last task on ImageNet100-B0 of 10 steps. “Ours” refers to training DER using CEC and CFC loss. The abscissa represents the number of the class in the task.

Figure 10. Predicting task ID. Model performance in predicting task ID on the third task with different auxiliary loss after training each task on ImageNet100-B0 from step three to the last step. We compare the accuracy of the contrastive auxiliary losses and DER. “Ours” refers to training the model using CEC and CFC loss. The abscissa represents the step of current task.

Figure 11. Accuracy on CIFAR100B0S10 of DyTox with different auxiliary losses. DyTox_Aux, denotes the default DyTox method employing auxiliary loss.

Table 1. Test accuracy on ImageNet-100 and ImageNet-1000 datasets. Top table is the result on ImageNet-100-B0 and ImageNet-1000-B0. The bottom table is the result on ImageNet-100-B50. Results of comparison methods come from [4,5,6]. DyTox [5] has an erratum regarding distributed memory on their official GitHub repository. The results of SEED [25] are taken from their paper, where old samples are not retained. #pcount means the average number of parameters when testing over step, calculated in millions. Average accuracy is recorded as Avg (%). Last step accuracy is recorded as Last (%). Bold values indicate the best performance.

	ImageNet-100 B0					ImageNet-1000 B0
	#pcount	Top-1		Top-5		#pcount	Top-1		Top-5
Method		Avg	Last	Avg	Last		Avg	Last	Avg	Last
iCaRl	11.22	-	-	83.60	63.80	11.2	38.40	22.70	63.70	44.00
RPSNet	-	-	-	87.9	74.0	-	-	-	-	-
BiC	11.22	-	-	90.60	84.40	11.2	-	-	84.00	73.20
WA	11.22	-	-	91.0	84.1	11.2	65.67	55.60	86.60	81.10
SEED	-	$67.8 \pm 0.3$	-	-	-	-	-	-	-	-
DER w/o P	61.6	77.18	66.70	93.23	87.52	61.6	68.84	60.16	88.17	82.86
DyTox	11.0	71.85	57.94	90.72	83.52	11.4	68.14	59.75	87.03	82.93
FOSTER	11.2	78.40	69.91	-	-	11.2	68.34	-	-	-
BEEF	-	77.62	68.78	93.66	89.32	-	67.06	58.67	86.21	81.73
BEEF-Compress	-	79.34	71.12	93.30	88.94	-	-	-	-	-
MCG	72.80	80.46	71.52	94.76	90.90	72.80	70.80	62.30	88.65	84.14
Ours	61.6	$80.61$	$72.36$	$94.96$	$91.54$	61.6	$72.19$	$64.27$	$90.25$	$86.23$
	ImageNet100-B50
	#pcount	Top-1		Top-5
Method		Avg	Last	Avg	Last
UCIR	11.22	68.09	57.3	-	-
PODNet	11.22	74.33	-	-	-
TPCIL	11.22	74.81	66.91	-	-
SEED	-	$70.9 \pm 0.5$	-	-	-
FOSTER	11.22	77.54	-	-	-
DER w/o P	67.2	78.20	74.92	94.20	91.30
MCG	78.4	79.83	75.24	94.98	92.72
Ours	67.2	$80.35$	$75.48$	$95.31$	$93.48$

Table 2. Results on CIFAR100-B0, averaged over three different class orders. Results of the comparison method come from MCG [6], DyTox [5], and DER [4]. #pcount means the average number of parameters when testing over step, calculated in millions. Average accuracy is recorded as Avg (%). Last step accuracy is recorded as Last (%). MCG uses RandAugment when training CIFAR100. In this table, our method did not utilize RandAugmentation.

	5 Steps			10 Steps			20 Steps
Method	#pcount	Avg	Last	#pcount	Avg	Last	#pcount	Avg	Last
iCaRl	11.2	71.14	-	11.2	$65.27 \pm 1.02$	50.74	11.22	$61.20 \pm 0.83$	43.75
UCIR	11.2	62.77	-	11.2	$58.66 \pm 0.71$	43.39	11.2	$58.17 \pm 0.30$	40.63
BiC	11.2	73.10	-	11.2	$68.80 \pm 1.20$	53.54	11.2	$66.48 \pm 0.32$	47.02
WA	11.2	72.81	-	11.2	$69.46 \pm 0.29$	53.78	-	$67.33 \pm 0.15$	47.31
PODNet	11.2	66.70	-	11.2	$58.03 \pm 1.27$	41.05	11.2	$53.97 \pm 0.85$	35.02
RPSNet	60.6	70.5	-	56.5	68.60	57.05	-	-	-
DER w/o P	33.6	$76.80 \pm 0.79$	-	61.6	$75.36 \pm 0.36$	65.22	117.6	$74.09 \pm 0.33$	62.48
DyTox	-	-	-	10.73	$73.66 \pm 0.02$	$60.67 \pm 0.34$	10.74	$72.27 \pm 0.18$	$56.32 \pm 0.61$
FOSTER	11.2	72.56	-	11.2	72.90	-	11.2	70.65	-
BEEF	-	72.31	62.58	-	71.94	60.98	-	69.84	56.71
BEEF-Compress	-	73.05	62.48	-	72.93	61.45	-	71.69	57.06
MCG	44.8	$78.15 \pm 0.58$	-	72.8	$77.40 \pm 0.94$	-	128.8	$76.20 \pm 1.18$	-
Ours	33.6	$78.4 \pm 0.99$	$70.97 \pm 0.25$	61.6	$77.17 \pm 1.11$	$68.13 \pm 0.28$	117.6	$76.02 \pm 1.27$	$64.37 \pm 0.42$

Table 3. Results on CIFAR100-B0S10 Order2. MCG [6] trains gates using DER’s auxiliary loss. We replace the auxiliary loss with our losses (CEC. and CFC.), while keeping everything else unchanged, represented as “MCG + Ours” in the table. MCG uses RandAugment when training CIFAR100. “Ours + RA” indicates that we use RandAugment in our methods. The results of MCG are obtained by running their official open source code. #pcount means the average number of parameters when test over step, calculated in millions. Bold values indicate the best performance.

Method	#pcount	Avg	Last
MCG	72.8	77.69	67.57
MCG + Ours	72.8	78.84	69.4
Ours	61.6	77.33	68.34
Ours + RA	61.6	78.81	69.93

Table 4. Results on CIFAR100-B0S20. MCG [6] uses RandAugment when training CIFAR100. “Ours + RA” indicates that we use RandAugment in our methods. #pcount means the average number of parameters when test over step, calculated in millions. Results of MCG is reported in their paper. Bold values indicate the best performance.

Method	#pcount	Avg	Last
MCG	128.8	$76.20 \pm 1.18$	-
Ours	117.6	$76.02 \pm 1.27$	$64.37 \pm 0.42$
Ours + RA	117.6	$77.27 \pm 0.89$	$64.27 \pm 0.59$

Table 5. Results on CIFAR100-B50. MCG [6] uses RandAugment when training CIFAR100. “Ours + RA” indicates that we use RandAugment in our methods. #pcount means the average number of parameters when test over step, calculated in millions. Result on CIFAR100-B50, averaged over three different class orders. The results for MCG are sourced from their published paper.

	2 Steps		5 Steps
Method	#pcount	Avg	#pcount	Avg
MCG	33.6	$76.72 \pm 0.62$	50.4	$76.19 \pm 0.29$
Ours + RA	22.4	$77.49 \pm 0.56$	39.2	$76.31 \pm 0.62$

Table 6. Ablation study. Aux. denotes the auxiliary loss used in DER. F.A. stands for feature adjustment. A checkmark (✓) indicates that the corresponding component is used, while a cross (×) indicates it is not used.

Aux.	CEC.	CFC.	F.A.	Avg	Last
✓	×	×	×	$77.32$	$67.6$
✓	×	×	✓	$78.01$	$69.16$
×	✓	×	×	$79.25$	$70.30$
×	✓	×	✓	$79.98$	$71.44$
×	✓	✓	×	$80.20$	$72.2$
×	×	✓	×	$79.19$	$70.98$
×	✓	✓	✓	$80.61$	$72.36$

Table 7. BWT and FWT on Imagent100-B0S10.

Method	BWT	FWT
DER (w/o P)	−11.72	−6.5
CEC.	−11.00	−6.15
CFC.	−7.49	−9.2
CEC. + CFC.	−6.71	−10.73
CEC. + CFC. + F.A.	−5.91	−11.57

Table 8. Sensitive study of

λ_{1}

for CEC loss.

Table 8. Sensitive study of

λ_{1}

for CEC loss.

Metric	$λ_{1} = 0.9$	$λ_{1} = 0.95$	$λ_{1} = 1$
Last	70.30	70.24	70.28

Table 9. Sensitive study of

λ_{2}

,

λ_{1}

is fixed at

0.9

.

Table 9. Sensitive study of

λ_{2}

,

λ_{1}

is fixed at

0.9

.

Metric	$λ_{2} = 0.5$	$λ_{2} = 0.7$	$λ_{2} = 0.9$	$λ_{2} = 1.2$
Last	71.6	72.2	72.1	72.14

Table 10. Impact of memory size on performance.

memory	500	1000	2000
Avg	71.43	75.17	78.81

Table 11. Ablation study on CIFAR100B0-10 for DyTox. The first row is obtained by running their official open source code. DyTox [5] corrected their result on their offical GitHub repository for using the distributed memory. All results in this table are with distributed memory. Aux. denotes the auxiliary loss used in DyTox.

Aux.	CEC.	CFC.	Avg	Last	Token_min_dist	Token_max_dist	bwt	Forgetting
✓	×	×	$71.20$	$57.3$	$0.5015$	$0.6486$	$- 16.96$	$23.44$
×	✓	×	$73.02$	$59.87$	$0.5385 (+ 7.3 %)$	$0.6300 (- 2.8 %)$	$- 15.09$	$20.65$
×	×	✓	$72.41$	$60.24$	$0.5082 (+ 1.3 %)$	$0.6853 (+ 5.6 %)$	$- 7$	$11.26$
×	✓	✓	$73.21$	$60.33$	$0.5181 (+ 3.3 %)$	$0.6533 (+ 0.7 %)$	$- 14.65$	$19.68$

Table 12. Results on ImageNet100-B0S10. DyTox [5] uses DER’s auxiliary loss. We replace the auxiliary loss with our losses (CEC. and CFC.), while keeping everything else unchanged, represented as “DyTox + Ours” in table. The result of DyTox is obtained by running their official open source code. DyTox corrected their result on their offical GitHub repository for using the distributed memory. All results in this table were conducted under the distributed memory setting using two 3090 Ti GPUs.

Method	Avg	Last
DyTox	$73.22$	$60.82$
DyTox + ours	$74.80$	$63.62$

Table 13. Results on ImageNet100-B0S10. MCG [6] trains gates using DER’s auxiliary loss. We replace the auxiliary loss with our losses (CEC. and CFC.), while keeping everything else unchanged, represented as “MCG + Ours” in the table. Results of MCG is reported in their paper. #pcount means the average number of parameters when test over step, calculated by millions. Bold values indicate the best performance.

Method	#pcount	Avg	Last
MCG	72.8	80.46	71.52
MCG + Ours	72.8	81.07	72.1
Ours	61.6	80.61	72.36

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Dong, M.; Li, R. Rethinking the Stability–Plasticity Dilemma of Dynamically Expandable Networks. Symmetry 2025, 17, 1379. https://doi.org/10.3390/sym17091379

AMA Style

Dong M, Li R. Rethinking the Stability–Plasticity Dilemma of Dynamically Expandable Networks. Symmetry. 2025; 17(9):1379. https://doi.org/10.3390/sym17091379

Chicago/Turabian Style

Dong, Mingda, and Rui Li. 2025. "Rethinking the Stability–Plasticity Dilemma of Dynamically Expandable Networks" Symmetry 17, no. 9: 1379. https://doi.org/10.3390/sym17091379

APA Style

Dong, M., & Li, R. (2025). Rethinking the Stability–Plasticity Dilemma of Dynamically Expandable Networks. Symmetry, 17(9), 1379. https://doi.org/10.3390/sym17091379

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Rethinking the Stability–Plasticity Dilemma of Dynamically Expandable Networks

Abstract

1. Introduction

2. Related Works

3. Methodology

3.1. Problem Setup

3.2. Rethinking the Stability–Plasticity Dilemma

3.3. Asymmetric Contrastive Auxiliary Loss

3.4. Feature Adjustment

4. Experiments

4.1. Setup

4.2. Evaluation on ImageNet-100

4.3. Evaluation on ImageNet-1000

4.4. Evaluation on CIFAR100

4.5. Ablation Study

4.6. Effects of Auxiliary Loss

4.7. Sensitive Study of Hyperparameters

4.8. Computational and Parameter Overhead

4.9. Memory Size Setting

4.10. Evaluation on DyTox

4.11. Evaluation on MCG

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI