Image–Text (IT)-Prompt: Prompt-Based Learning Framework Empowered by the Cluster-Based Nearest Class Mean (C-NCM) for Rehearsal-Free Contrastive Language–Image Pretraining (CLIP)-Based Continual Learning

Jiao, Li; Fu, Wenlong; Chen, Xiaolu

doi:10.3390/app15062966

Open AccessArticle

Image–Text (IT)-Prompt: Prompt-Based Learning Framework Empowered by the Cluster-Based Nearest Class Mean (C-NCM) for Rehearsal-Free Contrastive Language–Image Pretraining (CLIP)-Based Continual Learning

by

Li Jiao

¹

,

Wenlong Fu

^2,* and

Xiaolu Chen

¹

School of Information and Communication Engineering, Communication University of China, Beijing 100024, China

²

State Key Laboratory of Media Convergence and Communication, Communication University of China, Beijing 100024, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(6), 2966; https://doi.org/10.3390/app15062966

Submission received: 21 January 2025 / Revised: 5 March 2025 / Accepted: 7 March 2025 / Published: 10 March 2025

Download

Browse Figures

Versions Notes

Abstract

:

The Contrastive Language–Image Pretraining (CLIP) model has demonstrated remarkable zero-shot capabilities through contrastive learning on large-scale text-image datasets, sparking interest in developing continuous learning methods to extend its knowledge while preserving zero-shot performance. However, traditional approaches often modify the pretrained parameters of CLIP, compromising its zero-shot capabilities, and face challenges due to the substantial parameter size of CLIP and lengthy training times. To address these issues, we propose the Image–Text (IT-)Prompt method, which leverages the inherent correlation between visual and textual information to train discrete prompts dedicated to individual tasks, serving as repositories for task-specific knowledge. By employing discrete textual prompts as guidance, we ensure the uniqueness of each task’s prompt and prevent interference among tasks, thus alleviating catastrophic forgetting during continuous learning. While retaining the pretrained parameters of CLIP, our approach introduces only a small number of additional trainable parameters, enhancing training efficiency and preserving the original zero-shot capabilities of CLIP. Building on IT-Prompt, we further introduce a Cluster-based Nearest Class Mean classifier, which eliminates the need for Softmax classifiers to store and retrain old task samples, significantly improving training efficiency and reducing resource consumption. Experiments demonstrate that our method achieves over a 10% performance improvement compared to state-of-the-art CLIP-based continuous learning methods, with enhanced efficiency and reduced overhead.

Keywords:

class incremental learning; prompt learning; brain-inspired classifier

1. Introduction

In recent years, deep learning models have excelled in processing large datasets in a one-shot manner [1,2]. However, extending the knowledge of pretrained models to new datasets often necessitates retraining on these entire datasets [3], incurring significant economic and temporal costs. This challenge is particularly acute for open-source multimodal models like Contrastive Language–Image Pretraining (CLIP) [4], which are pretrained on extensive and complex image–text pairing datasets and demonstrate robust zero-shot capabilities across various tasks [5]. The prohibitive cost of retraining the entire CLIP model renders it impractical for real-world applications.

As illustrated in Figure 1, various continual learning methods have been developed to train models on sequential tasks without degrading performance on previous tasks [3,6,7]. Replay-based methods, in particular, have proven effective [8,9,10,11,12,13]. These methods involve sampling instances from a stored data pool of previous tasks and combining them with samples from new tasks, thus preserving earlier task performance. Nevertheless, identifying suitable replay strategies for models like CLIP is challenging due to the vast and diverse nature of their original datasets.

Concurrently, research has explored continual learning strategies based on CLIP, including knowledge distillation methods [14,15]. However, these approaches might necessitate adjustments to the pretrained parameters of CLIP for downstream tasks, potentially compromising its zero-shot capabilities. Therefore, it is essential to explore innovative methodologies for continual learning in multimodal models like CLIP that maintain zero-shot performance, reduce training costs, and enhance training efficiency.

Prompt-based continual learning methods are emerging as promising solutions. These methods keep the parameters of pretrained feature extractors intact while introducing a small set of learnable key-prompt parameters tailored to new task sequences [2,16,17,18,19]. However, existing prompt-based methods primarily utilize the knowledge from a single modality within pretrained models [20,21]. For multimodal models like CLIP, effectively harnessing pretrained knowledge from both image and text modalities is crucial. Inspired by this, we propose IT-prompt, a novel prompt-based method that guides key-prompt learning from both image and text perspectives, thus enabling comprehensive knowledge expansion within the pretrained CLIP model. Building on IT-Prompt, we introduce a C-NCM classifier designed to reduce dependence on Softmax classifiers, which require the continuous storage and retraining of old task samples. The C-NCM classifier significantly cuts training resource consumption and enhances training efficiency by emulating human-like adaptive learning mechanisms.

In summary, the contributions of this paper can be summarized as follows:

We introduce IT-Prompt, a novel prompt-based continual learning method for pretrained CLIP models, which leverages prompts for both image and text modalities to harness the correlation between visual and textual modalities, effectively extending the pretrained CLIP models’ knowledge. IT-Prompt preserves the zero-shot capabilities of the pretrained CLIP models and mitigates the catastrophic forgetting problem inherent in previous CLIP-based continual learning methods.
Building on IT-Prompt, we further introduce a C-NCM classifier, which eliminates the need for Softmax classifiers to store and retrain old task samples, significantly improving training efficiency and reducing resource consumption.
The proposed method outperforms all state-of-the-art methods across multiple benchmark datasets, demonstrating a significant performance improvement of around 10% on various task sequences in CIFAR100 and TinyImageNet.

Comparison with Existing Works In the rapidly evolving field of continual learning, recent research approaches can be broadly categorized into three streams: replay-based methods, regularization techniques, and parameter isolation strategies. Replay-based methods [8,9,10,11] store examples from previous tasks but face scalability challenges with large-scale models like CLIP. Regularization techniques [22,23] penalize updates to critical parameters but often compromise the model’s ability to learn new tasks effectively. Parameter isolation approaches, particularly prompt-based methods [2,16,17,19], have shown promise by introducing task-specific learnable parameters while freezing the pretrained backbone. However, existing prompt-based methods predominantly focus on single-modality knowledge and rely on computationally expensive Softmax classifiers requiring continuous storage of old samples. Our work addresses these fundamental limitations through IT-Prompt, which uniquely leverages both visual and textual modalities of CLIP, and our novel C-NCM classifier that eliminates the need for storing previous task data—innovations that significantly advance the state of the art in multimodal continual learning.

2. Related Work

2.1. Continual Learning

2.1.1. Traditional Continual Learning Methods

Continual learning methods are commonly divided into four categories: regularization-based, distillation-based, architecture-based, and rehearsal-based. Regularization-based methods [22,24,25,26] impose constraints on the update of crucial network parameters for prior tasks by integrating regularization terms into the loss function. While effective, these methods often significantly inhibit the network’s ability to learn new information. Distillation-based methods [23,27,28] synchronize the output spaces of current and previous tasks, becoming integral to various incremental learning strategies as knowledge distillation techniques evolve. Architecture-based methods [29,30,31,32,33] mitigate catastrophic forgetting by dynamically expanding network architectures or modifying internal structures for new tasks. Rehearsal-based methods [8,9,34,35,36,37], on the other hand, integrate old and new knowledge by training with a mix of representative samples from both past and current tasks. However, these methods face limitations in privacy-sensitive and security-focused scenarios due to the need to store samples from previous tasks.

2.1.2. Prompt-Based Continual Learning Methods

Recent advances in continual learning have seen a significant increase in the adoption of prompting techniques from natural language processing (NLP) [6]. Prompt-based continual learning methods keep the original model parameters intact, thus preserving the zero-shot capabilities of pretrained models. These methods introduce learnable key-prompt parameters to adapt flexibly to the learning requirements of new tasks, thereby facilitating effective knowledge transfer [2,18,19,21,38]. However, current prompt-based approaches predominantly rely on exploiting single-modality knowledge within pretrained models to guide the learning process. For multimodal models such as CLIP, it is essential to explore strategies for effectively utilizing pretrained knowledge from both image and text modalities.

2.1.3. CLIP-Based Continual Learning Methods

For the continual learning of CLIP, Continual-CLIP [39] demonstrated impressive performance in downstream continual learning tasks by leveraging the IT classifiers from the pretrained CLIP model without employing any specific continual learning methods. Conversely, VR-LwF [14] and ZSCL [15] utilized mainstream continual learning approaches, such as replay and knowledge distillation for incremental learning in downstream tasks, achieving notable results. Additionally, AttriCLIP [17] employed a text-prompt-based approach for continual learning in downstream tasks, also yielding promising outcomes. However, these methods have not fully exploited the inherent transferability of the pretrained CLIP model. CLAP [40] enhances continual learning with CLIP through probabilistic finetuning, improving calibration, mitigating forgetting, and enabling better uncertainty estimation for tasks like novel data detection and exemplar selection. DDAS [41] enables parameter-efficient continual learning for vision–language models by integrating Mixture-of-Experts (MoE) adapters and a Distribution-Discriminative Auto-Selector (DDAS), mitigating forgetting, preserving zero-shot recognition, and reducing training costs by 60%. TMM-CLIP [42] enhances rehearsal-free class-incremental learning by preserving discriminative visual representations, using dynamic prompt tuning to reduce intra-task confusion, and employing contrastive inter-task learning to improve task separability, outperforming state-of-the-art methods on multiple benchmarks. In summary, while these approaches have shown effectiveness in continual learning for CLIP, there remains room for enhancing the utilization of CLIP’s pre-existing transfer capabilities to achieve more robust continual learning performance.

2.2. Classifier

The Softmax classifier paired with cross-entropy loss has traditionally been the preferred method for classification tasks in neural networks. While this combination is prevalent in continual learning (CL) for image classification, several studies [9,43] have noted that models using the Softmax classifier exhibit a pronounced bias towards more recent tasks, primarily due to the imbalance between new and old classes. This imbalance is a major contributor to catastrophic forgetting. To address this challenge, [18] employs representation statistics from previously learned classes to correct classifier bias and enhance the stability of prompt learning. Nonetheless, this strategy incurs additional training costs, particularly for lengthy task sequences.

3. Proposed Method

3.1. Problem Formulation

In this paper, we focus on the more representative and challenging class incremental learning (CIL) scenario [44]. For a deep model

Φ (f (\cdot))

consisting of a classifier

Φ (\cdot)

and the backbone

f (\cdot)

, the goal of CIL is to train the model

Φ (f (\cdot))

on tasks

{T_{t}}_{t = 1}^{n}

sequentially so that the model can classify testing samples from any task. Specifically, at the t-th task

T_{t}

, the training set is

D_{t} = {x_{t m}, y_{t m}}_{m = 1}^{N_{t}}

, where

x_{t m}

is the m-th image at task

T_{t}

with label

y_{t m}

. And the label space

Y_{t}

of task

T_{t}

is disjoint with other tasks, i.e.,

⋂_{t = 1}^{n} Y_{t} = \emptyset

. Once the model finishes learning on task

T_{t}

, the corresponding training set

D_{t}

is dropped and becomes inaccessible when learning from

T_{t + 1}

.

3.2. Overall Framework

As illustrated in Figure 2, the overall framework is composed of two primary components: prompt-based feature learning and NCM-based feature processing and classification. To effectively utilize the pretrained knowledge of multimodal models like CLIP, we introduce IT-Prompt, a novel method aimed at improving task-specific feature extraction. IT-Prompt harnesses the pretrained CLIP model’s image encoder to deliver comprehensive visual knowledge and its text encoder to provide category-specific guidance for prompt construction and optimization. This method capitalizes on the synergy between text and image modalities to enhance prompt development. Through the processes of inference, optimization, and construction of task-specific prompts, IT-Prompt develops a customized feature extractor for the specific task. Following feature extraction, the C-NCM algorithm is applied for further processing and classification. The C-NCM algorithm includes two stages: (1) feature preprocessing, which refines the features for the current task, and (2) the C-NCM incremental classifier, which develops an adaptive classification model suitable for incremental learning scenarios. This integrated framework ensures efficient feature extraction, processing, and classification, thus boosting the model’s performance and adaptability.

3.3. Task-Specific Prompt Learning

3.3.1. Task-Specific Prompt Inference

During the pretraining phase, CLIP’s image encoder accumulates extensive visual knowledge through image–text contrastive learning on large-scale datasets, forming a robust foundation for guiding the learning of task-specific keys. We freeze the parameters of the image encoder and construct the task classifier based on this foundation. The classifier is fine-tuned to accurately predict keys, enabling the selection of a task-specific prompt for each test sample.

As shown in Figure 3, given an input image

x

, we feed it into image feature extractor

f (x; θ_{f})

of CLIP with frozen parameters

θ_{f}

. Then, the extracted features are passed into a classifier

g (z; θ_{g})

, where

z

represents the extracted features and

θ_{g}

denotes the parameters of the classifier. Finally, we obtain a logits vector

y_{logits} = g (f (x; θ_{f}); θ_{g})

, which represents the predicted probabilities for different classes.

Assuming the true labels are

y_{true}

, we can quantify the difference between the predicted values and the true values using the cross-entropy loss function. The expression for the cross-entropy loss function is given by

L_{K} = - \sum_{i} y_{true, i} \log (σ {(y_{logits})}_{i})

(1)

Here,

y_{true, i}

represents the probability of the i-th class in the true labels (usually 0 or 1 in one-hot encoding), and

σ {(y_{logits})}_{i}

represents the predicted probability of the i-th class after applying the Softmax function to the logits vector.

key = \{\begin{matrix} 1, & if \hat{y} \in Y_{t 1} \\ 2, & if \hat{y} \in Y_{t 2} \\ \dots & \dots \\ n, & if \hat{y} \in Y_{t n} \end{matrix}

(2)

After obtaining the logits vector

y_{logits}

, we select the category with the highest probability as the predicted category, denoted as

\hat{y}

. Subsequently, utilizing Equation (2), we can determine the corresponding key by leveraging the predicted category.

3.3.2. Construction of the Prompt

Similar to most prompt-based continual learning methods [18,21], we have established a prompt pool for all tasks. As depicted in Figure 3, when training a new task, a set of task-specific prompts

P_{T}

is constructed within this pool. Typically, prompt-based methods utilize the Vision Transformer (ViT) [45] as their backbone, which employs a transformer architecture featuring multi-head self-attention (MSA) and feedforward layers in each block. For the construction of prompts, a flexible number of MSA layers are selected to integrate

P_{T}

for the current task. To implement Prefix Tuning (PreT) [46], considering a single MSA layer,

P_{k}

is concatenated with

h_{k}

, and

P_{v}

is concatenated with

h_{v}

. Both

P_{k}

and

P_{v}

are derived from

P_{T}

, with

h = h_{k} = h_{q} = h_{v}

in ViT. Consequently, the input

h_{i}

to the subsequent MSA layer is defined by

h_{i} = Attention (h_{q} W_{q, i}, [P_{k}; h_{k} W_{k, i}], [P_{v}; h_{v} W_{v, i}]), i = 1, \dots, m

, where

W_{q, i}

,

W_{k, i}

, and

W_{v, i}

are projection matrices and m represents the number of heads.

3.3.3. Optimization of the Prompt

Recent advancements in vision-language models demonstrate that language typically provides information complementary to visual data [4,20]. In this context, we employ a pretrained text encoder to extract features pertinent to image categories, which guide the optimization of prompts. Due to the extensive textual data involved in the pretraining phase, the extracted category text features exhibit a degree of discretization. As a result, prompts influenced by these text features are somewhat discrete in nature. This discretization helps minimize knowledge interference across different tasks, promotes task-specific prompt learning, and aids in mitigating the phenomenon of catastrophic forgetting.

Upon receiving the key associated with the input image x, the model selects a task-specific prompt

P_{T a s k}

. This prompt is then concatenated with several MSA modules within the image encoder of CLIP, yielding the image’s feature representation

F_{I}

guided by the

P_{T a s k}

.

F_{I}

encapsulates rich task-specific information. Simultaneously, based on the label name of x, a textual prompt

P_{t e x t}

is constructed, such as “A photo of a dog”.

P_{t e x t}

is fed into the text encoder of CLIP to obtain the textual feature representation

F_{T}

of the image. Given the discrete nature of textual data, we aim to compute a constrastive-based loss [47] between the prompt-guided

F_{I}

and

F_{T}

, denoted as

L_{P}

, as elaborated in Equation (3).

L_{P} = - \frac{1}{N} \sum_{i = 1}^{N} \log \frac{\exp (F_{I} \cdot F_{T i} / τ)}{\sum_{j = 1}^{M} \exp (F_{I} \cdot F_{T j} / τ)}

(3)

where N is the batch size, M is the number of categories so far and

τ

is a temperature parameter controlling the scale of similarity scores.

In addition, for prompt updates, a prompt ensemble (PE) strategy is utilized to efficiently transfer knowledge from previously learned prompts (which encode task-specific knowledge) to the new task’s prompt. The new task’s prompt is optimized as follows:

P_{t} = α \sum_{i = 1}^{t - 1} P_{i} + (1 - α) P_{t},

(4)

where

P_{i}

represents the learned prompts of the previous tasks,

P_{t}

is the prompt for the new task, and

α

is a weighting factor that balances the contributions of the previous prompts and the new task’s prompt.

To ensure that the new task’s prompt does not overlap with the previously learned prompts, a contrastive learning loss is employed during the optimization of the new task’s prompt. The contrastive loss is defined as

{sim}_{i j} = \frac{M_{i} \cdot M_{j}}{∥ M_{i} ∥ ∥ M_{j} ∥},

(5)

where

M = [F_{I}; μ_{t}]

is the concatenated matrix of the prototype of previous classes and the feature of current sample. The similarity values are normalized by a temperature factor (e.g., 0.8). Finally, the contrastive loss is computed as

L_{C} = - \frac{1}{| M |} \sum_{i = 1}^{| M |} \log \frac{\exp ({sim}_{i i} / τ)}{\sum_{j = 1}^{| M |} \exp ({sim}_{i j} / τ)},

(6)

where

τ

is the temperature factor. This loss encourages the new task’s prompt to remain distinct from the previously learned prompts, thereby preventing overlap and improving task-specific prompt optimization.

In this way, the overall optimization objective is defined as Equation (7).

L = L_{C} + λ L_{P}

(7)

where

λ

denotes balance factors.

3.4. C-NCM Classifier

3.4.1. Feature Preprocessing

During the feature preprocessing phase, this paper utilizes prompt-based methods to extract task-specific features and addresses natural intra-class variations by implementing a simple k-means clustering technique to organize features within each class. For instance, encountering an “apple pear” and categorizing it as a type of pear illustrates how the concept of “pear” can be expanded by creating distinct sub-categories, such as general pears and apple pears. Drawing inspiration from this example, k-means clustering partitions the extracted features into sub-clusters within each class, thereby enhancing the model’s ability to represent these variations accurately.

Sub-clusters containing a minimal number of samples are identified as outliers and are subsequently removed to ensure that the remaining clusters are more representative and meaningful. This preprocessing step optimizes the prepared features for the next phase of task-specific classifier learning, thereby improving efficiency and robustness in continual learning scenarios.

3.4.2. C-NCM Classifier

The NCM classifier and its variants are extensively applied in few-shot and zero-shot learning scenarios [9]. Specifically, given an image x along with its corresponding prompt p, they are processed through a feature extraction network f to obtain an embedding vector. For each class, the embeddings of all samples are first obtained and then clustered into multiple subclasses using the feature preprocessing methods described in the previous sections. Each class is thus represented by its corresponding subclasses.

To classify the new sample x, the embedding of x is compared with all the subclasses obtained so far using cosine similarity. The classification label is then determined by identifying the subclass with the highest similarity. The cosine similarity is computed as follows:

sim (f (x), μ_{s}) = \frac{f (x) \cdot μ_{s}}{∥ f (x) ∥ ∥ μ_{s} ∥},

(8)

where

f (x)

is the embedding vector of the input sample x,

μ_{s}

is the prototype of subclass s, and

∥ \cdot ∥

denotes the Euclidean norm. The class label is determined as

y = \arg \max_{s} sim (f (x), μ_{s}),

(9)

where the maximization is performed over all subclasses s across all classes. By considering all subclasses, this approach enables a more refined classification process that takes intra-class variations into account.

4. Experiments and Analysis

4.1. Experiment Settings

4.1.1. Datasets

We adopt four primary benchmarks to assess the CIL capabilities of CLIP. The ImageNet-R [48] benchmark involves randomly dividing 200 classes into 5, 10, 20 distinct tasks, with each task consisting of 40, 20, and 10 classes. Similarly, the CUB-200 is established on the CUB-200-2011 [49] dataset, which focuses on fine-grained bird classification. In this benchmark, the 200 classes are randomly split into 5, 10, and 20 tasks, with each task comprising 40, 20, and 10 classes. Meanwhile, the CIFAR-100 [50] benchmark involves the random partitioning of the original CIFAR-100 dataset into different separate tasks, denoted as T10, T20, and T50, each containing different classes. For the TinyImageNet benchmark with 200 classes, we follow a similar implementation as [15], initially training on 100 classes. The remaining 100 classes are randomly split into 5, 10, and 20 tasks, with each task comprising 20, 10, and 5 classes.

4.1.2. Evaluation Metrics

To assess the performance of continual learning of CLIP, we record the average classification accuracy of all seen classes at the end of each task training and denote the average accuracy on the i-th task after learning the j-th task as

A_{i j}

. Following [15,18,21], we evaluate all methods with the metrics below.

Final average accuracy (FAA) refers to the last average accuracy after learning all the tasks:

$F A A = \frac{1}{T} \sum_{i = 1}^{T} A_{i T}$

(10)

where $A_{i T}$ is the average accuracy of task i after learning task T, and T is the number of tasks. A larger FAA indicates greater learning capacity and less forgetting.
Cumulative Average Accuracy (CAA) is the average of historical FAA values after learning each task:

$C A A = \frac{1}{T} \sum_{j = 1}^{T} \frac{1}{j} \sum_{i = 1}^{j} A_{i j}$

(11)

CAA reflects the overall performance after learning each incremental task, which can also be denoted as “Inc-Acc”. Larger CAA indicates greater learning capacity and less forgetting.
Final Forgetting Maximum (FFM) quantifies catastrophic forgetting by averaging the maximum performance drop of each task after learning the final task. A lower FFM indicates greater learning capacity and reduced forgetting. The FFM can be formulated as

$F F M = \frac{1}{T - 1} \sum_{i = 1}^{T - 1} max_{t \in {i, \dots, T - 1}} (A_{i t} - A_{i T})$

(12)

where $A_{i T}$ is the average accuracy of task i after learning task T, and T is the number of tasks.
F1-score. This metric provides deeper insights into model performance, where a higher F1-score signifies better overall effectiveness. It is defined as:

$F_{1} = \frac{2 \times Precision \times Recall}{Precision + Recall}$

(13)

where

$Precision = \frac{T P}{T P + F P}$

(14)

$Recall = \frac{T P}{T P + F N}$

(15)

Here, TP (True Positive) is the correctly predicted positive cases, FP (False Positive) is cases incorrectly predicted as positive, and FN (False Negative) is cases not identified as positive by the model.

4.1.3. Compared Methods

We compared CLIP to three distinct pretraining paradigms, iBOT [51], DINO [52], and MoCo V3 [53], to assess CLIP’s efficacy in the context of continuous learning. Building on these findings, we compare the proposed IT-Prompt with existing CLIP-based continual learning methods, including Continual-CLIP [39], VR-LwF [14], ZSCL [15] and AttriCLIP [17]. AttriCLIP employs the ViT-L/14 architecture, whereas the other methods utilize the ViT-B/16 architecture. More training details can be found in the Appendix A.2.

4.2. Results

We compare IT-Prompt with other CLIP-based methods on the CIFAR-100, TinyImageNet, ImageNet-R, and CUB-200 datasets with different task sequences. From Table 1 and Table 2, it is evident that our proposed approach offers substantial advantages over existing pretrained CLIP-based continual learning methods.

Specifically, when evaluated on the CIFAR-100 dataset across diverse task sequences, both the FAA and CAA metrics demonstrate significant performance improvements, with enhancements exceeding 10% compared to other methods. In addition, on the TinyImageNet dataset, our method outperforms the state-of-the-art ZSCL approach, achieving nearly a 10% improvement in FAA and a 5% improvement in CAA. These results underscore the robustness and efficacy of our approach in comparison to other methods in the domain. Upon evaluation on the challenging ImageNet-R dataset, which incorporates more complex samples from ImageNet, and the intricate CUB-200 dataset, renowned for its fine-grained distinctions, our method stands out with promising performance improvements. Specifically, on the ImageNet-R dataset, across various task sequences, both the FAA and CAA metrics show varying degrees of performance enhancement. Furthermore, on the CUB-200 dataset, our method achieves an remarkable nearly 20% increase in FAA compared to the ZSCL method across different task sequences.

The superior performance of IT-Prompt is driven by its task-adaptive prompt tuning, which dynamically adjusts prompts to capture task-specific nuances while maintaining generalization, reducing task interference. Additionally, the C-NCM classifier eliminates the need for storing and retraining old task samples by leveraging class prototypes, maintaining strong classification boundaries, and preventing long-term forgetting. Unlike previous methods that overlook textual information, IT-Prompt preserves CLIP’s zero-shot capabilities by ensuring effective multi-modal synergy between visual and textual features, reducing intra-task and inter-task confusion. Furthermore, its efficient training resource utilization avoids costly rehearsal-based strategies and full model finetuning, enabling scalability to larger datasets while achieving superior continual learning performance. Notably, across various datasets and task sequences, the C-NCM classifier demonstrates performance comparable to that of the Softmax classifier with replay, even without utilizing replay. This advantage becomes particularly evident in longer task sequences, further highlighting the effectiveness of this method in conserving training resources.

To further demonstrate the effectiveness and superiority of our proposed IT-Prompt, we incorporated t-SNE visualizations, comparing our proposed method with baseline methods using 10 randomly selected categories from the CIFAR-100 dataset. These visualizations reduce the high-dimensional feature space to two dimensions while preserving neighborhood relationships, further demonstrating the effectiveness and superiority of our approach. The t-SNE visualization results (refer to Figure 4) demonstrate that our proposed IT-Prompt exhibits significant advantages over baselines, including ZSCL, LwF-VR, and iCaRL. IT-Prompt generates feature representations with well-defined class boundaries, robust intra-class cohesion, and enhanced inter-class separation. Data points form compact, structurally coherent clusters, whereas competing methods display various degrees of class overlap and boundary ambiguity. This limitation is particularly pronounced in the iCaRL method, where class clusters appear dispersed and largely indistinguishable from one another.

: Additional Evaluation Results Table 3 presents the performance comparison of different continual learning methods on the CIFAR100-T10 and ImageNet-R-T10 datasets using the newly introduced FFM and F1-score metrics. The results demonstrate the superiority of our proposed IT-Prompt approach. Specifically, IT-Prompt achieves the lowest FFM across both datasets, indicating significantly reduced catastrophic forgetting compared to other methods. Simultaneously, it attains the highest F1-score, highlighting its ability to maintain robust classification performance throughout incremental learning. These results reinforce the effectiveness of IT-Prompt in continual learning scenarios.

4.3. Analysis

4.3.1. Impact of Pretrained Paradigm

Experiments conducted in [18] have demonstrated a significant impact of the pretrained paradigm on the performance of prompt-based continual learning methods. We adapted different pretrained models and combined them with the HiDe-Prompt proposed in [18] to conduct continuous learning experiments. As illustrated in Table 4, when three pretrained models, iBOT, DINO, and MoCo, pretrained on ImageNet1k, are utilized, the performance gap of FAA on both the ImageNet-R and CUB-200 datasets can reach up to 10%. Moreover, the FAA and CAA for CLIP models is over 10% higher than for other self-supervised models. On the CUB-200 dataset, CLIP outperforms other models, reaching a 78% FAA and 82% CAA. This performance indicates that CLIP’s pretrained paradigm is well suited for continual learning.

4.3.2. Impact of Pretrained Datasets

The choice of pretrained datasets may influence the performance of prompt-based continual learning methods. As shown in Table 5, the performance difference between the iBOT models pretrained on ImageNet-21k and ImageNet-1k datasets is approximately 4% on the CUB-200 dataset. Similarly, using three pretrained CLIP models with different datasets (see Appendix A.1) on the CUB-200 dataset, the performance gap between FAA and CAA is close to 2%. Overall, the experimental results reveal that different pretrained datasets indeed have a significant impact on the performance of the prompt-based method for continual learning. Therefore, when comparing different models, it is imperative to use the same pretraining datasets to ensure a fair and accurate assessment of their capabilities.

4.3.3. Impact of Classifiers

To assess the impact of classifiers on continual learning methods, we conducted ablation experiments comparing the Softmax classifier and the C-NCM classifier. As shown in Figure 5a–c, the Softmax classifier suffers from severe catastrophic forgetting in the absence of replay, whereas the C-NCM classifier achieves performance nearly comparable to that of the Softmax classifier with replay, even without relying on replay. This demonstrates that the prototype-based nature of C-NCM enables more stable decision boundaries, reducing the dependency on replay buffers for knowledge retention.

To further analyze the advantages of the C-NCM classifier over the Softmax classifier, we conducted a comparative study under identical hardware conditions using an NVIDIA GeForce RTX 4090 GPU. Specifically, we measured the training time per task required for each method to achieve optimal performance across different task sequences. As illustrated in Figure 5d–f, the Softmax classifier with replay incurs significantly higher training costs, as it must continuously revisit past samples to mitigate forgetting. In contrast, the C-NCM classifier achieves comparable or superior performance without replay, significantly reducing training overhead. As the task sequence expands, the Softmax classifier’s reliance on replay leads to a continual increase in training time, whereas the C-NCM classifier remains computationally efficient, since only the classifier parameters related to the current task need updating, eliminating the need for revisiting past data.

These results highlight the dual benefits of the C-NCM classifier: it not only mitigates catastrophic forgetting but also substantially improves training efficiency by reducing computational overhead. By eliminating the need for replay while maintaining competitive performance, C-NCM provides a more scalable and resource-efficient solution for continual learning. These findings further validate its effectiveness in real-world scenarios, where memory constraints and training efficiency are critical considerations.

4.3.4. Hyperparameters for Prompt

The length of a single prompt, which dictates its capacity to encode specific facets of task knowledge, represents a critical hyperparameter in prompt-based continual learning methods. To investigate this, we conducted ablation experiments on the prompt length, denoted as L-N. As demonstrated in Table 6, our method exhibits robust performance, indicating that even concise prompts can effectively capture essential task knowledge.

4.3.5. Effectiveness of Text Guide Component

To evaluate the effectiveness of the text guide component in the IT-Prompt method, we divided the CIFAR-100 dataset into three task sequences, containing 10, 20, and 50 tasks, respectively. Comparative experiments, both with and without the text guide, are depicted in Figure 6. The text guide component significantly enhanced performance in both the FAA and CAA metrics across these task sequences. This demonstrates the component’s adaptability to different classifiers and task sequences.

5. Conclusions and Future Work

In this paper, we introduce IT-Prompt, a novel prompt-based continual learning framework designed for pretrained CLIP models. This framework capitalizes on the synergy between image and text modalities to enhance the models’ knowledge base while preserving their zero-shot capabilities and mitigating catastrophic forgetting. Furthermore, we have developed the C-NCM classifier, which obviates the need for storing and retraining samples from old tasks, thereby significantly enhancing training efficiency and reducing resource consumption. Experimental results indicate that our method consistently surpasses existing state-of-the-art approaches across multiple benchmark datasets, demonstrating superior adaptability and robustness in managing dynamic and evolving tasks. These findings underscore the potential of IT-Prompt and the C-NCM classifier to revolutionize continual learning methodologies, offering an efficient and scalable solution for forthcoming challenges.

However, IT-Prompt may still face limitations when dealing with highly imbalanced class distributions or domain shifts, as it relies on pre-trained CLIP representations. In incremental learning scenarios with more diverse and heterogeneous data, the fixed nature of CLIP’s pre-trained feature space may struggle to accommodate unseen distributions, leading to suboptimal generalization and increased task interference. Future work could explore adaptive prompt tuning strategies that dynamically adjust to varying data distributions and more robust feature alignment techniques to ensure stability across diverse continual learning settings. Additionally, integrating meta-learning or self-supervised adaptation mechanisms could further enhance IT-Prompt’s ability to handle evolving and diverse data streams more effectively.

Author Contributions

Conceptualization, L.J. and W.F.; methodology, L.J.; validation L.J.; formal analysis, L.J. and W.F.; investigation, L.J. and X.C.; data curation, L.J. and X.C.; writing—original draft preparation, L.J.; writing-review and editing, L.J. and W.F.; visualization, X.C.; supervision, W.F. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by National Social Science Fund of China under grant 24BG143.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

CIFAR-100: https://www.cs.toronto.edu/~kriz/cifar.html (accessed on 10 March 2023); TinyImageNet: https://huggingface.co/datasets/zh-plus/tiny-imagenet/tree/main (accessed on 10 March 2023); ImageNet-R: https://people.eecs.berkeley.edu/~hendrycks/imagenet-r.tar (accessed on 10 March 2023); CUB-200: https://data.caltech.edu/records/20098 (accessed on 10 March 2023).

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. Implementation Details

In this section, we describe the implementation details of all experiments.

Appendix A.1. Pretrained Models

For the section “Impact of the Pretrained Dataset”, the CLIP model was pretrained on three different datasets, namely WIT, DataComp, and ImageNet-1k. On the other hand, other pretrained models such as iBOT were pretrained on ImageNet-21k and ImageNet-1k. DINO and MoCo were pretrained solely on ImageNet-1k. In other sections, the CLIP model used was pretrained specifically on the ImageNet-1k dataset.

Appendix A.2. Training Regime

We employed a different number of epochs on various datasets. Specifically, we trained for 20 epochs on CIFAR-100 and conducted training for 50 epochs on the remaining datasets. Adam is adopted as the optimizer with a initial learning rate of 0.03. The weight decay is 0, the batch size is 24. On Tiny-ImageNet and CIFAR-100, n is set to 10. For CUB-200, n is set to 1, while for ImageNet-R, n is set to 5. Except for CIFAR-100, where the prompt length is set to 5, the prompt length is set to 20 for the remaining datasets. The number of prompts equals the task sequence length.

References

Zhou, D.W.; Wang, Q.W.; Qi, Z.H.; Ye, H.J.; Zhan, D.C.; Liu, Z. Deep class-incremental learning: A survey. arXiv 2023, arXiv:2302.03648. [Google Scholar]
Wang, Z.; Zhang, Z.; Lee, C.Y.; Zhang, H.; Sun, R.; Ren, X.; Su, G.; Perot, V.; Dy, J.; Pfister, T. Learning to prompt for continual learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 139–149. [Google Scholar]
Garg, S.; Farajtabar, M.; Pouransari, H.; Vemulapalli, R.; Mehta, S.; Tuzel, O.; Shankar, V.; Faghri, F. Tic-clip: Continual training of clip models. arXiv 2023, arXiv:2310.16226. [Google Scholar]
Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning, Virtual, 18–24 July 2021; pp. 8748–8763. [Google Scholar]
Zhou, K.; Yang, J.; Loy, C.C.; Liu, Z. Learning to prompt for vision-language models. Int. J. Comput. Vis. 2022, 130, 2337–2348. [Google Scholar] [CrossRef]
Ke, Z.; Liu, B. Continual learning of natural language processing tasks: A survey. arXiv 2022, arXiv:2211.12701. [Google Scholar]
Lee, K.Y.; Zhong, Y.; Wang, Y.X. Do pre-trained models benefit equally in continual learning? In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 2–7 January 2023; pp. 6485–6493. [Google Scholar]
Hayes, T.L.; Krishnan, G.P.; Bazhenov, M.; Siegelmann, H.T.; Sejnowski, T.J.; Kanan, C. Replay in deep learning: Current approaches and missing biological elements. Neural Comput. 2021, 33, 2908–2950. [Google Scholar] [CrossRef] [PubMed]
Mai, Z.; Li, R.; Kim, H.; Sanner, S. Supervised contrastive replay: Revisiting the nearest class mean classifier in online class-incremental continual learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 3589–3599. [Google Scholar]
Pellegrini, L.; Graffieti, G.; Lomonaco, V.; Maltoni, D. Latent replay for real-time continual learning. In Proceedings of the 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Las Vegas, NV, USA, 25–29 October 2020; pp. 10203–10209. [Google Scholar]
Rolnick, D.; Ahuja, A.; Schwarz, J.; Lillicrap, T.; Wayne, G. Experience replay for continual learning. In Advances in Neural Information Processing Systems, Proceedings of the 33rd International Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; Curran Associates Inc.: Red Hook, NY, USA, 2019; Volume 32. [Google Scholar]
Hayes, T.L.; Kanan, C. Selective replay enhances learning in online continual analogical reasoning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 3502–3512. [Google Scholar]
Ho, S.; Liu, M.; Du, L.; Gao, L.; Xiang, Y. Prototype-guided memory replay for continual learning. IEEE Trans. Neural Netw. Learn. Syst. 2023, 35, 10973–10983. [Google Scholar] [CrossRef] [PubMed]
Ding, Y.; Liu, L.; Tian, C.; Yang, J.; Ding, H. Don’t stop learning: Towards continual learning for the clip model. arXiv 2022, arXiv:2207.09248. [Google Scholar]
Zheng, Z.; Ma, M.; Wang, K.; Qin, Z.; Yue, X.; You, Y. Preventing zero-shot transfer degradation in continual learning of vision-language models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 4–6 October 2023; pp. 19125–19136. [Google Scholar]
Tang, Y.M.; Peng, Y.X.; Zheng, W.S. When prompt-based incremental learning does not meet strong pretraining. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 4–6 October 2023; pp. 1706–1716. [Google Scholar]
Wang, R.; Duan, X.; Kang, G.; Liu, J.; Lin, S.; Xu, S.; Lü, J.; Zhang, B. Attriclip: A non-incremental learner for incremental knowledge learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 3654–3663. [Google Scholar]
Wang, L.; Xie, J.; Zhang, X.; Huang, M.; Su, H.; Zhu, J. Hierarchical decomposition of prompt-based continual learning: Rethinking obscured sub-optimality. In Advances in Neural Information Processing Systems, Proceedings of the 37th International Conference on Neural Information Processing Systems, New Orleans, LA, USA, 10–16 December 2024; Curran Associates Inc.: Red Hook, NY, USA, 2024; Volume 36. [Google Scholar]
Wang, Z.; Zhang, Z.; Ebrahimi, S.; Sun, R.; Zhang, H.; Lee, C.Y.; Ren, X.; Su, G.; Perot, V.; Dy, J.; et al. Dualprompt: Complementary prompting for rehearsal-free continual learning. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Cham, Switzerland, 2022; pp. 631–648. [Google Scholar]
Khan, M.G.Z.A.; Naeem, M.F.; Van Gool, L.; Stricker, D.; Tombari, F.; Afzal, M.Z. Introducing language guidance in prompt-based continual learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 4–6 October 2023; pp. 11463–11473. [Google Scholar]
Smith, J.S.; Karlinsky, L.; Gutta, V.; Cascante-Bonilla, P.; Kim, D.; Arbelle, A.; Panda, R.; Feris, R.; Kira, Z. Coda-prompt: Continual decomposed attention-based prompting for rehearsal-free continual learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Paris, France, 4–6 October 2023; pp. 11909–11919. [Google Scholar]
Kirkpatrick, J.; Pascanu, R.; Rabinowitz, N.; Veness, J.; Desjardins, G.; Rusu, A.A.; Milan, K.; Quan, J.; Ramalho, T.; Grabska-Barwinska, A.; et al. Overcoming catastrophic forgetting in neural networks. Proc. Natl. Acad. Sci. USA 2017, 114, 3521–3526. [Google Scholar] [CrossRef] [PubMed]
Li, Z.; Hoiem, D. Learning without forgetting. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 2935–2947. [Google Scholar] [CrossRef] [PubMed]
Shi, Y.; Zhou, K.; Liang, J.; Jiang, Z.; Feng, J.; Torr, P.H.; Bai, S.; Tan, V.Y. Mimicking the oracle: An initial phase decorrelation approach for class incremental learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 16722–16731. [Google Scholar]
Aljundi, R.; Babiloni, F.; Elhoseiny, M.; Rohrbach, M.; Tuytelaars, T. Memory aware synapses: Learning what (not) to forget. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 139–154. [Google Scholar]
Zeng, G.; Chen, Y.; Cui, B.; Yu, S. Continual learning of context-dependent processing in neural networks. Nat. Mach. Intell. 2019, 1, 364–372. [Google Scholar] [CrossRef]
Zhao, H.; Fu, Y.; Kang, M.; Tian, Q.; Wu, F.; Li, X. MgSvF: Multi-Grained Slow versus Fast Framework for Few-Shot Class-Incremental Learning. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 46, 1576–1588. [Google Scholar] [CrossRef] [PubMed]
Arani, E.; Sarfraz, F.; Zonooz, B. Learning fast, learning slow: A general continual learning method based on complementary learning system. arXiv 2022, arXiv:2201.12604. [Google Scholar]
Van De Ven, G.M.; Li, Z.; Tolias, A.S. Class-incremental learning with generative classifiers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 3611–3620. [Google Scholar]
Zhou, D.W.; Wang, F.Y.; Ye, H.J.; Ma, L.; Pu, S.; Zhan, D.C. Forward compatible few-shot class-incremental learning. In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 9046–9056. [Google Scholar]
Wang, F.Y.; Zhou, D.W.; Liu, L.; Ye, H.J.; Bian, Y.; Zhan, D.C.; Zhao, P. Beef: Bi-compatible class-incremental learning via energy-based expansion and fusion. In Proceedings of the Eleventh International Conference on Learning Representations, Virtual, 25–29 April 2022. [Google Scholar]
Li, X.; Zhou, Y.; Wu, T.; Socher, R.; Xiong, C. Learn to grow: A continual structure learning framework for overcoming catastrophic forgetting. In Proceedings of the International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019; pp. 3925–3934. [Google Scholar]
Konishi, T.; Kurokawa, M.; Ono, C.; Ke, Z.; Kim, G.; Liu, B. Parameter-level soft-masking for continual learning. In Proceedings of the International Conference on Machine Learning, Honolulu, HI, USA, 23–29 July 2023; pp. 17492–17505. [Google Scholar]
Van de Ven, G.M.; Siegelmann, H.T.; Tolias, A.S. Brain-inspired replay for continual learning with artificial neural networks. Nat. Commun. 2020, 11, 4069. [Google Scholar] [CrossRef] [PubMed]
Van de Ven, G.M.; Tolias, A.S. Generative replay with feedback connections as a general strategy for continual learning. arXiv 2018, arXiv:1809.10635. [Google Scholar]
Bagus, B.; Gepperth, A. An investigation of replay-based approaches for continual learning. In Proceedings of the 2021 International Joint Conference on Neural Networks (IJCNN), Shenzhen, China, 18–22 July 2021; pp. 1–9. [Google Scholar]
Tiwari, R.; Killamsetty, K.; Iyer, R.; Shenoy, P. Gcr: Gradient coreset based replay buffer selection for continual learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 99–108. [Google Scholar]
Wang, Y.; Huang, Z.; Hong, X. S-prompts learning with pre-trained transformers: An occam’s razor for domain incremental learning. In Advances in Neural Information Processing Systems, Proceedings of the 36th International Conference on Neural Information Processing Systems, New Orleans, LA, USA, 28 November–9 December 2022; Curran Associates Inc.: Red Hook, NY, USA, 2022; Volume 35, pp. 5682–5695. [Google Scholar]
Thengane, V.; Khan, S.; Hayat, M.; Khan, F. Clip model is an efficient continual learner. arXiv 2022, arXiv:2210.03114. [Google Scholar]
Jha, S.; Gong, D.; Yao, L. Clap4clip: Continual learning with probabilistic finetuning for vision-language models. In Advances in Neural Information Processing Systems, Proceedings of the 39th Annual Conference on Neural Information Processing Systems, Philadelphia, PA, USA, 25 February–4 March 2025; Association for the Advancement of Artificial Intelligence: Washington, DC, USA, 2025; Volume 37, pp. 129146–129186. [Google Scholar]
Yu, J.; Zhuge, Y.; Zhang, L.; Hu, P.; Wang, D.; Lu, H.; He, Y. Boosting continual learning of vision-language models via mixture-of-experts adapters. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 23219–23230. [Google Scholar]
Pan, Y.; Yuan, Z.; Wu, X.; Li, Z.; Xu, C. TMM-CLIP: Task-guided Multi-Modal Alignment for Rehearsal-Free Class Incremental Learning. In Proceedings of the 6th ACM International Conference on Multimedia in Asia, Auckland, New Zealand, 3–6 December 2024; pp. 1–7. [Google Scholar]
Hou, S.; Pan, X.; Loy, C.C.; Wang, Z.; Lin, D. Learning a unified classifier incrementally via rebalancing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019 ; pp. 831–839. [Google Scholar]
Van de Ven, G.M.; Tolias, A.S. Three scenarios for continual learning. arXiv 2019, arXiv:1904.07734. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Li, X.L.; Liang, P. Prefix-Tuning: Optimizing Continuous Prompts for Generation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Online, 1–6 August 2021. [Google Scholar] [CrossRef]
Lin, H.; Zhang, B.; Feng, S.; Li, X.; Ye, Y. PCR: Proxy-based contrastive replay for online class-incremental continual learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 24246–24255. [Google Scholar]
Hendrycks, D.; Basart, S.; Mu, N.; Kadavath, S.; Wang, F.; Dorundo, E.; Desai, R.; Zhu, T.; Parajuli, S.; Guo, M.; et al. The many faces of robustness: A critical analysis of out-of-distribution generalization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 8340–8349. [Google Scholar]
Wah, C.; Branson, S.; Welinder, P.; Perona, P.; Belongie, S. The Caltech-UCSD Birds-200-2011 Dataset; California Institute of Technology: Pasadena, CA, USA, 2011. [Google Scholar]
Krizhevsky, A.; Hinton, G. Learning Multiple Layers of Features from Tiny Images; University of Toronto: Toronto, ON, Canada, 2009. [Google Scholar]
Zhou, J.; Wei, C.; Wang, H.; Shen, W.; Xie, C.; Yuille, A.; Kong, T. ibot: Image bert pre-training with online tokenizer. arXiv 2021, arXiv:2111.07832. [Google Scholar]
Caron, M.; Touvron, H.; Misra, I.; Jégou, H.; Mairal, J.; Bojanowski, P.; Joulin, A. Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 9650–9660. [Google Scholar]
Chen, X.; Xie, S.; He, K. An empirical study of training self-supervised vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 9640–9649. [Google Scholar]

Figure 1. For class incremental learning, the process is divided into two phases: a training phase and a testing phase. The model is trained solely on the current task’s category-specific data during training and evaluated comprehensively across all previous and current task categories during testing.

Figure 2. Illustration of the proposed IT-Prompt framework. Frozen components are indicated by a green background and lock symbol, whereas the trainable parts are denoted by the unlock symbol. A task-specific prompt is concatenated with the current-task image and inputted into an image encoder, generating prompt-guided image features. Simultaneously, a text prompt is constructed by using the class names of all learned categories and fed into the text encoder, generating a text prompt.

Figure 3. Illustration of prompt-based learning. Whenever a new task is learned, a prompt corresponding to the task is added to the prompt pool. The prompt of the current task is combined with the MSA module in the ViT architecture using Prefix Tuning.

Figure 4. The t-SNE visualization among our proposed IT-Prompt and baseline methods (ZSCL, LwF-VR, and iCaRL) on the CIFAR100 dataset (10 randomly selected classes).

Figure 5. Performance of two different classifiers (a–c). Comparison of training time per task between the C-NCM classifier and the Softmax classifier with replay (d–f). Replay represents that the classifier needs to be retrained on the stored old samples.

Figure 6. Effectiveness of the text guide component. The FAA performance of different experiments (left y-axis). The CAA performance of different experiments (right y-axis). The gain from the text guide is highlighted in orange. Performance on CIFAR100 with 10, 20, 50 steps.

Table 1. Overall performance of the CLIP-based continual learning on CIFAR100 and TinyImageNet with different task sequences. We present final average accuracy (FAA↑) and cumulative average accuracy (CAA↑). Bold values indicate the best-performing metrics in each comparison.

Method	CIFAR-100						TinyImageNet
	T10		T20		T50		T5		T10		T20
	FAA	CAA	FAA	CAA	FAA	CAA	FAA	CAA	FAA	CAA	FAA	CAA
Continual-CLIP	66.72	75.17	66.72	75.95	66.72	76.49	66.43	70.49	66.43	70.55	66.43	70.51
LwF-VR	70.75	78.81	63.54	74.54	59.45	71.02	70.89	77.56	67.05	74.12	63.89	69.94
ZSCL	73.65	82.15	69.58	80.39	67.36	79.92	73.57	80.27	71.62	78.61	68.30	77.18
AttriCLIP	81.4	$- -$	$- -$	$- -$	$- -$	$- -$	$- -$	$- -$	$- -$	$- -$	$- -$	$- -$
IT-Prompt (Softmax+Replay)	88.41	91.98	88.05	92.32	85.88	91.79	82.11	84.8	81.72	84.63	77.64	82.45
IT-Prompt (C-NCM)	87.01	90.88	88.15	92.82	86.08	92.09	81.01	83.9	82.22	85.03	78.04	83.34

Table 2. Overall performance of CLIP-based continual learning on ImageNet-R and CUB-200 with different task sequences. We present final average accuracy (FAA↑) and cumulative average accuracy (CAA↑). Bold values indicate the best-performing metrics in each comparison.

Method	ImageNet-R						CUB-200
	T5		T10		T20		T5		T10		T20
	FAA	CAA	FAA	CAA	FAA	CAA	FAA	CAA	FAA	CAA	FAA	CAA
Continual-CLIP	73.97	78.91	73.97	79.81	73.97	80.36	51.12	60.35	51.12	62.13	51.12	63.79
iCaRL	80.67	86.46	78.03	84.66	73.7	81.74	58.37	70.46	48.39	64.57	49.48	64.52
LwF-VR	80.35	86.93	76.47	84.41	75.57	81.25	58.84	70.36	48.69	63.23	48.88	63.85
ZSCL	82.4	87.21	80.03	86.35	78.67	85.08	63.89	73.84	56.44	70.4	52.93	66.13
IT-Prompt (Softmax+Replay)	84.12	84.89	84.84	86.94	84.76	85.77	81.0	83.25	75.54	80.16	74.72	79.75
IT-Prompt (C-NCM)	82.22	82.99	83.44	85.89	85.66	85.87	80.07	82.77	78.20	80.36	77.66	79.15

Table 3. FFM and F1-score results in CIFAR100-T10 and ImageNet-R-T10 datasets. Bold values indicate the best-performing metrics in each comparison.

Method	CIFAR100-T10		ImageNet-R-T10
Method	FFM	F1-Score	FFM	F1-Score
Continual-CLIP	6.78	0.43	5.89	0.49
iCaRL	5.66	0.52	3.32	0.61
LwF-VR	5.23	0.56	3.12	0.65
ZSCL	4.28	0.63	2.37	0.71
IT-Prompt	3.16	0.84	1.58	0.79

Table 4. Overall performance of continual learning with different pretrained paradigm. Bold values indicate the best-performing metrics in each comparison.

PTM	ImageNet-R-T10		CUB-200-T10
PTM	FAA	CAA	FAA	CAA
iBOT	71.33	73.62	75.9	78.44
DINO	68.11	71.70	75.47	78.68
MoCo	63.77	68.26	61.52	67.51
CLIP	83.11	84.72	76.37	81.11

Table 5. Overall performance of continual learning with different pretrained datasets. Bold values indicate the best-performing metrics in each comparison.

PTM	ImageNet-R-T10		CUB-200-T10
PTM	FAA	CAA	FAA	CAA
iBOT-21k	70.83	73.23	71.64	75.26
iBOT-1k	71.33	73.62	75.9	78.44
CLIP-1k	84.84	86.94	78.20	80.36
CLIP-DataComp	84.63	86.95	77.28	79.37
CLIP-WIT	83.11	84.72	77.37	80.12

Table 6. Overall performance of continual learning with different prompt length. Bold values indicate the best-performing metrics in each comparison.

Prompt Length	CUB-200
	T5		T10		T20
	FAA	CAA	FAA	CAA	FAA	CAA
L-10	80.02	82.18	77.82	79.41	75.92	78.11
L-20	80.07	82.77	78.20	80.36	77.66	79.15
L-40	81.11	83.85	78.87	80.65	77.77	79.93

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Jiao, L.; Fu, W.; Chen, X. Image–Text (IT)-Prompt: Prompt-Based Learning Framework Empowered by the Cluster-Based Nearest Class Mean (C-NCM) for Rehearsal-Free Contrastive Language–Image Pretraining (CLIP)-Based Continual Learning. Appl. Sci. 2025, 15, 2966. https://doi.org/10.3390/app15062966

AMA Style

Jiao L, Fu W, Chen X. Image–Text (IT)-Prompt: Prompt-Based Learning Framework Empowered by the Cluster-Based Nearest Class Mean (C-NCM) for Rehearsal-Free Contrastive Language–Image Pretraining (CLIP)-Based Continual Learning. Applied Sciences. 2025; 15(6):2966. https://doi.org/10.3390/app15062966

Chicago/Turabian Style

Jiao, Li, Wenlong Fu, and Xiaolu Chen. 2025. "Image–Text (IT)-Prompt: Prompt-Based Learning Framework Empowered by the Cluster-Based Nearest Class Mean (C-NCM) for Rehearsal-Free Contrastive Language–Image Pretraining (CLIP)-Based Continual Learning" Applied Sciences 15, no. 6: 2966. https://doi.org/10.3390/app15062966

APA Style

Jiao, L., Fu, W., & Chen, X. (2025). Image–Text (IT)-Prompt: Prompt-Based Learning Framework Empowered by the Cluster-Based Nearest Class Mean (C-NCM) for Rehearsal-Free Contrastive Language–Image Pretraining (CLIP)-Based Continual Learning. Applied Sciences, 15(6), 2966. https://doi.org/10.3390/app15062966

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Image–Text (IT)-Prompt: Prompt-Based Learning Framework Empowered by the Cluster-Based Nearest Class Mean (C-NCM) for Rehearsal-Free Contrastive Language–Image Pretraining (CLIP)-Based Continual Learning

Abstract

1. Introduction

2. Related Work

2.1. Continual Learning

2.1.1. Traditional Continual Learning Methods

2.1.2. Prompt-Based Continual Learning Methods

2.1.3. CLIP-Based Continual Learning Methods

2.2. Classifier

3. Proposed Method

3.1. Problem Formulation

3.2. Overall Framework

3.3. Task-Specific Prompt Learning

3.3.1. Task-Specific Prompt Inference

3.3.2. Construction of the Prompt

3.3.3. Optimization of the Prompt

3.4. C-NCM Classifier

3.4.1. Feature Preprocessing

3.4.2. C-NCM Classifier

4. Experiments and Analysis

4.1. Experiment Settings

4.1.1. Datasets

4.1.2. Evaluation Metrics

4.1.3. Compared Methods

4.2. Results

4.3. Analysis

4.3.1. Impact of Pretrained Paradigm

4.3.2. Impact of Pretrained Datasets

4.3.3. Impact of Classifiers

4.3.4. Hyperparameters for Prompt

4.3.5. Effectiveness of Text Guide Component

5. Conclusions and Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Appendix A. Implementation Details

Appendix A.1. Pretrained Models

Appendix A.2. Training Regime

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI