If you see this, something is wrong
To get acquainted with the document, the best thing to do is to select the "Collapse all sections" item from the "View" menu. This will leave visible only the titles of the top-level sections.
Clicking on a section title toggles the visibility of the section content. If you have collapsed all of the sections, this will let you discover the document progressively, from the top-level sections to the lower-level ones.
Generally speaking, anything that is blue is clickable.
Clicking on a reference link (like an equation number, for instance) will display the reference as close as possible, without breaking the layout. Clicking on the displayed content or on the reference link hides the content. This is recursive: if the content includes a reference, clicking on it will have the same effect. These "links" are not necessarily numbers, as it is possible in LaTeX2Web to use full text for a reference.
Clicking on a bibliographical reference (i.e., a number within brackets) will display the reference.
Speech bubbles indicate a footnote. Click on the bubble to reveal the footnote (there is no page in a web document, so footnotes are placed inside the text flow). Acronyms work the same way as footnotes, except that you have the acronym instead of the speech bubble.
By default, discussions are open in a document. Click on the discussion button below to reveal the discussion thread. However, you must be registered to participate in the discussion.
If a thread has been initialized, you can reply to it. Any modification to any comment, or a reply to it, in the discussion is signified by email to the owner of the document and to the author of the comment.
First published on Saturday, Apr 26, 2025 and last modified on Saturday, Apr 26, 2025 by François Chaplais.
University of Pisa, Largo Bruno Pontecorvo 3, Largo Bruno Pontecorvo 3, 56127, Italy Email
University of Pisa, Largo Bruno Pontecorvo 3, Largo Bruno Pontecorvo 3, 56127, Italy Email
Politecnico di Torino, Corso Duca degli Abruzzi 24, Torino, 10129, Italy Email
University of Auckland, 34 Princes Street, Auckland Central, Auckland 1010, New Zealand Email
Indian Institute of Technology, Main Gate Rd, IIT Area, Powai, Mumbai, Maharashtra 400076, Email
University of Warwick, Coventry CV4 7AL, United Kingdom Email
University of Pisa, Largo Bruno Pontecorvo 3, Largo Bruno Pontecorvo 3, 56127, Italy Email
Parameter-Efficient Fine-Tuning, Continual Learning, Large Models
The emergence of large pre-trained networks has revolutionized the AI field, unlocking new possibilities and achieving unprecedented performance. However, these models inherit a fundamental limitation from traditional Machine Learning approaches: their strong dependence on the i.i.d. assumption hinders their adaptability to dynamic learning scenarios. We believe the next breakthrough in AI lies in enabling efficient adaptation to evolving environments — such as the real world — where new data and tasks arrive sequentially. This challenge defines the field of Continual Learning (CL), a Machine Learning paradigm focused on developing lifelong learning neural models. One alternative to efficiently adapt these large-scale models is known Parameter-Efficient Fine-Tuning (PEFT). These methods tackle the issue of adapting the model to a particular data or scenario by performing small and efficient modifications, achieving similar performance to full fine-tuning. However, these techniques still lack the ability to adjust the model to multiple tasks continually, as they suffer from the issue of Catastrophic Forgetting. In this survey, we first provide an overview of CL algorithms and PEFT methods before reviewing the state-of-the-art on Parameter-Efficient Continual Fine-Tuning (PECFT). We examine various approaches, discuss evaluation metrics, and explore potential future research directions. Our goal is to highlight the synergy between CL and Parameter-Efficient Fine-Tuning, guide researchers in this field, and pave the way for novel future research directions.
Deep learning models achieve impressive results by scaling their parameters and the data used during training. However, these models often require substantial resources, which makes it impractical to update them continuously. This limitation is especially significant when resources are constrained. Continual Learning (CL) provides a solution by enabling these models to learn and adapt to new information while retaining previously acquired knowledge. Specifically, CL methods address the challenge of training a single model to learn and retain knowledge across a sequence of tasks over time, ensuring that previously learned information is not forgotten.
CL is crucial for Large Pre-Trained Models (PTMs), which are expensive to train from scratch and can suffer from catastrophic forgetting during traditional fine-tuning for new tasks. Parameter-Efficient Fine-Tuning (PEFT) mitigate this by strategically adjusting only a small subset of parameters in the pre-trained model, significantly reducing computational costs while achieving similar performance. When combined with CL, PEFT allows PTMs to continuously learn and integrate new information while retaining past knowledge, in a resource-efficient manner. This paves the way for a new generation of large PTMs that can continuously evolve and adapt without extensive retraining.
The intersection between CL and PEFT is a fascinating area of study. On one hand, CL focuses on learning from dynamic and evolving data streams, allowing models to adapt to new tasks without forgetting previously learned knowledge. This is a critical aspect of machine learning, as it enables models to stay relevant in the face of changing data landscapes. On the other hand, PECFT enhances pre-trained models by strategically adjusting a limited subset of parameters during fine-tuning.
The combination of these two principles gives rise to Parameter Efficient Continual Fine Tuning (PECFT). PECFT combines the adaptability of CL with the efficiency of PEFT, enabling models to progressively adapt to new tasks while retaining past knowledge. This convergence introduces a new paradigm for large PTMs, allowing continuous evolution without extensive retraining. Works like Learning to Prompt [1] and Dual Prompt [2] have already explored this idea of CL and PEFT. It represents an exciting convergence of efficiency and adaptability, two traits that are increasingly recognized as critical in the fast-paced, ever-evolving world of machine learning. By balancing these elements, PECFT paves the way for the development of more effective, adaptable, and sustainable machine learning models that are capable of meeting the complex and growing demands of the field.
This survey explores PECFT, aiming to provide a comprehensive overview of this area, including the motivations behind CL and PEFT, how PECFT addresses limitations in traditional approaches, existing PECFT methods, a comparison of these methods, and future directions for PECFT development. By offering a comprehensive overview of the field, we aim to:
The organization of this paper as follows: we introduce background concepts in Sec. 2, present a state-of-the-art and elaborated taxonomy of representative CL methods in Sec. 3, show an overview of current PEFT approaches in Sec. 4, describe the confluence of CL and PEFT in Sec. 5, and discuss future directions in Sec. 6. Finally, we conclude this paper in Sec. 7 by summarizing our findings.
Continual Learning , also known as incremental or lifelong learning [3, 4, 5, 6, 7], is a subfield of machine learning concerned with incrementally acquiring, updating, and retaining knowledge over a sequence of tasks presented over time. Unlike traditional machine learning, which works with static data distributions, continual learning tackles the challenge of learning in a dynamic and evolving environment where new data often come from different distributions. More formally, CL considers a stream of \( T \) tasks. Each task \( t \) comprises a new dataset \( D^t = (X^t,Y^t) \), where \( X^t \) denotes the input instances and \( Y^t \) denotes the instance labels. The objective is to train a model \( f_{\Theta}: X \longrightarrow Y \) using data from a sequence of \( T \) tasks: \( D = \{D^1, ..., D^T \} \), and where each \( D^t\) can follow a different distribution. Here, \( \theta\) are the learned parameters of the model. During each task, the model \( f_{\Theta} \) minimizes the objective function \( \mathcal{L} \) using data \( D^t \). Each task is presented sequentially to the model and trained for \( E\) epochs. The objective function is defined as follows:
(1)
Traditionally, machine learning models are usually trained on a finite, limited and static data set, limiting their adaptability. Naively training a model on new data to perform well on novel tasks can degrade its performance on previously learned tasks. This problem is known as Catastrophic Forgetting (CF) [8] and is the core challenge of CL. Theoretically, consequential re-training and adapting to new tasks can shift how features are represented within the model. This phenomenon, where the representation of features within a model evolves in a way that can negatively impact previous tasks, is known as Representation Drift [9, 10].
In CL, a model needs to learn corresponding task(s) with no or limited access to old training samples and perform well on both new and previous sets. Due to the need to adapt to the new task, CL approaches must find a balance between plasticity, the model’s ability to adapt to new information and stability, its capacity to retain past knowledge [11].
CL scenarios provide contexts for how the model should learn and adapt to a continuous stream of data over time; these scenarios are often categorized based on how information is presented and how the model is expected to perform. Table 1 presents an overview of such scenarios, with their main properties.
| Scenario | Description | Key Features |
| Task Incremental Learning (TIL) | Each task has unique data distributions and non-overlapping labels, with \( D^t = (X^t, Y^t, t)\) | • Task IDs available in testing • Non-overlapping class sets between tasks |
| Class Incremental Learning (CIL) | Similar to TIL in training, but must predict among all classes without task information | • No task IDs in testing • Must classify across all previously seen classes |
| Domain Incremental Learning (DIL) | Same classes across different domains, with \( D^t = (X^t, Y, t)\) | • Shared label space • Different data distributions • No task IDs in testing |
| Online Continual Learning (OCL) | Real-time learning from continuous data stream, with \( D = (X_t, Y_t)\) | • Single-pass learning • Immediate data disposal • Streaming setup |
| Class-Incremental with Repetition (CIR) | Allows both new classes and repetition of previously seen ones | • Natural class recurrence • Instance repetition • Varying frequencies |
| Rainbow Memory (RM) | Addresses “blurry” task boundaries with shared classes | • Diversity-aware memory • Enhanced augmentation • Uncertainty-based selection |
Some scenarios provide task identifiers during testing, allowing the model to know which task it is performing . One of such scenarios is Task Incremental Learning (TIL). In TIL, the model is trained on a sequence of tasks \( T\) where each had a unique dataset and non-overlapping labels, represented as \( D^t = (X^t, Y^t, t) \) for \( t \) in \( T \), where \( p(X_i) \neq p(X_j) \) and \( Y_i \cap Y_j = \emptyset \) for \( i \neq j \). Task-specific information \( t\) is available during training and testing, represented as \( p(X^t) \) for \( t \) in \( T \) [12].
A second scenario is Class Incremental Learning (CIL), which, while similar to TIL in training, is more challenging in inference. This is because, during testing, the specific task information is unavailable, causing the model to have to predict among all observed classes, not only those of the task \( t\) . This lack of task information during testing requires the model to differentiate and classify all previously learned classes without additional context, making CIL significantly more challenging.
Domain Incremental Learning (DIL) [13] instead refers to the scenario where a model learns to adapt to a sequence of domains over time, each one representing a variation of the same task . The key challenge in DIL is that the model must perform well on new domains without forgetting what it has learned from previous ones, often without knowing which domain it is dealing with during testing. Formally, we can represent this as \( D^t = (X^t, Y, t)\) for \( t\) in \( T\) , where \( p(X_i) \neq p(X_j)\) for \( i \neq j\) , but \( Y\) remains constant across all tasks. Notably, \( Y_i = Y_j\) for all \( i, j\) in \( T\) . In DIL, task identities are not required during inference, which distinguishes it from TIL. This scenario is particularly relevant in applications where the same set of classes needs to be recognized across varying domains or conditions, such as object recognition under different lighting or environmental contexts [14].
Online Continual Learning (OCL) [15] is a CL scenario that simulates real-time learning by processing a continuous stream of data with temporally shifting distributions. In this approach, the model learns directly from the incoming data, adapting to changes over time while storing only a minimal amount of information from the stream. OCL presents a dynamic scenario where tasks have disjoint data label spaces and training samples arrive as a continuous, one-pass data stream. We can formulate this as \( D = (X_t, Y_t)\) for \( t\) in \( \mathbb{N}\) , where \( Y_i \cap Y_j = \emptyset\) for \( i \neq j\) . In OCL, the model must learn from each sample only once, as it becomes immediately unavailable after being processed. This constraint simulates real-world scenarios where data cannot be stored or revisited due to privacy concerns or storage limitations. The challenge in OCL lies in the model’s ability to continuously and quickly adapt to new classes while maintaining performance on previously learned ones, all within the constraints of a single-pass learning paradigm. This learning scenario is particularly applicable in systems that must adapt in real-time to evolving environments or user preferences, such as online recommendation systems.
While TIL, CIL, DIL and OCL scenarios dominate much of the continual learning research, less frequently explored scenarios offer valuable insights into more realistic learning environments. Two such scenarios are Class-Incremental Learning with Repetition and Rainbow Memory.
Class-Incremental Learning with Repetition (CIR) [16] represents a more flexible and realistic scenario in CL, where both the introduction of new classes and the repetition of previously seen classes are allowed. In CIR, the model \( f_\theta\) , with \( \theta\) representing its parameters, learns from a stream of \( N\) experiences \( S = \{e_1, e_2, ..., e_N\}\) , where each experience \( e_i\) brings a dataset of examples \( D_{e_i} = \{X_i, Y_i\}\) . Unlike CI scenarios, where \( Y_i \cap Y_j = \emptyset\) for \( i \neq j\) , or DI scenarios, where \( Y_1 = ... = Y_N = Y\) , CIR allows \( |Y_i \cap Y_j| \geq 0\) . This flexibility enables both instance repetition (\( |X_i \cap X_j| \geq 0\) ) and concept repetition. Importantly, in CIR, repetition is a property of the environment and cannot be controlled by the CL agent, which distinguishes it from structured Replay strategies. CIR streams can be generated using methods like the Slot-Based Generator (\( G_{slot}\) ) or the Sampling-Based Generator (\( G_{samp}\) ), which allow for the creation of customized streams with varying degrees of repetition.
Unlike CI or DI scenarios, CIR better mimics real-world data streams where concepts naturally reoccur over time with varying frequencies. This property makes CIR particularly important for several reasons: (i) it challenges CL algorithms to balance stability and plasticity more dynamically, as they must retain knowledge of infrequent classes while adapting to frequent ones. (ii) CIR allows for the study of knowledge accumulation and refinement over time, which is critical for long-term learning systems. Potential applications of CIR are numerous and diverse. In computer vision, CIR could be applied to object recognition systems in dynamic environments, such as autonomous vehicles or surveillance systems, where particular objects may appear more frequently than others, but all must be recognized accurately.
Rainbow Memory (RM) [17] is another CL scenario that addresses the challenges of more realistic, “blurry” task boundaries. In real-world scenarios, new tasks often share classes with previous tasks, creating a continuum rather than distinct, disjoint task boundaries. This mixed setup is more challenging and practical than the traditional disjoint CL scenario. RM works by focusing on two key strategies:
Diversity-aware memory update: RM selects samples for episodic memory based on their classification uncertainty. This uncertainty is estimated using perturbed versions of the samples:
(2)
where \( u(x)\) is the uncertainty of sample \( x\) , \( T\) is the number of perturbations, and \( S_c\) is the number of times class \( c\) is predicted as the top class across perturbations. RM then selects samples across the uncertainty spectrum, ensuring a diverse representation that includes both easily classifiable and boundary samples.
The relevance of RM lies in its ability to address more realistic CL scenarios. By maintaining a diverse memory of past tasks and using augmentation to enhance this diversity, RM is better equipped to handle the gradual shift in class distributions in real-world applications. This approach significantly mitigates CF.
Large Pre-trained Models are massive neural networks trained on vast amounts of data to learn general-purpose representations [19, 20]. These models, predominantly based on the Transformer architectures [21], have significantly altered the machine learning landscape, particularly in natural language processing and increasingly in computer vision and multimodal applications.
The “pre-training” in PTMs refers to the initial training phase where the model learns to perform tasks on large corpora, such as predicting masked words or generating coherent text. This process imbues the model with a broad understanding of language structures, world knowledge, and even rudimentary reasoning capabilities [22]. The “large” aspect denotes the model’s size in terms of parameters and the scale of data and computation involved in their training [23].
A key advantage of PTMs is their capacity to learn robust representations that serve as excellent starting points for a wide range of downstream tasks, often requiring minimal task-specific training data [24]. In practice, pre-trained models are adapted to specific tasks through fine-tuning or by training new classifier layers, making them versatile tools across various domains.
Key examples of large PTMs showcase their diverse capabilities:
These models have pushed the boundaries of artificial intelligence, exhibiting capabilities that range from human-like text generation to solving complex reasoning tasks, often with little to no task-specific training [26]. Their effectiveness in capturing knowledge from large volumes of labeled and unlabeled data, combined with their adaptability, has made PTMs crucial in multiple research and application areas.
The fundamental architecture commonly used with large PTMs is the Transformer [21], built around the self-attention mechanism. The way this layer works is that given an input sequence \( X = \{x_1, x_2, …, x_n\} \), where \( x_i \) represents the \(i\)-th token or patch, the Transformer module computes attention scores using queries \( Q \), keys \( K \), and values \( V \), which are linear projections of \( X \). Self-attention mechanism is defined as:
(3)
where \( d_k \) is the dimensionality of the keys. This attention allows the model to weigh the importance of each token/patch relative to others, capturing contextual relationships efficiently. This architectural component makes language models excel in natural language understanding and generation by employing multiple layers of self-attention and feed-forward networks. These models are trained to generate text by predicting the next token or finding the masked value in a sequence, maximizing the probability \( P(x_{i+1} | x_1, x_2, …, x_i) \).
There exists also vision models, e.g. ViT [25], that apply the Transformer architecture to image data. An image is divided into patches, flattened and linearly embedded into a sequence of vectors. Given an image \( I \) divided into \( N \) patches, each patch \( p_i \) is embedded into a vector \( e_i \). The sequence of embedded patches \( E = \{e_1, e_2, …, e_N\} \) is then processed through the transformer architecture similarly to text tokens, allowing the model to capture spatial relationships within the image.
Although PTMs have demonstrated remarkable success across various tasks, they have some limitations. One significant challenge is the domain gap between pre-training and target task data. Pre-trained models are typically trained on large-scale datasets; however, in many cases, these datasets will not fully represent the specific characteristics of the target task. This domain gap can lead to performance degradation when deploying pre-trained models directly to new tasks or domains [27].
One way of alleviating this problem is to employ Transfer Learning strategies [28]. Transfer learning involves leveraging knowledge from a source domain to improve performance on a target domain or task. By fine-tuning the model parameters on a smaller dataset related to the target task, the model can adapt to the specific characteristics of the task while using the pertaining as a good starting point. Parameter-Efficient Fine-Tuning has emerged as a significant advancement in large PTMs studies. It refines the capabilities of pre-trained models by strategically adjusting a limited subset of parameters during fine-tuning. This approach differs significantly from comprehensive fine-tuning, which requires the entire model to be retrained, often at considerable computational cost. PEFT achieves impressive results while minimizing parameter updates, ensuring efficient adaptation and preserving the model’s previously acquired knowledge [29].
Continual learning is learning from dynamic data distributions arriving in sequence. As shown in Figure 1, CL is generally divided into four categories: regularization-based, replay-based, optimization-based, and architecture-based approaches, represented respectively in green, blue, yellow, and red. It is important to note that some methods can belong to multiple categories. The following is a brief introduction to each category; for more detail, see previous work that explains the differences in these types of approaches [14, 30, 31].
One alternative to naively training a model in a sequence of tasks is adding a constraint on the loss functions to encourage the mitigation of forgetting. This regularization can be imposed on weights by estimating the parameters importance so relevant weights do not drift significantly. One of the first approaches that proposed this was Elastic Weight Consolidation (EWC) [32], which captures the prior importance using a diagonally approximated Fisher information matrix. EWC is later improved by finding a better approximation of the fisher information matrix (sFIM).
Regularization can also be imposed on activations to prevent activation drift, which generally outperforms its weight-regularisation counterpart. One work in this line is Learning without Forgetting (LwF) [33], which prevents activations of the old network from drifting while learning new tasks. The less-forgetting learning penalizes the activation difference except for the fully-connected layer. Riemannian Walk [34] extends EWC by incorporating KL-divergence-based regularization, path integral methods to assess parameter importance and sample-based strategies for retaining past knowledge. Rotated EWC (R-EWC) [35] enhances EWC by reparameterizing the network to better align with the FIM, improving the diagonal approximation and reducing forgetting.
Similarly, Synaptic Intelligence (SI) [36] evaluates the importance of each synapse online during training and consolidates significant parameters by penalizing considerable changes to mitigate forgetting. Memory Aware Synapses (MAS) [37] determine the importance of weights based on the sensitivity of the learned function, enabling adaptive penalization of changes to significant weights without relying solely on the loss function. Efforts have also been directed towards enhancing the implementation of secondary penalties. Incremental Moment Matching (IMM) [38] integrates multiple models trained on different tasks by aligning their weight distributions, employing weight transfer, and L2 regularization to sustain performance across tasks.
Function regularization is another way, also known as Knowledge Distillation (KD) [39], to target the intermediate or final output of the prediction function. This approach typically employs the previously-learned model as the teacher and the currently-trained model as the student, leveraging knowledge distillation techniques to mitigate catastrophic forgetting. In ideal scenarios, KD would target all old training samples, but in continual learning settings, alternatives such as new training samples, a small fraction of old training samples, external unlabeled data, or generated data are used, albeit suffering from varying degrees of distribution shift.
Pioneering works like iCaRL [40] learn new training samples while utilizing predictions from the output heads of the old tasks to compute the distillation loss. LwM [41] exploits the attention maps of new training samples for KD, while EBLL [7] learns task-specific autoencoders to prevent changes in feature reconstruction. GD [42] further distills knowledge from the large stream of unlabeled data available.
Some works add three subgroups [30]: Logit Distillation, Feature Distillation, and Relational Distillation. Logit distillation focuses on transferring knowledge by aligning the final output logits of the neural network between the old and new models. LwF [33] is a pioneering application of KD in CL, guiding the learning of the new model by utilizing the outputs of the old model.
Feature distillation ensures the protection of learned knowledge by distilling at the intermediate levels of the models. Methods like UCIR [43] force the features extracted by the new embedding module to be the same as the old one, providing a stronger regularization. Other works like LwM [44] and AFC [45] utilize attention maps and feature importance, respectively.
Relational distillation introduces a different perspective by distilling the relationships between multiple samples or instances. This level captures the structural information within the data, enhancing the retention of old knowledge. Methods like COIL [46] suggest bidirectional distillation with co-transport, utilizing semantic relationships between old and new models.
Replay-Based methods stores a small subset of data from the previously accessed tasks to reinforce the network’s memory of old knowledge. Current memory-based methods have achieved promising results on many CL benchmarks. Saving examples helps mitigate forgetting from previous tasks by representing past distributions used during training. For the memory to be sufficient, it must represent the previous distribution as fully as possible, considering all its classes and concepts.
For the case of replay-based methods, together with minimizing Equation 1, model \( f_{\Theta}\) needs to minimize \( \mathcal{L}\) using the data available in memory \( M\) at time \( t\) . The buffer \( M^t\) comprises \( |M|\) samples from previous distributions, meaning that at task \( t\) , the buffer will contain samples only from \( t'<t\) . As shown in Equation 4.
(4)
The function in charge of populating memory \( M\) is known as Storage Policy [47] and decides which elements go into the memory by sampling from set \( D^t\) given a function \( \mathbf{P}\) , as shown in Equation 5. An ideal policy function is the one that minimizes Equation 4 for evaluation stream \( D^1 ... D^T\) , restricted by the memory size \( |M|\) .
(5)
In most cases, we assume that \( M^{t+1}\) will always contain \( |M|\) samples, and the storage policy will decide which samples to remove to add those from new task \( D^t\) . In practice, designing an effective storage policy involves striking a balance between preserving diversity in the memory, i.e. representing past distributions fully, and accommodating new information from the current task. This delicate balance ensures that the memory remains relevant and informative over time, contributing to the model’s overall performance across the entire sequence of tasks. The replay-based CIL branch quickly draws attention due to its appealing ability to resist catastrophic forgetting. For instance, PODNet [48] adopts an efficient spatial-based distillation loss to reduce forgetting, with a focus on the large-phase setting, achieving reasonably good results. AANets [49] employs a new architecture containing a stable block and a plastic block to balance the stability and plasticity.
On top of the replay-based CIL, methods exploring exemplar storing techniques are also fruitful. For instance, Reinforced Memory Management (RMM) [50] seeks dynamic memory management using reinforcement learning. By plugging it into PODNet and AANets, RMM attains a state-of-the-art performance.
GAN-based CIL replays past samples by generating them using GANs. The deep generative replay generates synthetic samples using an unconditioned GAN. It is later improved by memory replay GAN adopting a label-conditional GAN. GAN-based CIL relies heavily on GAN’s generative performance and is only tested on relatively small datasets, such as MNIST. Bias correction-based CIL mainly tries to address the task-recency bias. The end-to-end incremental learning reduces the bias by introducing a balance training stage where only an equal number of samples for each class is used. The Bias Correction (BiC) [51] includes an additional trainable layer which aims to correct the bias. The method named LUCIR [52] fights the bias by changing the softmax layer into a cosine normalization one.
In architecture-based methods of CL, the network architecture is dynamically updated during the learning process to retain previously acquired knowledge. These methods are designed to adjust the model’s architecture to effectively handle new tasks while maintaining the knowledge from earlier tasks. Architecture-based methods can be categorized into fixed-capacity and capacity-increasing approaches based on the evolution of their model parameters as the number of tasks increases. Building on these foundations, we explore the implementation of task-specific parameters, extending these concepts through parameter allocation and dynamic network architectures.
Parameter Allocation approach dedicates isolated parameter subspaces within the network to each task. Piggyback [53], HAT [54], WSN [55], and H2 [56] utilize binary masks to select specific neurons or parameters for each task, effectively freezing the old tasks’ masked regions to prevent forgetting. PackNet [57], CLNP [58], and AGS-CL [59] identify important neurons or parameters for the current task and release the rest for subsequent tasks. However, limited network capacity can lead to saturation of “free” parameters as the number of tasks increases. Dynamic architecture expansion can address this by allowing the network to grow when necessary, with methods like reinforcement learning, architecture search, and variational Bayes being employed for optimization.
Dynamic network architecture could classify model decomposition and modular network. Model Decomposition explicitly separates a model into task-sharing and task-specific components. Regular networks can incorporate task-specific components as parallel branches, adaptive layers, or masks of intermediate features. Additionally, network parameters themselves can be decomposed into task-sharing and task-specific elements. While this approach offers scalability, the number of task-specific components typically grows linearly with tasks, making resource efficiency a crucial factor.
Modular Network utilize parallel sub-networks or sub-modules to learn tasks independently. Early works like Progressive Networks [60] introduced an identical sub-network for each task, facilitating knowledge transfer through adapter connections. Expert Gate [4] employs a mixture of experts, expanding one for each new task. PathNet [61] pre-allocates parallel networks to construct candidate paths, selecting the best path for each task. MNTDP [62] seeks to find the optimal layout from existing and potentially new sub-modules. Similar to Parameter Allocation, these methods aim to construct task-specific models while enabling explicit knowledge reuse through sub-network combinations.
Unlike other directions, most architecture-based approaches aim to de-correlate tasks in network parameters, potentially sacrificing scalability and inter-task generalizability. Task identities are often required to determine which set of parameters to use. This limitation can be overcome by inferring task identities from the predictive uncertainty of task-specific models or by learning the function of task-identity prediction through continual learning strategies.
Beyond incorporating additional loss terms, e.g. regularization and replay, the optimization-based approach in CL explores alternative optimization strategies. This includes techniques like gradient projection to constrain model updates and prevent knowledge loss, as well as meta-learning approaches that aim to automatically acquire inductive biases suitable for CL scenarios. Additionally, some work focuses on optimizing the training process from a loss landscape perspective.
CL encompasses a diverse range of techniques that go beyond regularization and experience replay to address the challenge of learning from a continuous stream of data without forgetting previously acquired knowledge. Gradient projection methods, such as GEM [63], A-GEM [64], and LOGD [65], manipulate parameter updates to align with or remain orthogonal to specific directions, preserving past input and gradient spaces. In contrast, OWM [66] and OGD [67] offer contrasting strategies: OWM modifies updates to be orthogonal to the previous input space, while OGD preserves old gradient directions and rectifies current ones. OrthogSubspace [68] and GPM [69] leverage orthogonal subspaces for CL, with [69] focusing on maintaining the gradient subspace of old tasks. FS-DGPM [70] dynamically adjusts [69] by discarding unimportant bases to enhance learning plasticity and convergence. TRGP [71] defines a “trust region” based on gradient projection to selectively reuse frozen weights from prior tasks.
Meta-learning approaches in CL aim to acquire a data-driven inductive bias suitable for various scenarios, eliminating the need for manual design. OML [72] employs meta-training to perform online updates while minimizing interference, promoting sparse representations. ANML [73] extends this idea by meta-learning a context-dependent gating function to activate relevant neurons for each task. Meta-learning can be combined with experience replay for better utilization of both old and new data. MER [74] aligns gradient directions, while iTAML [75] applies a meta-updating rule for balancing them. La-MAML [76] optimizes the OML objective with an adaptive learning rate and leverages replay for online training. OSAKA [77] proposes a hybrid objective for knowledge accumulation and fast adaptation, achieved through meta-training initialization and task incorporation. MERLIN [78] utilizes a metadistribution of model parameters to sample and ensemble task-specific models for inference, while MARK [79] maintains incrementally updated shared weights through meta-learning and selective masking for specific tasks.
Fine-tuning is a transfer learning technique that adapts a pre-trained model to a specific downstream task by further training it on task-specific data [19]. This process typically involves updating all or most of the model’s parameters to optimize performance on the new task. Fine-tuning has become crucial in modern ML pipelines, particularly for PTMs, as it allows leveraging the rich representations learned from vast amounts of data to solve specific tasks with relatively small datasets [24]. The importance of fine-tuning lies in its ability to significantly reduce training time and computational resources compared to training models from scratch while often achieving superior performance [80].
| Method | Key Mechanism and Features |
| Adapter [A] | Bottleneck layers (\( 2kd\) /layer) with down/up projections. Effective task adaptation with minimal overhead |
| AdaptFormer [A] | Specialized vision modules (\( <2\%\) ) enhancing ViT transferability across tasks |
| Adapter Fusion [A] | Two-stage fusion combining task adapters with knowledge distillation |
| Mera [A] | Efficient integration of existing adapters using same-track combination strategy |
| AdaMix [A] | Multiple parallel adapters providing diverse task perspectives |
| LoRA [R] | Low-rank decomposition (\( 2dr\) /layer) for weight updates. Zero inference overhead |
| QLoRA [R] | 4-bit NormalFloat quantization + LoRA (0.1%) for memory-efficient training |
| DoRA [R] | Separates magnitude and direction components for improved training dynamics |
| RoSA [R] | Combines low-rank and sparse components for better outlier handling |
| VeRA [R] | Shared frozen matrices with learnable scaling vectors |
| Prompt Tuning [A] | Continuous trainable prefix tokens (\( l \times d\) ) for task conditioning |
| P-Tuning [A] | Hybrid discrete and continuous prompts with LSTM/MLP encoder |
| Prefix Tuning [A] | Optimizable virtual tokens (0.1%) as prefix for generation |
| MPT [A] | Distillation-based transferable prompts (0.035%) for multi-task |
| BitFit [S] | Selective bias-term updates (\( <0.1\%\) ) preserving model knowledge |
| Masking [S] | Task-specific binary masks identifying crucial weights (3-10%) |
| [A]: Additive, [R]: Reparameterization, [S]: Selective | |
However, as PTMs grow in size, with some models containing billions of parameters, traditional fine-tuning becomes increasingly challenging and expensive. This approach often requires specialized hardware, substantial energy consumption [23, 20] and can lead to potential overfitting on smaller downstream datasets [81]. Moreover, full fine-tuning can result in catastrophic forgetting, where the model loses its ability to perform well on previously learned tasks [82].
In the context of large-scale PTMs, Parameter-Efficient Fine-Tuning has emerged as a resource-efficient approach for model adaptation [83]. PEFT methods aim to achieve performance comparable to or even surpass full model fine-tuning while updating only a small number of trainable parameters, either by selectively updating a subset of the model’s parameters [84] or introducing new task-specific parameters [85]. This approach significantly reduces computational costs and memory requirements, making it particularly effective when working with very large models or in scenarios with limited computational resources, as it enables efficient adaptation across various tasks without extensive retraining.
PEFT strategies can be broadly categorized based on how they modify or add parameters to the pre-trained model. These categories include: (i) additive methods, which introduce new task-specific parameters to the model; (ii) reparameterization methods, which reparameterize the existing model parameters more efficiently; (iii) and selective methods, which fine-tune a subset of the model parameters based on their importance to the task. Since this paper focuses on the intersection of Continual Learning and Parameter-Efficient Fine-tuning, we will present an overview of these methods, introducing and discussing only a few of these — for a comprehensive comparison of different PEFT categories and methods, we refer readers to recent surveys by [31] and [86]. An summary of the PEFT methods we discuss can be seen in Table 2.
In additive methods, task-specific parameters and lightweight modules are added to the pre-trained model architecture.
In the original Adapter method [85], small neural network modules, the so-called “adapters”, are inserted into the transformer layers of the pre-trained model. These adapters are then trained on the downstream task while keeping the pre-trained model parameters frozen. Adapters have a bottleneck structure, and each adapter block comprises two fully connected layers: the first layer projects the input \( x \in \mathbb{R}^d\) into a lower-dimensional space \( z \in \mathbb{R}^k\) , with \( k \ll d\) , while the second layer restores this representation to the original dimension. The total number of parameters introduced is \( 2kd\) , significantly less than that of a single fully connected layer. Only the adapter block’s parameters are adjusted during fine-tuning, leaving the pre-trained model parameters frozen. This ensures effective task adaptation with minimal computational overhead.
AdaptFormer [87] enhances pre-trained Vision Transformers for various image and video tasks with significant benefits. It employs lightweight modules, adding less than 2% extra parameters to a ViT. This approach improves ViT’s transferability without altering its original pre-trained parameters, allowing it to surpass existing fully fine-tuned models in action recognition benchmarks. Such method also offers the flexibility to combine multiple adapters, usually for different downstream tasks.
In Adapter Fusion [88], a two-stage learning algorithm is used to combine knowledge from multiple tasks efficiently. This approach prevents catastrophic forgetting and avoids issues related to dataset balancing by clearly separating the stages of knowledge extraction and composition. This separation enables the classifier to effectively utilize the representations learned from various tasks, enhancing overall efficiency.
Similarly, Mera [89] integrates multiple pre-existing adapters into a unified model via model fusion, all while markedly enhancing the performance of [88]. This is achieved through the “same-track” strategy, which combines adapters from the same pretraining task sequence.
AdaMix [90] presents an approach to fine-tuning that maximizes parameter efficiency by employing a mixture of adapter modules. This technique improves performance on downstream tasks while minimizing changes to PTM weights. Additionally, AdaMix is engineered to match the computational cost and trainable parameters of the underlying PEFT method. Unlike conventional PEFT methods, which typically employ a single adaptation module per Transformer layer, AdaMix incorporates multiple adaptation modules to capture diverse perspectives of the task. Its adaptable framework can seamlessly integrate into various PEFT methods, such as [85, 91], highlighting its versatility.
Another group of additive methods are soft prompts, which are trainable tensors that are concatenated to the inputs. The key idea is that such prompts can learn and adapt during fine-tuning, while the original weights of the pre-trained model remain intact. Several approaches have been developed to leverage this concept effectively.
Prompt Tuning [92] adapts frozen, pre-trained language models to perform specific downstream tasks by introducing additional learnable prompt tokens, represented as \( P = [P_1, P_2, …, P_l] \). These tokens are concatenated with the original input \( X \in \mathbb{R}^{n \times d} \) to construct the final input \( [P; X] \). In standard prompting, additional information is added to the input, allowing the model to maximize the likelihood of generating the correct output without altering the model parameters \(\theta\). This is typically achieved by selecting prompt tokens either manually or via non-differentiable search methods. Prompt tuning, however, decouples the prompt from the model’s fixed parameters. This decoupling allows the embeddings of these prompt tokens to be learned and optimized, making it easier to adapt to specific tasks.
Building on this foundation, P-Tuning [93] was introduced to address a critical limitation — the inherent instability issues of discrete prompts. It introduces trainable continuous prompt embeddings that are concatenated with traditional discrete prompts. The core template structure is defined as:
(6)
where \( [P_i]\) represents continuous prompt embeddings that are optimized through backpropagation, while \( x\) and \( y\) represent the input and label respectively. The combination of trainable continuous embeddings with discrete prompts is done via a prompt encoder, i.e. LSTM or MLP, to model the dependencies between the embeddings. This hybrid approach significantly reduces performance variance, maintaining a consistent behavior where traditional methods showed up to 20% drops from single word changes in prompts.
Similarly, Prefix Tuning [94] prepends trainable continuous vectors to the input, allowing the transformer to attend those as “virtual tokens” while keeping the main model parameters frozen. The activations \( h_i\) are computed as:
(7)
where \( P_\theta\) is parameterized using an MLP for stability: \( P_\theta[i, :] = \text{MLP}_\theta(P'_\theta[i, :])\) . This approach achieves comparable performance to full fine-tuning while only requiring 0.1% of the parameters.
Taking prompt tuning to a multi-task setting, Multitask Prompt Tuning (MPT) [95] innovates by learning a single transferable prompt from multiple source tasks that can be adapted to target tasks. By operating in two stages — source training and target adaptation — MPT learns a task-shared prompt matrix through knowledge distillation. This prompt is decomposed into a task-shared matrix \( P^*\) and low-rank task-specific matrices \( W_k = u_k \otimes v_k^T\) for each task \( k\) , with the task-specific prompt parameterized as \( \hat{P}_k = P^* \odot W_k\) . The method combines multiple loss functions:
(8)
where \( \mathcal{L}_{\text{PLM}}\) represents task-specific losses, \( \mathcal{L}_{\text{Logits}}\) aligns probability distributions between teacher and student models via KL-divergence, and \( \mathcal{L}_{\text{Hidden}}\) minimizes differences in hidden states. This approach achieves impressive performance while tuning only 0.035% of task-specific parameters.
Reparameterization methods provide an alternative way to represent and update a neural network’s parameters during training. Instead of directly modifying the original weight matrices, it expresses them through a transformation function using a smaller set of parameters. The key advantage is that we can achieve similar model capabilities while training far fewer parameters, making the process more efficient and controllable.
LoRA [91] addresses the challenge of adapting massive pre-trained models to specific tasks by using low-rank decomposition matrices to indirectly update the model’s weights. It introduces a low-rank update to the pre-trained model weights, expressed as:
(9)
where \( W_0 \in \mathbb{R}^{d \times k}\) represents the initial pre-trained weight matrix, while \( B \in \mathbb{R}^{d \times r}\) and \( A \in \mathbb{R}^{r \times k}\) are the additional parameters, with rank \( r \ll \min(d,k)\) . During finetuning, \( W_0\) is kept frozen while the \( A\) and \( B\) matrices are updated. This formulation allows for efficient computation and storage, as the update matrices \( B\) and \( A\) are much smaller than the original weight matrix. LoRA is particularly efficient both for the low parameter count and for the small GPU usage during training. Furthermore, it introduces no additional latency during inference, as the update \( \Delta W = BA\) can be merged with \( W_0\) post-training.
QLoRA [96] is a reparameterization approach that enables training of LLMs on limited hardware while maintaining performance. It achieves this by combining quantization with [91] through several key innovations. QLoRA processes weight matrices through a dual approach:
(10)
where the base model weights are quantized to 4-bit precision using NormalFloat (NF4), while the LoRA adapters (\( L_1, L_2\) ) operate in 16-bit precision. QLoRA integrates three major things that work together to enable efficient training: 4-bit NormalFloat (NF4), an information-theoretically optimal quantization format for normally distributed weights that outperforms standard 4-bit quantization; Double Quantization, which further compresses the model by quantizing the quantization constants themselves, reducing memory requirements by approximately 0.37 bits per parameter; and Paged Optimizers that manage memory spikes during training by utilizing NVIDIA unified memory for automatic CPU-GPU memory transfers. This combination enables fine-tuning of 65B parameter models on a single 48GB GPU while preserving full 16-bit performance. In practice, QLoRA achieves comparable results to full fine-tuning while requiring only 0.1% of the trainable parameters, demonstrating particular effectiveness in low-data regimes and better extrapolation to unseen topics.
DoRA [97] also uses low-rank matrix decomposition, but its novelty lies in decomposing the pre-trained weight matrix \( W\) into magnitude \( m\) and direction \( V\) components:
(11)
where \( \|\cdot\|_c\) denotes the vector-wise norm across each column.
It then applies LoRA specifically to the directional component:
(12)
where \( W_0\) is the pre-trained weight, and \( BA\) represents the low-rank update. This decomposition approach is inspired by Weight Normalization techniques and aims to simplify the learning task for the low-rank updates. By doing so, DoRA seeks to enhance the learning capacity and stability of LoRA, potentially bridging the performance gap between PEFT methods and full fine-tuning. Importantly, [97] maintains the inference efficiency of LoRA, introducing no additional latency during deployment.
On the other hand, RoSA (Robust Adaptation) [98] combines both low-rank and sparse adapters trained in parallel with frozen pretrained weights. Inspired by robust principal component analysis, RoSA recognizes that while fine-tuning updates can be approximated by low-rank matrices [91], adding a sparse component can capture important outlier components that LoRA might miss. The method achieves comparable or better accuracy than full fine-tuning while maintaining the memory and computational efficiency of other PEFT methods, making it particularly effective for complex tasks like mathematical reasoning and coding where traditional LoRA methods often fall short. Additionally, RoSA is compatible with quantized base weights, making it the first approach to successfully combine quantization, low-rank projections, and sparsity while maintaining high performance.
VeRA [99] is another reparameterization PEFT method for large language models that significantly reduces the number of trainable parameters compared to LoRA while maintaining comparable performance. It uses a single pair of shared random matrices across all layers and learns small scaling vectors for adaptation. Unlike LoRA, VeRA uses frozen random matrices \( A\) and \( B\) shared across layers, with trainable scaling vectors \( b\) and \( d\) :
(13)
where \( \Lambda_b\) and \( \Lambda_d\) are diagonal matrices formed from vectors \( b\) and \( d\) . Here, \( B \in \mathbb{R}^{m \times r}\) and \( A \in \mathbb{R}^{r \times n}\) are not required to be low-rank since they remain static. The method employs two initialization strategies:
The family of selective methods for PEFT includes such techniques that aim to pick and finetune only a small subset of the model parameters, hence reducing computational requirements and memory footprint.
Masking [84] starts from the idea of selecting the weights that are important for the downstream task, instead of finetuning the entire model. Based on the lottery ticket hypothesis [100], they propose to select the parameters with a series of learned binary masks, one for each task. The authors experimented with Transformer-based architectures, such as BERT [19], RoBERTa [101] and DistilBERT [102], on different NLP tasks, including part-of-speech tagging, named-entity recognition, sequence classification, and reading comprehension. In this approach, for each weight matrix \( \bf{W}^l\) of the \( l\) -th transformer block, they randomly initialize a matrix \( \bf{M}^l\) , which is then binarized via an element-wise thresholding function. The binary mask is then applied to the weights by Hadamard product:
Through their experiments, they prove that this masking technique achieves comparable results to complete finetuning, while being more parameter-efficient and lightweight. They also show that 3% \( \sim\) 10% initial sparsity of \( \bf{M}^l_\text{bin}\) represents a good trade-off between retaining knowledge and flexibility. Additionally, they demonstrate that, in general, it is reasonable to select all the weights from shallower layers, while learning masks for higher layers to reach optimal results.
BitFit [103] is a sparse finetuning method, in which the basic concept is to update only the bias terms, while keeping the rest of the network frozen. Such approach is very parameter-efficient, since very few weights are modified: for BERT architecture, they amount to less than 0.1% of the total number of the parameters. In particular, the authors consider the bias terms of key, query and value weights of each self-attention head, for each layer of the BERT encoder, plus the bias parameters from the MLP layers on top.
The BitFit approach was evaluated against the GLUE benchmark [104], obtaining comparable results to naive finetuning, but modifying a tiny portion of the weights. Moreover, they experimented with updating a subset of the bias parameters, achieving only slightly worse results.
The intersection between PEFT and CL presents a promising avenue for developing more adaptive and efficient AI systems. The synergy between these approaches addresses the key limitations that each field has individually and opens up new possibilities for scalable, lifelong learning systems, such as:
In recent times, several works in CL [2, 1, 108, 109] have incorporated different PEFT methods, each offering a different complementary advantage over its CL counterpart. In this survey, we split these PECFT method based on the kind of PEFT method applied.
Continuous Adapter (C-ADA) [110] explores the potential of rehearsal-free continual learning by introducing an extensible parameter Continual Adapter Layer (CAL) that allows better reuse of knowledge by adapting weights for new tasks while preserving old knowledge. It also employs a Scaling & Shifting module, that reduces the divergence between the pre-training and downstream datasets by transferring the feature space from pre-trained datasets to downstream dataset
Similarly, Adapter-based Continual Learning (ACL) [111] introduces a framework that uses a feature adapter to combat catastrophic forgetting. It is designed with a task-specific head, which cleverly groups all previously learned classes into a single “out-of-distribution” category, enabling more effective feature discrimination. The approach also includes a collaborative fine-tuning mechanism, ensuring that the outputs of the different classifiers remain comparable, facilitating accurate head selection during inference. The framework’s effectiveness is validated through comprehensive testing on three image datasets, demonstrating ACL’s superior performance in continually learning new classes while preserving existing knowledge.
In adapter-based CL frameworks, routing mechanisms enable dynamic selection of specialized adapters, facilitating efficient knowledge accumulation and modular model expansion. [112] introduces a framework that dynamically expands pre-trained CLIP models using Mixture-of-Experts (MoE) adapters, featuring a Distribution Discriminative Auto-Selector that intelligently routes inputs between the MoE Adapter and original CLIP model. This architecture demonstrates remarkable efficiency, reducing parameter training overhead by 60% while maintaining competitive performance.
SEMA [113] takes a different approach to routing, employing an expandable weighting router that generates weighted combinations of adapter outputs. Its routing mechanism is enhanced by representation descriptors that monitor distribution shifts to trigger selective adapter expansion. This “soft” routing strategy, combined with on-demand expansion, enables SEMA to achieve sub-linear growth rates without relying on memory rehearsal.
The Continual Adapter Tuning (CAT) [114] tackles Aspect Sentiment Classification (ASC) by combining task-specific adapters with a frozen pre-trained backbone. CAT uses its continual adapter initialization technique for knowledge transfer between tasks and label-aware contrastive learning to jointly optimize features and classifiers. The framework eliminates the need for task IDs during inference by employing a majority voting strategy across adapter paths. Through validation on 19 ASC datasets, CAT demonstrates state-of-the-art performance in maintaining effectiveness across domains while enabling efficient knowledge transfer.
AdaPtive Adapter RouTing (APART) [115] presents an innovative solution to Long-Tailed Class-Incremental Learning (LTCIL), addressing the dual challenges of catastrophic forgetting and data imbalance without relying on stored exemplars. This approach leverages pre-trained models through a dual-pool adapter system: a primary pool for general knowledge retention, and an auxiliary pool specifically designed for minority classes. APART uses an adaptive instance routing mechanism, which dynamically combines information from both pools without fixed thresholds for minority class identification. By freezing the pre-trained model’s core parameters and utilizing trainable layer-wise adapters, the method enables effective adaptation while minimizing forgetting. APART demonstrates how pre-trained models can be effectively leveraged for LTCIL in real-world applications, where data storage and privacy concerns are paramount, with extensive experimental validation confirming its effectiveness across multiple benchmarks.
ATLAS [116] proposes a two-stage learning paradigm that effectively balances the preservation of existing knowledge with the acquisition of new capabilities by incorporating vector learning method, which intelligently combines information from different adapters based on cosine similarity between tasks. Unlike previous parameter-efficient solutions that created isolated modules for each task, ATLAS minimizes knowledge redundancy while expanding the model’s representational capacity. It employes both multi-modal and uni-modal tasks in upstream continual learning, providing valuable insights into how multi-modal model updates affect performance across different modalities. Its ability to enhance distribution richness and improve generalization capability makes it a promising solution for CL in Vision-Language models.
The Expand and Merge [117] framework introduces a parameter-efficient architecture that combines adapter layers with Vision-Language models. The approach is built on two key innovations: first, it employs specially designed adapter layers that expand for new tasks while keeping old knowledge intact through frozen parameters; second, it leverages a pretrained text encoder’s fixed embedding space to guide the vision encoder’s continual learning process through vision-language pretraining models like CLIP. To effectively manage knowledge integration, the framework implements an adaptive fusion mechanism using scaling weights at different network depths, complemented by a unique parameter merging stage that prevents performance degradation while controlling parameter growth. This design addresses two common limitations in the field: the parameter bloat, typical of expansion-based approaches, and the over-constraining of new learning found in regularization-based methods. Through extensive validation across three datasets, the framework demonstrates superior performance compared to current state-of-the-art methods, exhibiting robust performance maintenance on both old and new tasks while effectively managing parameter efficiency.
Recent works have leveraged Low-Rank Adaptation for CL scenarios, offering a parameter-efficient alternative to traditional prompt or adapter-based methods. By utilizing LoRA’s inherent ability to create compact task-specific updates, these approaches demonstrate strong performance in sequential learning while maintaining minimal memory overhead.
InfLoRA [118] is a reparameterization technique that confines weight updates to a carefully designed subspace, eliminating interference between new and old tasks. This subspace-constrained learning improves upon existing approaches that either reuse parameters or randomly expand them for new tasks. The method approximates the gradient space using the input matrix of new tasks, making it particularly effective in CIL scenarios where task IDs are unavailable during inference. Through this carefully designed subspace and branch expansion architecture, InfLoRA demonstrates superior performance compared to state-of-the-art methods, particularly in maintaining stability on old tasks while effectively adapting to new ones.
Prototype Guided Incremental LoRA (PILoRA) [119] addresses two key challenges in federated class incremental learning: catastrophic forgetting and data heterogeneity across clients. The method combines prototype learning with parameter-efficient fine-tuning through an innovative two-pronged approach. First, it introduces incremental LoRA, which mitigates forgetting by constraining different learning stages to orthogonal subspaces, allowing efficient knowledge accumulation through parameter summation during inference. Second, it implements a prototype re-weight module that leverages heuristic information between prototypes and class features to address classifier bias without retraining. Built on the ViT backbone, PILoRA performs well on standard benchmarks while maintaining strong performance even under extreme heterogeneous conditions where other methods falter. Its effectiveness stems from the unified approach to both continual learning and federated learning challenges, recognizing their shared need for robust feature representations and bias mitigation in classifiers. While InfLoRA focuses on confining weight updates to carefully designed subspaces to prevent interference between tasks, PILoRA expands on this idea by implementing orthogonal subspaces for different learning stages while adding prototype learning to handle data heterogeneity.
A slightly different approach comes from Dual Low-Rank Adaptation (DualLoRA) [120]. It employs two distinct low-rank adapters working in concert: an orthogonal adapter that operates in subspaces perpendicular to previous task features to preserve old knowledge, and a residual adapter that learns in task-specific subspaces to enable effective adaptation to new tasks. DualLoRA is quite efficient in feature subspace extraction; it uses Singular Value Decomposition, which eliminates the need for multiple forward passes that burden previous approaches like InfLoRA. During inference, DualLoRA uses its dynamic memory mechanism which modulates the residual adapter’s output based on task relevance computed from input samples. This dynamic adjustment not only improves feature embeddings but also enables accurate task identification without explicit task labels. By maintaining parameter efficiency while offering superior performance across multiple benchmarks, DualLoRA represents a significant advance over existing PEFT methods for CL.
Experience Replay Informative-Low Rank Adaptation (ERI-LoRA) [121] tackles CL in task-oriented dialogue systems by combining replay-based methods with parameter-efficient fine-tuning. The method improves upon standard experience replay by introducing a weighted distribution sampling strategy for previous domain examples, while leveraging LoRA for efficient parameter updates on LLaMA2. Instead of random sample selection, ERI-LoRA employs an informed sampling approach that considers class label distributions. Through extensive evaluation on intent detection and slot-filling tasks across multiple datasets and orderings, ERI-LoRA shows significant improvements over state-of-the-art methods, achieving a 13% higher F1 score in slot-filling and 0.85% better accuracy in intent detection.
Another replay-based method is Task Arithmetic with LoRA for CL [122]. Instead of sequential training, that brings forgetting risks, this approach trains individual LoRA modules for each task separately, then it combines the task vectors using task arithmetic rules before merging them back into the pre-trained Vision Transformer. The method’s innovation lies in its efficient use of resources by only training small LoRA modules and leveraging a minimal memory buffer of 10 samples per class for final fine-tuning.
CL with STack-And-Mask INcremental Adapters (STAMINA) [123] enhances text-to-image diffusion models’ ability to learn long sequences of concepts without forgetting. The method combines LoRA with two key technical innovations: hard-attention masks parameterized by low-rank MLPs using Gumbel softmax, and learnable MLP tokens that replace traditional custom token embeddings, with all trainable parameters capable of being folded back into the base model after training, eliminating inference overhead. It is also able to maintain plasticity thanks to its sparse adaptation approach. Through comprehensive evaluation on a 50-concept benchmark of landmarks and faces, STAMINA demonstrates superior performance over existing methods while requiring fewer training steps to achieve concept mastery.
Interpolation-based LoRA [124] approaches CL in LLMs through mode connectivity, i.e. the observation that different optimization minima can be connected through low-loss valleys. Rather than viewing catastrophic forgetting as a binary trade-off, it implements a dual-memory architecture: a fast learner that rapidly adapts to new tasks — which provides plasticity — and a slow learner that consolidates long-term knowledge — maintaining stability. This design is inspired by empirical findings that mode connectivity exists in parameter-efficient fine-tuning of LLMs, and that linear interpolation between task-specific optima can achieve better plasticity-stability trade-offs than previous approaches. [124] tries to solve continual learning in a rather different way: instead of trying to preserve discrete task knowledge, it leverages the continuous nature of the parameter space to find optimal interpolation points that serve both past and present learning objectives.
Learning to Prompt (L2P) [1] is one of the first works that applies prompting to CL. In this work, a prompt is trained for each task on a pre-trained network and subsequently stored in a key-value prompt pool, where the key is learnable. To select the most adequate prompts at inference time for a given input, the authors introduce a deterministic query function, i.e. the pre-trained network, to extract features from the input. Prompts are then selected based on the similarity between input query and prompt keys. L2P was evaluated against the main CL strategies, showing enhanced performance in class-incremental, domain-incremental and task-agnostic scenarios. Similarly, DualPrompt [2] builds on the prompt pool concept by introducing two types of prompts: general prompts — for common features across tasks — and expert prompts — for task specific instructions. These prompts are attached to two distinct groups of contiguous attention layers, selected through heuristic search. DualPrompt leverages the L2P workflow: each e-prompt is associated with a task-specific key that is learned to match the input features; at inference time, a query function on the test sample is used to retrieve the best e-prompt. Compared against leading memory-based and architecture-based CL methods, DualPrompts outperforms all while remaining rehearsal-free. It also beats L2P by a modest margin.
An additional technique for prompt-based approaches is called Language Guidance for Prompt-based Continual Learning (LGCL) [125]. In this method, the underlying assumption is that natural language can be used as a shared representation space for tasks with different visual features. Hence, the authors propose to employ a pre-trained text encoder to obtain a representation of both tasks and single classes. Then, the task-level representation is used to select the prompts via cosine similarity with each key, while the class-level embedding is encoded in the output feature of the model, to perform the classification. LGCL is shown to consistently outperform CL baselines on both Split-CIFAR100 [126] and Split-ImageNet-R [2].
Another study that builds on the prompt pool concept is the Prompt Gradient Projection (PGP) approach [127]. It represents the first work that analyzes anti-forgetting mechanisms integrated with prompt tuning techniques. In particular, the authors employ the Gradient Projection Method [128], which demonstrates theoretically that forgetting is mitigated if the weights are updated in the orthogonal direction to the subspace spanned by the previous features. PGP was evaluated in combination with L2P and DualPrompt, showing improving results of accuracy and forgetting.
A slight variation from L2P comes from S-Prompts [129]. This approach is specifically applied to Domain-Incremental Learning scenarios, without the need to keep a buffer of samples from previous tasks. In L2P and similar works the prompts are shared among tasks, meaning that the new knowledge shares the same feature space of the old tasks, limiting the learning capacity and possibly inducing interference. Conversely, S-Prompts instantiate a new prompt for each domain, so that each prompt can obtain optimal performance on its task independently from the other prompts. At inference time, K-NN is employed to find the closest domain to a given test sample, and the corresponding prompt is prepended to the input to perform classification. In the paper, the authors show an exhaustive comparison with both prompting methods and standard CL baselines, showing great advancements on the state-of-the-art results.
A limitation of the methods we have discussed so far consists in the fixed size both of the prompt pool and the prompts themselves. This limits their capability to scale, being difficult to increase the learning capacity when needed. CODA-Prompt [108] overcomes this issue defining a set of prompt components: for each input sample, the components are merged via weighted summation, forming a so-called decomposed prompt. By doing so, the method can automatically adjust the prompting capacity depending on the complexity of the task at hand. Furthermore, the authors introduced an attention mechanism on the input query, which allows to focus on relevant features only, reducing also the input dimensionality. Unlike L2P and DualPrompt, CODA-Prompt is optimized in an end-to-end fashion, using directly the classification loss, which helps increasing performance and establishing new SOTA results. Indeed, this method surpasses DualPrompt on both class-incremental and domain-incremental scenarios, especially when the number of tasks increases.
A different approach comes from Progressive Prompts [109]. In this work, a new prompt is learned for each new task, and it is then concatenated with the previously learned ones. Each prompt \( P_k\) is trained only during corresponding task \( T_k\) , while it is kept frozen during subsequent tasks. The prompts concatenation is then prepended to the input embeddings. This approach was evaluated on two CL benchmarks for text classification using two widely-used language models. The results demonstrate its effectiveness in improving performance and mitigating catastrophic forgetting, by facilitating greater forward transfer across tasks.
Following the idea of prompts concatenation, Prompt of Prompts (POP) [130], proposes an approach involving two sets of prompts. First, one or more prompts are learned per each task, with the objective to discriminate among the classes in the task. Such prompts are then kept frozen for subsequent tasks, but are still given in input to the model; in this way, the new prompts are forced to learn only the new information from the current task, without duplicating knowledge present in other prompts. Additionally, this method uses a second set of prompts, called Prompt of Prompts, to combine the representations from different tasks. While the task prompts are frozen after training on the corresponding task, the Prompt of Prompts set is learned continually, so to integrate and update the features across all the tasks.
The problem of correctly evaluating the performance of CL algorithms has been present since the early developments of the field. It is clear that accuracy alone is not enough: most of the times, standard “offline” learning models have an higher accuracy than CL solutions, but this drop may be acceptable if we measure other, but nonetheless equally important benefits brought by CL algorithms.
This is particularly evident in the context of PECFT methods. Indeed, we mentioned multiple times throughout this work that the principal concern about PTMs lies in their training and inference inefficiency, and in the large resources needed for fine-tuning. This means that, when dealing with very large architectures, pure accuracy cannot be the only metrics employed to evaluate such models.
Firstly, in dynamic scenarios the accuracy on the last task only is not so informative, since we are interested in retaining knowledge and performance on all seen task. Formally, we define the Average Accuracy [5] as:
(14)
where \( a_{i,j}\) is the accuracy on task \( j\) , with the model trained continually from task 1 to task \( N\) .
Apart from pure accuracy, the CL fields principally studies approaches to diminish the catastrophic forgetting phenomenon, hence it comes straightforwardly that it represents an important measure to keep into consideration.
Average Forgetting [34] represents how much the model “forgets” about previous tasks. It is formally defined as:
(15)
Given the sequential nature of the learning scenarios, it is useful to measure what is the influence of learning multiple tasks, i.e. how much training on a task affects the performance on other tasks. This is particularly interesting because it gives practical insights on how fast the model learns through time. For this reasons, forward transfer and backward transfer [131, 132] are introduced.
Forward Transfer(FWT) measures the influence a task \( t\) has when training on a future task \( k\) ; a positive FWT indicates that the model is able to exploit previously acquired knowledge during the subsequent learning sessions. It is defined as:
(16)
Similarly, Backward Transfer(BWT) indicates the influence of a task \( t\) on a previously learned task \( k\) ; a positive BWT means the agent is able to not degrade and improve performance throughout its lifetime.
(17)
Other meaningful metrics to keep track of are the ones related to the model efficiency. An absolute measure of the model size is given by the Number of trainable parameters [91], i.e. the amount of parameters introduced by a PECFT method. This represents a straightforward yet effective measure of the computational requirements of a given approach, and a simple way to compare different techniques.
On the same line, the relative growth of the number of parameters over time is equally important. We define the Model Size Efficiency (MS) [132] as:
(18)
where we indicate with \( Mem(\theta_i)\) the model memory size in terms of the parameter \( \theta\) at the task \( i\) .
Different pre-trained models’ influence on CL methods’ effectiveness requires deeper analysis. Research not only indicates variations in performance between models trained from scratch and those utilizing pre-trained architectures but also shows that different pre-trained models benefit differently from existing continual learning approaches. [133] Thus, exploring the selection of the optimal pre-trained model and its best-fitted continual learning strategy given a real-world task can lead to meaningful contributions to guiding industry practice. Additionally, while a majority of the experiments were tested over visual and natural language modalities, video and audio are domains where the application of continual learning can be valuable to explore as well. [134, 135]
A multimodality pre-trained model poses a challenge but also offers opportunities for developing cross-modality strategies for continual learning tasks. The key idea is that as the models receive additional information from the task, they can develop a more accurate and robust representation, especially when the supplementary guidance comes from large pre-trained models, which are known for their ability to generate robust representations in the first place. For instance, [136] indicates that aligning visual features with semantic groups and leveraging semantic relations among categories can boost model robustness against distribution shifts. While language guidance has been extensively investigated in Transfer Learning [137] across different vision tasks, its utilization in continual learning has been relatively overlooked. With the rise of large pre-trained models in specific domains, exploring potential cross-modality guidance holds promise for further research.
Model merging presents an exciting opportunity in CL, by combining multiple expert models, each specialized in different aspects of a task, we can create a system that not only mitigates issues like catastrophic forgetting but also benefits from the diverse strengths of each model. This becomes particularly important in dynamic, evolving domains, as it allows the model to expand its knowledge over time without forgetting what it has previously learned. However, a common problem that these model merging solutions, such as Task Arithmetic [138], often encounter is parameter interference, which leads to significant performance degradation when these expert models are merged. Some works such as TIES-MERGING [139] and DARE [140] have led to significant improvements in model merging. [139] addresses interference by resetting parameters that have only changed minimally, resolving sign conflicts, and merging only those parameters that align with the final agreed-upon sign. [140], on the other hand, eliminates redundant delta parameters by randomly dropping them and rescaling the remaining ones, which has shown tremendous effectiveness in sparsifying and merging multiple expert models without significant performance loss.
Typical model merging scenarios often require combining pre-existing expert models—each specialized in a specific task—into one unified system. However, this static approach falls short in scenarios where new tasks emerge over time. In continual learning, we face the challenge of incrementally integrating new task-specific models without retraining the entire system. Recent advances in dynamic model merging address this by tackling issues such as parameter interference, memory efficiency, and sequential integration, enabling systems that adapt more effectively as new tasks are encountered. For instance, MagMax[141] introduces a framework that merges task-specific models using sequential fine-tuning combined with a maximum magnitude weight selection strategy. This approach integrates new information effectively and preserves the integrity of earlier learning to help tackle CT. In contrast, Representation Surgery for Multitask Model Learning [142] addresses a different challenge. Here, the focus is on mitigating the representation bias that emerges when merging models trained on disparate tasks. By inserting a lightweight, task-specific module—dubbed “Surgery”—the method realigns the merged model’s internal representations with those of the individual models, thereby enhancing overall performance in multitask scenarios. While Adaptive LoRA Merging for Domain Incremental Learning[143] highlights the limitations of fixed-weight merging by proposing an adaptive mechanism that dynamically computes merging coefficients. This flexibility allows the system to balance the contributions of new and old domains, ensuring robust performance in evolving environments while reducing manual tuning.
Lastly, [144] takes a sequential projection-based approach. By projecting new updates onto subspaces orthogonal to those of previously merged models and applying adaptive scaling, this method minimizes interference and maintains a constant memory footprint, making it highly scalable for continual learning applications.
By moving away from traditional continual learning approaches, recent trends focus on the adaptive combination of lightweight modules such as [145, 91] in dynamic environments. This enables the seamless integration of new tasks as they emerge, without the need for extensive retraining of large, monolithic models. By merging these modular components on the fly, systems can remain both efficient and realistic in handling real-world challenges, making them ideally suited for large-scale models
Large models have been considered to exhibit “emergent” phenomena, yet they still have ways to go before they can effectively handle more complex reasoning tasks. [26, 146] CL is crucial in this regard and will also face new challenges.
The evolution of tasks from single-ability tasks, such as image classification, to more complex reasoning tasks demands a re-evaluation of existing continual learning methods or the proposal of new approaches. Classically, many methods in CL are evaluated using classification datasets such as CIFAR and ImageNet. There has been a recent push towards transitioning from hard split datasets to benchmarks that reflect more natural temporal and task shifts. [147, 148] Despite this, very few studies address the specific requirement for more complex reasoning tasks, such as visual question answering [149, 150, 151], decision-making [152], motion programming [153], scene understanding [154].
Reasoning tasks can differ significantly in their requirements. For instance, in image classification, a model continually improves its ability to recognize objects within a single modality and primarily focuses on learning representations for each class. However, in reasoning tasks like visual question answering, a model in many cases need to acquire multiple new abilities such as attribute recognition, knowledge reasoning based on changing demands. [151, 155]
Representation methods that focus on retain representative embedding from historical task typically fail to handle this situation, thus cannot be applied directly. Traditional replay techniques may prove insufficient in addressing these more intricate tasks as well. Some good examples that dive deeper into the specific need for reasoning tasks are: [156] maps sub-symbolic inputs to high-level concepts and explicitly replay concepts for CL in neuro-symbolic(NeSy) architectures such as DeepProbLog [157]; [151] utilizes scene graph as a prompt to represent previous images in visual question answering task and replays pseudo scene graphs alongside corresponding QA pairs.
Therefore, a promising direction for future research efforts involves developing novel approaches to improve the reasoning ability of large models within the framework of continual learning.
Future research should explore Continual Learning in more practical and realistic settings. This includes scenarios characterized by limited computational budgets. A recent survey [158] pointed out that existing continual learning methods tend to consider memory constraint, but computational costs are not extensively considered. While the cost of memory and privacy concerns may seem prohibitive, in practice, storing large datasets for extended periods can relatively inexpensive compared to the computational costs of training models on such datasets. [159] This is especially relevant for methods relying on replay mechanisms.
Additionally, investigating Online Learning [160, 15] paradigms where data arrives in small, incremental batches presents a significant frontier for research in this field. Inherently, the ability to detect incoming data shifts enables a more efficient model update process and can contribute to more robust model performance.
Task-agnostic [161] and unsupervised [162, 163] are settings were less explored but have started to gain more attention in recent years. Classic Continual Learning settings often assume availability of data annotation such as class label or task label. However, data labeling efforts can be costly and even not feasible in some real-world industrial situations. Thus, research into unsupervised or Task-agnostic settings can lead to more practical use cases of Continual Learning in real-world scenarios.
In this work, we tackled a key limitation of current foundation models: their restricted ability to adapt to specific downstream tasks over time. To address this, we focused on two research fields aimed at enhancing adaptation in dynamic scenarios: Continual Learning and Parameter-Efficient Fine-Tuning. One one hand, CL enables models to incrementally learn from a continuous stream of tasks, while PEFT facilitates rapid adaptation to individual tasks. We explored the intersection of these two fields — which we called Parameter-Efficient Continual Fine-Tuning (PECFT) — and examined how their techniques can be combined to improve the efficiency and sustainability of large pre-trained models in evolving environments. We believe that bridging CL and PEFT has the potential to drive significant advancements in AI research, shaping the future of large-scale models. Through this review, we aim to provide researchers with an overview of existing methods in this emerging area, fostering the development of novel and effective solutions.
During the preparation of this work the author(s) used ChatGPT in order to enhance readability and check grammar. After using this tool, the author(s) reviewed and edited the content as needed and take(s) full responsibility for the content of the publication.
Research partly funded by PNRR - M4C2 - Investimento 1.3, Partenariato Esteso PE00000013 - "FAIR - Future Artificial Intelligence Research" - Spoke 1 "Human-centered AI", funded by the European Commission under the NextGeneration EU programme and Leonardo Labs.
[1] Learning to prompt for continual learning Proceedings of the IEEE/CVF conference on computer vision and pattern recognition 2022 139–149
[2] Dualprompt: Complementary prompting for rehearsal-free continual learning European Conference on Computer Vision 2022 631–648 Springer
[3] Lifelong machine learning Springer 2018 1
[4] Expert gate: Lifelong learning with a network of experts Proceedings of the IEEE conference on computer vision and pattern recognition 2017 3366–3375
[5] Efficient lifelong learning with a-gem arXiv preprint arXiv:1812.00420 2018
[6] Continual lifelong learning with neural networks: A review Neural networks 2019 113 54–71
[7] Encoder based lifelong learning Proceedings of the IEEE international conference on computer vision 2017 1320–1328
[8] Catastrophic interference in connectionist networks: The sequential learning problem Psychology of learning and motivation Elsevier 1989 24 109–165
[9] New Insights on Reducing Abrupt Representation Change in Online Continual Learning 2022
[10] Continual learning for predictive maintenance: Overview and challenges Intelligent Systems with Applications 2023 200251
[11] On the Stability-Plasticity Dilemma of Class-Incremental Learning 2023
[12] Three scenarios for continual learning arXiv preprint arXiv:1904.07734 2019
[13] A Unified Approach to Domain Incremental Learning with Memory: Theory and Algorithm 2023
[14] A comprehensive survey of continual learning: Theory, method and application IEEE Transactions on Pattern Analysis and Machine Intelligence 2024
[15] A comprehensive empirical evaluation on online continual learning Proceedings of the IEEE/CVF International Conference on Computer Vision 2023 3518–3528
[16] Class-Incremental Learning with Repetition 2023
[17] Rainbow Memory: Continual Learning with a Memory of Diverse Samples 2021
[18] Cutmix: Regularization strategy to train strong classifiers with localizable features Proceedings of the IEEE/CVF international conference on computer vision 2019 6023–6032
[19] BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) 2019 4171–4186
[20] Language models are few-shot learners Advances in neural information processing systems 2020 33 1877–1901
[21] Attention Is All You Need 2023
[22] Language Models as Knowledge Bases? 2019
[23] Scaling Laws for Neural Language Models 2020
[24] Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer 2023
[25] An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale 2021
[26] Emergent abilities of large language models arXiv preprint arXiv:2206.07682 2022
[27] On the Domain Adaptation and Generalization of Pretrained Language Models: A Survey 2022
[28] A Comprehensive Survey on Transfer Learning 2020
[29] On the effectiveness of parameter-efficient fine-tuning Proceedings of the AAAI Conference on Artificial Intelligence 2023 37 11 12799–12807
[30] Deep class-incremental learning: A survey arXiv preprint arXiv:2302.03648 2023
[31] Continual Learning with Pre-Trained Models: A Survey Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence, IJCAI-24 2024 Kate Larson 8363–8371 International Joint Conferences on Artificial Intelligence Organization Survey Track 10.24963/ijcai.2024/924
[32] James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A. Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, Demis Hassabis, Claudia Clopath, Dharshan Kumaran, and Raia Hadsell. Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences, 114(13):3521–3526, March 2017.
[33] Learning without forgetting IEEE transactions on pattern analysis and machine intelligence 2017 40 12 2935–2947
[34] Riemannian walk for incremental learning: Understanding forgetting and intransigence Proceedings of the European conference on computer vision (ECCV) 2018 532–547
[35] Rotate your networks: Better weight consolidation and less catastrophic forgetting 2018 24th International Conference on Pattern Recognition (ICPR) 2018 2262–2268 IEEE
[36] Continual learning through synaptic intelligence International conference on machine learning 2017 3987–3995 PMLR
[37] Memory aware synapses: Learning what (not) to forget Proceedings of the European conference on computer vision (ECCV) 2018 139–154
[38] Overcoming catastrophic forgetting by incremental moment matching Advances in neural information processing systems 2017 30
[39] Distilling the Knowledge in a Neural Network 2015
[40] icarl: Incremental classifier and representation learning Proceedings of the IEEE conference on Computer Vision and Pattern Recognition 2017 2001–2010
[41] Learning without Memorizing 2019
[42] Overcoming catastrophic forgetting with unlabeled data in the wild Proceedings of the IEEE/CVF International Conference on Computer Vision 2019 312–321
[43] Learning a unified classifier incrementally via rebalancing Proceedings of the IEEE/CVF conference on computer vision and pattern recognition 2019 831–839
[44] Learning without memorizing Proceedings of the IEEE/CVF conference on computer vision and pattern recognition 2019 5138–5146
[45] Minsoo Kang, Jaeyoo Park, and Bohyung Han. Class-Incremental Learning by Knowledge Distillation with Adaptive Feature Consolidation. In CVPR, 2022.
[46] Co-transport for class-incremental learning Proceedings of the 29th ACM International Conference on Multimedia 2021 1645–1654
[47] Memory Population in Continual Learning via Outlier Elimination Proceedings of the IEEE/CVF International Conference on Computer Vision 2023 3481–3490
[48] PODNet: Pooled Outputs Distillation for Small-Tasks Incremental Learning 2020
[49] Yaoyao Liu, Bernt Schiele, and Qianru Sun. Adaptive aggregation networks for class-incremental learning. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, June 2021.
[50] RMM: Reinforced Memory Management for Class-Incremental Learning 2023
[51] Large Scale Incremental Learning 2019
[52] Saihui Hou, Xinyu Pan, Chen Change Loy, Zilei Wang, and Dahua Lin. Learning a unified classifier incrementally via rebalancing. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.
[53] Piggyback: Adapting a single network to multiple tasks by learning to mask weights Proceedings of the European conference on computer vision (ECCV) 2018 67–82
[54] HAT-CL: A Hard-Attention-to-the-Task PyTorch Library for Continual Learning 2024
[55] Forget-free Continual Learning with Winning Subnetworks Proceedings of the 39th International Conference on Machine Learning 2022 Chaudhuri, Kamalika and Jegelka, Stefanie and Song, Le and Szepesvari, Csaba and Niu, Gang and Sabato, Sivan 162 Proceedings of Machine Learning Research 10734–10750 PMLR
[56] Helpful or Harmful: Inter-task Association in Continual Learning Computer Vision – ECCV 2022 2022 Avidan, Shai and Brostow, Gabriel and Cissé, Moustapha and Farinella, Giovanni Maria and Hassner, Tal 519–535 Springer Nature Switzerland
[57] PackNet: Adding Multiple Tasks to a Single Network by Iterative Pruning 2018
[58] Continual Learning via Neural Pruning 2019
[59] Continual Learning with Node-Importance based Adaptive Group Sparse Regularization 2021
[60] Progressive neural networks arXiv preprint arXiv:1606.04671 2016
[61] Pathnet: Evolution channels gradient descent in super neural networks arXiv preprint arXiv:1701.08734 2017
[62] Efficient Continual Learning with Modular Networks and Task-Driven Priors 2021
[63] Gradient Episodic Memory for Continual Learning 2022
[64] Efficient Lifelong Learning with A-GEM 2019
[65] Layerwise Optimization by Gradient Decomposition for Continual Learning 2021
[66] Guanxiong Zeng, Yang Chen, Bo Cui, and Shan Yu. Continual learning of context-dependent processing in neural networks. Nature Machine Intelligence, 1(8):364–372, August 2019.
[67] Orthogonal Gradient Descent for Continual Learning 2019
[68] Orthogonal Subspace Learning for Language Model Continual Learning 2023
[69] Gradient Projection Memory for Continual Learning 2021
[70] Flattening Sharpness for Dynamic Gradient Projection Memory Benefits Continual Learning 2021
[71] TRGP: Trust Region Gradient Projection for Continual Learning 2022
[72] Meta-Learning Representations for Continual Learning 2019
[73] Learning to Continually Learn 2020
[74] Learning to Learn without Forgetting by Maximizing Transfer and Minimizing Interference 2019
[75] iTAML: An Incremental Task-Agnostic Meta-learning Approach 2020
[76] La-MAML: Look-ahead Meta Learning for Continual Learning 2020
[77] Online Fast Adaptation and Knowledge Accumulation: a New Approach to Continual Learning 2021
[78] Meta-Consolidation for Continual Learning 2020
[79] Optimizing Reusable Knowledge for Continual Learning via Metalearning 2021
[80] Universal Language Model Fine-tuning for Text Classification 2018
[81] Energy and Policy Considerations for Deep Learning in NLP 2019
[82] Overcoming catastrophic forgetting in neural networks Proceedings of the national academy of sciences 2017 114 13 3521–3526
[83] Towards a unified view of parameter-efficient transfer learning arXiv preprint arXiv:2110.04366 2021
[84] Masking as an efficient alternative to finetuning for pretrained language models arXiv preprint arXiv:2004.12406 2020
[85] Parameter-Efficient Transfer Learning for NLP Proceedings of the 36th International Conference on Machine Learning 2019 Chaudhuri, Kamalika and Salakhutdinov, Ruslan 97 Proceedings of Machine Learning Research 2790–2799 PMLR
[86] Parameter-Efficient Fine-Tuning for Pre-Trained Vision Models: A Survey 2024
[87] AdaptFormer: Adapting Vision Transformers for Scalable Visual Recognition 2022
[88] AdapterFusion: Non-Destructive Task Composition for Transfer Learning 2021
[89] MerA: Merging Pretrained Adapters For Few-Shot Learning 2023
[90] AdaMix: Mixture-of-Adaptations for Parameter-efficient Model Tuning 2022
[91] LoRA: Low-Rank Adaptation of Large Language Models 2021
[92] The Power of Scale for Parameter-Efficient Prompt Tuning 2021
[93] Xiao Liu, Yanan Zheng, Zhengxiao Du, Ming Ding, Yujie Qian, Zhilin Yang, and Jie Tang. Gpt understands, too. AI Open, 5:208–215, 2024.
[94] Prefix-Tuning: Optimizing Continuous Prompts for Generation 2021
[95] Multitask Prompt Tuning Enables Parameter-Efficient Transfer Learning 2023
[96] QLoRA: Efficient Finetuning of Quantized LLMs 2023
[97] DoRA: Weight-Decomposed Low-Rank Adaptation 2024
[98] RoSA: Accurate Parameter-Efficient Fine-Tuning via Robust Adaptation 2024
[99] VeRA: Vector-based Random Matrix Adaptation 2024
[100] The lottery ticket hypothesis: Finding sparse, trainable neural networks arXiv preprint arXiv:1803.03635 2018
[101] Roberta: A robustly optimized bert pretraining approach arXiv preprint arXiv:1907.11692 2019
[102] DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter arXiv preprint arXiv:1910.01108 2019
[103] Bitfit: Simple parameter-efficient fine-tuning for transformer-based masked language-models arXiv preprint arXiv:2106.10199 2021
[104] GLUE: A multi-task benchmark and analysis platform for natural language understanding arXiv preprint arXiv:1804.07461 2018
[105] Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning Advances in Neural Information Processing Systems 2022 35 1950–1965
[106] Overcoming catastrophic forgetting with hard attention to the task International conference on machine learning 2018 4548–4557 PMLR
[107] Zixuan Ke, Bing Liu, Nianzu Ma, Hu Xu, and Lei Shu. Achieving forgetting prevention and knowledge transfer in continual learning. ArXiv, abs/2112.02706, 2021.
[108] Coda-prompt: Continual decomposed attention-based prompting for rehearsal-free continual learning Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2023 11909–11919
[109] Progressive Prompts: Continual Learning for Language Models International Conference on Learning Representations 2023
[110] Beyond Prompt Learning: Continual Adapter for Efficient Rehearsal-Free Continual Learning 2024
[111] Adapter Learning in Pretrained Feature Extractor for Continual Learning of Diseases 2023
[112] Boosting Continual Learning of Vision-Language Models via Mixture-of-Experts Adapters 2024
[113] Self-Expansion of Pre-trained Models with Mixture of Adapters for Continual Learning 2024
[114] Qiangpu Chen, Jiahua Huang, Wushao Wen, Qingling Li, Rumin Zhang, and Jinghui Qin. Cat: Continual adapter tuning for aspect sentiment classification. Neurocomputing, 580:127423, 2024.
[115] Adaptive Adapter Routing for Long-Tailed Class-Incremental Learning 2024
[116] ATLAS: Adapter-Based Multi-Modal Continual Learning with a Two-Stage Learning Strategy 2024
[117] Expand and Merge: Continual Learning with the Guidance of Fixed Text Embedding Space 2024 International Joint Conference on Neural Networks (IJCNN) 2024 1-8 10.1109/IJCNN60899.2024.10650910
[118] InfLoRA: Interference-Free Low-Rank Adaptation for Continual Learning 2024
[119] Federated Class-Incremental Learning with Prototype Guided Transformer arXiv preprint arXiv:2401.02094 2024
[120] Dual Low-Rank Adaptation for Continual Learning with Pre-Trained Models 2024
[121] Zeinab Borhanifard, Heshaam Faili, and Yadollah Yaghoobzadeh. Combining replay and lora for continual learning in natural language understanding. Computer Speech & Language, 90:101737, 2025.
[122] Rajas Chitale, Ankit Vaidya, Aditya Kane, and Archana Santosh Ghotkar. Task arithmetic with loRA for continual learning. In Workshop on Advancing Neural Network Training: Computational Efficiency, Scalability, and Resource Optimization (WANT@NeurIPS 2023), 2023.
[123] James Seale Smith, Yen-Chang Hsu, Zsolt Kira, Yilin Shen, and Hongxia Jin. Continual diffusion with stamina: Stack-and-mask incremental adapters. 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 1744–1754, 2023.
[124] Analyzing and Reducing Catastrophic Forgetting in Parameter Efficient Tuning 2024
[125] Introducing language guidance in prompt-based continual learning Proceedings of the IEEE/CVF International Conference on Computer Vision 2023 11463–11473
[126] Learning multiple layers of features from tiny images 2009
[127] Prompt Gradient Projection for Continual Learning The Twelfth International Conference on Learning Representations
[128] Gradient Projection Memory for Continual Learning International Conference on Learning Representations
[129] S-prompts learning with pre-trained transformers: An occam’s razor for domain incremental learning Advances in Neural Information Processing Systems 2022 35 5682–5695
[130] POP: Prompt Of Prompts for Continual Learning arXiv preprint arXiv:2306.08200 2023
[131] Gradient episodic memory for continual learning Advances in neural information processing systems 2017 30
[132] Don't forget, there is more than forgetting: new metrics for Continual Learning arXiv preprint arXiv:1810.13166 2018
[133] Do pre-trained models benefit equally in continual learning? Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision 2023 6485–6493
[134] BYOL for audio: Exploring pre-trained general-purpose audio representations IEEE/ACM Transactions on Audio, Speech, and Language Processing 2022 31 137–151
[135] Univl: A unified video and language pre-training model for multimodal understanding and generation arXiv preprint arXiv:2002.06353 2020
[136] Language Semantic Graph Guided Data-Efficient Learning Advances in Neural Information Processing Systems 2024 36
[137] Zero-shot learning through cross-modal transfer Advances in neural information processing systems 2013 26
[138] Editing Models with Task Arithmetic 2023
[139] TIES-Merging: Resolving Interference When Merging Models 2023
[140] Language Models are Super Mario: Absorbing Abilities from Homologous Models as a Free Lunch 2024
[141] MagMax: Leveraging Model Merging for Seamless Continual Learning 2024
[142] Representation Surgery for Multi-Task Model Merging 2024
[143] Eric Nuertey Coleman, Luigi Quarantiello, Julio Hurtado, and Vincenzo Lomonaco. Adaptive LoRA merging for efficient domain incremental learning. In Adaptive Foundation Models: Evolving AI for Personalized and Efficient Learning, 2024.
[144] Merging Models on the Fly Without Retraining: A Sequential Approach to Scalable Continual Model Merging 2025
[145] Parameter-Efficient Transfer Learning for NLP 2019
[146] Are emergent abilities of large language models a mirage? Advances in Neural Information Processing Systems 2024 36
[147] The clear benchmark: Continual learning on real-world imagery Thirty-fifth conference on neural information processing systems datasets and benchmarks track (round 2) 2021
[148] Clad: A realistic continual learning benchmark for autonomous driving Neural Networks 2023 161 659–669
[149] Krisp: Integrating implicit and symbolic knowledge for open-domain knowledge-based vqa Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2021 14111–14121
[150] Neural-symbolic vqa: Disentangling reasoning from vision and language understanding Advances in neural information processing systems 2018 31
[151] Symbolic replay: Scene graph as prompt for continual learning on vqa task Proceedings of the AAAI Conference on Artificial Intelligence 2023 37 1 1250–1259
[152] Neural logic machines arXiv preprint arXiv:1904.11694 2019
[153] Hierarchical motion understanding via motion programs Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2021 6568–6576
[154] Learning to describe scenes with programs International conference on learning representations 2019
[155] Vqacl: A novel visual question answering continual learning setting Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2023 19102–19112
[156] Neuro-symbolic continual learning: Knowledge, reasoning shortcuts and concept rehearsal arXiv preprint arXiv:2302.01242 2023
[157] Deepproblog: Neural probabilistic logic programming Advances in neural information processing systems 2018 31
[158] Continual learning: Applications and the road forward arXiv preprint arXiv:2311.11908 2023
[159] Online continual learning without the storage constraint arXiv preprint arXiv:2305.09253 2023
[160] Online continual learning in image classification: An empirical survey Neurocomputing 2022 469 28–51
[161] Task-free continual learning Proceedings of the IEEE/CVF conference on computer vision and pattern recognition 2019 11254–11263
[162] Beyond supervised continual learning: a review arXiv preprint arXiv:2208.14307 2022
[163] Semi-supervised and unsupervised deep visual learning: A survey IEEE transactions on pattern analysis and machine intelligence 2022
I am normally hidden by the status bar