1. Introduction
Deep learning has emerged as a powerful and evolutionary technology, improving the quality of life across various fields. The demand for a proficient understanding of deep learning models has surged, driven by the availability of vast datasets. However, this pursuit has raised concerns about data privacy during the training process. One attractive method to mitigate these costs is federated learning [
1]. In federated learning, each client trains its own model and shares only the trained model’s parameters with the server, without actually sending raw data. Since the server only receives the model parameters, it cannot access the clients’ data directly. Therefore, federated learning enables training that ensures the privacy of client data. While federated learning mitigates privacy risks, it is not without its drawbacks. In comparison to conventional learning methods, it often experiences performance degradation due to client heterogeneity. And data transfer costs are required for model aggregation. In this paper, we propose a novel methodology aimed at ameliorating the transmission challenges, with a specific focus on mitigating large data transmission needs. Our approach involves the use of adapters, which serve to enhance the efficiency of data transmission, thus addressing a key limitation of federated learning.
In general, federated learning, which combines the results from multiple clients, often exhibits a lower performance compared to the traditional centralized learning methods. This performance degradation can be attributed to the heterogeneity of the systems and data in federated learning. Since clients train models using their own devices (systems) and individual data, variations in device performance can lead to differences in learning speed. In extreme cases, due to some clients’ performance limitations, they cannot even participate in the learning process. Furthermore, in federated learning each client trains models using its own diverse data, where the quantity and quality of data held by each client may differ. For example, in tasks that require various labels, certain clients may lack data for some labels, causing problems.
Many methodologies have been proposed to solve the heterogeneity of federated learning. Many methodologies have improved the aggregation method and training method of federated learning. There were methods for resolving heterogeneity based on model weight, such as FedProx [
2] and FedDyn [
3]. Feature-based heterogeneity solutions, such as FedUFO [
4] and MOON [
5], have been proposed. Some tried to solve the heterogeneity problem by improving the global model performance, and the APFL algorithm [
6] was proposed through this. Also, incorporating pre-trained models into federated learning has proven to mitigate the performance degradation caused by data heterogeneity [
7].
In [
7], it was experimentally confirmed that the pre-trained model solves various problems of federated learning without using any special aggregation method. Pre-trained models are models trained on general and large-scale datasets. In the field of natural language processing, the methodology of fine-tuning these pre-trained language models for downstream tasks through transfer learning has shown state-of-the-art performance in most areas. Moreover, in federated learning pre-trained models consistently outperform non-pre-trained deep learning models [
7].
In the field of natural language processing, pre-training generally utilizes large-scale language models. The recent rapid development of deep learning is closely related to the increase in model capacity. Each year, the size of the model is increasing, which leads to an increase in performance. For example, the BERT-base [
8] model proposed in 2018, a large-scale language model commonly used in natural language processing, has a parameter number of about 340 M. The T5 model [
9] proposed in 2019 has a size of 11 B. In addition, the GPT-3-base [
10] and Megatron-Tuning [
11] models proposed in 2020 have parameter numbers of up to 175 B and 530 B, respectively. In federated learning, however, sending large-scale language models can be burdensome, since the trained parameters need to be transmitted over the network. In federated learning, during each global epoch, the trained models need to be downloaded from all clients, and then the models are aggregated and uploaded. However, uploading/downloading large-scale language models during each global epoch poses challenges in terms of time and network resources.
In this paper, we propose a novel methodology to address these issues and save network transmission time in federated learning. The proposed method applies Adapters [
12], which were introduced for efficient transfer learning, to federated learning. This allows federated learning to proceed with less model transmission. We conduct experiments in the areas of natural language processing and computer vision to demonstrate how the proposed methodology can significantly reduce the network transmission time compared to existing approaches.
The main contributions of our paper are three-fold: First, we identify that pre-trained models can mitigate the data heterogeneity problem in federated learning but render a new challenge of large data transmission requirements. Second, we introduce the adapter mechanism, which involves training large language models using smaller-sized adapters. This approach effectively addresses the problem of excessive data transmission issues in federated learning that uses transformer-based pre-trained models. Finally, we conduct extensive experiments on diverse federated learning datasets in both natural language processing and computer vision domains, to demonstrate the efficiency and performance of our proposal. The evaluation results highlight a significant reduction in training time of approximately 20–40% and a remarkable improvement in transmission speed, surpassing 98% compared to previous approaches.
The structure of the remainder of this paper is outlined as follows.
Section 2 provides an overview of the related work in the field.
Section 3 elaborates on the details and design of the proposed approach.
Section 4 presents the results of the evaluation conducted. Finally,
Section 6 concludes the paper.
3. Methodology
This paper proposes to use adapters in federated learning, to improve transmission efficiency during the process of federated learning with large transformer-based pre-trained language models. Ref. [
7] has shown that using a large language model can solve various problems with federated learning. Various problems caused by heterogeneity were alleviated and the performance of the global model was improved. However, federated learning is trained using model transmission of servers and clients. At this time, using a large model causes network overload from transmission. The proposed method of this paper uses adapters to solve the transmission problem. The reason is that the adapter can train the large language model with fewer parameters. Experiments on NLP and CV are conducted to confirm the efficiency and performance of the methodology. They also measure the amount of reduction in transmission.
One of the biggest issues of federated learning is performance degradation due to data heterogeneity and system heterogeneity. The study [
7] showed that the pre-trained large language model could improve the problem. However, pre-trained large language models require a high model capacity. When a large-capacity deep learning model is used in federated learning, a very large amount of transmission is required. Federated learning requires the parameters of clients to be uploaded/downloaded at each global step, so using a large capacity model increases the amount of transmission exponentially.
For example, popular transformer models in natural language processing and computer vision, such as BERT-base [
8] and ViT-base [
17], have sizes of 440 MB and 330 MB, respectively. In federated learning, it is necessary to transmit and receive the trained model parameters to and from each client at every global epoch. Therefore, transmitting and downloading large model parameters multiple times becomes problematic in federated learning. For instance, if 10 clients perform federated learning for 30 global epochs using BERT-base, it would result in a total transmission of approximately 264,000 MB, or 263 GB.
The amount of transmission in federated learning is calculated as shown in Equation (
1). Federated learning should upload/download to all clients at every global step. Therefore, the total transmission amount
T of federated learning is calculated by multiplying the global epoch
, the number of clients
, and the capacity
C of the model transmission size and then doubling the result (upload and download).
Hence, we propose using adapters to reduce the size of the model parameters that need to be transmitted.
The overall structure of the proposed methodology is illustrated in
Figure 1, which consists of three main steps. The first step is the preparation and downloading of the pre-trained model. In the first step, the pre-trained model is downloaded so that each client has the same model structure and parameter value before starting the first global step of federated learning. The second step is the client model training and upload. In the second step, clients train the local model. After that, each client uploads the trained adapter and classification head to the server.The final step is the aggregation and download of the models. In the final step, the learned models are merged. The learned model parameter value transmitted in the second step is aggregated to the global model. Each client then downloads the aggregated global model to start the next global step. Each step is described in
Section 3.2,
Section 3.3 and
Section 3.4.
3.1. Pre-Trained Model with Adapter
In this section, we discuss the deep learning model used in this paper. Firstly, the paper employs a pre-trained large language model. A pre-trained language model refers to a model that has been trained on a large-scale dataset and is typically divided into pre-trained transformer layers and embedding layers. The pre-trained large language model is fine-tuned to fit the downstream task. The classification head is added to classify the label for the purpose of each downstream task. The classification head is typically implemented as a one-layer feed-forward network. The classification head derives the probability for each label in the downstream task. For example, next word prediction predicts the probability that each word in every word will be used as the next word. Image classification predicts probabilities for all candidate categories.
Ref. [
12] confirms that learning with adapter layers can achieve a similar performance. In this paper, we reduce the amount of training parameters by using adapter layers. The adapter mechanism trains only the adapter layer and classification head while freezing the pre-trained language model. As a result, the adapter mechanism can achieve a similar performance even with a small amount of training resources. The adapter layer, trained using adapter models, such as LoRA [
16], Houslby [
12], and Preffier [
15], is an additional layer used to train the language model. In other words, the model to be used for federated learning in this paper consists of an embedding layer, transformer layer, adapter layer, and classification head. And according to [
12], the model training process freezes the embedding and transformer layers.
3.2. Prepare Pre-Trained Language Model
In this section, we provide detailed explanations on the first steps shown in
Figure 1. The first step is the preparation of the model for client training. In our proposed federated learning approach, each client performs downstream task training using a pre-trained large language model with adapters. Federated learning use the global model and the local model. The global model is a deep learning model owned by the server, and the local model is a deep learning model owned by each client. Federated learning is when each client trains the local model through their dataset and aggregates it to the global model at the server. Therefore, each local model and global model have the same model structure. You must also have the same parameter value at the beginning of the training of the local model.
In summary, in the proposed methodology all clients receive the full model (pre-trained large language model) from the server before starting federated learning. Note that this is a one-time download. This ensures that all clients begin with the same pre-trained parameters for federated learning.
3.3. Train Local Model
The next step involves clients training the local model and uploading it to the server. Each client fine-tunes the local model for the downstream task. After the local learning epoch, each client uploads the model parameter value to the server. However, using the pre-trained large language model in conventional federated learning requires a large amount of transmission capacity to be uploaded to the server, because federated learning involves servers and clients transmitting the full model. Typically, pre-trained large language models have hundreds of MB or several GB of capacity, which requires too much transmission. In addition, the server is required to receive trained model parameter values from all clients, which creates a network bottleneck.
In this paper, learning is conducted using adapters to reduce transmission. For learning with adapters, such as that explained in
Section 3.1, the pre-trained model is frozen. There is no change in the model parameter values of the transformer layer and embedding layer because only the adapter layer and classification head are learned. Only the adapter layer and the model parameter value of the classification head change. In this step, clients train the local model as much as the local epoch through each of their datasets. As a result of clients learning their respective datasets from the local model, only the adapter layer and classification head are learned. Therefore, clients send only the adapter layer and classification head to the server. Compared to the transformer layer and embedding layer, the capacities of the adapter layer and the classification layer are very small, which can increase transmission efficiency.
3.4. Aggregation into Global Model
Lastly, the server aggregates the trained parameters, and the clients download them. The final step aggregates the adapter layer uploaded by clients in the second step and the model parameter value of the classification head layer. The aggregated model parameter value is overwritten on the global model. The above process completes the global model learning in one global step. In this process, the federated learning aggregation method uses FedAvg [
1]. FedAvg is the most basic method of averaging model parameter values for each local model. At this point, the server aggregates only adapter layers and classification heads, because the transformer layer and embedding layer did not change the values at client’s training step.
After one global step is completed, all clients must synchronize their model parameter value before starting learning for the next global step. Therefore, all clients download the global model learned on the server. The downloaded parts are the adapter layer and classification head, because the transformer layer and embedding layer did not change the values at aggregation step. Therefore, each client downloads only the model parameter values of the adapter layer and the classification head layer, which can increase transmission efficiency.
The proposed methodology solves the increased network transmission problem when using pre-trained large language models in federated learning. Instead of training the transformer layers and embedding layers, which takes most of the model capacity in the pre-trained language model, the proposed approach trains and transmits only the smaller-sized classification head and adapter layers. In result, the proposed method reduces the network transmission and the number of parameters to be trained and potentially decreases the overall training time.
5. Discussion and Limitations
The evaluation of the proposed method in this paper revolves around two primary issues. Firstly, it addresses the question of whether the use of the adapter mechanism can effectively decrease both the data transmission and learning time. Secondly, it investigates whether the reduction in training time and data transmission does not lead to performance degradation. These aspects are rigorously examined through experiments conducted using the federated learning datasets of both computer vision and natural language processing.
This paper validates the reduction in training time and data transmission detailed in
Section 4.4.2. Adapters significantly reduce the size of the model to be transferred by up to 98%. We mathematically calculated the decrease in transmission when the model size is reduced, which causes the reduction in transmission time during the training time. As a result, the reduction in transmission time shows is about 98%. Furthermore, the use of adapters reduces the training time of the local model by minimizing the number of layers that need training. In this paper, we experimentally check the reduction in training time when the transmission time is excluded in the local environment through
Section 4.4.1. This shows that the reduction in training time, excluding the transmission time, can be about 20%. This confirms that the proposed methodology may show a reduction in training time and transmission time. In addition,
Section 4.3.1 shows that performance degradation does not occur despite the reduced training time. Experimental results of NLP generally show a slight performance degradation. Experiments with CV generally show performance improvements. We confirm from the experimental results that the proposed method represents a reduction in training time, and this does not lead to significant performance degradation.
In this paper, experiments on CV tasks were not described, which prevented us from confirming the reduction in training time for these applications. Despite our efforts in conducting the experiments, we were unable to observe a decrease in the training time, regardless of whether the adapter was used or not. This limitation arose due to the speed discrepancy on the CPU-based image pre-processing and training speed using the GPU. Unfortunately, our computational resources did not permit us to bridge the gap effectively. In addition, we note that the experiments were conducted solely in a local environment. We did not perform transmission speed experiments on an actual network due to these limitations. Instead, we calculated the reduction in transmission speed based on our experiments in the local settings. Addressing these constraints in future studies will provide a more comprehensive understanding of the proposed methodology’s applicability and effectiveness across diverse scenarios.
Furthermore, note that our methodology operates exclusively during the fine-tuning process and is not applicable in the pre-training phase. This limitation arises from the need to individually train the pre-trained large language model. Consequently, even though pre-training demands the most extensive dataset, the proposed method cannot be employed during this crucial phase. Additionally, while this paper successfully reduces the training time, there is a slight performance degradation observed in the NLP tasks. For future work, there is a need for research focused on reducing both the training time and data transmission during the full training of large language models in federated learning. Simultaneously, efforts should be directed towards eliminating performance degradation in NLP tasks through advancements in the adapter mechanism research.
6. Conclusions
In this paper, we addressed the problem of the increased transmission time caused by the pre-trained large language model in federated learning. To overcome this issue, we proposed and experimented with a federated learning approach using adapters, which previously have been suggested as an efficient fine-tuning method. As a result, the transmission time was reduced by about 98% compared to the methodology using the pre-trained large language model without adapters. In addition, the training time was also reduced by 20–40% as the number of parameters to be learned decreased. Nevertheless, the predictive performance was similar. Through this, it was confirmed that time-efficient federated learning is possible without performance degradation when an adapter is used in federated learning using a large language model, such as [
7]. Also, the proposed methodology showed lower transmission sizes than traditional federated learning without a large language model. In addition, because the proposed methodology uses a large language model, it showed a higher predictive performance than traditional federated learning. Through this, it was confirmed that the proposed methodology can induce performance improvements with the same or lower transmission amount as traditional federated learning.
The significance of our proposed method lies in its ability to improve the transmission efficiency of federated learning. Therefore, it enables the use of a large language model, such as ChatGPT, powered by the GPT-3 model, in real-world federated learning environments. While large language models, such as ChatGPT, have shown impressive performances, it is practically challenging to use GPT-3 in an actual federated learning environment. This is mainly due to the substantial time and transmission costs incurred by clients with limited computational resources when learning and transmitting the large GPT-3 model. In contrast, the proposed methodology offers an attractive solution to significantly reduce the transmission costs. Furthermore, our experiments showed that the training time was partially mitigated. In summary, our proposal stands as a key enabler, facilitating the use of large models in a real-world federated learning environment.