*3.3. User Adaptation with Multiple Kernel Variant of Maximum Mean Discrepancies*

Federated learning solves problems of data availability and privacy. However, another important problem is personalization. Even if the cloud model can be directly used, it still performs poorly on a particular house. The weights of this network have been pretrained by the federated learning process, then the user adaptation process will fine-tune the pre-trained network. Since the network does not need to update all weights from scratch for new tasks, it costs less in computation and time, which is especially suitable for edge devices.

Figure 3 shows the architecture of the proposed network. This is a classic hybrid model of convolutional neural network (CNN) and Bi-directional Long Short-Term Memory (BiLSTM), referred to as CNN-LSTM, more details can be found in [31]. This network is a two-stream architecture, thus the source data and the target data can be fed into the network simultaneously. Two streams of data go from the CNN layers to the Bi-LSTM layers and finally through the fully connected (FC) layers to compute the forward loss. We consider that sections of CNN can extract low-level features about a series of load values and BiLSTM aims at capturing sequential relationships.

To minimize domain discrepancy, domain loss is also introduced to optimize the network. MK-MMD is used to measure domain loss in which the source data is aligned with the target data for computing. Multi-kernel *k* is used to adapt to different feature domains and hidden representations of higher layers are embedded in a RKHS where the mean embeddings of distributions in different user data can be explicitly matched. The loss of MK-MMD is defined as shown in Formula (10) [23]:

**Figure 3.** The architecture of proposed network, from top to bottom, consists of CNN layers, BiLSTM layers and fully connected layers.

$$\mathcal{L}\_{MK-MMD}(X\_{\mathcal{S}}, X\_T) = \left\| \frac{1}{|X\_{\mathcal{S}}|} \sum\_{\mathbf{x}\_s \in X\_{\mathcal{S}}} \phi(\mathbf{x}\_s) - \frac{1}{|X\_T|} \sum\_{\mathbf{x}\_t \in X\_T} \phi(\mathbf{x}\_t) \right\|\_{\mathcal{H}}^2 \tag{10}$$

where *xs* ∈ *XS* denote source data points from source datasets and *xt* ∈ *XT* denote target points from datasets of houses need to be adapted. Gaussian kernels are selected as the kernel function *k* in this paper since they can map features to infinite dimensions. We use a combination of Gaussian kernels by varying bandwidth *γ* with a multiplicative step size of 21/2. The Gaussian kernel function with the bandwidth *γ* is defined, as shown in Formula (11):

$$k(\mathbf{x}\_s, \mathbf{x}\_t) = \mathbf{e}^{-\frac{\|\mathbf{x}\_s - \mathbf{x}\_t\|^2}{\gamma}} \tag{11}$$

let *η* denote the trade-off parameter, then the total loss function of the network during user adaptation is computed by Formula (12):

$$\arg\min\_{\Theta\_{\mathsf{U}}} \mathcal{L}\_{\mathsf{U}} = \sum\_{i=1}^{n} \ell(y\_{i\prime} f\_{\mathsf{u}}(\mathbf{x}\_{i})) + \sum\_{i=1}^{n^{u}} \ell(y\_{i\prime}^{\boldsymbol{u}} f\_{\mathsf{u}}(\mathbf{x}\_{i}^{\boldsymbol{u}})) + \eta \mathcal{L}\_{\mathsf{M}\gets\mathsf{M}\text{-}\mathsf{M}\text{-}\mathsf{M}\text{-}\mathsf{U}\_{\mathsf{T}}}(\mathbf{X}\_{\mathsf{t}\prime} \mathbf{X}\_{\mathsf{T}}) \tag{12}$$

Since learned features transition from general to specific along the network with increasing domain discrepancies, the lower CNN and BiLSTM layers catch general features that can transfer from different houses. Hence, the parameters in the first dashed box in Figure 3 are frozen during user adaptation, whereas the weights of FC layers are updated by the total loss, as shown in Formula (12).

#### *3.4. Learning Process and Summary*

The learning procedures of DFA are summarized in Algorithm 1. Furthermore, we can consider the algorithm as a general process applied in STLF and separate the procedures into two sections. Section 1 of step 1 to step 8 is a federated learning process, while Section 2 of step 9 is for transfer learning. Other federated learning methods (e.g., vertically federated learning) can replace the horizontal federated learning method in Section 1 to deal with heterogeneous features from diverse organizations. Meanwhile, other effective transfer learning methods can also be embedded in Section 2 for better personalization. The neural network used in this framework can also be replaced according to the computing power of real-world devices or the features of datasets.


London Households that took part in the UK Power Networks led Low Carbon London project between November 2011 and February 2014. Readings were taken at half-hourly intervals. The customers in the trial were recruited as a balanced sample representative of the Greater London population. The dataset contains electricity consumption, in kWh (per half hour), unique household identifier, date and time [32]. As an example, a period of records from 4 households are shown in Figure 4. It can be seen that the records are in different patterns which means a general model is not suitable for forecasting electricity consumption for a particular house. Meteorological variables recorded in London collected from Dark Sky API [33] are introduced to enrich our datasets. We merge electricity consumption datasets and meteorological datasets in terms of timestamps to generate a new feature table for each household.

Some discrete features (e.g., 'weekday', 'icon') should be encoded to embedding features. Then, feature normalization is implemented for all features with min–max normalization, as shown in Formula (13):

$$\mathfrak{X}\_{i}^{j} = \frac{\mathfrak{X}\_{i}^{j} - \mathfrak{X}\_{i,\min}}{\mathfrak{X}\_{i,\max} - \mathfrak{X}\_{i,\min}} \tag{13}$$

where *x<sup>j</sup> <sup>i</sup>* denotes the value for feature *i* at the time step *j*, *xi*,*min* and *xi*,*max* denote the maximal and minimal value for feature *i*, respectively. *x*ˆ *j <sup>i</sup>* is the value for *<sup>x</sup><sup>j</sup> <sup>i</sup>* after normalization.

We consider the feature table as time-series data according to the timestamps, each row of the table denotes a record sampled at half-hourly intervals. We implement a sliding window with a look back at 24 records to forecast the next record. Hence, the proposed network can give a half-an-hour-later load value prediction, one training sample consists of features of 24 records and the value of the next electricity consumption. The input dimension is |X | × *L*, where |X | is the number of expected features in the merged table X , *L* denotes the width of the sliding window.

**Figure 4.** Load data of four houses for one day from the used datasets.

### *4.2. Implementation Information*

The proposed network is composed of two convolutional layers, two pooling layers, two BiLSTM layers and two FC layers. The network adopts a convolution size of 1 × 17 and a kernel size of 3 for pooling layers. The proposed network is trained with the MSE loss and adopts stochastic gradient descent (SGD) with an initial learning rate of 0.01 and 0.9 momentum for optimization. Batch size is set to 32. The training process is early stopped within 10 epochs and the rate of dropout is set to 0.1 to prevent overfitting.

In the following experiments, cross validation and grid search are used to select the hyperparameters and the hyperparameters with the lowest average forecasting MAPE will be used. During the training process, we use 70% of the data for training while the rest 30% is for evaluation. All experiments are repeated five times to ensure reliability, implemented in Pytorch, and conducted on a single NVIDIA GeForce RTX 2080 GPU.

A single machine is used to simulate the federated learning process and we can set the number of user nodes *Nnodes* according to the experimental requirements. Table 1 shows some symbol definitions of the experiments. Since a single machine is used to simulate the federated process, the training process is serial. However, this has no effect on comparing model forecasting accuracy and computation time between the federated architecture and the centralized architecture. Centralized learning means data are gathered from all devices to train a single model on the central server, which does not secure the privacy of users.

**Table 1.** Symbol definitions of the experiments.


To evaluate the forecasting performance of DFA, four baseline models are used for comparison purposes. The following are simple introductions for these models.


#### *4.3. Model Evaluation Indexes*

The mean absolute percentage error (MAPE) is used to evaluate forecasting accuracy. The evaluation equations are defined as shown in Formula (14):

$$\text{MAPE} = \left[\sum\_{i=1}^{N} \left(\hat{y}\_i - y\_i\right) / y\_i\right] / N \times 100\% \tag{14}$$

where *y*ˆ*<sup>i</sup>* is the forecast load consumption value, *yi* is the actual load consumption value and *N* is the total number of sampling points for evaluation.

To evaluate whether a particular model *m* has skill with respect to a baseline model *r* the MAE ratio, we use skill score, as shown in Formula (15):

$$s = 1 - \frac{\text{MAE}\_{\text{nr}}}{\text{MAE}\_{\text{r}}} \tag{15}$$

where MAE is the mean absolute error. MAE is calculated as shown in Formula (16):

$$\text{MAE} = \left[\sum\_{i=1}^{N} |\mathfrak{g}\_i - \mathfrak{g}\_i|\right] / N \tag{16}$$

#### *4.4. Experimental Forecasting Performance*

The proposed DFA and four baseline models are evaluated on 10 randomly chosen target houses. For each target house, the load records from June 2012 to June 2013 are used as training data, and 720 load records in September 2013 for prediction to calculate MAPE values. DFA makes use of all datasets of ten houses in the federated process and leverages the datasets from the target house to operate user adaptation. Baseline models are trained with the data from the target house. Table 2 shows the MAPE values of DFA and baseline models for 10 houses. Figure 5 shows the MAPE values for direct observation.


**Table 2.** MAPE values of DFA and baseline models for 10 houses.

From Table 2, we can see that the proposed DFA consistently outperforms the baseline models for ten houses. On average, it shows 38.28%, 69.83%, 58.81% and 63.65% relative improvements over Transformer, DSHW, Encoder–Decoder and LSTM, respectively, based on skill scores. The performances of LSTM and Encoder–Decoder are similar to each other and worse than Transformer since the number of parameters is less compared to Transformer. Performances of DSHW fluctuate widely and are inferior to the other models based on deep learning. We believe this is due to the differences in the cyclical characteristics in different spans which are influenced by many uncertain factors in residential loads. In summary, DFA has the best performance. We conclude that one of the reasons for the remarkable superiorities is DFA uses all datasets from ten houses to learn a model in the federated architecture. Additionally, we calculate the curve of MAPE values by varying the number of houses as shown in Figure 6. It can be seen that MAPE values of DFA gradually decrease with the number of houses increasing whereas other models do not vary much. This means that the model will be more robust when more devices are connected to the systems in the reality. More discussions about the superiorities in forecasting performance can be found in Section 4.6.

**Figure 5.** MAPE values of DFA and four baseline models for 10 houses.

**Figure 6.** MAPE values of four baseline models and DFA with different numbers of houses connected to the federated system for 10 houses.

To evaluate the persistence of DFA, we conduct day-ahead and week-ahead forecasting tasks of DFA and four baseline models on one house, the results are shown in Table 3. It can be observed that although the forecasting performance of DFA decreases as the period goes from one day to one week, DFA outperforms all baseline models no matter how long is the forecasting period. We attribute this decline to the fact that DFA uses a sliding window for training and forecasting: the value forecasted by DFA will be added to the end of the sliding window for the next forecasting. Forecasting errors are cumulative as the period grows.

**Table 3.** Forcasting MAPE values of DFA and baseline models for day-ahead and week-ahead.


#### *4.5. Performance of Federated and Centralized Architecture*

Table 4 shows the forecasting performance and computation time comparison of the federated and centralized architecture. CNN-LSTM, as shown in Figure 3, is chosen as the test model. The different number of records in Table 3 means how many records for each node are used to train the model. For federated learning, the training time can only be estimated, the training time can be estimated as shown in Formula (17):

$$T\_{training} = T\_{round} \cdot N\_{round} \tag{17}$$

where *Ttraining* denotes the training time, *T*¯ *round* indicates average computation time for all devices involved in each round.

From Table 3, it can be seen that the forecasting performance of the federated architecture is superior to the centralized while making predictions for STLF in the conditions of the different numbers of local records with *Nnodes* = 10. When *Nnodes* and records increase, the federated architecture can make use of more data to train the model, the forecasting performance will improve.

The federated architecture outperforms the centralized architecture on computation time comprehensively, with great advantage. As the federated architecture leverages devices involved in each round for training at the same time. It can be seen that the computation time fluctuates only slightly when *Nnodes* increases. This is because in each round, each device only processes the local data simultaneously and does not care about data from other devices in the system. Meanwhile, it also can be observed that the computation time rises at a lower rate than the centralized method as the number of records increases because for the centralized architecture, the incremental data of each device should be collected for training, the increment of data is decided by *Nnodes*.


**Table 4.** MAPE values and computation time of the federated and centralized architecture.

Figure 7 shows the correlation between accuracy and the number of rounds for these two architectures. Federated learning shows a higher rate rise at the beginning of the iterations while the centralized accuracy rises slowly because multiple devices compute simultaneously in one round of iterations of federation learning. As can be seen from the trend of the curves, the federated architecture uses fewer iterations to achieve a satisfactory accuracy and achieve the state of convergence, which is also reflected in the shorter computation time in Table 3.

**Figure 7.** Correlation between accuracy and number of rounds for the federated and centralized architecture.

Now, we analyze the communication overhead of these two architectures. For the federated architecture, the calculation of communication overhead is defined in Formula (18):

$$Trans\_{Fed} = 2N\_{round} \cdot N\_{nodes} \cdot S\_{model} \tag{18}$$

meanwhile, the communication overhead of the centralized architecture is defined as shown in Formula (19):

$$Trans\_{\mathbb{C}m} = N\_{\text{nodes}} \cdot S\_{\text{data}} \tag{19}$$

From the aforementioned formulas, the communication complexity of the federated architecture is O(*Nround* · *Nnodes* · *Smodel*) and the centralized architecture is O(*Nnodes* · *Sdata*). Since *Nnodes* is presented in both equations, it can be reduced. Therefore, the complexity is O(*Nround* · *Smodel*) and O(*Sdata*) respectively. When the *Sdata* is much larger than *Nnodes* · *Sdata*, the federated architecture has less communication burden than the centralized, which is common in practical applications. We can easily infer that the computation time will increase in the reality since the incremental communication overhead. In a summary, DFA is scalable with increasing data and has lower computation time and communication bandwidth requirements.

#### *4.6. Ablation and Extensibility Experiments*

To validate the superiority of DFA, we conduct the ablation and extensibility experiments based on datasets of five houses, other experiment settings are consistent with Section 4.4.

We use Fed to denote the DFA without MK-MMD optimization which is a CNN-LSTM network trained by the federated architecture. NoFed denotes the CNN-LSTM model trained by the centralized architecture with data only from the target house. We can see from Figure 8 that Fed achieves a better performance than NoFed on each target house. This indicates that each target house benefits from the federated architecture which makes it possible to leverage datasets from other houses, ensuring privacy simultaneously. It also can be seen that DFA has remarkable improvements in performance compared with Fed. We conclude that the transfer learning method can successfully conduct knowledge transfer from the federated model to the target houses to improve forecasting performance.

**Figure 8.** Ablation experiments of the federated architecture and MK-MMD optimization on 5 houses.

Furthermore, we extend DFA to different versions in which the part of MK-MMD is modified by the alternative transfer learning methods. Maximum mean discrepancy (MMD) is the single kernel version of MK-MMD. CORAL [34] is one of transfer learning methods that use the covariance matrices of the source and target features to compute the domain loss. It can be seen from Figure 9 that DFA can achieve satisfying performances on forecasting with different transfer learning methods. The results indicate that DFA is extensible with other transfer learning algorithms according to the real applications.

**Figure 9.** Extensibility experiments with alternative transfer learning methods on 5 houses.

#### **5. Conclusions**

In this paper, we propose a federated transfer learning approach for residential STLF. This approach addresses data availability and privacy by using a federated architecture. We implement a transfer learning method, multiple kernel variant of maximum mean discrepancies, to adapt to the non-IID data among different houses. The experimental results show that DFA shows a huge improvement in forecasting performance over other models. We also evaluate the federated architecture DFA used; it shows that the architecture is superior to the centralized architecture in computation time and has a small burden on communication. In the future, it would be promised for subsequent studies to adopt the state-of-the-art federated and transfer learning algorithms to achieve better forecasting performance with the framework of DFA.

**Author Contributions:** Conceptualization, Y.S. and X.X.; methodology, Y.S.; software, Y.S.; validation, Y.S.; formal analysis, Y.S.; investigation, Y.S.; resources, Y.S.; data curation, Y.S.; writing—original draft preparation,Y.S.; writing—review and editing, X.X.; visualization, Y.S.; supervision, X.X.; project administration, Y.S.; funding acquisition, X.X. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was funded by the National Natural Science Foundation of China, grant number 51975422.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**

