**1. Introduction**

The International Energy Agency has identified energy efficiency in buildings as one of the five methods to secure long-term decarbonization of the energy sector [1]. In addition to environmental benefits, the improvement of the building energy efficiency also presents vast economic benefits. Buildings with efficient energy systems and management strategies have much lower operating costs [2]. The activities of humans in residences occupy a large portion of energy consumption and CO2 emission [3]. Residential load forecasting can assist sectors in balancing the generation and consumption of electricity, which improves energy efficiency through the management and conservation of energy.

Several uncertain factors, such as historical load records, weather conditions, population mobilities, social factors and emergencies, influence electricity usage. Due to the high volatility and uncertainties involved, short-term load forecasting for a single residential unit may be more challenging than for an industrial building [4]. Machine-learning-based methods, driven by data, are applied to mitigate these challenges more and more frequently. However, the scope of machine-learning-based applications will be hindered due to the privacy and security concerns raised by more and more supervision departments and users. Even in some countries, many users refuse the installation of smart meters because users are reluctant to disclose their private data. In addition, newly built houses cannot provide sufficient data to build effective models. In summary, the data exist in the form of isolated islands, which makes it difficult to merge the data from different users to train a

**Citation:** Shi, Y.; Xu, X. Deep Federated Adaptation: An Adaptative Residential Load Forecasting Approach with Federated Learning. *Sensors* **2022**, *22*, 3264. https://doi.org/10.3390/ s22093264

Academic Editor: Hossam A. Gabbar

Received: 21 March 2022 Accepted: 21 April 2022 Published: 24 April 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

robust model. Hence, one of the problems in this paper we focused on is data availability and privacy.

A number of researches have achieved good results on STLF, such as support vector regression (SVR) [5], the artificial neural network [6] and boosted tree [7]. Additionally, some hybrid methods that combine artificial intelligence methods with traditional methods are proposed to achieve better forecasting performance, such as hybridizing extended Kalman Filter and ELM [8]. Fan et al. [9] proposed a SVR model hybridized with differential empirical mode decomposition (DEMD) method and auto regression (AR) for electric load forecasting. Transformer is a novel time series prediction model based on the encoder– decoder structure. Originating from this structure, many methods have yielded good results in the field of energy forecasting, such as STA-AED [10] and informer [11]. However, these approaches do not consider user privacy and modeling with limited data.

A lot of privacy-preserving solutions relying on data aggregation and obfuscation have been proposed to ensure privacy [12]. However, these solutions are not suitable for residential short-term energy forecasting since they often introduce extra procedures to obfuscate and reconstruct the data [13]. In addition, as the solutions based on machine learning are computationally intensive in the step of model training, most works consider only centralized training approaches. Clients' data should be collected onto a central server where the model is trained, which leads to a heavy burden on communication. Especially when the model needs to be constantly updated with new data, as the data from millions of distributed meters are required. Under this circumstance, federated learning has been proposed to overcome these challenges. Federated Learning is a distributed machine learning approach where a shared global model is trained, under the coordination of a central entity, by a federation of participating devices [14]. The peculiarity of the approach is that each device trains a local model with the data never leaving each local machine. Only the parameters of models are sent to the central computing server for updating the shared global model. Hence, the federated architecture can protect privacy effectively. Federated learning has been demonstrated to be effective in the area of load forecasting, federated learning with clustered aggregation is proposed in [15], and has good performance for individual load forecasting. Federated learning applied in heating load demand prediction of buildings also has a high capability of producing produce acceptable forecasts while preserving data privacy and eliminating the dependence of the model on the training data [16]. Furthermore, federated learning has been applied in several application successfully, such as human–computer interaction [17], natural language processing [18], healthcare classification [19], transportation [20,21], and so on, where privacy and scalability are essential.

Another critical problem for residential load forecasting is that the general model is not adapted to each house since the datasets are non-IID, which the federated architecture and conventional machine learning algorithms do not well handle with [22]. The problem is particularly acute in the case of newly built houses. Even though the dataset bias and unbalance are inevitable [23], many researchers classify users according to different attributes to deal with this challenge, but it does not fit well with a federated learning architecture [24]. This situation is particularly suitable for applying transfer learning. Transfer learning aims at establishing knowledge transfer to bridge different domains of substantial distribution discrepancies. In other words, data from different houses have domain discrepancies which is a major obstacle in adapting the predictive model across users. STLF models based on transfer learning are discussed in [4,25,26].

A representative transfer learning method is domain adaptation, which can leverage the data in the information-rich source domain to enhance the performance of the model in the data-limited target domain. As a well-known algorithm applied for domain adaptation, deep neural network [27] is capable of discovering factors of variations underlying the houses' historical data, and group features hierarchically in accordance with their relatedness to invariant factors, and it has been studied extensively. A lot of research has shown that deep neural networks can learn more transferable features for domain adaptation [28]. It is shown that deep features must eventually transition from general to specific in the network, with the transferability of features decreasing significantly at higher levels as domain discrepancies increase. In other words, the common features between different users are captured in lower layers, and the features of the specific user hide in higher layers which depend greatly on the target datasets and are not safely transferable to another user.

In this article, we address the aforementioned challenges within a novel user adaptative load forecasting approach. The approach is the combination of federated learning and transfer learning. The architecture of federated learning in this approach aims at building a CNN-LSTM based general model, which does not compromise privacy and works well with the limited data. Then, MK-MMD, a distance to measure domain discrepancies, is used to calculate the domain discrepancies between houses, then optimize the general network which can reduce the domain discrepancies effectively and reduce the forecasting error. The contributions of this paper are summarized as follows:


## **2. Technical Background**

#### *2.1. Federated Learning Concepts*

Due to security and privacy concerns, data exist in the form of isolated islands, making it difficult for data-driven models to leverage big data. One possible approach is federated learning, which can train a machine learning model in a distributed way.

Let matrix D*<sup>i</sup>* denote the data held by the partner *i*, each row of the matrix represents one sample, and each column is a feature. Since the feature and sample spaces of the data parties may not be identical, federated learning can be classified into three classes: horizontally federated learning, vertically federated learning and federated transfer learning.

Horizontal federated learning is applicable in the conditions in which different partners have the same or overlapped feature spaces but different spaces in samples. It is similar to the case of dividing data horizontally in a tabular view, hence horizontal federated learning is also known as sample-partitioned federated learning. Horizontal federated learning can be summarized as Formula (1):

$$\mathcal{X}\_{i} = \mathcal{X}\_{j\prime} \mathcal{Y}\_{i} = \mathcal{Y}\_{j\prime} \mathcal{Z}\_{i} \neq \mathcal{Z}\_{j\prime} \forall \mathcal{D}\_{i\prime} \mathcal{D}\_{j\prime} \, i \neq j \tag{1}$$

let X , Y, I denote the feature space, the label space and the sample ID space.

Different from horizontal federated learning, partners in the vertically federated learning share the same spaces in samples, but different ones in feature spaces. We can summarize vertically federated learning as shown in Formula (2):

$$\mathcal{X}\_{\mathbf{i}} \neq \mathcal{X}\_{\mathbf{j}}, \mathcal{Y}\_{\mathbf{i}} \neq \mathcal{Y}\_{\mathbf{j}}, \mathcal{Z}\_{\mathbf{i}} = \mathcal{Z}\_{\mathbf{j}}, \forall \mathcal{D}\_{\mathbf{i}}, \mathcal{D}\_{\mathbf{j}}, \mathbf{i} \neq \mathbf{j} \tag{2}$$

Federated transfer learning is applied in the conditions in which datasets differ not only in sample spaces but also in feature spaces. For example, a common representation or model is learned from different feature spaces and later used to make predictions for samples with only one-side features. Federated transfer learning is summarized as shown in Formula (3):

$$\mathcal{X}\_{\rm i} \neq \mathcal{X}\_{\rm j}, \mathcal{Y}\_{\rm i} \neq \mathcal{Y}\_{\rm j}, \mathcal{Z}\_{\rm i} \neq \mathcal{Z}\_{\rm j}, \forall \mathcal{D}\_{\rm i}, \mathcal{D}\_{\rm j}, \mathfrak{i} \neq \mathcal{j} \tag{3}$$

In this paper, the federated learning framework is a horizontal federated learning architecture since the data collected by devices is in the same feature space. It uses a master–slave architecture, as shown in Figure 1. In this system, N participant devices collaborate to train a machine learning model with the help of the master server.

**Figure 1.** A horizontal federated learning architecture.

In step 1, each participant computes the model gradient locally and masks the gradient information using cryptographic techniques such as homomorphic encryption, and sends the results to the master server. In step 2, the master server performs a secure aggregation operation. In step 3, the server distributes the aggregated results to each participant. In step 4, each participant decrypts the received gradients and updates their respective model parameters using the decrypted gradients. The above steps continue iteratively until the loss function converges or the maximum number of iterations is reached. We can see that the data of the participants are not moved during the training process, so the federated learning can protect user privacy that distributed machine learning models trained on Hadoop do not have. In the training process, an arbitrary number of devices can concur to model training without the need of transferring collected data to a centralized location. The federated model can tackle the increasing data without consideration of communication bandwidth since only local gradients need to be sent.

#### *2.2. Transfer Learning Concepts and MK-MMD*

Firstly, it is hard to collect sufficient data from domains of interest, referred to as target domains. Meanwhile, a large number of data may be available for some related domains called source domains. Secondly, machine learning algorithms work well based on a fundamental assumption: the training and future data must be in the same feature space and follow the same distribution. However, this assumption is not held in real-world applications. For these reasons, transfer learning is introduced to address these problems. Transfer learning can leverage similarities between data, tasks, or models to conduct knowledge transfer from the source domain to the target domain. These similarities are considered a representation of the distance between domains. Then the key issue is to introduce the standard distribution distance metric and minimize the distance.

MK-MMD is a type of distance metrics. This distance is computed with respect to a particular representation *φ*(·), a feature map function. This function can map the original data into a reproducing kernel Hilbert space (RKHS) endowed with a characteristic kernel *k*. The RKHS may be infinite dimensions that can transform non-separable data to linearly separable. The distance between the source domain with probability *p* and the target domain with probability *q* is defined as *dk*(*p*, *q*). The data distribution *p* = *q* iff *d*<sup>2</sup> *<sup>k</sup>* (*p*, *q*) = 0. Then, the squared expression of MK-MMD distance [29] is denoted as Formula (4):

$$\left|d\_k^2(p,q)\right| \stackrel{\Lambda}{=} \left||E\_p[\phi(\mathbf{x}\_S)] - E\_q[\phi(\mathbf{x}\_T)]\right||\_{\mathcal{H}\_k}^2\tag{4}$$

where H*<sup>k</sup>* denotes the RKHS endowed with a characteristic kernel *k*.

Kernel technique, as Formula (5) shows, can be used to compute Formula (4), which can convert the computation of the inner product of the feature map *φ*(·) to computing the the kernel function *k*(·) instead.

$$k(\mathbf{x}\_S, \mathbf{x}\_T) = \langle \boldsymbol{\phi}(\mathbf{x}\_S), \boldsymbol{\phi}(\mathbf{x}\_T) \rangle \tag{5}$$

As mean embedding matching is sensitive to the kernel choices, MK-MMD uses multikernel *k* to provide better learning capability and alleviate the burden of designing specific kernels to handle diverse multivariate data. It provides more flexibility to capture different kernels and leads to a principled method for optimal kernel selection.

Multi-kernel K is defined as the convex combination of kernels {*ku*} as in Formula (6) [28]:

$$\mathcal{K} \stackrel{\Delta}{=} \left\{ k = \sum\_{u=1}^{m} \mathcal{B}\_{u} k\_{u} : \sum\_{u=1}^{m} \mathcal{B}\_{u} = 1, \mathcal{B}\_{u} \ge 0, \forall u \right\} \tag{6}$$

where the constraints on coefficients {B*u*} make the derived multi-kernel *k* characteristic.

#### **3. The Proposed Method**

*3.1. The Overview of the Proposed Approach*

The overview of the proposed approach is shown in Figure 2. Without loss of generality, there are 3 households which need to be predicted, the number can be extended without too much work. Each household has a device for computing models and communicating with the master server. The approach mainly consists of 6 procedures as follows:

Step 1: The master server constructs the initial global model with public datasets.

Step 2: The master server distributes the global model to all users.

Step 3: The master server selects a fraction of users, then the selected devices train models with their local data.

Step 4: The selected devices upload models to the master server.

Step 5: The master server updates the global model by aggregating the uploaded models. Repeat Step 2 to Step 5 until the global model convergence.

Step6: Each device fine-tunes the convergent global model using user adaptation with local data.

#### *3.2. Federated Learning Process*

Deep neural networks are selected in the federated learning process since neural networks update models based on gradient descent. The federated learning process can get a pre-trained model for the latter user adaptation. Firstly, in the training process, the model is initialized on the master server with public datasets. The initial global model is denoted as *fG*, then the learning objective function is defined as shown in Formula (7):

$$\arg\min\_{\Theta\_{\mathcal{G}}} \mathcal{L} = \sum\_{i=1}^{n} \ell(y\_{i\prime} f\_{\mathcal{G}}(x\_{i})) \tag{7}$$

where (·) denotes the loss for the neural network, the loss used in this paper is mean squared error (MSE) loss since load forecasting problem is a regression problem. {*xi*, *yi*}*<sup>n</sup> i*=1 are samples from datasets, and Θ are the parameters learned.

**Figure 2.** Overview of the deep federated adaptation. The top box is the master server while the 3 bottom boxes denotes 3 houses. Each house contains one computing device connected to the master server for processing the data. The data collected by the smart meter is locked and cannot be transmitted to the master server.

After the initial global model is trained, the master server will distribute the model to all remote devices. Then, a subset of remote devices are chosen for training user model *fu* with local data. Let {*x<sup>u</sup> <sup>i</sup>* , *<sup>y</sup><sup>u</sup> <sup>i</sup>* }*n<sup>u</sup> <sup>i</sup>*=<sup>1</sup> denote samples from datasets *u*. Technically, the learning objective function for each user is denoted as Formula (8):

$$\arg\min\_{\Theta\_{\boldsymbol{\mu}}} \mathcal{L} = \sum\_{i=1}^{n^{\boldsymbol{\mu}}} \ell(y\_i^{\boldsymbol{\mu}}, f\_{\boldsymbol{\mu}}(x\_i^{\boldsymbol{\mu}})) \tag{8}$$

Then, all the user models are uploaded to the master server for averaging based on the algorithm FedAVG [30], and the formulation of averaging is as Formula (9):

$$f\_{\mathbb{G}}{}'(w) = \frac{1}{K} \sum\_{k=1}^{n} f\_{\mathbb{u}\_k}(w) \tag{9}$$

where *w* are parameters of the network and *K* is the number of devices in the chosen subset. Then, let *fG* = *f <sup>G</sup>* on the master server, after adequate rounds of iterations, the updated server model *fG* has better performance on generalization ability. When devices of newly built houses are connected to the federated system, the master server can distribute the global model to help new devices take part in the next iteration, hence, federated learning can deal with cold start problems and is extensible. It is worth noting that the network is trained by the federated learning using data from different houses, which expands the training data and makes the model more robust, and has better generalization ability.
