**2. Modeling Methodology**

We define the synthetic data generator (SDG) in this section. We define a parametric model (SDG) that can be used to generate synthetic samples of EV session data, and its inputs. We assume that each session can be described using three parameters: (i) arrival time (*ta*), (ii) connection time (*tc*) and (iii) required energy (*E*). The departure time can be calculated using *td* = *ta* + *tc*. *E* represents the charging load that an EV has requested (based on measured charging power throughout the full session). Session parameters for date *d* can be generated using Equations (1)–(3).

$$t\_d = AM(d) \tag{1}$$

$$\mathbf{t}\_{\mathbf{c}} = M M\_{\mathbf{c}}(\mathbf{t}\_{a\_{\mathbf{c}}}, \mathbf{d}) \tag{2}$$

$$E = MM\_c(t\_{a\prime}d) \tag{3}$$

In what follows, we define (i) the arrival model (*AM*), (ii) the mixture model for connection times (*MMc*) and (iii) the mixture model for required energy (*MMe*). Trained SDG models can be used to generate a sample of data. Data generation is a two step process.


*AM*, *MMc*, *MMe* is trained for a set of dates (**S**). Dates present in **S** will have similar daily properties (e.g., arrival profiles), and we can define **S** by assuming a grouping criteria for days, e.g., we can assume that each month will have similar arrival profiles, i.e., the grouping criteria for dates is months *m*. For each month *m*, all dates of that month will be the elements of set **S**. Details about defining **S** in practice, in particular for a real-world dataset, are included in Section 4.

## *2.1. Arrival Models*

Arrivals of EVs in a group of charging stations (poles) can be considered as events over time. For a large number of poles, we can assume that the inter-arrival times (IATs,Δ*t*) of EVs follow an exponential distribution (which we validate in Section 4.2). Based on this assumption, one method to model arrival times of EVs is to model the time in between arrivals (Δ*t*). A second method is to model the total number of EV arrivals in a time interval. Both these methods are defined below.

#### 2.1.1. Inter-Arrival Time Models

To model inter-arrival times (Δ*t*) we use the exponential distribution, which is characterized by a rate parameter *λ* (rate of EV arrivals). Inter-arrival time (IAT) models are defined as follows:

$$t\_i = t\_{i-1} + \Delta t \tag{4}$$

$$PDF(\Delta t) = \lambda\_{i-1} e^{-\lambda\_{i-1} \Delta t} \tag{5}$$

$$
\lambda = f\_{\mathbf{S}}(t) \tag{6}
$$

where the *i*th EV arrives at time *ti*, *PDF* represents the probability distribution function and *t* is time of day. The rate parameter *λ* is dependent on time, and *f***S** defines the profile of *λ* with respect to *t* for the type of days present in **S**. We can use different methods to fit *f***S**: The **mean model** is based on average values of *λ* for given timeslot *ts*. This results in a discontinuous mapping between *λ* and *t*, with a sudden change in *λ* at the boundaries of each timeslot *ts*. To have continuous *λ* throughout the day, we use regression-based methods: either a **polynomial model** using polynomial regression, or a **localized regression model**. Training these models is explained in detail in Section 4.1.1. In Algorithm 1, we outline the pseudocode to generate arrivals over a given horizon. We use the date (*d*) to retrieve the appropriate *f***<sup>S</sup>**, and predict *λ*. The IAT between the current and new arrival is generated as a random sample from the exponential distribution with rate *λ*. Arrivals are generated throughout the horizon for each date.

#### **Algorithm 1:** Inter-arrival time (IAT) model.

**Input :** *H* (Horizon, initial to final date) **Output :***T* (List of EV arrival times in *H*) **for** *d* ∈ *H* **do** *f***S** = ge<sup>t</sup> arrival rate model for *d*; *t* = 0; **while** *t* < 24 **do** *λ* = *f***S**(*t*); Δ*t* = sample from exponential distribution with rate *λ*;*t* = *t* + Δ*t*; append *t* to list *T*;

#### 2.1.2. Arrival Count Models

Instead of generating the next arrival of EV, here we focus on generating the number of arrivals in a given *ts* (timeslot, e.g., slots of 60 min). The number of arrivals *N* in *ts* can be generated as a random sample from a discrete probability distribution Equation (7). This distribution can be characterized using parameters **P**, and Equation (6) can be modified to Equation (8), wherein we model these parameters. We distribute *N* arrivals uniformly over the duration of timeslot *ts*. Arrival count (AC) models can be defined as follows:

$$PDF(N) = f(\mathbf{P})\tag{7}$$

$$\mathbf{P} = f\_{\mathbf{S}}(t\_s) \tag{8}$$

We model the parameters **P** of the discrete distribution for each *ts* using the function *f***<sup>S</sup>**. Our underlying assumption that the IATs of EVs follow an exponential distribution amounts to assuming a Poisson distribution for the number of arrivals *N* in such a timeslot. Yet, for the Poisson distribution, the variance is equal to the mean of the distribution, while the number of arrivals may have a larger variance. In such case we need to include other discrete probability distributions that describe counts data [20]: we propose using the negative binomial model. In summary, we have two options to model the arrival counts (AC):


Pseudocode for generation of arrivals of EVs using the Poisson model is given in Algorithm 2 (adaptation to the negative binomial model for sampling *N* is straightforward).

**Algorithm 2:** Arrival count (AC) model. **Input :***H* (Horizon, initial to final date) **Output :***T* (List of EV arrival times in *H*) **for** *d* ∈ *H* **do** *f***S** = ge<sup>t</sup> arrival rate model for *d*; **for** *ts* = 1, 2, . . . 24 **do** *λ* = *f***S**(*ts*); *N* = sample from Poisson distribution with rate *λ*; *A* = evenly space *N* points in *ts*; append all *t* ∈ *A* to list *T*;

#### *2.2. Mixture Models (MMc*, *MMe)*

The connection time of each plugged-in EV depends on what time the EV arrived, i.e., its arrival time. We can model the probability distribution, *PDFta* (*tc*) using gaussian mixture models (GMM), where *tc* can be generated as a random sample from the probability distribution, Equation (9), once we know the value of *ta*. We can group dates of a month (or daytype) into the same type of day, for which we use the same model. These dates then form a set **S** (set of dates). Similarly to the connected times, GMMs can be fitted for required energy (charging load).

$$\text{MM}\_{\mathbb{C}} \colon \text{PDF}\_{t\_0, \mathbb{S}}(t\_{\mathbb{C}}) \tag{9}$$

$$\text{ $MM\_{\mathfrak{e}}$ } \colon \text{PDF}\_{t\_{\mathfrak{e}}, \mathbf{S}}(E) \tag{10}$$

The steps for data generation using SDG are summarized in Figure 1b. We used a trained SDG model and horizon as inputs. As seen in Figure 1a, we provided the methodology to train the models from a raw dataset. In Section 3 we describe the data cleaning and prepossessing, and session clustering steps. Then come the details of training and evaluation in Section 4.

**Figure 1.** Modeling methodology for (**a**) training SDG models, and (**b**) generating synthetic samples.

In this section we define and outline the inputs of SDG, by defining *AM* for EV arrivals, and *MMc* and *MMe* for connection times and required energy. Inputs are simply the dates *d* (and arrival times *ta* in case of *MMc* and *MMe*). We also summarize the parameters of SDG) by characterizing models using the parameters of the underlying probability distributions.
