*2.1. Dataset*

The dataset contains historical data of time series from the natural gas consumption of multiple cities all over Greece, as well as the average daily temperature of each city's surrounding area. The data spans from 1 March 2010, or later on for some cities, until 31 October 2018. Specifically for some centralized larger cities of Greece such as Athens and Larissa, where the natural gas distribution system was installed early on, there is data since 1 March 2010, as seen in Figure 1, while in some other large cities, like Thessaloniki, also seen in Figure 1, or smaller ones, like Alexandroupoli, seen in Figure 2, data collection started later on. The exact starting dates of data collection for each city are given later on in Section 3.

**Figure 1.** Natural gas energy consumption [MWh] for Athens, Thessaloniki, and Larissa over the years.

**Figure 2.** Natural gas energy consumption [MWh] for Alexandroupoli, Drama, Karditsa, and Trikala over the years.

As provided, the dataset contained time and dates, natural gas consumption of each city's distribution point, and the daily average temperature of the area in Celsius degrees. On top of the existing data, social indicators have been added to the dataset such as a month indicator, a day indicator, a weekday/weekend indicator, and a bank holiday indicator. The proper addition of social variables is a key factor to the study since the aim is to see if qualitative social traits can improve the performance of a forecasting model, and by how much compared to other methods.

#### *2.2. Feature Engineering*

A certain amount of feature engineering is required for the qualitative data to take proper form, in order to be readable by the machine learning algorithms. This takes place during the preprocessing phase and is conducted in the following way: Months and days are described by a name, e.g., September, Tuesday, etc., and need to be transformed into categorical values, e.g., 9, 2, etc., in a serial way. Therefore, the following association is considered: January-1, February-2, etc., and Monday-1, Tuesday-2, etc. Each of these values are then transformed into vectors with the size of the value range of the variable. In detail, the "month" variable contains 12 different values, one for each month, therefore the size of the vector is 1 × 12. Respectively, the "day" variable contains 7 values, one for each day, therefore the size for this vector is 1 × 7. Consecutively, the "month" variable is transformed into 12 variables, one for each month, and the "day" variable is transformed into 7 variables, one for each day of the week. The "bank holiday" variable denotes a public or religious holiday that affects social behavior (businesses are closed, people are out celebrating, etc.) and is binary, therefore there is no need for any kind of further transformation.

Time and date data are transformed into a single timedate variable which is then used as an index, thus leading to the total amount of 22 variables that are being taken as inputs for the modeling of the energy forecast of the proposed methodology. The desirable variable for the forecast is the natural gas energy consumption from the specific distribution point, which is used as output. The correlation of the consumption energy with the mean daily temperature for the city of Athens is shown in Figure 3.

The correlation plots of consumption energy and mean temperature for all the investigated cities are given in the Appendix A, where it is obvious that not all cities have the same pattern of correlation between the mean temperature and the consumption of natural gas. This variation in patterns is one of the reasons that the implemented models achieve different accuracies for each different city, as will be shown later.

**Figure 3.** Correlation of the energy demand and the mean temperature for the training set (**a**) and the test set (**b**) for the city of Athens.
