**Open Data and Energy Analytics**

Special Issue Editors

**Benedetto Nastasi Massimiliano Manfren Michel Noussan**

MDPI • Basel • Beijing • Wuhan • Barcelona • Belgrade • Manchester • Tokyo • Cluj • Tianjin

*Special Issue Editors* Benedetto Nastasi Sapienza University of Rome Italy

Massimiliano Manfren University of Southampton UK

Michel Noussan Fondazione Eni Enrico Mattei Italy

*Editorial Office* MDPI St. Alban-Anlage 66 4052 Basel, Switzerland

This is a reprint of articles from the Special Issue published online in the open access journal *Energies* (ISSN 1996-1073) (available at: https://www.mdpi.com/journal/energies/special issues/ open data energy).

For citation purposes, cite each article independently as indicated on the article page online and as indicated below:

LastName, A.A.; LastName, B.B.; LastName, C.C. Article Title. *Journal Name* **Year**, *Article Number*, Page Range.

**ISBN 978-3-03936-218-9 (Pbk) ISBN 978-3-03936-219-6 (PDF)**

c 2020 by the authors. Articles in this book are Open Access and distributed under the Creative Commons Attribution (CC BY) license, which allows users to download, copy and build upon published articles, as long as the author and publisher are properly credited, which ensures maximum dissemination and a wider impact of our publications.

The book as a whole is distributed by MDPI under the terms and conditions of the Creative Commons license CC BY-NC-ND.

## **Contents**


#### **Massimiliano Manfren and Benedetto Nastasi**


## **About the Special Issue Editors**

**Benedetto Nastasi** (PhD) is Senior Energy Planner and Lecturer at Sapienza University of Rome and Guest Researcher at TU Delft University of Technology. Previous affiliations include TU/e Eindhoven University of Technology, The Netherlands, and International Solar Energy Society and Guglielmo Marconi University, Italy. His work is related to Power-to-What solutions for energy systems design with a specific focus on the built environment. He has developed expertise on hydrogen technologies, energy efficiency, hybrid systems, energy efficiency in buildings, distributed generation, as well as micro and smart grids. He holds a PhD with Honors in Energy Systems Planning and Design at Sapienza University of Rome.

**Massimiliano Manfren** (PhD) is Lecturer in the Sustainable Energy Research Group (SERG), within the Faculty of Engineering and Physical Sciences of the University of Southampton (UK). His previous affiliations include Politecnico di Milano (IT) and University of Bologna (IT). His research focuses on analytics and predictive models for energy system design and operational optimization at multiple scales, from individual users to communities. His research aims to establish a convergence between scientific disciplinary knowledge in energy demand modelling at multiple levels; energy-efficient technologies; and advances in machine learning and operation research techniques, through an integrated use of simulation, optimization, statistics, and data mining on case studies. He holds a PhD in "Programming, Maintenance, and Rehabilitation of Buildings and Urban Systems"from Politecnico di Milano.

**Michel Noussan** (PhD) is Senior Research Fellow at Fondazione Eni Enrico Mattei (FEEM) Future Energy Research Program and Affiliate Professor of Sustainable Transport at Sciences Po's Paris School of International Affairs (PSIA). His current research activities are focused on the analysis and comparison of different mobility solutions in the framework of decarbonization and digitalization trends of the transport sector. He has developed expertise on energy systems analysis, combined heat and power, district heating, energy efficiency and local energy planning. He was a researcher and university lecturer at Politecnico di Torino in the domain of energy systems analysis, and he has a track record of several publications in international journals and conferences. He holds a PhD in Energy Engineering from Politecnico di Torino.

## *Editorial* **Open Data and Energy Analytics**

#### **Benedetto Nastasi 1,2,\*, Massimiliano Manfren <sup>3</sup> and Michel Noussan 4,5**


Received: 10 March 2020; Accepted: 30 April 2020; Published: 7 May 2020

**Abstract:** This pioneering Special Issue aims at providing the state-of-the-art on open energy data analytics; its availability in the different contexts, i.e., country peculiarities; and at different scales, i.e., building, district, and regional for data-aware planning and policy-making. Ten high-quality papers were published after a demanding peer review process and are commented on in this Editorial.

**Keywords:** open data analytics; energy planning; smart cities; open energy governance; urban database; energy mapping; building dataset; energy modelling; data mining; machine learning

#### **1. Overview**

Open data and policy implications coming from data-aware planning require collection and the pre- and postprocessing as operations of primary interest. These procedures require that data are freely available to people and decision-makers. Openness is, therefore, the best way. Referring to the relationship between data and energy, public administrations, governments, and research bodies are promoting the construction of reliable and robust datasets (i) to pursue policies coherent with the sustainable development goals, as well as (ii) to allow citizens to make informed choices. Energy engineers and planners must provide the simplest and most robust tools to collect, process, and analyze data, to offer solid data-based evidence for future projections at building, district, and regional scales for an effective systems planning.

For all these reasons, researchers encouraged by the call for papers shared their original works in the field of ''Open Data and Energy Analytics". Among the numerous submissions, the following 10 successfully passed the review process.

#### **2. A Short Review of the Contributions to This Issue**

Cutting-edge outcomes of ongoing and recently ended European research projects are published in this Special Issue. In detail, two H2020 projects, namely, PLANHEAT and HOTMAPS, are the sources of innovative results published in three original articles.

The paper authored by Fremouw et al. [1] deals with the role played by open data in supporting urban transition planning, thanks to the energy potential mapping within the H2020 project PLANHEAT. The aim of the paper is to identify the principal recurring issues in energy data acquisition and processing to overcome the existing barriers in data availability. An increase of the quality of energy mapping tools follows the relevance and availability of energy data. Thanks to the activities of the HOTMAPS project, Pezzutto et al. [2] present the design of an open-source toolbox to support urban planners, energy

agencies, and public administrations for planning the heating and cooling supply at different scales. A bottom-up approach is used to collect and analyze market data related to space heating and domestic hot water systems and their performance in Europe. Within the same HOTMAPS project, Müller et al. [3] face the challenge of uncertainties coming from different databases and from large differences in available datasets among EU countries. A top-down approach is proposed, and a comparison between country-level and municipal-level building stock data is made for gross floor area and energy demand for space heating and domestic hot water. Transparency and regular update of datasets fostered by the increase of smart meters installation are crucial to support and effective energy planning.

Moreover, this Special Issue presents also different research works dealing with the potential of gathering useful information from available data in different fields, both for performance assessment and future scenarios design.

Korkovelos et al. [4] illustrate an overview of open-access geo-spatial data and GIS-based electrification models aiming to support SDG7, with a detailed discussion on their role in answering complex policy questions. Their research work presents an updated version of the Open-Source Spatial Electrification Toolkit (OnSSET-2018), which is described in detail and applied to a case study in Malawi, comparing the cost of different electrification options by 2030. The results highlight that the optimal mix includes off-grid PV systems for two-thirds of the population, and power grid extension for the rest. The sensitivity analysis provides additional insights on the crucial role of electricity demand projections in the optimal electrification solution.

Electricity data can also support a better evaluation of the distributors' performance, as described by Ganhadeiro et al. [5] in a case study in Brazil. The authors propose an improved methodology to better assess how environmental variables affect the energy efficiency of electricity distribution companies. The methodology presented by the authors can be extended to other countries where there is at least some influence of private sector in energy distribution, or any other regulated service.

Another interesting case for the potential of data in supporting energy analyses is presented by De Kok et al. [6], who focus on the use of user-generated contents in social media to understand and improve the energy consumption behavior of individuals. The authors highlight the interesting potential of social media content as a complementary support to other sources, thanks to the massive amount of data and the low cost of analysis. Thanks to an image and text processing pipeline, relevant information can be extracted to describe different energy-consuming activities. The strengths and weaknesses of this approach are presented, by applying the method to two case studies in Amsterdam and Istanbul.

Zipperle and Orthofer [7] present an innovative open-source interface for MESSAGEix model, named d2ix. MESSAGEix is an optimization model for strategic energy planning and integrated assessment of energy–engineering–economy–environment systems, including effects such as emissions, economic development, land and water use, and health implications. It can be linked also to the general-economy MACRO model to incorporate feedback between prices and demand levels for energy and commodities. The d2ix interface enables concise presentation and editing of model input data and increases the accessibility and transparency of the modelling processes, reducing barriers and simplifying collaborative working.

In the narrow field of energy efficiency in the built environment, Attanasio et al. [8] propose a methodology for the automatic estimation of building primary energy demand related to space heating and to the characterization of the relationship between the latter and the main building features. The methodology was tested using an energy performance certificate database with 90,000 flats in Piedmont region (Italy) and four machine learning algorithms. The methodology can be used for quick estimation of expected building energy demand as well as setting credible targets for improving building performance.

Another application of data analysis techniques in the built environment is presented by Manfren and Nastasi [9]. They describe an integrated workflow from parametric energy performance analysis to model calibration. A passive house building is a case study that seeks to show an effective and transparent way to link design and operation performance analysis together with reducing the efforts in modelling and monitoring by providing parametric performance boundaries. These performance boundaries are used to ease monitoring process and to identify insights in a simple, robust, and scalable way.

Finally, Vialetto and Noro [10] present an application of Internet of Things (IOT) and Industry 4.0 concepts to the industrial energy efficiency. A clustering modelling approach for the short-term forecasting of energy demand in industrial facilities is shown. The forecasting model is applied to an industrial facility (wood processing industry) with simultaneous heat and electricity demand, where it proves to be effective, with a very small error in the order of 3%.

**Author Contributions:** Conceptualization, B.N.; writing—original draft preparation, B.N., M.M., and M.N.; writing—review and editing, B.N., M.M., and M.N. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received no external funding.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## *Article* **Energy Potential Mapping: Open Data in Support of Urban Transition Planning**

#### **Michiel Fremouw 1,\*, Annamaria Bagaini <sup>2</sup> and Paolo De Pascali <sup>2</sup>**


Received: 15 November 2019; Accepted: 25 February 2020; Published: 9 March 2020

**Abstract:** Cities play a key role in driving the transition to sustainable energy. Urban areas represent between 60% and 80% of global energy consumption and are a significant source of CO2 emissions, making energy management at the urban scale an important area of research. Urban energy systems have a strong influence on the environment, economy, social dimensions and urban spatial planning. Energy consumption affects the urban microclimate, urban comfort, human health, and conversely, urban physical, economic and social characteristics affect the energy urban profile. In order to improve the quality of energy strategies, policies, and plans, local authorities need decision support tools, like energy potential mapping, which have risen significance in the last decades. Energy data are crucial for those tools. They can increase the quality and effectiveness of energy planning but also support the integration between energy and spatial planning. Energy data can also stimulate citizen engagement as well as encourage sustainable behaviours and CO2 emission reduction. This paper aims to increase the practice of data-aware planning, through the study of problems in energy data acquisition and processing observed in European projects focused on developing energy mapping tools. The problems observed attend to two main areas: technical and socio-economic issues. Those were derived from a comparison of energy mapping tools, and the work conducted for the PLANHEAT development. The scope of the research is to understand the main recurring issues in energy data acquisition and processing, in order to overcome the barriers in data availability. Increasing awareness of the relevance of energy data can foster the use of energy mapping tools, increasing the quality of energy policies and planning.

**Keywords:** energy planning; energy potential mapping; urban energy atlas; urban energy transition; energy data; data-aware planning; spatial planning

#### **1. Introduction**

Human society is facing an unprecedented challenge. The extended use of fossil fuels as a source of CO2 emissions has been the main driver of global climate change, and the built environment is playing a significant part in this. Changing urban planning practices is considered a primary component of the pathways towards climate mitigation [1,2]. In order to plan for the transition towards residual and renewable energy within this built environment, the nature of the urban fabric and local circumstances are highly important, yet often poorly mapped and quantified.

The availability of open data [3] contributes to the political, social and economic development of a country. Public administrations are often not aware of the value data can bring to societies, quality of life, environmental protection and energy turn. In order to accomplish this, data must be available, accessible, user-friendly, and reusable [4,5]. In this sense, energy data (e.g, data about energy demand

and local renewable energy sources) are crucial to innovate and increase the efficacy of urban energy policy and urban energy planning.

The urgency of countering climate change [1] spurs local governments to define their energy strategies, emission targets, as well as sustainability agendas, spurs local governments to define their energy strategies, emission targets, as well as sustainability agendas, but in many cases, these are lists of well-meant intentions, rather than operative actions, and not easy to translate into specific interventions. The reason behind this is that urban energy policies are not directly based on real data and information, but on top-down, aggregated estimations of the city energy profile (i.e., the urban energy performance in terms of energy demand, energy local sources, and the future trend of energy demand and supply). The city energy profile is however strongly related to urban shape, geometry and physical characteristics [6–11]. Energy demand and energy sources are rarely distributed homogenously within the city. Some areas show high energy consumption, and waste, whereas others exhibit energy poverty issues; and there are areas with high resource availability, while in others, locally available resources cannot satisfy the demand [12].

Many studies [7,8,13–15] show that, in the long term, the most suitable opportunities to influence energy consumption and the related CO2 emissions are represented by the decisions taken in the field of urban planning. These decisions concern the land use, the design of public and private mobility, the urban waste system (i.e., collection and recycling process), the water system (i.e., supply and water treatment), the energy production from renewable sources (Spatial planners must find optimal locations for windmills, biomass and solar power plants as well as energy storage systems. At the same time conflicts with competing uses and the environment have to be minimized.) and its distribution; the green system and of course the design of buildings. The irregular spatial distribution of energy demand and supply sources within cities show a need for understanding the different relationships between energy and urban characteristics. These considerations find in the urban energy map a tool able to give effective support to the decision-making and planning process [11,16,17].

Energy mapping tools can give a spatial dimension to the energy issue and provide a means of understanding the relationships between urban and energy factors. The energy map aims to clarify the characteristics of a specific energy pattern (demand, supply, production and distribution), through the spatial visualization of energy data, which can support "informed" interventions, suggest strategies, and identify priorities and the most suitable locations for energy district developments. The energy map may be also a good basis to trigger an informed debate on the choices to take: an ideal catalyst for discussions and for defining shared objectives.

The term "energy mapping" is not officially defined, and because of the complexity of the subject, encompasses a wide range of implementations. Over the last decade, cities have developed different tools to increase their capacity to evaluate the availability of renewable energy sources (such as solar maps, geothermal maps) connected to the energy consumption and demand (heating and cooling maps). Several techniques, methods and objectives have been developed, some only in empirical form, others have been coded at the operational level, identifying steps of implementation [11,13,18].

All of these rely on sufficient availability of (geo) data. The literature on energy mapping instruments [11,13,16,19–23] offers a summary of the variables that an energy map should contain and integrate: the spatial distribution of the energy demand (electricity and heating/cooling energy demand); the spatial distribution of population; the land use; the characteristics of the building stock (use, year of construction, height of buildings, etc.); polluting emissions resulting from energy consumption; the spatial distribution of RES (Renewable Energy Sources, such as solar, solar thermal, wind, both in-land and off-shore, hydroelectric, geothermal, from biomass, from reconversion of excess heat, from waste-to-energy, etc.); heating degree hours (HDH); the presence and location of anchor loads; the design of energy grids; the location and size of expansion plans (residential, commercial, etc.); some specific barriers such as the presence of protected areas (frequently in the heritage and ecology categories); the mobility system; and possibly socio-economic indicators to identify the poorest or most degraded areas.

The innovative nature and the benefits of using an energy mapping tool have not necessarily resulted in swift implementation. One of the main difficulties is a lack of suitable data, that is also available to the urban planner. Access to energy data is essential to developing the appropriate tools and improve the ability to make decisions that merge physical-spatial issues with energy-environmental ones. "Making urban data accessible" is therefore becoming a fundamental prerequisite for urban innovation [24], and for increasing the quality of energy-related policies. Energy data should be accurate and based on real acquisition (in order to reduce the inaccuracies associated with estimation), geospatially referenced, and measured at short temporal intervals over at least a year (in order to account for both peaks and seasonal fluctuations).

Fortunately, the digital revolution is offering significant perspectives in terms of data acquisition, processing and use, providing new opportunities for urban investigation. New technologies improve the ability to analyse the urban energy profile (for instance smart meters–now also available for gas consumption), and can allow for a very detailed and real-time view of consumption (but also the use of mobile devices-mobile phones, tablets, etc.–for recording consumption and citizen habits, able to influence also people behaviour).

#### *Aims and Objectives*

This study builds upon the work conducted during the Horizon 2020 funded European PLANHEAT project, which started in 2016 [22]. The main objective of PLANHEAT is to develop an integrated tool, a potential energy map, which will empower public authorities (cities and regions) in the development of sustainable energy plans, with a special focus on distributed heat (cold) networks and energy district design. The PLANHEAT tool supports local authorities by providing:


Both the development of the PLANHEAT toolkit and its use require collecting energy related data. Thus, an assessment was made early on in the project, in order to discover the main issues related to the availability and quality of these types of data (both open and internal/proprietary data), as experienced by the toolkit's intended end users [25]. Through questionnaires and interviews, 26 cities in 8 countries (France, Belgium, Italy, Greece, Netherlands, Hungary, Croatia, and Spain) were asked to evaluate the rate of data openness in 7 data categories: heat demand, heat supply, transport sector, census information, energy audits, knowledge and motivation, and finally, the connection of local and national plans. The results showed that every country involved has more than 50% of the types of data considered available publicly [18–25], however local authorities highlighted structural difficulties in reaching and using these data [25]. Even if datasets exist, getting the required data (for example when the data owner is another stakeholder) sometimes proves difficult, and they may require processing or interpreting if the data was recorded for different purposes.

These results bring forward the need of investigating in greater detail the reasons behind those difficulties, with the aim of increasing the capacity of PLANHEAT to provide a useful toolkit and increase its usability by municipalities with varying levels of access to energy data. Thus, other projects with similar goals were analysed, with the aim to understand which type of data they used, how they collected them and which problems they had to deal with. The comparison between the difficulties emerged in these projects those that emerged in the PLANHEAT interviews made it possible to find similarities and common issues useful to determinate the main barriers in developing and using energy maps.

The scope of this paper is to understand the nature of commonly recurring problems in accessing and processing energy data. These problems can limit the usability of energy mapping tools by local authorities. Identifying the main issues assists in making these instruments more adaptable and therefore more effective, in terms of supporting energy turn and the decision-making process. Furthermore, this can also help improve the normative framework related to open data policies, and the elaboration of data standards and licenses for publication, which in turn may increase the future availability of suitable energy data.

#### **2. Materials and Methods**

This paper points out two main problems related to energy data availability and usability: technical issues (spatial and temporal resolution) and socio-economic issues (privacy, financial costs, ownership, concurrence, etc.). These two categories of problems raise from (1) the interviews conducted during the preliminary phase of the PLANHEAT project [25] and (2) from a comparative analysis of energy mapping projects and experiences, listed in Table 1.

In order to have a better understanding of the shared and common difficulties occurring in the energy data access and processing for developing energy maps, both a literature review and a study on European projects have been conducted. The intention of studying projects only in Europe comes from the need to remain under the same normative energy efficiency and data legislation. The literature review is built based on material collected through searching scholarly databases, mainly Scopus.com and Sciencedirect.com, using keywords including: "Urban energy maps"; "Urban energy mapping tools"; "Urban energy atlas"; "Energy web maps"; "Energy decision-support tools". For the selection of EU (European Union) projects, the research has been conducted on the European Commission web site, which collects all projects funded by topic. At the section Intelligent energy Europe, the projects have been selected in the categories: "Energy efficiency"; "Integrated initiatives"; "Heating and cooling" (https://ec.europa.eu/energy/intelligent/projects/). From this initial, wide range of energy mapping projects (EU projects and academic/institution ones) related to urban energy mapping tools, a selection was made of those deemed most suitable for comparison, from the perspective of evaluating the difficulties in collecting and processing data, as shown in Table 1. These projects provided sufficient information on the applied methodologies, the type of data used, and the results achieved for evaluation. Other projects found did not allow sufficient in-depth study for the analysis conducted, either because data collection was not a significant element, or because the underlying model is proprietary.

For each project, an analysis was made of the aim and type of the project; the steps of implementation; its status (ongoing, ended); the type of tool developed and its usability; the spatial scale(s) used; and the type of (geo) data used. The study focuses on building bound energy data availability and processing, intending to raise the most recurring problems, which can reduce the implementation of energy potential maps by local authorities. The aim of the paper is to increase data-aware planning and policy-making in the field of energy planning and urban energy policies, by identifying opportunities and solutions for problems with data acquisition and processing. This helps to improve the effectiveness of energy mapping tools and makes their usage more affordable for the large number of smaller local authorities.

In the majority of these projects either open data is used, but the level of detail is low (city/region/country), or the detail level is high (below city level), but the user is required to input significant amounts of private data to get to the planning stage.


ListofProjectsFocusedonEnergyPotentialMappingDevelopment.

*Energies* **2020**, *13*, 1264

#### **3. Energy Planning Data: An Overview of Problems and Issues**

Although all these projects are intended for energy planning, and supporting local and regional authorities, their specific implementation varies. In some cases the end result was an energy atlas, in others a spatial and/or quantitative decision support tool for the built environment. They do all share a requirement for and ability to use (geo) data in order to provide their potentials and assessments.

Furthermore, they benefit from more accurate data to be able to represent the real energy profile of the city [12,26,27,40]. Open energy data can create both economic as well as social value [5]. Those are largely driven by the level of openness and the cost of availability.

The first step for increasing the energy relevance into a planning process is the definition of which energy data are relevant at which steps and phases [26,41]. It allows more directed collection of data and avoids loss of time and resources. The second step should be a general overview of data owners and stakeholders. It is necessary to clarify which stakeholders are crucial for which elements [42], to increase the participation and understand which type of data is available and who can provide them (private bodies, public offices, European or international agencies). If the data needed are still not available, local authorities could consider building a new dataset. This process incurs a cost however, which may be a significant hurdle for small organisations.

In most cases, energy related datasets are available [43]. Municipalities, for example, have data about the floor space, year of construction, and building function (office, residential etc.). Energy and infrastructure companies have billing data that refers to the energy consumption of their clients. Standard renewable energy potentials, like for example solar (photovoltaic or thermal), are increasingly available at a high detail level due to their relatively low input requirements. In this case it is possible to use information on roof surface, slope and orientation, all covered by a high-resolution DEM that is an open-source dataset, with solar radiation data acquired from meteorological institutes (usually available as open data) to calculate roof potential [25].

However, in many cases, the use of datasets faces several challenges. From the comparative study of energy mapping and energy analysis tools (Table 1) we discovered two main categories of data acquisition problems:


#### *3.1. Technical Issues*

As the process of urban energy planning is relatively new, most of the relevant data has been collected for other purposes. Because of this, the exact definition of the values represented in the dataset determines if, and if so, how suitable these are for energy planning purposes.

Furthermore, available data are usually only (publicly) released in an aggregated form, if at all. This applies both to spatial and temporal aspects.

#### 3.1.1. Spatial Resolution

An issue that was frequently encountered with geospatial dataset availability during the PLANHEAT project, is that either the spatial resolution is too low for suitable projection or further analysis, or that only a single figure is available for the area under consideration.

An example is open data on residential building gas consumption in the Netherlands, which is collected by energy supply companies (ESCOs) and made available publicly in aggregated form at the neighbourhood level, through Statistics Netherlands (CBS) [25]. As a result of privacy regulations, figures in some neighbourhoods were also occasionally reported as censored ('afgeschermd') when the number of houses or companies fell below a threshold. Experiences in other countries have been mixed, where sometimes the local ESCO did not provide even aggregated numbers at all, reportedly for competitive or political reasons. Section 3.2 goes into greater detail on this and other socio-economic issues.

Data may also not be available in the same type of spatial division. Statistics agencies for example tend to collect their data using administrative divisions, whereas energy supply potential maps may be based on environmental data which uses rasters. Although GIS (Geographic Information System) software allows these to be projected over one another, using different source formats may introduce additional inaccuracies during both the analysis and subsequent processing steps.

A low spatial resolution may also frustrate the analysis of demand and supply geodata and matching of sources and sinks, especially if there is a strong relation with the urban fabric for the categories under consideration. Although a planner would be able to use low resolution data to do an initial, fast assessment of the possibilities in their city, designing energy transition plans requires not only (geo)data on demand, retrofitting (i.e., demand reduction), and (residual and renewable) supply potentials, but also on the possibilities of existing, and space for new, infrastructure.

In some cases, commercial organisations provide high resolution energy potential maps themselves. For the same competitive reasons mentioned however they are rarely transparent about the exact calculation steps followed in order to produce the values displayed and may include internal assessments on suitability and cost that deviate from the considerations of other users. A lack of transparency decreases the level of confidence of these datasets in itself.

#### 3.1.2. Temporal Resolution

A second issue is with the available temporal resolution. Datasets encountered during the PLANHEAT project that were suitable for energy planning, were usually either geospatial (annual figures) or temporal (for one or few buildings or areas) in nature, however rarely both (the one notable exception within the project being 1 to 5 km<sup>2</sup> resolution environmental satellite data, used for surface water potentials).

For mapping and planning purposes, annual data are usually sufficient, as at this stage of the planning process, quantities and concentrations are more important than temporal patterns. However, the fluctuating nature of both demand and some forms of residual and renewable energy supply might mean that the total potential figures require more than simply providing sufficient generation capacity, as there may be periods where supply vastly outpaces demand requirements (therefore effectively losing available energy), or conversely, demand outpaces supply.

An annual simulation can therefore be run to consider the impact of an energy transition plan. This however requires high resolution temporal, rather than spatial data for the components that are part of the planned energy system, in order to determine if these components are dimensioned to cope with both mismatches and peaks in demand and supply. As with high resolution spatial data, suitable high-resolution temporal data (at the hourly or smaller level) is frequently not available publicly in some cases, or at all in others.

More than with the spatial dimension, the technical and privacy considerations of data acquisition can be an issue here. Registering an hourly profile for household electricity demand over a year firstly requires the presence of a smart meter, and secondly permission of the household to read and store more than just the annual balance (which is required for billing purposes, the primary reason an electricity meter is installed). Accurately assessing the residual heat potential of an industrial process requires both temperature and volume monitoring, especially when used in a simulation. Even if monitoring is already active, commercial entities may be reluctant to (publicly or privately) release these figures, because they might subsequently be analysed by competitors.

#### *3.2. Socio-Economic Issues*

The analysis conducted on the PLANHEAT interviews' results and the literature review (European and academic researches related to energy mapping tools) highlights several socio-economic difficulties in dealing with energy data: data ownership and privacy, market competition, irregular updating of datasets, discrepancies between data formats, problems of data aggregation and disaggregation, public administration incapacity to treat and process data. Privacy (Article 8 of the Charter of Fundamental Rights of the European Union and Article 16 of the Treaty on the functioning of the European Union guarantee the protection of personal data. This means that acquisition, transfer and publication of personally identifiable data are subject to restrictions [44].) is always mentioned as one of the key challenges. The EU has several framework policies that protect the privacy of individuals (for instance the new GDPR (General Data Protection Regulation) directive in 2018 [45]), and therefore provide barriers both in the acquisition, transfer between actors (sometimes even between departments within the same civil authority) and publication [25]. When using statistical division-based datasets, sometimes even neighbourhoods and districts may cause confidentiality or privacy issues, if very few addresses are located there. For privacy issues in publication, a simple solution is available: aggregation, which can provide acceptable anonymity. But data aggregation may also reduce the quality of the final output. In many cases, local authorities possess data at a low spatial resolution, for example, city level energy consumption (or even regional scale statistical data). In this case a disaggregation process, using spatial indicators (like city cadastre or land use), would be needed to estimate the spatial distribution of energy consumption in the city.

Energy data are often in the hands of actors who may be unwilling to share it (energy companies, other private companies that for some reasons collect useful data). The competitive market is also a challenge when publishing open data is seen as potentially disadvantageous for companies. Sometimes (commercial) energy companies may have relevant data, but either do not provide this at all or only share at a low resolution, unsuitable for planning purposes. In the case is possible to run into problems of data combining and harmonization, because datasets come from different data owners, with different standards and aggregation methods, due to the liberalization of the energy production and sales market.

Sometimes new data acquisition is needed. The cost of creating and publishing datasets is high and take a long time. Understanding which types of data are available makes it easier to discover which data are missing, so data collection efforts can be specifically targeted for these, avoiding extra costs. Interviewing suitable stakeholders is also useful, especially when data collection projects are already planned, and the user simply must wait for it to become available. In this case, it will be more about cooperation than the associated cost.

Therefore, sometimes useful information might be derived from unexpected sources, and communicate any benefits coming from data sharing is beneficial to make it more interesting for companies to cooperate.

A lack of awareness is also a key challenge. The difficulty of accessing energy data is connected to the complexity of the energy production and supply system, increased with the liberalization of the energy market and the entry of different and new stakeholders. Commercial companies often have little interest in processing and sharing data. Public administrations have huge difficulties in managing data, as a result of differing standards and aggregation methods used by the operators. Municipalities and commercial companies are also not motivated and incentivised, and sometimes not aware of the potential benefits of sharing their energy data. When data owners are not aware of the potential value and the possibilities of open energy data to themselves, the step towards publishing their data is not likely to happen [5]. Now commercial companies (especially related to the energy market) see the opening of energy data as an added cost, without any benefit. The benefits of opening data would become much clearer if there would be direct incentives (financial incentives, government support) for private parties to publish energy data.

Lastly, data quality (including age) and the completeness of datasets is also crucial. Inaccurate or incomplete datasets can lead to misinterpretation and liability issues. For example, final consumption data that is a decade old may not be representative anymore, as not just the size and thermal efficiency of the building stock may have changed, but internal heat load (for example more electronics) and user habits will be different as well.

An interesting element from the Scotland heat map [28] is the methodology used to classify the quality of collected data: the confidence level. Data are categorized respect the level of detail provided. The confidence level 5—the most accurate one—represents data with high resolution, for example, data that come from the family bills. Meanwhile, at confidence level 1, there is the footprint of buildings from which to estimate or disaggregate the urban energy consumption. The goal of the Scotland heat map is to increase over time the quality of all the data collected, to obtain a real representation of the energy profile of the Scottish territory as accurately as possible [28].

Other problems linked to the quality of data are related to consumption habits and certain energy carriers. In some regions, part of the heat demand is fulfilled with unmetered sources, for example, wood (pellets) or fuel oil. The use of electric resistance heating means that some heat demand is effectively concealed as electricity consumption. This also applies to cooling, which is largely supplied by air conditioning. Cities where cooling demand plays a significant role or is expected to in the future because of climate change, are lacking methods to extract actual cooling demand from the data they have [25].

Finally, the lack of a governance framework, guidance, and regulation specifically on open energy data may also form a problem.

#### **4. PLANHEAT: Using Public Open Data to Overcome Problems in Data Acquisition**

Addressing the issues with data availability forms the basis of the PLANHEAT project [25]. In this Horizon 2020 funded project, a toolkit was developed that both integrates a wide range of open datasets and allows the user to replace these with higher resolution private data, and add own data for which there is no public substitute yet.

The PLANHEAT integrated toolkit is open source and QGIS3-based. The use of GIS software is becoming commonplace in urban planning practices but requires significant amounts of high-resolution data in order to produce useful results. A Heat/Cold (HC) potential map, as in this case, needs a large amount of data (geospatial, temporal and other), which municipalities may not have or cannot use. The main innovation of the PLANHEAT toolkit is that it can produce useful results, even if a local administration only has a limited amount of information available. Conversely, the same toolkit can use rich datasets as well, in order to be attractive for large metropolitan areas, where detailed data is frequently available to produce more accurate maps, plans and simulations. The objective of PLANHEAT is to offer a support tool, based as much as possible on public and open databases, for example, European ones, automatically loaded in the device settings. Initiatives like the EU Open Data Portal [13] aim to unlock data already collected by governments and institutions free of charge and without copyright. The PLANHEAT toolkit only asks for a relatively small amount of initial information from municipalities, like the municipal and district boundaries, the location of monuments or historic buildings with high cultural and aesthetic value, or particular consumption habits or resources used. Municipalities can replace the data provided in the toolkit when more detailed local data becomes available, in order to increase the accuracy of the results. The ability to use generic data at a lower detail level makes it possible to get started quickly without having to spend significant effort on data collecting. This will also reveal which layers to concentrate efforts for additional data collection, and which ones can be investigated later. For example, a map coming from the disaggregation process based on NUTS3 datasets (Nomenclature of territorial units for statistics) will allow visualizing the peaks of demand or the greatest territorial potentialities, which represents a crucial starting point for increasing the quality of urban energy strategies and planning.

The PLANHEAT tool consists of three modules: the mapping module (mapping local demand and supply sources), the planning module (plan new scenarios based on residual and renewable energy sources), and the simulation module (simulation of the new scenarios and Key Performance Indicator (KPI) evaluation). At the side of the energy consumption and supply mapping module, the tool provides two approaches.

The first one is the City Mapping Module (CMM), which applies a top-down or disaggregation method (Figure 1) [16,22,23]. For this method, the input required from the user is very limited. Only aggregated final consumption figures (GWh, Gigawatthour) per sector are required, and the boundary or boundaries to which they apply (city, districts). Subdivision within each boundary is made by applying the appropriate geospatial indicator, some of which are available directly through the PLANHEAT web database. Examples of geospatial indicators are the CORINE land use map (Coordination of Information on the Environment) and OpenStreetMap (specifically building footprints) [22].

**Figure 1.** 'City' (top-down or disaggregation) and 'District' (bottom-up or aggregation) methods, as used in PLANHEAT [14].

The second one, the District Mapping Module (DMM), uses a bottom-up approach (aggregation, [16,22,23], Figure 1), by deriving heat demand from building characteristics (Table 2). Although more complicated, it provides two calculation routes: 'complete', which requires 11 variables and 'simplified', which will be less accurate, but requires only little input (a unique identifier for each building; gross floor area; age and use of each building) (Table 2) [22] shows a comparison between the input data for the 'complete' and 'simplified' methods. Both methods result in individual building values for use by the simulation module, that can also be aggregated for mapping purposes.

**Table 2.** Input data requirement comparison of the DMM calculation methods [22].


The DMM was tested on a district close to Lecce, Italy, where little data was available. With this tool was possible determinate the district energy demand and identify opportunities for developing energy strategies based on RES and alternative energy resources. The validation test was incredibly useful for supporting the planning activity of the municipality.

On the supply side, both environmental and anthropogenic sources were included. Input data can be divided into technical parameters and geospatial indicators. An example is residual heat from wastewater treatment plants (WWTPs), for which the European Environment Agency had a base dataset that covered the EU28 [46,47]. Input data used are shown in Table 3.


**Table 3.** Geospatial indicators and technical variables used for residual heat potential from WWTPs.

Although explicit levels of confidence were not implemented project wide, because in many cases a single data flow and corresponding continentally uniform dataset was chosen, the sewage heat recovery dataset contains two Confidence Levels (CLs). Measured flow is not available for some facilities, therefore residual heat potential was in some cases estimated based on capacity multiplied by the average national load (see Table 3), and therefore assigned a lower CL [14]. In future versions of the tool, which may contain a wider range of (higher or lower input data resolution) calculation methods catering to specific European regions, this feature will likely be expanded upon.

The full list of input data used in the PLANHEAT project is extensive, and can be found in the reports released by the project [16,18,22,24,46,47]. These represent a balance between accuracy, public availability, and full coverage of the EU28, and can be accessed through the PLANHEAT web database.

Experiences gained during the PLANHEAT project show that is possible to start an energy mapping process with relatively little data input. This opportunity is crucial for data-aware planning and to support policy-making both from an energetic and spatial perspective.

#### **5. Conclusions and Recommendations**

The energy transition will only be successful if it is integrated into the urban planning process [13,20]. As the need to increase the sustainability of the built environment is widely acknowledged, there is a clear need to rethink and implement new urban planning procedures in order to meet these expectations. A better understanding of and interactions between urban planning and energy issues are useful, not only for the planners themselves but also for the private sector, local communities and citizens who may take appropriate decisions and receive benefits from related economic and social added values [42].

The economic value could come from the increased economic activity and employment, while the social value could come from the improved social conditions within different communities of a city. Sharing responsibility makes the communities involved (Local energy communities).

Alhamwi et al. [21] and Manfren et al. [48] argue that the most innovative and advanced planning practices include communication strategies and actions for activating the participation of all urban actors. Supporting tools for visualizing urban phenomena and simulating future trends will play an increasing role in the coming years. The use of these tools supports inclusion and participation, because it allows the sharing of goals, shows the advantages of decisions, finds which actors will be involved in the processes, enables community participation (by the creation of local energy cooperatives), influences behaviour and increases overall awareness.

The energy sector produces economic benefits and attracts investments. In the transition from a centralized energy system to a distributed one, local economic opportunities increase. Energy potential maps are a useful tool to increase investment in clean energy and energy efficiency, providing the geospatially quantified information required to guide urban transformation processes. An energy potential map can also influence investor choices, attract external financing and demonstrate potential economic income, increasing the attractiveness and competitiveness of the city. The renewable energy sector is an expanding market, which rewards the most virtuous cities, communities and companies. In the long term, a city's ability to deal with climate change and offer high-quality, healthy environments will depend on its capacity to understand these complex phenomena. Supporting tools can represent a winning means to face new urban challenges and boost sustainable development.

The opportunity of applying innovative methodologies (and the tools that facilitate these) does not necessarily result in immediate concrete application of the opportunities identified. One of the problems that cause this is a lack of availability of required and suitable data. Using appropriate energy data in the planning process (both spatial and energy related) allows for a more effective analysis, as well as more effective interventions and strategies. Collecting data issues can be divided into two main categories: technical issues and socio-economic issues. Research on open data maturity in Europe (EU28+) [5] shows that countries completed over 55% of their open data journey with the development of basic open data policies and open data portal. Even if suitable datasets exist, the development of supporting tools is complicated. For this reason, the development of an incremental tool, like PLANHEAT, applicable in every context and based on public open data, permits a quick start and increases the usability.

From an open data perspective, the strategy should be the creation of open data policies, the increase of policy quality, and the setup of standards and licenses for the publication, all of which will in turn improve the availability of suitable energy data. Sometimes, a lack of standardization is related to a desire to facilitate data publication from data owners. Setting up standards would require a bigger effort from them, which might also translate to higher related costs. However, standardization is crucial for making the interoperability of datasets possible and increasing their use (or re-use) by local authorities.

Energy issues and spatial planning are tightly connected. For this reason, facilitating the vertical and horizontal coordination between urban and energy stakeholders is very important, in order to provide energy demand and supply data, to invest in integrated projects and to actively participate in the energy transition [42].

**Author Contributions:** Conceptualization, M.F. and A.B.; Methodology, A.B. and M.F.; Validation, M.F., P.D.P.; Formal Analysis A.B.; Resources, M.F. and A.B.; Data Curation, M.F. and A.B.; Writing-Original Draft Preparation, A.B., M.F.; Writing—Review & Editing, A.B., M.F.; Supervision, M.F. and P.D.P.; Project Administration M.F. and A.B. All authors have read and agreed to the published version of the manuscript.

**Funding:** Part of the research leading to this paper has received funding from the European Research Council (ERC) under the European Union's Horizon 2020 research and innovation programme (grant agreement n◦ 723757).

**Acknowledgments:** The authors would like to thank all PLANHEAT partners: RINA Consulting; Delft University of Technology; University of Zagreb; National Observatory of Athens; Tecnalia; Vito; Lecce local authority; Antwerp local authority; Velika Gorica local authority; the Regional environmental center for central and eastern Europe; Euroheat & Power; Geonardo; Artelys.

**Conflicts of Interest:** The authors declare no conflict of interest. The funding sponsors had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, and in the decision to publish the results.

#### **References**


© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## *Article* **Assessment of the Space Heating and Domestic Hot Water Market in Europe—Open Data and Results**

## **Simon Pezzutto 1,\*, Silvia Croce 1,2, Stefano Zambotti 1, Lukas Kranzl 3, Antonio Novelli <sup>1</sup> and Pietro Zambelli <sup>1</sup>**


Received: 13 April 2019; Accepted: 8 May 2019; Published: 9 May 2019

**Abstract:** The paper investigates the European space heating (SH) and domestic hot water (DHW) market in order to close knowledge gaps concerning its size. The stimulus for this research arises from incongruences found in SH and DHW market's data in spite of over two decades of scientific research. The given investigation has been carried out in the framework of the Hotmaps project (Horizon 2020—H2020), which aims at designing an open source toolbox to support urban planners, energy agencies, and public authorities in heating and cooling (H&C) planning on country, regional, and local levels. Our research collects and analyzes SH and DHW market data in the European Union (EU), specifically the amount of operative units, installed capacities, energy efficiency coefficients as well as equivalent full-load hours per equipment type and country, with a bottom-up approach. The analysis indicates that SH and DHW account for a significant portion of the total EU energy utilization (more than 20%), amounting to almost 3900 TWh/y. At the same time, the energy consumption provided by district heating (DH) systems exceeds the one of condensing boilers. While DH systems applications are growing throughout the EU, the replacement of elderly, conventional boilers progresses at a slower pace.

**Keywords:** space heating; domestic hot water; market assessment; EU28; district heating; open data

#### **1. Introduction**

While the member states (MS) of the European Union (EU) aim at reaching an integrated policy framework, especially directed at delivering sound market regulations to investors, national policies have been focused on a permanent improvement in efficiency by bringing the share of energy generated by renewable energy sources (RES) to 27% within 2030. Moreover, EU MS explicitly set a reduction in greenhouse gas (GHG) emissions goal of 40% compared to 1990 levels by 2030 [1], aiming to 80–95% by 2050 [2]. The achievement of the Paris Conference of the Parties 21 (COP21) accordance will require reaching at least the upper bound of this range [3].

In 2015, EU's primary energy consumption accounted for about 1600 Mtoe/y, of which a major contribution is provided by heating and cooling (H&C) applications (about 800 Mtoe/y, including also industrial heat), followed by transport and electricity (about 490 Mtoe/y and 310 Mtoe/y respectively) [4–8]. Buildings are responsible for approximately 640 Mtoe/y, which corresponds to 40% of the whole EU primary energy consumption [9,10]. The largest parts of energy utilization within the EU building stock (~75% in total), according to an order of magnitude, occur for space

heating (SH), domestic hot water (DHW), and space cooling (SC) [11]. In the past three decades, EU MS invested massively in assessing the energy used by the different sectors [12–17]. In contrast to SC, the SH and DHW field is well researched in the scientific literature since more than two decades apart [7,18].

In particular, the EC supported a number of studies to provide quantitative data in this area, and inform related energy strategies and roadmaps (e.g., EC 2019 [19], Pardo et al. 2012 [20], and EC 2011 [21]). Further notable studies on the SH and DHW market in Europe were the result of various projects, such as the H2020 HRE4 [22], IEE STRATEGO [23], and Seventh Framework Programme (FP7) iNSPiRe [24]. Moreover, the following reports provide valuable insights into the investigated market: Patronen et al. 2012 [25], Boermans et al. 2012 [26], Von Manteuffel et al. 2016 [27], and Sanner et al. 2011 [28]. Finally, also scientific journal papers like Scoccia et al. 2018 [29], Balaras et al. 2007 [30], and Leurent et al. 2018 [31] contribute to the understanding of the SH and DHW market in Europe.

Carried out in the framework of the Horizon 2020 (H2020) Hotmaps project [32], our study generated a data repository for the SH and DHW market [33] (sources are available in the respective csv file [34]). The data are released as open data under the Creative Commons license CC-BY 4.0 [35]. This EU-funded project aims at designing an open source toolbox (released under the Apache 2.0 license [36]) to support urban planners, energy agencies, and public authorities in heating and cooling (H&C) planning at different scales (national, regional, and local), and in line with EU policies. As part of the project, the data analyzed in the present paper have been collected through a bottom-up approach. Based on the analysis carried out in [37], a more comprehensive investigation has been performed. In particular, we took a closer look at the uncertainty of generated results, provided an interpretation of the main outcomes, compared the main result with related findings of scientific literature as well as discussed its implications.

In order to create a high quality data set—characterized by completeness, accuracy, and reliability—in the framework of our analysis we place a special focus on the following aspects:


#### *1.1. Data Inventory*

One of the main challenges of creating an inventory of SH and DHW market technologies consists of preparing an exhaustive list of all existing data. Generally, the use of data collected at EU-wide level offers unique advantages due to their extensive territorial scope (e.g., EurObserv'ER [40], EUROHEAT&POWER [41], and EHPA [42]). However, data completeness can never be fully ensured.

Attempts of closing data gaps require not only extrapolating and assembling data from large data sets available online (e.g., EU Buildings Database [43], EHPA's Online Stats Tool [44], and IGA [45]). To ensure a rigorous approach and address the lack of data, it also encompasses searching data source-by-source, especially by using individual scientific literature sources such as journal papers (e.g., Bertoldi et al. 2012 [46], Martinopoulos et al. 2018 [47], and Clay 2015 [48]).

One important aspect of the data inventory is to ensure that the information can be understood and interpreted correctly by any user. This requires a compilation of clear metadata description, annotation, contextual information, and documentation. The data documentation provides standardized structured information, indicating the creator, title, time references, access conditions, and terms of use of the data collection (please see Pezzutto and Zambelli 2019 [34] and Pezzutto and Zambelli 2019 [49]). The data repository is structured following the Frictionless data standards [50], to encode and describe the metadata and the main data set information using a data package.json file that is readable by both human and computers. A more detailed insight on the methodology that produced the data set is provided by the respective README.md file [51]. The license of the data repository is encoded using the Software Package Data Exchange (SPDX) format [52], in order to univocally identify the

license. The Hotmaps' project selects a git repository to publish the data set, instead for example of a File Transfer Protocol (FTP) service, because a git service allows to: (i) preserve the history of the changes; (ii) perform an automatic versioning of the data that univocally identified the data set; and (iii) provide the functionalities to manage and discuss external contributions (e.g., open/assign/close issues, accept/comment/reject data modifications, etc.). Further details on the data inventory can be found in the Hotmaps Data Management Plan (DMP) [53].

#### *1.2. Data Reliability*

Much effort has been dedicated to analyze sources, assess the reliability of the gathered data, and fill existing gaps by in-depth investigations. We discern various types of information, by analyzing the different approaches applied for the collection of the identified data (e.g., amount of SH and DHW sold vs. operative units). In case of lacking or uncertain documentation, the data have not been considered for the development of the database.

All information collected on SH and DHW (i.e., amount of operative units, installed capacities, energy efficiency coefficients as well as equivalent full-load hours per equipment type and country) have been filtered and evaluated statistically; the methodology adopted is described in Section 2. Materials and Methods. Moreover, additional sources and types of information have been used to validate the outcomes obtained for the EU28 (see Section 4—Discussion) to assess their reliability.

#### *1.3. Data Definition and Comparability*

Although most data providers use standardized data formats and units, this does not necessarily mean that data are entirely comparable. In order to increase data comparability, the entire process of data elaboration requires adjusting differences and inconsistencies resulting from different methods, assumptions, measures, time references, and specifications [54].

Data have been collected for each EU MS using the most recent year available, while data over a decade old have been excluded (please see [34]). The developed data sets including the documentation are expected to improve data quality, add value to already existing data and provide data needed to monitor the progress of the SH and DHW field in Europe.

#### **2. Materials and Methods**

Our main data sources were derived from previous works. In particular, to those elaborated by AALBORG UNIVERSITY, HALMSTAD UNIVERSITY, and EUROPA-UNIVERSITÄT FLENSBURG in the context of several projects dedicated to the topic, including the data sets of the H2020 project Heat Roadmap Europe 4 (HRE4) [55], and the Intelligent Energy Europe (IEE) project STRATEGO (Multi level actions for enhanced Heating and Cooling plans) [56]. Another source, relevant for the data set compilation of the present investigation, is the data collection of the tender "Mapping and analyses of the current and future (2020–2030) heating/cooling fuel deployment (fossil/renewables)—ENER/C2/2014-641" led by the Fraunhofer Institute for Systems and Innovation Research—FH ISI [57].

Besides the deliverables achieved through the projects above (such as [58–60]), "Deliverable 2.1 Intermediate analysis of the heating and cooling industry" [61] was key in carrying out our work. The deliverable was produced within the tender "Support to key activities of the European technology platform on renewable heating and cooling"—PP-2041/2014.

Additional important information are provided by reports of Solar Heat Worldwide (e.g., [62,63]), EUROSTAT [64], and the TABULA WebTool [65]. Scientific publications have also been used as data sources, e.g., [66–68]. Given the large amount of references, in Section 3. Results and Table 1, only the major ones are indicated. Table 1 summarizes the most relevant data sources per type of information researched.


**Table 1.** Key data sources for amount of operative units, installed capacities, energy efficiency coefficients, and equivalent full-load hours per space heating (SH) and domestic hot water (DHW) equipment type and country (EU28), and information on public availability of data.

The analysis started by considering different SH and DHW technologies installed throughout Europe. The data were collected for each MS—as sources mainly provide information at country level—and were not subdivided by sector. The equipment typologies were categorized as found in [58,62,69]:

	- Non-condensing;
	- Condensing;
	- Aerothermal;
	- Geothermal;
	- Unglazed collectors;
	- Flat-plate collectors;
	- Evacuated tube collectors;

In the list above, furnaces were classified in the category "Boilers, Non-condensing".

For each MS and type of equipment, data regarding number of units, installed capacity, yearly equivalent full-load hours, and energy efficiency coefficients were collected. With regard to energy efficiency coefficients, the absolute majority of the technologies identified were characterized by thermal efficiency. These include condensing and non-condensing boilers, stoves, electric radiators, CHP-IC units, and various solar thermal systems (unglazed, flat-plate, and evacuated tube collectors). Aerothermal and geothermal HPs were instead described by the coefficient of performance (COP). In order to estimate the efficiency of DH systems, mean losses were included by considering DH network heat losses [70].

We also researched information on system types—in percentage at country level—as well as resources used to fuel each equipment considered. The latter are classified as proposed in [58,64,69]:


The category "Other fuels" included less dispersed combustibles (e.g., coke, peat, etc.) [64].

The data analysis was based on a bottom-up approach, which included an extensive literature analysis aimed at deriving reliable values. Data collected from scientific literature sources were filtered and statistically analyzed.

As a first step, for each MS and category of information (i.e., number of units, installed capacity, yearly equivalent full-load hours, and energy efficiency coefficients) at least three data were collected from different sources when possible. Then, their mean values were calculated. Depending on the amount of references, data that departed between a range of plus or minus one standard deviation around the mean of the respective data pool were excluded. The resulting numbers were utilized to calculate a more robust mean.

The sources of the data used as input cannot always be classified as open data, but the results of the statistical elaboration were released as open data. The published data set in [34] explicitly specifies when a value is the result of the statistical elaboration using more than one source (tagged as "Own calculation") or derived by a single source solely; for the latter, the source is specified.

Moreover, the coefficient of variation (CV) was utilized as a statistical indicator of uncertainty for generated values. The CV is the ratio of the standard deviation to the mean. The higher the CV the higher is the dispersion around the mean. Generally, it is indicated as a percentage [71,72] (displayed at the top of the columns in Figures 1–4). Unfortunately, due to missing data, it was not always possible to retrieve two or more data for each investigated value; in these cases, no statistical elaborations were carried out. In a minor amount of cases, data were extrapolated from one country, where data were available, to another, where data were missing—whenever in the presence of geographical, socio-economic, and historical similarities. Extrapolation of data was applied to the following countries:


This approach was only applied with regard to mean installed capacities, efficiencies, and equivalent full-load hours. We did not put forward any specific assumption on mean installed capacities for DH systems. Zero values are present only when this was supported by one or more references, e.g., showing that no DH systems are available in Malta so far [73]. In a few specific cases, when information was available only at aggregated EU28 level, data were applied to all MS equally. As an example, this was the case of values regarding the mean installed capacity of stoves [61].

Based on the methodology proposed by Pezzutto et al. 2017 [7], once the data collection has been concluded, mean capacities installed per technology have been divided by their respective energy efficiency coefficients to obtain the work input (*W*) per equipment type. Then, in order to obtain energy consumption values per equipment type and sector, the number (*Nr.*) of units was multiplied by the equivalent full-load hours (*T*—time) in a year and by its work input (*W*) using the following Equation (1):

$$Energy\text{ }Consumption\_{SH} \text{ } \&\text{ } DHW = Nr\_{\text{ }umits} \times T\_{\text{partial }umals} \text{ } hurus \times W \tag{1}$$

The utilized formula represents a simplified method to assess the energy consumption given by SH and DHW equipment at EU28 level, not differentiating between modulation and on-off equipment, as well as not considering partial load operation, efficiency of sources depending on its level of load, and accumulation of energy in buildings. Mainly due to not taking into consideration partial load operation, the used methodology thus might underestimate the assessed energy consumption.

To compare this investigation outcome with others in the scientific literature (please see Section 4. Discussion), we converted energy demand in energy consumption values by multiplying by 1.15. An important distinction is in place: What is meant with energy demand and energy consumption. The first is the net energy necessary to satisfy both SH and DHW needs. The second represents instead the input of energy at the level of devices necessary to cover the demand. On the basis of these definitions, the values of these quantities differ by a conversion factor. With these premises, since a boiler's efficiency is <1 (about 0.8–0.9 for those currently installed in Europe), energy consumption values are always higher compared to demand [37,66,69].

As the applied methodology relies on a number of assumptions, the main uncertainties associated with the final results were the following:


At this point, it has to be stressed that the amount of data subject to assumptions accounted for approximately 4% of those needed to generate the results of the present investigation.

• The utilization of an EU-wide mean value to turn SH and DHW demand into energy consumption leads to imprecisions, given the energy efficiency level taken into consideration refers to boilers only. However, the considered equipment is the most diffused in Europe [57,58].

#### **3. Results**

The present paper displays the main results aggregated at the whole EU28 level and not for each MS individually. The entire data set, with detailed data for each MS, including sources, is available as open data in the Hotmaps git repository under [33] under [34]. The results at EU28 scale, per each type of equipment, regarded all the main data categories used for the estimation of the final energy consumption, i.e., number of installed units, equivalent full-load hours, mean installed capacity, and energy efficiency coefficients. Finally, the distribution of energy consumption per equipment type at EU28 level was presented and discussed.

In the column charts of Figure 1, the error bars indicated standard deviations, and above positioned percentages the coefficient of variation (CV).

With regard to installed units, Figure 1 shows the amount of SH and DHW units per equipment type at EU28 level (in millions—Mil.). Non-condensing boilers had the greatest diffusion, with about 80 Mil. installed devices, followed by stoves (60 Mil.). Other technologies, in order of distribution magnitude are electric radiators (approximately 30 Mil. units), condensing boilers, and aerothermal HPs, with about 10 Mil. units, respectively. They are followed by geothermal HPs (2 Mil. units) and STS flat-plate collectors (about 1 Mil. units). STS-evacuated tube collectors, CHP-IC, STS-unglazed collectors, and DH were less diffused equipment, with 0.14, 0.05, 0.03, and 0.02 Mil. units, respectively.

Looking at the average CV percentages per equipment type related to SH and DHW at EU28 level (indicated on the top of the columns over the bars in Figure 1) we inferred that the data building these bars were highly unequally distributed. The overall CV percentage was 34%. The highest variation was the one of condensing boilers (~66%), followed by aerothermal HPs (~52%). The lowest variation was the one of stoves (~8%). Other equipment types were characterized by variations around 30%.

**Figure 1.** Number of operative units for SH and DHW per equipment type, EU28 [40,41,55–58,62,63,65] (As not visible in Figure 1: HP geothermal = 1.93, STS unglazed, flat-plate, and evacuated tube collectors = 0.03, 0.97, and 0.14, CHP-IC = 0.05, DH = 0.02 Mil. units.).

Figure 2 displays the annual distribution of equivalent full-load hours per equipment type. CHP-IC units had the highest mean value of full-load hours per year, nearly 1900 hours (h). Boilers (condensing and non) were second with over 1000 h. Equivalent full-load hours of electric radiators and DH amounted to 900 h each, closely followed by aerothermal HPs, with more than 700 h, and STS-flat-plate and STS-unglazed collectors with 400 h each. Geothermal HPs presented about 300 h, while stoves and STS-evacuated tube collectors were positioned last, with approximately 200 h each.

CV percentages included in Figure 2 indicated that the obtained data was rather dispersed. The mean value amounted to roughly 26%. The highest variation was given for STS-evacuated tube collectors (~53%), followed by the variation of STS-unglazed (~40%). The lowest variations related to condensing and non-condensing boilers (~6%). Other equipment types were characterized by variations between about 18% and 34%.

**Figure 2.** Distribution of mean SH and DHW units' equivalent full-load hours per equipment typology, EU28 [57,59–63,69].

The mean installed capacity per equipment type in kW is presented in Figure 3. DH's mean value exceeded Figure 3's axis indication, reaching a nominal number of almost 75,000 kW. CHP-ICs were characterized by means of about 200 kW. Next were STS-unglazed collectors with over 140 kW, followed by other STS types (i.e., flat-plate and evacuated tube collectors) with approximately 40 kW. Boilers (condensing and non) had a mean installed capacity of about 20 kW. Geothermal HPs and electric radiators came next with approximately 10 kW each. Conclusively, aerothermal HPs and stoves were positioned last with around 5 kW each.

In the case of average installed capacity per SH and DHW equipment type (EU28), the mean CV percentage was quite high, with an average of 38%. The highest variation related to stoves and aerothermal HPs (around 70%) followed by electric radiations (~60%). Geothermal HPs and CHP-IC followed with variations of about 40%. The lowest variations related to condensing and non-condensing boilers (~3%). Other equipment types were characterized by variations around 30%.

**Figure 3.** Mean installed capacity per equipment, EU28 [55,56,58–63,69].

Efficiency values at full-load (Figure 4) were evaluated by means of different indicators depending on the technology of the equipment type considered. Looking at technologies characterized by a thermal efficiency coefficient, we found that boilers and electric radiators had efficiency mean values near to 100%, with STS-unglazed collectors and non-condensing boilers being placed second and third with around 90% and 85%. Other STS systems, flat-plate and evacuated tube collectors, presented values around 60%; while, CHP-IC units and stoves of respectively 58% and 50%. For technologies characterized by a COP coefficient, geothermal HPs were significantly more efficient than aerothermal ones, the energy efficiency of the former amounting to 4.5 and of the latter to 3.5. Indicated values referred to nominal COPs and were not related to real operating conditions for building uses. To fully consider DH systems' efficiency, we included in the mean losses those deriving from DH network heat losses. The heat losses mean value for EU28 was found to be 13.70% [70].

For SH and DHW equipment, energy efficiency coefficients at full-load (EU28) had CV mean percentages with values around 10%. The highest variation was found for stoves (~36%). The lowest one for electric radiators (~2%). Other equipment types were characterized by variations between 5% and 12%.

**Figure 4.** Energy efficiency coefficients at full-load per equipment type, EU28 [55–60,65,68–70].

Finally, including the data presented in previous figures in Equation (1), the results in terms of energy consumption per equipment type (Figure 5) were obtained. The entire EU28 energy use for SH and DHW technologies amounted to approximately 3880 TWh/y, and its largest share went to non-condensing boilers (over 2600 TWh/y, equaling to 67% of total). DH technologies came second with an energy use of about 500 TWh/y (13% of total). Condensing boilers' energy consumption corresponded to 350 TWh/y (i.e., 9% of total), while electric radiators consumed nearly 250 TWh/y (approximately 6% of total). These were followed by stoves, with about 130 TWh/y (approximately 3% of the above indicated 3880 TWh/y). CHP-IC, STS (flat-plate collectors), aerothermal HPs, STS (unglazed collectors), geothermal HPs, and STS (evacuated tube collectors) were last, accounting together for about 2% of total. A particularly striking feature was that the energy consumption deriving from DH systems exceeded the one of condensing boilers. The indicated difference was significant as the value for DH systems was approximately 25% higher.

**Figure 5.** Energy consumption per type in TWh/y, EU28 [40,41,55–63,65,68–70].

Additionally, Table 2 displays the results in percentage with regard to various fuels utilization for SH and DHW equipment in the 28 EU MS.


**Table 2.** Fuels utilization at EU28 level for various SH and DHW equipment in percentage (NA—not available) [57,64,69,74,75] 1.

<sup>1</sup> Electric radiators, aerothermal, and geothermal HPs, as well as STS—unglazed, flat-plate, and evacuated tube collectors were not considered in Table 2 due to not being fuel powered. A minor amount of coal and renewables driven condensing boilers, as well as not solely renewables (biomass) powered stoves are operative in Europe too [58,69,76].

As per Table 2, condensing boilers were mostly gas (natural gas) driven (~66%), followed by oil, with about 34%. The same ranking applies for non-condensing boilers. Gas (natural gas) was the first energy vector with nearly 54%. Oil was second (~38%), followed by RES and coal with about 5% and 2%, respectively.

Stoves were estimated to be for 100% RES (biomass) driven.

Regarding CHP-IC, gas (natural gas) again was the first (~43%). RES and coal followed with about 23% and 19%, respectively. In the last place we found "other fuels" and oil, with approximately 7% each.

DH systems were found to be mainly powered by gas (natural gas). Coal comes next, with about 29%. Close behind coal, what follows were renewables, with nearly 26%. Last positioned were oil and other fuels, with approximately 4% and 2%.

In conclusion, the outcome concerning centralized and individual boilers utilization showed individual technologies to be present in EU28 with slightly over half, nearly 54% [59,60,69].

#### **4. Discussion**

SH and DHW equipment's total energy consumption at EU28 level nearly reached 3900 TWh/y (approximately 3880 TWh/y) of which over 85% (about 3315 TWh/y) of this was provided by SH. Thus, only about 600 TWh/y (~580 TWh/y) accounted for DHW use. SH and DHW total consumption accounted for more than 20% of the EU's entire energy consumption [5]. If SC consumption was included, the latter amounted to only about 3% of the total energy consumption [37].

While notable studies in the field report SH and DHW values very close to ours (the H2020 project HRE4 [22], a publication of Patronen et al. 2012 [25], as well as the IEE STRATEGO project [23]), other studies differ greatly reporting values both falling short and exceeding ours. The three studies reporting values very close to ours differ by 3% to 6%.

The numbers falling short are provided by Boermans et al. 2012 [26], a report of the Seventh Framework Programme (FP7) iNSPiRe project [24], and by Von Manteuffel et al. 2016 [27]. In this case, differences with respect to our work range from 12% to 47%.

The detected values exceeding ours are given by Sanner et al. 2011 [28], Scoccia et al. 2018 [29], and Balaras et al. 2007 [30]. The indications provided by these authors vary compared to our own by 13% to 35%. Table 3 summarizes and analyzes in greater detail similarities and differences to our outcome found in scientific literature:


**Table 3.** Comparison of SH and DHW market quantifications at the EU level found in scientific literature and the present investigation's result.

We wish to emphasize that even though only ~11% of the input data for our investigation derived from the HRE4 project, our result showed only a minor deviation, 3%.

Furthermore, concerning the IEE STRATEGO project result, we must recall that the output would be even lower if taking into consideration the decrease of energy use for SH and DHW in EU buildings in the past decade.

To complete our study, we calculated energy consumption for DHW per MS, using population and household data—by means of energy per person [77], number of inhabitants [78], and dwellings [79]. The values, expressed in TWh/y, were found to be approximately 510 TWh/y and 540 TWh/y, respectively. Thus, the differences with respect to the results shown above were 12% and 7%, respectively. Please find respective data set under [80] (details on all sources used are available in the respective csv file [81]).

As already mentioned in the Section 3. Results, a particularly striking result is given by the fact that energy consumption provided by DH systems exceeded the one of condensing boilers by 25%. Furthermore, comparing non-condensing boilers with condensing ones, we found that the replacement of conventional boilers with better performing SH and DHW technologies seemed to progress very slowly. As a plausibility check, we should first consider the share of biomass boilers (~2%), oil boilers (~19%), and natural gas boilers (~28%) [23]. In fact, while Regulation EU 813/2013 does not impose the use of condensing boilers over biomass boilers, it only came into place at the end of 2015, with significant exemptions. As a further confirmation of our results, if we suppose that all gas and oil boilers installed in the period 2015–2016 are substituted by condensing boilers, these would reach a total number of around 10.7 Mil. (assuming non-condensing boilers installed prior to 2015 and assuming a lifetime of 15 years) [82]. This is very much in line with the data presented in Figure 1. However, in the event of an enforcement of the Eco-design Regulation EU 813/2013, the share of condensing boilers should grow significantly in the coming years. This evidence is further reinforced by the indication that the thermal efficiency of currently installed boilers in Europe is approximately 85% [66], while condensing boilers are characterized by declared efficiency levels of approximately 99% [37]. On the other hand, a number of scientific resources confirm the steady growth of DH applications within the EU28 [83–85].

Concerning the fuels utilization at EU28 level for SH and DHW equipment (Table 2), it has to be stressed that gas (natural gas) dominated the ranking, while renewables were lower positioned, besides in the case of stoves. However, it is worth nothing that the indicated renewables value for stoves had been estimated due to a lack of sources. RES were mostly found at lower level positions (once again this was not valid for stoves and CHP-IC). Especially with regard to DH, the utilization for RES was characterized by a high potential [86–88].

#### **5. Conclusions**

This work presented a collection, statistical elaboration, calculation, and comparison of data that offered insights on the SH and DHW market for the EU28. The main aspects were the following:


The collected data per MS and the entire EU are also available as an open source data set, which allows for freely access and to retrieve information on SH and DHW consumption.

The data collection at the basis of the investigation and its insights presented certain limitations, which resulted from the assumptions indicated in Section 2. Materials and Methods.

All types of collected data (amount of operative units, installed capacities, energy efficiency coefficients as well as equivalent full-load hours per SH and DHW equipment type and country) were subject to not negligible variations, which resulted in pertinent CV values. This was especially true for collected data concerning the number of operative units for condensing boilers and aerothermal HPs (CV = 66% and 52%, respectively), the amount of equivalent full-load hours of STS—evacuated tube collectors (CV = 53%), and mean installed capacities of electric radiators (CV = 60%). Consequently, the performed analysis represented an assessment, and respective outcomes are to be interpreted with care.

The data collection and results of this work can form the basis for the collection and analysis of further data regarding Europe's building stock. Finally, our research indicates room for improvement in terms of data quality and completeness, as well as for extending the scope to areas such as industry and transportation.

**Author Contributions:** Conceptualization, S.P.; Data curation, S.C. and S.Z.; Supervision, L.K.; Validation, S.C., S.Z., L.K., A.N. and P.Z.

**Funding:** We are thankful to the Horizon 2020 Hotmaps project (Grant Agreement number 723677), for providing partial funding for the performance of the work.

**Acknowledgments:** We are thankful to Amy Segata (Eurac Research) for designing the figures of the study. In addition, our gratitude goes to Sonja Gantioler (Eurac Research) for the language editing of the text.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## *Article* **Open Source Data for Gross Floor Area and Heat Demand Density on the Hectare Level for EU 28**

**Andreas Müller 1,2,\*, Marcus Hummel 1, Lukas Kranzl 2, Mostafa Fallahnejad <sup>2</sup> and Richard Büchele <sup>2</sup>**


Received: 4 November 2019; Accepted: 6 December 2019; Published: 16 December 2019

**Abstract:** The planning of heating and cooling supply and demand is key to reaching climate and sustainability targets. At the same time, data for planning are scarce for many places in Europe. In this study, we developed an open source dataset of gross floor area and energy demand for space heating and hot water in residential and tertiary buildings at the hectare level for EU28 + Norway, Iceland, and Switzerland. This methodology is based on a top-down approach, starting from a consistent dataset at the country level (NUTS 0), breaking this down to the NUTS 3 level and further to the hectare level by means of a series of regional indicators. We compare this dataset with data from other sources for 20 places in Europe. This process shows that the data for some places fit well, while for others, large differences up to 45% occur. The discussion of these results shows that the other data sources used for this comparison are also subject to considerable uncertainties. A comparison of the developed data with maps based on municipal building stock data for three cities shows that the developed dataset systematically overestimates the gross floor area and heat demand in low density areas and vice versa. We conclude that these data are useful for strategic purposes on aggregated level of larger regions and municipalities. It is especially valuable in locations where no detailed data is available. For detailed planning of heating and cooling infrastructure, local data should be used instead. We believe our work contributes towards a transparent, open source dataset for heating and cooling planning that can be regularly updated and is easily accessible and usable for further research and planning activities.

**Keywords:** open data; heating; building stock; heat map; spatial analysis; heat density map

#### **1. Introduction**

About 50% of the final energy consumption in Europe is spent on heating and cooling, including space heating and cooling, domestic hot water, and processing heat and cold [1]. The largest share of this demand is covered by fossil energy carriers [2]. Thus, the heating and cooling sector needs to be radically transformed in order to be in line with decarbonization targets. In contrast to other energy carriers (e.g., electricity and fuels), which are carried over hundreds (electricity) to thousands (oil, gas) of kilometers, the transmission of thermal energy (heat or cold) is still limited to local or regional systems [3]. Hence, the heating and cooling sector, per se, has a strong spatial dimension, which needs to be carefully considered in this transformation process. The European Energy Efficiency Directive [4,5] takes into account these considerations by requesting Member States of the European Union to provide a so called "comprehensive assessment of the potential for the application of high-efficiency cogeneration and efficient district heating and cooling". According to Annex VIII of

the directive, this should include—among other things—"a map of the national territory, identifying ( ... ) heating and cooling demand points". A preliminary version of these reports had to be delivered by the end of 2015, and an update will be due by the end of 2020. This comprehensive assessment is considered to support the strategic heating and cooling planning and mapping process on different spatial levels [6], which has a long tradition in countries like Denmark [7]. The existence of such datasets provides valuable support for assessments in this area of research. However, as pointed out by Noussan and Nastasi [8], Tronchin et al. [9], and others, the quality of the input data used for the analysis impacts the accuracy of the results. Therefore, an almost equally important step forward is ensuring the availability of such datasets to a broader group of users, such as policy makers and researchers, who foster increased data quality by identifying the weaknesses of datasets and subsequently improving their methodologies and underlying data foundations.

#### *1.1. Spacial Levels of Heating and Cooling Planning*

In general, three different spatial levels of analyses and regional detail may be distinguished for heating and cooling planning. These levels are very different based on the aims of the analyses, the required data, and the potential (research) questions that can be answered. At the first, or top level, analyses are performed at the strategic level of heat planning, either at a national, regional, or municipal scale, with the objective to identify possible areas of interest for district heating, excess heat integration, and the overall potentials for different decarbonisation options in the sector.

On a European level, the heat density map of Heat Roadmap Europe [10] (see also, e.g., Connolly et al., [11]; Persson et al., [12]; Möller et al. [13]) is one of the most relevant examples of this kind of analysis. Their methodology also uses a top-down approach but differs from our work insofar as they chose an econometric approach for the break-down of the demand data [14]. This project provides maps for downloading and use in the project platform. However, the datasets are not publicly available at the time of the submission of this paper. Moreover, the Heat Roadmap Europe project, a robust body of scientific literature (e.g., [15–25]) on similar yet more locally focused projects, is available. The heat density maps developed in the frame of the comprehensive assessments at the national level also belong to this category. While valuable because they visualize heat demands at a regional level, the synthesis report on the evaluation of the Comprehensive Assessments from the Joint Research Centre [26] finds that most of the maps developed and provided for the first round of Comprehensive Assessments are only choropleth maps at the level of predefined areas based on administrative borders (e.g., a county, region, or municipality) or other statistical sectors, which are shaded/coloured to indicate the heat demands within this area. This kind of map does not allow further planning and only advises upon which areas to look at closer. Only a few of the Member States provided interactive isopleth maps that allow one to zoom in or that show a disaggregation of the heat demands at a high resolution raster level with public access (e.g., Austria [27], Scotland [28], and the Netherlands [29]). Even fewer maps provide data or allow downloading of the heat demand density data for further assessments.

At the second level, analyses are done at the city or (larger) district level, aiming at developing regional development plans or evaluating the (pre-) feasibility of district heating. For example, a study performed by Brocklebank et al. [30] considers the initial stage of designing a district heating network, with energy mapping in the local area, using the case study of Darley Dale, England. The results of the mapping technique are compared to the heat mapping work carried out by the UK Government and are shown to be accurate enough for further analyses. Doroti´c [31] presents an economic analysis to determine the actual demands and the potential energy supply by using GIS based heat demand mapping methods and applies the results to an analysis of district heating network expansion in the city of Velika Gorica. Wyrwa and Yi-kuang [32] developed and applied a methodology to generate the data needed for the development of district heating systems. Their article presents a combined bottom-up–top-down approach applied for the city of Krakow as a case study and calculates of the useful heat demand for space heating and hot water preparation. Further examples for this kind of

analysis include the progRESsHEAT project [33], Cižman et al. [ ˇ 34], and Dochev et al. [35], which analyses the spatial heat demands for the City of Hamburg.

Detailed technical and local analyses at the district, building block, or even individual building levels constitute the third spatial level. At this level, different technological alternatives, technical designs, and planning options are assessed and compared in detail. For example, Törnros et al. [36] simulated the DH demand for a medium-sized DH network in a city in southern Germany and used a spatially explicit approach for the analysis by first geo-locating the buildings and their attributes obtained from various sources. Based on these results, the authors calculated the annual primary energy demands for heating and domestic hot water for all individual buildings and then aggregated these demands at the segment level of an existing DH network and simulated the water flow through the system to cover the demand.

#### *1.2. Aims and Objectives of This Work*

The objective of this paper, which builds on the work performed within the Hotmaps project, is to provide an open-data top-down derived dataset of the heated gross floor area and final energy demands for space heating and domestic hot water preparation for all EU28 countries (plus Iceland, Norway, and Switzerland) on a hectare (100 × 100 m) scale. The data are available as an open dataset and thus may be used by anybody to start analysing areas of their interest in the EU. By providing these data, we believe that we can substantially contribute to fulfilling the requirements defined by Annex VIII and support public authorities, energy agencies, and planners in strategic heating and cooling planning at the local, regional, and national levels.

Through the work and dataset presented in this paper, we contribute to the first two levels. The maps cover (heated) the gross floor area, as well as energy needs and final energy demands, for space heating and domestic hot water preparation in residential and non-residential buildings. The methodology can be classified as a top-down approach: Energy consumption data at the country level (NUTS (Nomenclature of Territorial Units for Statistics) 0) is broken down to the NUTS 3 level and subsequently to the hectare level based on different spatial indicators (for details, see the description of the methodology in Section 2). This generic top-down approach has very specific strengths and weaknesses. Its key strengths are the comprehensive availability of data for every region within the EU-28 (+ IS, NO, CH), the transparency of the applied method, and the consistency of the aggregated results with national statistics. This approach's weakness lies in the deviations between the developed data and the data generated based on a bottom-up approach using detailed data at the local level. This appears at a very high spatial resolution: The smaller the spatial selection of the heat density map, the higher the deviation between the concrete demand and building the stock data, which occurs naturally on the ground, and the statistical approach chosen for breaking down the data to the hectare level. Thus, it is important to be aware of this limitation and to apply the data for the purpose proposed earlier—i.e., for the strategic level of heat planning and regional energy planning but not for detailed technical design and planning (e.g., of district heating infrastructure).

After this introduction, we present in Section 2 the methods and approaches for developing EU-wide gross floor area and heat density maps. Section 3 presents the results, including a validation and comparison of the data for selected municipalities across Europe. We discuss the results in Section 4 and derive conclusions and provide an outlook for further work in this field in Section 5.

#### **2. Materials and Methods**

#### *2.1. General Approach*

The top-down heat density map developed in the Hotmaps project builds on a three-stage approach (see Figure 1). In the first stage (top-level), we derived the final energy demand (FED) and energy needs (EN) for space heating (SH) and domestic hot water preparation (DHW) based on extensive literature research for the individual countries considered in the study. These data sources

include the following: energy consumption data from energy statistics (e.g., data derived from Eurostat and national energy balances [37]), statistical data on the number of buildings, households, and shares of those per construction period [38], statistical and project related data on the typical properties of different building types and construction periods per country, and average climate data for different regions. From these data we built a building dataset for each country [39], which consistently combines these different types of input data, starting with the u-values for different building components and associated typical heat transmitting areas on the one hand and, on the other, the energy consumption reported in the national energy balances for the energy services focused upon by this analysis. This step is done by applying the Invert/EE-Lab model ([40], see also [41]). The Invert/EE-Lab model is a dynamic building stock model that calculates the energy needs and final energy consumption for SH and DHW, as well as the space cooling for regions or countries based on an underlying building physics model described by national and international norms [42–46] alongside the energetic properties of archetype buildings and their building components, climate, and usage data [40].

**Figure 1.** Schematic process of how we derived data maps at the hectare level for the EU-28 countries.

In the second stage, we distributed the EN for SH and DHW from the country level (NUTS 0) to the third territorial units at the statistics level (NUTS 3) based on an approach that we developed within the study, "Territories and low-carbon economy" (ESPON Locate) [47]. At this stage, we combine several indicators to estimate the share of EN for SH and DHW for the different NUTS 3 regions within each country.

For residential buildings, we consider the following indicators. Additional information on this approach is presented in the final report of the Espon Locate project [47].

	- Persons (population);
	- Number of dwellings;
	- Useful floor space per dwelling;

The following indicators are used for non-residential buildings (only services, excluding large industrial production facilities, etc.):


The third level constitutes the distribution of the NUTS 3 results at the hectare level and thus derives a heat density map. We developed an approach that correlates information from the locally built environment with its EN for SH and DHW preparation within the Hotmaps project and applied it to the EU28 countries, plus Norway, Iceland, and Switzerland. This was done by using a spatial distribution function based on similar indicators used at to transfer country data to the NUTS 3 level. This approach builds on the central idea that the EN for space heating and domestic hot water preparation correlates with the population number within a plot area, as well as its economic activity, climatic conditions, and some building properties, such as the average construction period and volume-to-surface ratios of the buildings. The final energy consumption (delivered energy plus thermal energy provided by on-site renewable energy sources) is then calculated from the energy needs by applying the country specific national conversion factor between the energy needs and final energy consumption.

#### *2.2. Population Distribution at the Hectare Level*

Our main source for local population data is a work published by Gallego [52], where a dataset for the European population in 2006 at the level of 1 km<sup>2</sup> is given. In addition, we considered a dataset for the population in 2014 at the level of 250 × 250 m, developed by JRC [53], which is also publicly available. Although the latter source is newer, we decided to use the older population data as the primary input data source for the population distribution because a comparison of the population based on these data sources and data from the human settlement project (i.e., the share of the plot area sealed by buildings at a 10 × 10 m level) [54] revealed that a significant share of the population in rural regions is distributed in areas with no buildings (see Figure 2).

**Figure 2.** Comparison of the plot area covered by buildings on a 10 × 10 m level (blue) and the population in 2014 per 250 × 250 m for a small town in Carinthia, Austria (46◦42 ; 13◦39 ) with about 3000 inhabitants. Sources: [53,54].

An advantage of the new JRC population dataset [53] over the older dataset of Gallego [52] is that the new dataset partly covers areas that do not feature data in [52]. To combine these two datasets, we, therefore, calculated the population distribution on a 1 km2 raster for Europe based on the following rules. If the primary population layer [52] does not contain data for a 1 km2 grid cell, we use the data from a 1 km<sup>2</sup> grid cell layer derived from [53] as a fall-back option. An analysis of the resulting quality of the combined layer indicated that:


Consequently, we obtain an overestimation of the population in rural areas when we distribute the population of the NUTS 3/ local administrative units (LAU) regions at the hectare level. To reduce this undesired effect, we give the population of the JRC layer a weight factor of 30% (by multiplying the population data with a factor of 0.3). We have chosen this value by assessing the results that we derived for different weighting factors (in the range of 10–100%) for different regions (see Figure 3). This value, we believe, offers a good compromise that balances the two effects that accompany the data: first, the described effect of overestimating the population in rural areas and secondly, the underestimation of population in areas not covered by Gallego. We, however, have not performed any systematic analysis on the optimal level for the applied weighting factor.

**Figure 3.** Schematic depiction of the method for combining the two data sources for population: Gallego [52] is the primary data source and places the population at the correct position but does not cover all areas (marked by red border lines), and JRC is secondary data source [53], which is shifted but covers additional areas.

Within the 1 km2 grid cells, we distributed the population by considering the land usage type at the hectare level using the Corine land cover data [55] and the European Settlement Map layer [54], which depicts the share of the surface that is sealed by buildings on a 10 × 10 m grid. After distributing the population at the hectare level, we sum up the population for the local administrative units (LAU). This is done for the ~115 thousand regions (Eurostat [56], using the LAU 2, except for Greece and Denmark, where we used LAU 1, since the Census 2011 was performed at the LAU 1 level only). We then compared the population values with the population data in the local administrative units stated in the statistical data sources of Eurostat [48,57] and JRC [58]. To reduce the deviations, we then adjusted the population distribution to determine a compromise between the populations at the square kilometre level [52,53], the population per LAU region, and the upper limit for the population density per hectare. For this upper limit, we analysed the distribution of the indicator: the population per hectare divided by the population in the corresponding 1 × 1 km grid cell. For this analysis, we calculated the ratio between the population per hectare and the average population within the same square kilometre grid for each hectare cell. We than clustered the outcome by the population densities at the 1 km2 grid level and removed all data points that did not exceed the average results for the density by a factor of 2. For the remaining data points, we calculated 95% and 99% for different population densities and defined an upper limit for the population at the hectare level (Figure 4), which is in the range of the 95–99% percentiles. The corresponding figure is read as follows: If 10 people live within a certain 1 <sup>×</sup> 1 km grid cell, then the population of each hectare cell within this 1 <sup>×</sup> 1 km<sup>2</sup> grid cell must not exceed ~5 inhabitants (50% of the total population of that square kilometre). If a <sup>1</sup> <sup>×</sup> 1 km<sup>2</sup> grid cell is populated by 10,000 people, the upper limit for inhabitants per hectare within that square kilometre must not exceed ~700 people (7%), and, for a population of 100,000 people per km2, the upper limit for the population per hectare in that area is 2000 people (2%).

**Figure 4.** Implemented upper limit for population density at the hectare level.

#### *2.3. Gross Floor Area of Buildings at the Hectare Level*

Our methodology estimates the heated gross floor area of buildings at the hectare level based on two independent approaches. The first approach, which we call the "population-based" approach, builds on the population grid at the hectare level and estimates the gross floor area by multiplying the population on the hectare level with the average gross floor area per person in the corresponding NUTS 3 region. We derived this indicator from the data provided by the European Census Hub [48] for most European NUTS3 regions, namely the average floor area per dwelling and the average persons per household. This approach delivers reasonable results for the residential building stock. However, its predictive quality is poor in areas with a high share of non-residential buildings. To overcome this problem, we developed a second independent layer for the gross floor area of buildings.

The second approach derives the heated gross floor area by considering the building footprints and the estimated number of floors per buildings. We extracted the building footprints from the European Settlement Map [54] and data from the building layer of the OpenStreetMap (OSM) database [59]. The European Settlement layer contains the share of plot area that is sealed by buildings on a 10 × 10 m grid, while the OSM database contains the buildings as 2-dimensional vector data.

To calculate the gross floor area from the building's footprint, we required information on the building height or number of floors. In order to obtain estimates for these parameters, we developed a building height model that estimates the average building height from the footprint of the buildings using the OSM-building data. The developed approach applies a generic building height model (Figure 5), which is refined by the average regional (municipality-specific) building height-to-building footprint derived from buildings whose building height information is stored in the OSM database (~5 Mil. buildings spread over Europe).

As can be seen from Figure 5, the generic model for the number of floors matches well for buildings with a 60 to 1000 m<sup>2</sup> building footprint. It is assumed that buildings with footprints below 30 m2 are mostly unheated (no floors) and are partially unheated if the footprint is between 30–45 m<sup>2</sup> (number of floors of 0.5). For buildings larger than 1000 m2, the OSM-data show a further increase in the building height, while we, in contrast, keep the height constant until reaching buildings of 2500 m<sup>2</sup> and gradually reduce the number of floors to three for buildings with a footprint of 10,000 m<sup>2</sup> or more. The underlying reason for this method is as follows. The OSM-data mostly contain the building height, while the number of floors is calculated by us based on a constant floor height of 3 m (including structural elements). Based on personal experience, we believe that the ceiling height of very large buildings (industrial production halls, shopping centres, etc.) will be higher compared to smaller buildings and

that the number of floors of such facilities rarely exceeds four. In the case of the OSM-data point for the largest building cluster (i.e., buildings with a footprint of more than 2500 m2 (~20,000 buildings with a median footprint of about 6500 m2)), the average floor height would correspond to about 5 m if we consider an average number of floors as 3.5. To us, this number appears plausible.

**Figure 5.** Generic building height model applied to the buildings covered in the OpenStreetMap database.

In order to estimate the average relationship between the building footprint and the building height, we calculated the number of floors for different building footprint sizes per municipality (~18,000 municipalities out of ~115,000 in the covered region, see Figure 6).

**Figure 6.** Number of buildings in the OpenStreetMap (OSM) database with information on the building height per municipality [59].

While each individual building from the OSM database with some information on their building height is given a weight of 1, the generic model for the number of floors is given a weight of 20. The resulting relationship between the footprint and number of floors for 150 randomly chosen municipalities (for which building height data are available) is shown in Figure 7.

**Figure 7.** Calculated relationship (based on OSM data) between the average number of floors and the building footprint for 150 randomly chosen municipalities across Europe, as well as their generic functions (red line).

In the next step, we compare the gross floor areas derived from the OSM data with those from the population-based approach. If the outcome of the OSM-based approach is lower, then we scale-up the OSM data accordingly so that they match the outcome of the population-based approach. This is done up to a factor of four. If the gross floor area per inhabitant in a hectare cell is less than 15 m<sup>2</sup> according to the OSM-based data, then the OSM-quality indicator, which estimates the completeness of the OSM data (see Figure 8), is reduced. This process leads to a lower weight of the OSM-based heated floor area data in the final calculation of the heated gross floor area (see Table 1).

**Figure 8.** Completeness of the OpenStreetMap-building stock data: Comparison of the OpenStreetMapdata (yellow) with the European Settlement Map (blue) for the region of Athens (left map) and Vienna (right map). Sources: [54,59] (OSM: Planet dump May 2018).


**Table 1.** Weighting of the population and value added (VA) based approach versus the OSM based approach for areas in different Corine land cover classes used to calculate the heated gross floor area at the hectare level.

\* The actual weighting factor is calculated, e.g., as wpop/(wpop + wOSM).

Finally, the heated gross floor areas of the residential and non-residential buildings are calculated by combining the results of both methods. For residential buildings, we used a weight factor between 0.015 and 1 using the population-based approach (depending on the Corine land cover class (see Table 1), while the OSM-based approach is given a weight of 5%, if there is no indication that the OSM does not fully cover the given grid cell.

To estimate the heated floor area of non-residential buildings, we use (a) the value added per LAU region [45] instead of the residential floor area per inhabitant indicator and (b) the OSM-based approach. We calculate the non-residential heated gross floor area by subtracting the residential gross floor area from the calculated gross floor area using the OSM-building information. Compared to the residential gross floor area, we used a comparatively higher weight in the OSM-approach for non-residential buildings since the quality of the OSM data (degree of completeness) is estimated to be high. Based on own estimations, Table 1 depicts the detailed set of weighting factors that we used in our approach for different Corine land cover classes. These weighting factors are taken for areas where the OSM data quality is estimated to be high. If the data quality of the OSM data is considered to be low (see Figure 8), then the weight of the OSM approach is reduced accordingly.

#### *2.4. Heating Degree Days at the Hectare Level*

The starting point for the calculation of the local heating degree days is the observed average daily temperatures on the 25 × 25 km raster [50] for the period from 2002 to 2012. With a resolution of more than 600 km2 per raster cell, this layer is too coarse to derive meaningful local heating degree days, as shown in Figure 9 (left figure) for the Alpine region, North Italy, and Croatia. To refine these data, we included in the calculation information on the local elevation using the digital elevation model over Europe (EU-DEM) layer at the 30 × 30 m grid level [60] and applied a temperature lapse rate of 6.5 ◦C per 1000 m elevation gain according to the specifications of the International Standard Atmosphere model [61] (see Figure 9, right).

**Figure 9.** Heating degree days for the 25 × 25 km grid (left side) and the refined grid at the hectare level. Sources: [50,60] and our own calculations.

The energy needs for space heating (SH) on a local level are corrected by applying the ratio between the calculated site's specific HDD and the HDD at the NUTS 3 level using an elasticity of 0.5. We purposefully corrected the energy needs for the local climate conservatively, as we tried carefully to avoid "overshooting" our corrections and instead preferred results that are "too" uniform for different areas. With such low elasticity, we also covered the (plausible) assumption that buildings in colder areas (e.g., those at higher elevations), in general, might already have higher energy performance than similar buildings in warmer (lower) areas, even if they have to fulfil the same energy performance standards.

#### *2.5. Surface-To-Volume Ratio of Buildings and Historical Construction Periods*

In order to calculate the spatial distribution of EN and the final energy demand (FED) for SH from the heated floor area, we furthermore considered the surface-to-volume ratio of buildings and the share of buildings per construction periods. To obtain data on the surface-to-volume ratio, we built on the data from the OSM building layer: the building footprint (area and perimeter) and the estimated building height. For the share of buildings per construction period, we extracted information on the soil sealing data provided by the Global Human Settlement project [53] for 1975, 1990, 2000, and 2014 on a 38 × 38 m grid. Besides comparing the soil sealing ratio per grid cell for the time slots, we tried to correct the data for the soil sealed by other elements, such as roads, by considering the current share of soil sealed by buildings against the total share of sealed soil per grid cell (at the hectare level). We also generically considered building demolition. For the period after 2000, we considered an annual demolition rate of 0.2% for buildings constructed before 1975 and 0.1% for buildings constructed between 1975–1990. Thus, we assume that the stock of buildings constructed before 1975 is only 97% of that shown in the soil sealing map due to building demolitions between 2000 and 2014. Furthermore, we assume that at least 0.75% of the soil sealing share in each period (1975/1990/2000/2014) must stem from buildings constructed in the latest construction period. This means that if the soil sealing is 40% for a given grid cell in 1990, then the share of soil sealed by buildings constructed between 1975 to 1990 must be at least 40% × 0.75% = 0.3%. If the soil sealing map of 1975 already depicts soil sealing of 40%, we reduce that value by 0.3% and add that value to the construction period from 1975 to 1990. As an example, the outcome of this process is visualised in Figure 10 for the region of Vienna.

As with the local heating degree days, we tried to conservatively correct the energy needs for the surface-to-volume ratio and construction periods. Again, we manually performed checks for different regions to indicate that the outcomes are plausible on a general level. We needed to remain cautious of the significant uncertainties of this methodology. Therefore, we gave these two factors a rather low weight and considered an elasticity of only 33% for the surface-to-volume ratio. To buildings constructed after 2000, we assigned a specific EN of 80%, and to buildings constructed before 1990, we assigned an EN of 125%, which was compared to the specific energy needs per building (for buildings constructed from 1990 to 2000).

**Figure 10.** Estimated shares of buildings per construction period at the hectare level for Vienna and its surroundings. Red is used to color-code the high shares, whereas low shares are shown in beige.

#### *2.6. Comparison of the Resulting Data with Data from Other Sources*

In order to better understand the characteristics and quality of the developed gross floor area (GFA) density and heat demand (HD) density layers, we compared these with data from other sources. We did this at the following regional levels:

	- - The basis for the bottom-up calculation of the GFA and the EN of the buildings in the three cities are shape files of the buildings containing the following information: shape and location of the building footprint, number of floors, building height and type, as well as age of the building. If data for certain buildings were missing, we filled these gaps using the average values of the other buildings with similar characteristics in the database.
	- - In the second step, we joined these building stock data with data on specific EN values for space heating and for hot water generation from the Invert/EE-Lab model [40]. With this model, we calculated these values for typical buildings in the countries according to the type and construction period of the buildings. The resulting values applied to the overall building stock in the countries match the national energy balances. In joining the values

with the building stock databases of the cities, we performed climate correction from the average heating degree days (HDD) in the countries to the HDD in the cities. For this, we used the HDD data from the Hotmaps database (see Figure 9, available at [62]).

For the comparison, we used data from published and unpublished sources. At the NUTS 3 or LAU level for several locations, data can be found in the published literature. However, for many regions and municipalities, data on the energy demands of space heating and hot water generation, or on the number of buildings, are only available in unpublished databases or reports. Because public authorities usually do not publish their building inventories, for our comparison at the hectare level, we were only able to use unpublished data.

#### **3. Results**

#### *3.1. Resulting Maps and Data*

The developed raster files for gross floor area and heat demand for space heating and hot water preparation in residential and non-residential buildings are integrated into the Hotmaps database and toolbox and are accessible on their respective webpages [63]. On the website, these data can be visualised and analysed for selected locations and can be used in the different integrated calculation modules. Figure 11 shows a screenshot of the heat demand density total layer and the provided indicators for the NUTS 3 region of Vienna.

**Figure 11.** Screenshot of the developed heat demand density layer for the NUTS 3 region of Vienna, accessed via the Hotmaps database and toolbox. Source: [63].

The layers are also available for download from the Hotmaps database and toolbox or directly from a GitHub repository. The corresponding links are given in the supplementary materials section at the end of the manuscript.

#### *3.2. Comparison of Results with Data from Other Sources at the NUTS 3, LAU 1, and LAU 2 Levels*

To understand the quality of the developed gross floor area (GFA) and heat demand (HD) data, we first compare these at the levels of the NUTS 3, LAU 1 (local administrative units), and LAU 2 regions with data from other sources. We collected data on population, residential GFA, total GFA, and HD for space heating and hot water generation in residential and tertiary buildings for 20 different regions. Table 2 lists the data sources we used, and Figure 12 shows the results of this comparison.


**Table 2.** Data sources of the reference values for comparison at the LAU/NUTS level.

**Figure 12.** Comparison of the calculated values for the population, the residential gross floor area (GFA), the total GFA, and the heat demand with the values stated in other sources for selected locations.

This comparison of the developed data and other sources for the 20 selected regions shows an average difference of 12% in the mean values among the absolute values, with a standard deviation of 10% and a deviation of 8% for the median values. However, for some values, we found nearly no difference between the developed and other data. Some values show remarkable differences up to 45%. The values for the population and total GFA seem to better match the developed data and other sources, whereas residential GFA and HD show greater differences. Residential GFA, on average, is 9% higher in the developed data, and the HD, on average, is 8% lower. Both show a standard deviation of 16%. For regions with higher numbers of population, the differences seem to be lower. However, these statistics on the difference between the developed data and other data have a high uncertainty mainly due to the limited number of regions under comparison and the limited certainty of data from other sources. We discuss these uncertainties in Section 4.2.

#### *3.3. Comparison of the Results with Data from Other Sources at the Hectare Level*

For the cities of Bistrita, San Sebastian, and Frankfurt, we compared the developed gross floor area (GFA) and heat demand (HD) density maps (top-down) with maps developed from municipal building stock databases (bottom-up). Both types of maps were created with the same projection and raster size. For the comparison, we scaled the values in each hectare element of the top-down maps using the following ratio: the sum of the values of all hectare elements in the bottom-up map divided by the sum of values of all hectare elements in the top-down map. This allows us to focus on the difference of the distribution of the GFA and HD in the territories between the top-down and the bottom-up maps. For HD, we compare two maps: the top-down map and the bottom-up map. For GFA, we compare three maps: the top-down map, the bottom-up map of the GFA for all buildings in the region, and the bottom-up map showing only the heated GFA (HA) in residential and tertiary buildings.

In the first step, we compared the distribution of the GFA and HD over the GFA and HD density in the top-down and bottom-up maps. Figure 13 presents these distributions, showing the cumulated GFA and HD values for all hectare elements, from the elements with low density to the elements with high density. In order to compare the distributions for the three cities, we normalized both axes.

**Figure 13.** Comparison of the hectare data from the bottom-up and top-down maps: cumulated floor area per total floor area over the floor area density per maximum floor area density in each region (**left side**) and the cumulated heat demand per total heat demand over heat demand density per maximum heat demand density in each region (**right side**).

The figures show that, in the top-down maps, the GFA and the HD are distributed over a smaller range of GFA and HD density compared to the bottom-up maps. The maximum density in the bottom-up maps is remarkably higher than the maximum density in the top-down maps: e.g., in Frankfurt, in the top-down map, the highest GFA density is only 13% of the highest GFA density in the bottom-up map; for HD in Frankfurt, the GFA density is even lower at 11%. This is the same for all analysed cities but with a lower difference between the maximum values. Furthermore, the GFA and the HD are more evenly distributed in the top-down maps than in the bottom-up maps. The strong increase of density for the 5–10% GFA and HD in the areas of highest density visible in the bottom-up maps is remarkably underestimated in the top-down maps. Due to the fact that these characteristics can be found for all comparisons between the top-down and bottom-up groups, this seems like a systematic difference between the top-down and bottom-up results.

The figure also shows that the form of the distribution of the GFA and HD over the GFA and HD density is similar in the bottom-up and top-down maps. That is, the slope of the distribution curve for Frankfurt is steeper than that for San Sebastian, and the slope of the curve for San Sebastian is steeper than that for Bistrita. This is the same for both the bottom-up and top-down maps, as well as for the GFA and HD comparison. Finally, we also in the comparison of the GFA maps that the bottom-up maps showing only the heated area (HA) match better with the top-down maps for the three cities than with the bottom-up maps showing the entire GFA for all buildings in the region. This result seems logical, as the top-down maps reflect the heated area only.

For the second analysis, we compared the values from the top-down and bottom-up maps for each hectare element. Figure 14 shows the difference between the bottom-up and the top-down value in each hectare element for the three cities. In this way, the following maps are compared: (a) a top-down GFA map reflecting the heated area (HA) in the region, with the bottom-up GFA map containing all buildings in the region (including industrial and non-energy relevant buildings); (b) a top-down GFA map with the bottom-up HA map only containing the HA of the residential and tertiary buildings; and (c) a top-down HD map with a bottom-up HD map. For this comparison, the top-down values have been scaled so that the overall GFA, heated area, and HD in the area are the same. In the figure, each hectare cell of the selected municipality is represented by a single dot. Blue dots indicate that the Hotmaps' top-down data distribute a lower share of energy or gross floor area to a specific hectare cell then the bottom-data distribute. The hectare cells are shown in red dots if the bottom-up maps allocate a lower share.

**Figure 14.** Difference between the top-down and bottom-up values for each hectare element in three cities: (**a**) gross floor area (GFA) of all buildings (including industrial and non-energy relevant buildings) in the bottom-up data vs. heated area (HA) in the top-down data (left column), (**b**) HA in the bottom-up data vs. HA in the top-down data (middle column), and (**c**) heat demand in the bottom-up and in the top-down data (right column).

#### **4. Discussion**

In this paper, we explained how we developed a gross floor area (GFA) density map and a heat demand (HD) density map at the level of 100 m × 100 m for the entire EU 28 (+ Norway, Iceland, and Switzerland) within the Hotmaps project. We also showed a comparison of the developed maps at the NUTS 3, LAU, and hectare levels with data and maps (developed) from other sources. In the following, we discuss the limitations of available data for the exercise and their effects on the results, as well as the uncertainty in the comparison of the results with data from other sources.

#### *4.1. Limitations of the Data*

The presented datasets and maps build on a statistical approach. This approach limits the accurateness of the data, as site specific or local conditions are not taken into account. Considering the input data, we believe that the population data are accurate up to a level between 250 × 250 m and 500 × 500 m. However, we must acknowledge that the input data are, on average, about 10 years old. In addition, the data are consistent with statistical data at the municipal level; given the limitation that statistical population data on LAU regions are not available for all census years and LAU regions (or contain inconsistencies), we used the average population data for the years 2008 to 2016. Checks, where we estimated the population of a given area using satellite images and estimations for the average number of persons per building, confirmed that the data are also plausible for higher resolutions.

Statistical data on the residential heated (net) floor area are available for most NUTS 3 regions. We again performed manual data quality checks, which indicated that our results are plausible at the hectare level for most regions. However, in the current dataset, we did not consider the observation that the heated area per inhabitant often decreases by increasing population density. For NUTS 3 regions with a strong urban versus rural area gradient, this might result in an overestimation of the heated residential gross floor area in urban areas. The heated gross floor area of non-residential buildings, however, remains very uncertain at the country level. Data quality checks indicate that the sum of the residential and non-residential heated gross floor area is in a plausible range. Also, the ratio between residential and non-residential gross floor area is plausible, although this indicator might not hold for grid cells, which contain few buildings. Furthermore, the comparison of regional building stock data indicates that we likely underestimated the floor area of non-residential buildings at the country level, which is an input in our model. The final energy demand for non-residential buildings, however, is in the correct order of magnitude, as well as the data for the heated gross floor area at the national level for building categories, such as offices, health and education, restaurants and hotels, retail and wholesale, and others. One possible reason for this result is that we overestimated the area-specific energy demand (e.g., due to the geometry of the buildings (small surface-to-volume ratio) and/or higher-than-estimated internal gains) or that a significant share of non-residential buildings is not fully heated (industrial production halls, warehouses, etc.). In order to systematically investigate this gap, further high-quality data on existing non-residential building stock is needed.

For the heat demand density map, we derived the local data from statistical data on energy consumption at the country level and national building stock characteristics, such as the average specific energy needs per construction period. To calculate the grid cell specific energy demand-per-floor area data, we assessed the surface-to-volume ratio of buildings based on the OpenStreetMap database, the share of floor area per construction periods, and the heating and cooling degree days. The impacts of the first two indicators are plausible but highly uncertain. We, therefore, give these indicators a low weight in our calculations. We belief that these last indicators, the heating and cooling degree days, are of higher accuracy, although we used a simple atmospheric temperature lapse rate model, which cannot account for local site-specific weather and climate conditions. Additional uncertainties exist, as we do not know if and how planers have already considered colder (or warmer) local climate conditions when the buildings were constructed in the past. Since we assume that this might be the case to some extent, we lowered the weight of the climate indicator compared to what is usually considered to be the actual thermodynamic degree of influence. Again, data quality checks indicate that our results

are plausible. However, we recommend using individual data on the heated area-specific energy needs or final energy demands, whenever local data are available. Another uncertainty regarding the final energy consumption (which is not the case for the energy needs) arises from the lack of information on the applied heating systems and the corresponding efficiency. If, in a region or grid cell hiatus, biomass-based stoves are widely applied, then the final energy consumption will be higher than if electricity-based systems are commonly used, even if the actual energy needs of the buildings are identical.

#### *4.2. Uncertainty in the Comparison of the Results with Data from Other Sources*

To compare the developed maps with data and maps from other sources at different regional levels is important to understand the potential use and the credibility of the dataset. Although it was possible to find values for population, residential gross floor area (GFA), total GFA, and heat demand (HD) for 20 regions, the uncertainty in the statistics for the difference between the developed data and the data from other sources is high. There are two main reason for this uncertainty. First, the share of the regions for which we compared developed data with data from other sources on the overall territory covered in the developed data is very low. The 20 regions in the presented comparison cover 1.4% of the population of the entire analysed territory. Second, the reliability of the data from other sources is often unknown. Descriptions of the data stated in reports often lack detail to understand what exactly is being represented in the data, e.g., what type of heat demand is reflected, what types of buildings are taken into account, the year of reference, if the demand data are climate corrected, or what the regional borders of the analysis are. Therefore, no quantitative conclusions for the differences between the developed and other data at the NUTS and LAU level are possible.

We also compared the top-down developed maps with maps based on municipal building stock data (bottom-up) for three cities at the hectare level. The bottom-up maps were developed by estimating the HD via the average HD in typical buildings in the countries calibrated at the national level and climate corrected to the location of analysis. We found that the overall GFA and HD, as well as the split between residential and tertiary buildings, in the developed bottom-up maps match well with the data for buildings and the energy statistics from the cities. However, the data in the local statistics also imply uncertainty. Notably, the energy demand for space heating and hot water generation is not generally measured but developed based on the estimated shares of energy carriers used for different purposes. Bottom-up estimations, on the other hand, strongly depend on the input data—most importantly, energy demand per m2, service factors, and building occupation. Furthermore, user behaviour is an important uncertainty when analysing HD at a very detailed regional level: Are people leaving houses for longer periods, are they used to having lower or higher indoor temperatures, or is a building or a flat really occupied or not?

#### **5. Conclusions and Outlook**

#### *5.1. Conclusions*

The developed GFA and HD density maps cover the entire territory of EU 28 + Norway, Iceland, and Switzerland. In addition, they are fully open source and therefore usable by everyone for every purpose. This is the first such dataset we know of.

A comparison of the developed data with data stated in other sources for selected cities and regions showed differences from very low up to 45%. The average difference of all compared values was 12% (median 8%), with a standard deviation of 10%. Differences in this range are also often experienced in comparing data from bottom-up estimations of GFA and HD based on local building stock data with values from buildings and energy statistics.

A comparison of the developed maps with maps based on municipal building stock datasets for three cities shows that, for these locations, the overall tendency of the distribution of GFA and HD over the GFA and HD density is similar in both approaches. This comparison also reveals the following systematic difference: The developed datasets seem to systematically overestimate the GFA and HD in low density areas and underestimate the GFA and HD in high density areas.

We conclude that the developed GFA and HD density maps allow a first analysis of GFA and HD distribution in all locations in Europe. Also, they can be used to identify areas that might be suitable for district heating. Especially for locations in Europe where detailed GFA and HD density maps are not available, the developed maps provide valuable data for initial and quick analyses. For the detailed planning of supply infrastructure, however, more detailed data from the local level should be used.

#### *5.2. Outlook*

Although the approach of developing heat density maps is not entirely new, and despite the achieved progress regarding the transparency, accessibility, and quality of the data presented in this paper, there is still a considerable need to enhance the work on heat density maps.

In the course of the Hotmaps project, the representatives of additional follower areas—beyond the cases presented above—will use the Hotmaps toolbox (www.hotmaps.eu). The creation of detailed, bottom-up heat density maps will provide the grounds for more data to be compared with the EU-28 default data map presented in this paper. The authors intend to use this process to continuously develop further calibration, a better understanding of possible deviations and biases, and regularly update the database via the toolbox and on the mentioned Git-Repository. This model, which creates the described heat density maps, will be published as open source at the end of the Hotmaps project; until then, the model is available on request.

High quality data on heating and cooling energy demands and consumption for entire regions or areas are rare and usually subject to substantial uncertainty. Thus, reference values for the plausibility checks of heat density maps are also rare and partly uncertain. In addition, even the assessment of the reliability of a source is often difficult since geographic system boundaries are not always clear, and the definitions, and indicators are often not fully documented. Thus, the improvement of data availability, reliability, documentation, and know-how on the municipal, regional, national, and European levels regarding energy demands and, in particular, heating and cooling, should be given higher priority. On the municipal level, this prioritization should also be integrated in the process of establishing strategic heating and cooling planning and mapping processes. We are convinced that a better data foundation and correspondingly trained persons are essential preconditions for more effective planning and mapping processes and thus a decarbonisation of the heating and cooling sector.

Cooling density maps have also been developed by the authors in the frame of the project Hotmaps. Due to the restricted space in this paper, we decided to limit the scope to heating only. The elaboration of cooling density maps has led to other types of uncertainties and issues that need be further analysed in future research work.

The validation and subsequent reliability of the heat density maps could be further improved by integrating other data sources. This refers, for example, to the data from the EPC databases. The project Enerfund (http://enerfund.eu/) has provided a rich map of EPC data, at least for some countries. Another important source and step to improve the quality of heat density maps is using real and measured energy consumption data. By increasing the roll-out of smart meters and devices that enhance the smart-readiness of buildings, a substantial amount of data will be available in the future. These data will have good potential to improve the reliability and real-time updates, adding higher time resolution and quality to the heat density maps. However, the availability of these data does not mean they are also accessible for the improvement of heat density maps. Clarifying data protection rules and the ownership of consumers' energy consumption data will be one of the key prerequisites.

Overall, we are convinced that with an increasingly stronger focus on the heating and cooling sector in achieving energy and climate policy targets, local work on decarbonising this sector will gain relevance. In this way, a better understanding of the spatial dimension of heating and cooling supply and demand will become progressively more important.

*Energies* **2019**, *12*, 4789

**Supplementary Materials:** The following developed datasets (maps) are available online at https://gitlab.com/ hotmaps: Heated gross floor area density maps of buildings in EU28 + Switzerland, Norway and Iceland for the year 2015: Residential building: https://gitlab.com/hotmaps/gfa\_res\_curr\_density, Non-residential buildings: https: //gitlab.com/hotmaps/gfa\_nonres\_curr\_density, all buildings: https://gitlab.com/hotmaps/gfa\_tot\_curr\_density, Heat density (final energy demand for space heating and DHW) on a hectare level 100 × 100 m for EU28, Norway, Iceland and Switzerland in 2015: Residential building: https://gitlab.com/hotmaps/heat/heat\_res\_curr\_density, Non-residential buildings: https://gitlab.com/hotmaps/heat/heat\_nonres\_curr\_density, All buildings: https: //gitlab.com/hotmaps/heat/heat\_tot\_curr\_density.

**Author Contributions:** Conceptualization, A.M. and L.K.; methodology, A.M.; software, A.M., M.H., and M.F.; validation, A.M., M.H., and M.F.; formal analysis, A.M. and M.H.; data curation, A.M., M.H., and R.B.; writing—original draft preparation, A.M., M.H., and L.K.; writing—review and editing, A.M., M.H., L.K., M.F., and R.B.; visualization, A.M., M.H., and M.F.

**Funding:** This research was funded by the Horizon 2020 programme of the European Union, grant number 723677—Hotmaps.

**Acknowledgments:** This work has benefited greatly from the discussions with the Hotmaps project.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **Abbreviations and Variables**


#### **References**


© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## *Article* **The Role of Open Access Data in Geospatial Electrification Planning and the Achievement of SDG7. An OnSSET-Based Case Study for Malawi**

## **Alexandros Korkovelos 1,\*, Babak Khavari 1, Andreas Sahlberg 1, Mark Howells <sup>1</sup> and Christopher Arderne <sup>2</sup>**


Received: 10 March 2019; Accepted: 8 April 2019; Published: 11 April 2019

**Abstract:** Achieving universal access to electricity is a development challenge many countries are currently battling with. The advancement of information technology has, among others, vastly improved the availability of geographic data and information. That, in turn, has had a considerable impact on tracking progress as well as better informing decision making in the field of electrification. This paper provides an overview of open access geospatial data and GIS based electrification models aiming to support SDG7, while discussing their role in answering difficult policy questions. Upon those, an updated version of the Open Source Spatial Electrification Toolkit (OnSSET-2018) is introduced and tested against the case study of Malawi. At a cost of \$1.83 billion the baseline scenario indicates that off-grid PV is the least cost electrification option for 67.4% Malawians, while grid extension can connect about 32.6% of population in 2030. Sensitivity analysis however, indicates that the electricity demand projection determines significantly both the least cost technology mix and the investment required, with the latter ranging between \$1.65–7.78 billion.

**Keywords:** open data; electrification modelling; Malawi; OnSSET

#### **1. Introduction**

The 2030 Agenda for Sustainable Development has set the goal of universal access to electricity by 2030 (SDG7) [1]. The challenge is significant. It involves reaching populations with limited income, often living in sparsely populated areas, mostly in developing and least developed countries [2]. Selecting the optimal electrification approach is also difficult; grid vs. off-grid, fossil fuel vs. renewable, public vs. private investment are just a few examples. Coping with dilemmas of this nature—involving the deployment of big technological systems—requires thorough analysis of the social, technical, economic and political characteristics of the studied area or country [3]. This in turn, requires access to reliable data and information [4,5]; e.g., distribution and density of population settlements, electricity demand levels, resource availability, poverty rate and economic activity, distance from functional infrastructure (e.g., transmission and distribution network, roads, power stations) to name a few.

Despite progress, in most countries where universal electrification is still to be achieved, such official information is yet difficult to access [6]; these data are typically not covered by standard national energy statistics. The paucity of such information is one reason hampering electrification progress [7,8]. However, this situation is gradually being overcome with the increasing availability of new data and analytical tools, especially in the field of geospatial analysis. Geographic Information Systems (GIS) and remote sensing techniques are becoming openly available and can now provide a range of location-specific information that has not been previously accessible.

In the energy sector, the use of GIS data and associated analytical tools to conduct strategic planning remains at an early stage, yet such efforts have multiplied in recent years to further support both public and private stakeholders in prioritizing and rationalizing energy infrastructure investments [9]. From a public-sector perspective, GIS analytics are increasingly being used by governments and utilities to prioritize and sequence their grid extension efforts, as well as integrate off-grid solutions within national strategies aiming to achieve universal electricity access in a given timeframe (e.g., Tanzania [10], Afghanistan [11], Zambia [12], Madagascar [13]. From a private sector perspective, similar analytics are used to demonstrate the opportunity for supplying off-grid customers with decentralized energy services (market opportunity identification) and support subsequent operational roll outs (business models).

With this paper we aim to: (a) provide an overview of the main GIS data and modelling efforts aiming to support electrification planning and the achievement of SDG7; (b) discuss their role (especially if open) as providers of useful insights to difficult policy questions; (c) illustrate narrative through a case study of Malawi using an open-data-based and updated version of the Open Source Spatial Electrification Toolkit (OnSSET 2018), and (d) identify critical data/methodological gaps and suggest actions of future development.

#### **2. GIS Based Electrification Planning**

#### *2.1. Open Access Data*

The availability and quality of open access, publicly available GIS datasets has improved significantly over the past years; new datasets emerge conveying useful information regarding resource availability, status of infrastructure, social and economic characteristics of global populations. The following paragraphs present GIS datasets that have been (or can be) used in geospatial electrification analysis. A summarized list of useful GIS data for geospatial electrification modelling, providing status and gaps, is available in Appendix A.

#### 2.1.1. Energy Infrastructure

The development of effective—GIS based—electrification plans depends greatly on the availability of credible and up-to-date records of existing infrastructure in the area of interest. The distribution of grid network for example, is an important input parameter. To illustrate, unelectrified settlements might find it more economical to connect to the national grid if in close proximity to service transformers or medium voltage (MV) lines. In contrast, areas that are located far from grid network might find off-grid technologies (mini-grids or solar home systems) are a better alternative. Therefore, low quality (erroneous or inadequate) datasets of the grid network may have a considerable impact on the results of electrification models. Other infrastructure (and thus for planning their datasets) such as the road network, are equally important; take for example remote villages without access to proper roads. They might experience high logistic costs for certain technologies e.g., high diesel prices. Several efforts have been recorded over the past few years aiming at reducing infrastructure data gap; few of them are briefly described below.

A noteworthy initiative recording power plants worldwide is the Global Power Plant Database by World Resource Institute [14]; the dataset contains geo-located entries of 28,500 power plants from 164 countries, including information on capacity, generation, ownership, and fuel type. It is open and frequently updated. The Global Roads Open Access Data Set (gROADS), v1 [15] provides a range of road data from the 1980s to 2010. Unfortunately, most country data is not 'date stamped' and spatial accuracy varies. OpenStreetMap (OSM) comes to fill data gaps in several instances [16]; OSM is a big—and growing—repository of open geospatial data including various elements of infrastructure, including roads [17]. The World Bank, has developed an online data explorer that records existing and planned transmission and distribution lines over Sub-Saharan Africa and Middle East [18]. The explorer draws from a comprehensive dataset [19] including power lines ranging from sub-kV to

700 kV. It should be noted however that there is large variation in the completeness of data by country. The ECOWAS Centre for Renewable Energy and Energy Efficiency (ECREEE) has provided a similar dataset for West Africa [20].

These efforts collect, organize and redistribute existing data. However, for many countries datasets are incomplete and in some cases of uncertain quality; for example, metadata describing the content, its source and how it was derived is often incomplete or missing. In order to overcome selected barriers, new methodologies have been developed. The energy access team at Facebook has released a remote sensing base predictive model for more accurate MV network mapping; the model is open source with output—as of the time of writing—being available for six countries in Sub-Saharan Africa [21]. Note that [22] provides an adaptation of this work available as an executable, open source code. Other initiatives consider the use of machine learning techniques and artificial intelligence. For example, Development Seed has developed an open source pipeline to efficiently map the high-voltage (HV) grid at a country-wide scale. The method uses high resolution satellite maps (0.5 m/pixel), from DigitalGlobe Platform to identify HV-towers. Then machine learning algorithms are applied to predict the distribution of transmission lines between the towers. Results are available for Nigeria and Zambia [23]. Finally, other initiatives [24–26] have also been developed in this area; some are and some are not focused in Sub-Saharan Africa. As they are open however, they can be applied globally and provide the potential to overcome important data shortages.

#### 2.1.2. Resource Mapping

Natural resource availability—such as sunlight for PV panels—is a significant decision parameter when choosing electrification options. Electrification solutions should take into account local conditions in order to be achieve long term sustainability [27]. Remote areas with abundant solar irradiance, far from oil supply, for example might be better served by photovoltaic systems rather than diesel generators. Similar logic is applied to other resources. So called 'big data' from Earth-orbiting satellites have enabled scientists to better assess resource availability on a global scale. This body of data, if processed properly can provide useful information for electrification projects as well. For example, a Global Horizontal Irradiation (GHI) dataset is available by [28]. They provide information regarding solar availability in a location (usually in kWh/m2/year). Other datasets such as wind speed [29,30], Digital Elevation Models (DEM) [31], land cover [32–35], river network [36], drainage basins [37], water discharge flows [38,39] are also highly useful. Combination of those, can yield very useful outputs such as wind power density [29] or capacity factors [40], hydro potential maps [41], which in turn can provide insights for the development of successful electrification projects.

#### 2.1.3. Socio-Economic

A critical challenge in current electrification efforts is to construct sustainable business models. That is, electrification projects (both private and public) need to be able to recover investment and operational costs and be profitable—or at least break even [42]. Information regarding the socio-economic context under which such projects are developed, is thus important during the design phase. The following paragraphs describe how GIS can help identify some of these characteristics and incorporate them into electrification modelling.

#### 2.1.4. Population Density & Distribution

Population density and distribution maps are used to indicate where population resides, thus where there is potential residential demand [43]. The map type as well as spatial resolution determines the detail (and sometimes accuracy) of information. Gridded population datasets (similarly to any other raster layer) represent information in the form of grid cells. In this case, grid cell values indicate population headcounts or density in a specific time.

Worldpop [44,45] has developed gridded population layers for many Sub-Saharan African countries at 100 m spatial resolution; 1 km resolution layers are available at continental level. These layers use interpolation techniques, which may give rise to inaccurate population estimates in certain cells; for example, some cells indicate population headcounts that have no physical meaning (e.g., less than 1). The Global Human Settlement layer (GHS) [46] suggests an alternative approach by indicating population values only in urban, peri-urban or rural areas; locations without population are eliminated. In similar manner, the Global Urban Footprint (GUF) [47–49] layer specifies in high spatial resolution (12 or 75 m) where settlements are located. The High Resolution Settlements Layer (HRSL) [50] provides population density maps in very high spatial resolution (30 m) but only for selected number of countries in Sub-Saharan Africa. Finally, [51] has developed a methodology that further processes the above datasets in order to provide more accurate vector type settlement layers for the case study of Tanzania.

#### 2.1.5. Night-Time Lights

Night-time light (NTL) maps capture light sources on the surface of the Earth using satellite imagery. These can be a good proxy for assessing where electrified human settlements are, as they indicate light pollution. The Visible Infrared Imaging Radiometer Suite (VIIRS) dataset is available in raster format at 250 m spatial resolution; it provides the luminosity value in every cell; low value indicates that there is little visible light while higher values indicate high luminosity [52]. As of 2018, VIIRS provides annual composites for 2015 and 2016 as well as monthly composites for all years between 2012–2018. Its availability in monthly composites allows for detailed analysis of light sources and reduces the occurrence of false positives—areas that seem to be lit but in reality are not. Note that DMSP-OLS V4 [53] is VIIRS predecessor; it is available in raster format at 1 km spatial resolution and available for composites until 2012. It should be noted that DMSP-OLS V4 composites have been processed in order to provide stable light values over time series. Finally, Earth Observatory [54] also provides night light data at various spatial resolutions but without providing stable light composites.

#### 2.1.6. GDP—Poverty Maps

Reference [55] developed a GIS layer presenting the Gross Domestic Product (GDP) in gridded format and on global scale for three intervals between 1990 and 2015 under 1 sq. km spatial resolution. The study uses primarily national GDP, PPP (purchasing-power-parity) values in constant 2011 international U.S dollars (\$). In this instance, GDP illustrates the sum of gross value added by all resident producers in the economy plus any product taxes and minus any subsidies not included in the value of the products. It is calculated without making deductions for depreciation of fabricated assets or for depletion and degradation of natural resources. In addition, the study uses sub-national GDP, PPP values (for 82 countries) where available. The values were also converted in constant 2011 international U.S. dollars (\$). These values were adjusted so that—when weighted by population—they total the GDP, PPP at the country level. By combining national and sub-national data, the global gridded GDP, PPP per capita maps were created. Poverty maps indicate the headcount ratio of population that lives below the poverty line (threshold usually being \$1.25 or \$2 per day) in an administrative area that can range from high level districts to lower level wards and municipalities. High resolution poverty maps (1 sq. km) have used geo-statistics in combination with GPS-located household survey data; such maps are however limited to a few countries. A combination of the aforementioned maps can provide very useful insights in electrification planning activities; they can be used as a proxy for economic activity or well-being in an area; or to create some sort of "heat map" indicating a better suited electricity access target per location.

#### 2.1.7. Other

The list of open, energy related geospatial data is big and growing together with online GIS data platforms, map catalogues and repositories that make such datasets publicly available such as Energydata.info [56], OpenStreetMap [17], Google Earth Engine [57], IRENA global atlas [58], World Resource Institute [59], UN biodiversity lab [60], NREL GIS data & OpenEI [61,62], Earth Data [63]. Country specific GIS platforms have also been developed to support open data dissemination such as in Bolivia [64], Brazil [65], Kenya [66], Malawi [67], Uganda [68] and Namibia [69].

#### *2.2. GIS Based Electrification Modelling Frameworks*

The advent of geospatial information stimulated the development of modelling tools, methodologies and user interfaces that leverage on them so as to better support electrification planning decisions. The first tools that used GIS information in order to assess local resources and support techno-economic optimization included HOMER, RETScreen, SWERA UNEP, PVGIS, HOGA, DER-CAM [9]. Such tools are out of the scope of this paper as they mostly focus on the assessment of individual projects. The focus here is on what we shall term, the "second generation" of GIS modelling frameworks. The latter, utilize geospatial information and GIS software in order to support higher level electrification planning efforts. A short description of those most commonly used in electrification planning efforts is presented below. We shall also turn much of our attention to open access efforts, as they allow for reproducibility and thus can form the basis for scientific expansion.

IMPROVES-RE program (2007–2009) is one of the first efforts to support rural electrification activities with the use of a geospatial information. It is an open web-based platform aiming to support rural electrification projects and increase their impact on sustainable development and poverty alleviation in Burkina Faso [70]. It should be noted that IMPROVES-RE can be considered the predecessor of GEOSIM, a commercial tool that has been used for rural electrification planning in some countries (e.g., Tanzania [10]). GEOSIM is only marginally included in this review since it is not open source and relies on proprietary GIS software (Manifold).

Network Planner is a GIS based, open source [71] modelling framework for planning electricity infrastructure projects. Its underlying model identifies the optimal electrification technology mix for currently unserved demand centers; those include demand for households and other productive uses of electricity. Network planner uses a modified version of Kruskal's algorithm (minimum spanning tree) in order to find the maximum length of medium voltage lines for which grid extension is cheaper than the available off-grid options (solar home systems, diesel mini-grids) [72]; it does not however include biomass, wind and hydro as potential energy sources. Network Planner has been applied to Liberia [73], Ghana, [74] Kenya [75] and Senegal [76].

RE2nAF is an open access web mapping application that enables geographically based exploratory analysis for off-grid electricity systems in the African continent. It overlays population settlements, infrastructure features (grid network, power plants and roads) and solar resources indicators (kWh/m2) aiming to provide a comparison between diesel and PV based electricity costs for electrification [77]. All underlying GIS datasets have been made publicly available; results have been analysed and discussed in [27,78,79]. It shall be noted however that the underlying model is not available in the form of a customizable tool that could allow replication or modification by a broader user base, thus less capable of capturing specificities associated with individual projects.

The Reference Electrification Model (REM) [80,81] is an optimization tool designed to provide detailed engineering designs for electrification projects. It combines geospatial information with electricity demand and technology costs in order to estimate and compare different combinations of electrification modes (grid, mini-grids and stand-alone systems). Using satellite imagery, deep learning-based computer vision and clustering algorithms, REM can provide high level of granularity ranging from country to village level analysis. The model also offers the possibility to assess the impact of various factors such as demand levels, grid reliability, fuel and technology costs and cost of non-served-energy. REM has been used for electrification planning in India [82], regions of Rwanda and Uganda [83], Kenya and Colombia. REM (in liaison with its sibling tools GridForm and uLink [84]) offer a comprehensive modelling approach to rural electrification challenges. However, as in the time of writing the model is not yet open source.

IntiGIS is a plug-in application for ArcGIS that uses geospatial information in order to assess and compare the techno-economic performance of several electrification technologies; these include (a) stand-alone systems (PV, wind, diesel), (b) mini-grid systems (diesel, hybrid—wind/PV/diesel) or (c) connection to grid MV lines. Results include numerical and cartographic values of each of the selected technologies, including the optimal levelized cost of electricity at each point of demand and sensitivity of various technical parameters. IntiGIS is distributed freely however its operation is dependent on proprietary software (ArcGIS). Results of its application are available for Ghana [85].

The Open Source Spatial Electrification Toolkit (OnSSET) [85] is a GIS based tool developed to identify the least-cost electrification option(s) between seven alternative configurations; grid connection/extension, mini grid systems (solar PV, wind turbines, diesel gensets, small scale hydropower) or stand-alone systems (solar PV, diesel gensets). OnSSET combines geospatial information related to infrastructure, resources, topology and socio-economic characteristics over a modelled area, in order to inform a tree search algorithm. The algorithm traverses iteratively through a sub-set of the tree nodes (un-electrified population settlements) using Locality-Sensitive Hashing (LHS) to identify the nearest neighbor and optimal electrification technology. Results indicate the optimal technology mix, capacity and investment requirements for achieving electricity access goals under pre-defined time series (This may include multiple time steps; minimum duration of a time step is one year). The model also considers a prioritization algorithm, which defines how electrification progresses over time. Findings can be presented in various GIS compatible formats such as interactive maps, graphs and tables. OnSSET has informed IEA's energy access outlook publications [2,86], UN estimates for all Latin American and African countries [87], as well as country studies for Ethiopia [88], Nigeria [89], Kenya [90], Afghanistan [11], Madagascar [13], Tanzania [91], Zambia [91] and Benin [92]. Electrification investment scenarios also feature for 56 countries in open access web-based platforms [87,91].

Other geospatial web-based applications are also available. The Off-grid Market Opportunities Tool uses geospatial information (such as population density, proximity to transmission and road network and others) to help private companies, governments, academia and civil society to develop a high-level view of where markets for off-grid electrification may exist to better inform decision-making [93]. The Nigeria Rural Electrification Plans (NESP) [94] web platform provides least-cost geo-spatial electrification plans for five Nigerian States including detailed standalone and mini-grid assessments together with grid extension modelling [95,96]. Myanmar off-grid analytics [97] is a web tool that maps village location in Myanmar and based on available GIS data (local resources and nearby infrastructure) provides information for potential investment in off-grid electrification technologies. Ghana Energy Access Toolkit (GhEA) [98] and ECOWREX GIS [99] are mapping tools used to monitor and evaluate renewable energy resources and energy access progress in the country using geospatial datasets.

Finally, there are few noteworthy methodologies that utilize geospatial information to inform electrification plans. They have not led to functional tools however they may be replicable. Tiba et al. [100] proposed—and applied in the case study of northeast Brazil—a GIS-based methodology that supports rural electrification. Kaijuka et al. [101] used GIS information to identify patterns of electricity demand in Uganda and suggest priority areas for energy investment in the country. Teske et al. [102] developed a comprehensive multi-sectoral approach aiming to provide universal access in Tanzania only though renewable energy based technologies. They used open access data and maps in order to visualize and analyze key parameters for the analysis of Tanzania's future energy situation. These included solar and wind resources, population density, access to electricity via the central power grid or mini grids, the distribution of wealth or the economic development projections as well as energy demand projection for each settlement.

#### *2.3. The Role of Open Access Data and Modelling Frameworks in Electrification Planning*

Naturally, data and tools are designed in different contexts and may serve specific purposes. Even though their capabilities and objectives may vary per case, we find that most electrification efforts follow a conceptual framework as illustrated in Figure 1.

**Figure 1.** Conceptual flowchart of GIS -electrification modelling frameworks.

The flowchart in Figure 1 is far from exhaustive; it captures however the main components that we feel are crucial in GIS based electrification modelling. These include data collection, data transformation, model selection and configuration, result analysis and dissemination. Each component can be useful for policy making towards SDG 7; and here is where, we believe, open access can have the highest impact.

If open, such frameworks can enable the replicability and reproducibility of embedded processes as well as reusability of input/output data. In this way they can yield rapid techno-economic screening analyses—usually at low cost—in order to delineate the high level spatial contours of immediate (or intermediate) investment plans for electrification. They can also provide a test bed for cumbersome, long-term implementation roadmaps; support the decision making process; facilitate investment mobilization and speed up the implementation process. Finally, if transparently designed they can by audited by third parties; this is critical for assuring quality, control and demonstrating due diligence in administering public funds. With this in mind, we try to answer a set of questions commonly encountered in SDG 7 related planning and policy development activities. These may involve, among others, the following:

	- a. What is the population density and how are settlements distributed in the country?
	- b. What are the settlements' characteristics?
	- a. What is the level of access and use?
	- b. What is the expected/targeted electricity demand for different locations or types of settlements?
	- a. What equipment capacity is required?
	- b. What is the potential role of different types of electricity supply technology?
	- a. Where can the national grid reach?
	- b. Where do off-grid systems step up to provide access?
	- c. Which areas may get access to electricity first?
	- a. What is the total investment required to achieve full access by 2030?
	- b. Where is investment most needed and in what form?
	- c. Where can households afford electricity and where should subsidization be considered?

The case of Malawi is selected since it is one of the countries with the lowest electrification rate in Sub-Saharan Africa. Following the conceptual flowchart presented in Figure 1 we set up an electrification investment scenario (EIS) using an updated version of the OnSSET modelling framework. We use entirely open access data, software and methods. It is to be noted that our findings are illustrative only. The aim is primarily to highlight the power of open access information and the positive impact they might have in supporting sustainable electrification policies.

#### **3. Electrification Policy Insights for Malawi**

#### *3.1. Data Collection and Transformation*

#### 3.1.1. Question 1 on Population Distribution & Characteristics

Malawi is a south eastern African country with population of about 18.62 million people [103]. The population growth is 2.83%, leading to an estimated population of 26.03 million in 2030 [104] and the urbanization rate 4.41% [104] per year. The average estimated household size is 4.3 and 4.5 people for urban and rural settlements respectively [105]. Identifying the location and type of settlements is a very important first step in the geospatial electrification analysis, as it can help denote several other characteristics.

Population-based datasets exist mostly in the form of grids. A grid comprises a number of spatially identical cells. The size of the cell determines the spatial resolution or else the area it represents. Each cell is used to represent a settlement and usually comes along with an attribute that specifies either total number of people or population density. It is often the case that gridded population datasets use interpolation or extrapolation techniques in order to fill data gaps [106]. This can cause false positives/negatives—areas that seem to be populated but in reality are not or reverse—and skew the electrification results. In reality, human settlements have various geometries. In a perfect modelling world, human settlements would be spatially represented by delineated vector polygons (referred to

hereafter as population clusters) with full description of the settlement's characteristics (e.g., acreage, population, number and size of households). However, datasets of this nature are available only for limited locations. To overcome this, we introduce a new methodology aiming to delineated and attribute population clusters. This is achieved by using existing gridded population datasets and a set of open source geospatial processing tools. A step by step description of the methodology is presented in Appendix B. The methodology was tested upon the case study of Malawi. The derivative dataset yielded ~198,900 population clusters of various geometries and size as shown in Figure 2.

**Figure 2.** Characterization and spatial distribution of population clusters in Malawi as identified by the OnSSET model.

The aggregated population in the clusters was estimated as 17.19 million people, 7.65% lower than national statistics provide. The difference can likely be attributed to compounding uncertainty in geospatial processing and was mitigated through a calibration process. Once calibrated, each cluster was then characterized as either urban or rural based on information available at the GHS (S-MOD) layer. The layer provides a standardized distinction between: (a) urban centers, (b) urban clusters (peri-urban) and (c) rural settlements. For simplification, both (b) and (c) were considered as rural in this study. The process yielded 16 big urban clusters with an aggregated population of about 3 million people (in line with national statistics). The rest were identified as rural clusters.

Urban population settlements are often located closer to the existing grid network, they show higher population density, increased economic activity and (usually) higher electricity access rates and demand; the opposite applies to rural settlements [43]. In order to capture this dynamic, poverty and GDP data [55] were extracted to each cluster as shown in Figure 3a,b, accordingly. The poverty map

indicates the headcount poverty rate in each cluster; the GDP map indicates the estimated total gross domestic product in each cluster. Information regarding settlements' socio-economic characteristics can be an important indicator for the selection of an "appropriate" electrification technology that will assure long-term sustainability of this solution.

**Figure 3.** Poverty rates (**a**) and estimated purchasing power parity Gross Domestic Product (GDP-PPP) in 2011 USD values (**b**) as distributed over population clusters in Malawi.

#### 3.1.2. Question 2 on Current Electrification Status

The electrification rate in Malawi is among the lowest in the continent; it is estimated that about 49.2% of population living in urban areas has access to electricity while the rate is merely 3.2% in rural areas [107]. With the urban ratio in Malawi being roughly 17% [97], the national electrification rate stands at ~11%. Knowing where currently electrified clusters are located is an initial step needed for the electrification analysis. With OnSSET, already electrified clusters are used as anchor points for the electrification model. Once identified, they are, together with the known existing and planned grid lines, considered as starting points from which the grid network can be further extended. The location of electrified clusters and the access rate within those, is information often not easily accessed. Thus, in order to identify already electrified settlements rapidly a heuristic is added to OnSSET. That heuristic relies on a GIS-based multi-criteria evaluation. Note that this can easily be updated with actual figures when—and if—available (if can be the case, that with informal connections national statistics may be unhelpful in determining the extent of the electrification. National statistics may count only formal connections). The evaluation is based on five spatial attributes for each one of which a default threshold (the suggested values were reflective for Malawi; threshold values may vary per country) is defined as shown below:


Priority factors can be assigned according to data availability and the level of confidence on the quality of the datasets. Independently, (A) serves as a priority proxy for identifying electrified locations. In case (A) is insufficient or not available, (B) is a considered a useful alternative. Finally, (C) might be used if none of the above is available. Yet, the use of (A), (B) or (C) alone might cause the selection of locations that are close to a line or transformer but not necessarily electrified; therefore, these layers shall be used in combination with (D) and/or (E). Note that in absence of both (A)–(C), the combination of solely (D) and (E) can yield alternative proxies. In fact, for the case of Malawi, 86.7% of all clusters with night-time light greater than zero are located within 1 km from a service transformer, and 96.7% are located within two km. Ideally as detailed surveys and measurement become available the validity of (and even the need for this) heuristic might be assessed.

While the authors had access to (A) and (B), these datasets were not openly available at the time of writing. For consistency with the narrative of this paper, we relied only on the use of (D) and (E) and identified 814 electrified population clusters. Then, for each one of these clusters, we calculated the ratio between lit and non-lit area (using NTL) and provided an estimate of the electrification rate within the cluster. Finally, we used an iterative routine in Python where the electrification rate in each cluster was calibrated so that the aggregated electrified urban and rural population matches the values indicated by national statistics. Results for Malawi are illustrated in Figure 4.

**Figure 4.** Distribution of settlements that indicate current access to electricity in Malawi. The multi-criteria evaluation yielded 16 urban and 798 rural electrified settlements with average electrification rates of 46.3% and 21.2% respectively.

According to the country's SE4ALL Action Agenda [108], the government in Malawi envisions that it will provide affordable and sustainable electricity services to all households at a level at least equivalent to Tier 1 (~38.7 kWh/household/year [109]) by 2030. Stimulated by this target and building upon the previous geospatial information, we prepare a map indicating targeted electricity levels per settlement as expected in Malawi by 2030. It should be noted that the current average household electricity consumption in Malawi is approximately 1072 kWh/year [108]. That is, all currently electrified settlements in Malawi were assigned a demand target equivalent to Tier 4 as in [109]. As illustrated in Figure 5, average electricity demand is expected to be higher in big urban clusters (Lilongwe, Blantyre, Zomba, Mzuzu); for the urban clusters identified in this analysis the median value of electricity demand was estimated at 32.4 GWh/year. For rural clusters the average electricity demand was estimated at 763 kWh/year.

**Figure 5.** Distribution of the expected residential electricity demand per population cluster based on specified access targets (Tier 4 for urban and Tier 1 for rural clusters) in Malawi.

#### 3.1.3. Additional Background Information

Malawi's current power system has a total installed generation capacity of about 361 MW with import capacity estimated less than 30 MW [110], whereas the country's current (actual and latent) demand is estimated to be as much as 700 MW leading to supply deficits [110]. According to the Master Plan and the rural electrification plan (MAREP), grid generation capacity will gradually increase to 1500 MW in 2020, 1859 MW in 2025 and 2519 MW in 2030 [108]. Grid extension currently plans to provide electricity to 31.6% of rural population by 2030 [108]. Beyond grid expansion, the government plans to electrify approximately 29.3% of rural population through solar home systems; and provide pico-solar systems to all the remainder (~39%) rural households by 2030. Other mini-grids are expected to electrify less than 0.1% of rural population [108]. Ramping up electricity access is a capital intensive process, especially in the rate under which this is expected to take place in Malawi. According to the SE4ALL Action Agenda, the cost of the suggested interventions for Malawi is estimated at \$5.3 billion [108]. It shall be noted that similar electrification targets have been established in in

many developing countries nowadays [111]. Also, in a historical parallel, the electrification of 1.7 million farms in 1930s in the USA came at a cost of \$321 million [112] (or ~\$5.7 billion in 2018 values). The development of a cost effective and sustainable rollout plan for Malawi is therefore essential in order to avoid unnecessary sunk costs and sub-optimal investment portfolios.

#### *3.2. Geospatial Modelling Framework Comfiguration*

The electrification investment scenario was developed so as to reflect the background information presented in the previous section. Therefore, we assumed that urban settlements target achieving Tier 4 by 2030 while rural settlements aim at Tier 1. We assumed that all currently electrified settlements are grid-connected; in these settlements full access is achieved through grid intensification only. In contrast, the un-electrified settlements are assessed for electrification using all electrification technologies. The selection of electrification technology is based on the lowest cost required to meet the specified Tier in each cluster.

It was assumed that the electrification progress is gradual. That is, the national access rate was set to reach 50% in 2023 and 100% in 2030 [108]. This was achieved with the introduction of a time step function in OnSSET that allowed the definition of explicit access targets per time interval. In this case we selected two time intervals in the means of representing the first five-year investment perspective (2018–2023) and the overall target up to 2030. The time step function relies on a prioritization algorithm developed to first pick "Low hanging Fruit" sites. That is, the algorithm prioritizes grid intensification first; then it continues electrifying other settlements based on the lowest (to highest) investment cost per capita achieved (either grid or off-grid).

From a techno-economic standpoint the following assumptions were made. For the centralized grid generation, the average investment cost was assumed as 1874 \$/kW based on the expected generation mix (this might include: large hydro at 1471.5 MW (58.4%), small hydro at 103.4 MW (4.1%), solar at 550 MW (21.8%), biomass (bagasse) at 46 MW (1.8%), coal at 300 MW (12%) and diesel at 48 MW (1.9%) [108]). in the country by 2030. Similarly, the grid generating cost of electricity was assumed as 0.076 \$/kWh. It should be noted that this value does not reflect the customer tariff but the estimated cost of producing 1 kWh of electricity. (It is assumed that taxes and subsidies are applied ex-ante. Indeed, this needs to be the case in order to rationalise the level of subsidy required for electrification.) Other costs related to grid extension (T&D costs, losses and connection costs) were also considered. Techno-economic parameters for the off-grid electrification technologies included (a) investment cost (\$/kW of installed capacity, including batteries), (b) operation and maintenance cost (% of investment cost per year), (c) capacity factor and (d) expected technology lifetime. Further, efficiency values and fuel costs were included for the diesel-based technologies. Finally, the discount ratio was set at 8%. A more detailed description of all assumptions is available in Appendix C. It should be noted that here for the purpose of this paper—introducing a cluster based approach to geospatial electrification—several parts of OnSSET were modified considerably. One of these, refers to the modelling of essential power components (e.g., type and size of substations, transformers, conductors). A more elaborate explanation of these modifications is available in Appendix D.

#### *3.3. Output, Analysis and Sensitivity*

The following section provides a brief analysis and visualization of key findings from the electrification investment scenario in relation to the policy questions posted in Section 2.3.

#### 3.3.1. Question 3 on Optimal Technology Mix

The model suggests that national grid may electrify 32.6% of population in 2030. More specifically, grid extension may provide electricity to 6.3 million people by 2023 increasing to 8.5 million by 2030. It is also noteworthy that all new grid connections derive mainly from intensification, or else ramping up connections in already electrified locations. Extension of the grid network was only observed in a limited number of areas due to the low access target levels set in this scenario. In contrast, off-grid

technologies do play a very important role in this scenario. As indicated, the majority (67.4%) of the population in Malawi is expected to get access to electricity by off-grid stand-alone PV systems (Figure 6a).

**Figure 6.** Least cost technology split (**a**) and additional capacity (**b**) required per province to reach universal access to electricity in Malawi by 2030.

A few (<0.03%) stand-alone diesel systems were identified with no mini-grids (PV, wind, hydro or diesel) being included in the electrification mix in this scenario. In total, the country will need to increase the generating capacity by 351.8 MW by 2030 (168.1 by 2023 and 183.7 between 2023–2030) in order to meet the increased residential demand indicated by this scenario. From these, 23.9% shall derive from the deployment of stand-alone PV systems. That said and by assuming that grid generating capacity mix will be as described in Section 3.1, it is estimated that renewable technologies in Malawi can account for up to 89.5% of the additional generating capacity needed to achieve universal access goals by 2030.

#### 3.3.2. Question 4 on Electrification Rollout Plan

From a geospatial perspective, national grid coverage is expected to cover 8817 km<sup>2</sup> or else about 14.4% of the populated land in Malawi. The majority of these areas are in close proximity to the existing network; in particular, 99.5% of the grid electrified population in 2030 (as per this scenario) is located within 5 km from the current grid network. Stand-alone PV systems have been identified as least cost electrification options in the rest of the country. In the districts of Nkhata Bay, Dedza, Ntcheu and Neno the electrified population by off-grid PV systems is expected to surpass 95%. Based on the time step function presented in Section 3.2 it was estimated that in the first five years of the analysis electricity service will reach about 8.76 newly electrified million people (Figure 7); from those about 51.5% will get access via off-grid systems while the rest through new grid connections (Figure 8). That is, grid connections will need to increase by a rate of 198,000 households per year until 2023 and slow down to 72,000 households per year between 2023–3030.

**Figure 7.** Percentage (%) of electrified population per province as in 2023.

**Figure 8.** Grid vs. Off-grid split as per 2023 rollout plan estimated to electrify 50% of Malawians.

#### 3.3.3. Question 5 on Cost of Electrification

The total investment required to achieve full electrification in Malawi by 2030, is \$1.83 billion. New grid connections will require \$1.48 billion. The investment cost per household varies depending on the distance to the transmission lines as well as the population in each settlement (Figure 9). The average cost of connecting to the grid amounted to \$228.2 per person or else about \$981 per household. It should be noted that these costs reflect mainly intensification of network; grid extension to new settlements even though slightly observed in this scenario might induce higher connection costs. Investment for decentralized technologies (stand-alone PV systems) is estimated to reach \$351.9 million. The average connection cost for stand-alone PV systems was estimated at \$26.3 per person or about \$118 per household. Finally, for the few stand-alone diesel systems identified the average connection cost was estimated at \$28.5 per person or about \$128 per household. The distribution of required investment over Malawi is presented in Figure 10.

**Figure 9.** Connection cost per capita based on the least cost option identified in the selected electrification scenario.

**Figure 10.** Investment requirements for the achievement of universal access as defined in the selected electrification scenario for Malawi by 2030. Results are aggregated per province.

#### 3.3.4. Synthesis and Sensitivity Analysis

Comparing the results of this analysis with the government's estimates on achieving universal access in Malawi (Section 3.1), some noteworthy observations stand out. In both cases, grid connection is expected to provide electricity to approximately one third (31–33%) of the population. Also, the role of off-grid PV systems is crucial; solar home and pico-solar systems are expected to provide electricity to two thirds (67–69%) of the population by 2030. The additional capacity needed to achieved universal access based on the above statements is 267.5 MW for the grid and 84.2 MW for off-grid systems. However, government generation expansion plans will reflect general expansion vision that includes electricity demand not only for the residential sector. That also explains the disparity detected in terms of the total investment requirements. The electrification model estimated that \$1.83 billion are needed to achieve the access target specified about a third of the government's estimates (~\$5.3 billion). Lack of more detailed information on the rollout, investment plan from the government of Malawi limits the possibility of a more in depth comparison. Otherwise, based on these results the least cost electrification plan seem to be in alignment.

It is important at this point to highlight that any electrification analysis is subject to certain assumptions on the decision parameters. In this study we have selected to run a sensitivity analysis for six input parameters including population growth, electricity demand target, electrification rate in 2023, grid generation cost of electricity, PV cost and diesel cost. In total, ninety-six scenarios were generated and analysed indicating that the total investment requirements to achieve universal access to electricity in Malawi ranges between \$1.65–7.78 billion. We find that the electricity demand target is the strongest determinant of both electrification investment and grid penetration in the total mix in comparison to the rest of parameters studied. A more detailed description of the findings is available in Appendix E.

#### **4. Discussion**

Open GIS data and modelling tools are increasingly being used in project development and planning in the energy sector. Their adoption and use can bring considerable advantages. It can provide a fast and cost effective way to map information that has a strong geospatial nature such as grid infrastructure, energy resources and settlement patterns. This can consequently, empower governments to effectively monitor progress, rationalize policy making and better inform strategic decisions in the energy field.

Electrification planning is no exception. The achievement of universal access to electricity is a crucial yet challenging task. It requires the motivation of significant financial resources in a timely and well-coordinated manner. This, given the rapid socio-economic changes and development particularly in currently unelectrified areas, makes the availability of good, up-to-date and consistent energy related information very important. This paper attempted to map existing data, tools and methods that have been commonly used to support SDG7 implementation efforts. It was observed that their number and importance has been progressively increasing over the past few years. Upon this, we provided key additions to the OnSSET methodology to form OnSSET 2018. Specifically, with this paper we have added an updated grid extension algorithm, a time step functionality and a new prioritization algorithm that allows the development of dynamic roll out plans for electrification. In addition, we introduce a restructured code basis that allows for a vector based approach of population settlements and the integration of new or upcoming geospatial datasets (MV lines, service transformers, poverty data, electricity demand for residential—e.g., Appendix F—as well as other productive activities). Despite that, limitations still exist and should be highlighted in the context of this analysis.

For geospatial data, the level of granularity is a key concern. Usually, open access data are available at low spatial or temporal resolution. Higher granularities are available either at a premium or under special agreement with the provider. Take for example T&D infrastructure; while at high level (e.g., HV lines) data have long been open and available for public consumption, at lower level (e.g., MV or LV lines) openly available data are scattered and inconsistent. This often leads to generalized assumptions, which in turn increase uncertainty in geospatial analysis. Reliability is another common concern. Open access geospatial data can be of unknown origin, questionable quality, poorly maintained, lack proper metadata or in some cases purposefully false. This makes quality assessment processes necessary for most practitioners before use, which can be time and resource consuming. Furthermore, despite progress, many socio-economic datasets potentially useful to electrification planning e.g., energy demand data, income level and distribution, energy expenditure, location of schools, health clinics and other productive nodes as well as mobile phone coverage are still limited or un-available in an open, geo-spatial format.

Similarly, the available GIS-based planning tools and methods, have one or more limitations: they are partially or fully proprietary; they focus only on rural areas and do not provide an overall electrification expansion indication for an entire country; they deploy a limited number of electrification technologies; they have restricted representation of demand; they lack a grid expansion algorithm or they do not account for a dynamic change of the bulk grid electricity supply. In OnSSET for example, the electricity demand is exogenous (layers imported from external calculations) and provide only an educated estimate. In addition, demand currently reflects only residential electrification targets. The model considers a set of static end-states (myopic optimization) thus, it does not use perfect foresight. Load profile is also represented on the basis of peak-to-average demand; that is reliability is incorporated but not optimized for. Despite its limitations, the basic OnSSET model is simple and open, allowing for a more tailored analysis to suite needs as needed—including improving all the above. Looking forward, we identify a clear need for synergies between the existing initiatives in the geospatial electrification field. The development of a single tool that incorporates all dimensions mentioned above could be theoretically feasible. Yet, we suggest that the development of a collaborative, open-source environment including interoperable data and tools with different characteristics might be more desirable. It could conceivably cover a wider range of applications and solutions—as well as harness

a greater volume of analysts and communities. This consequently would require that both data and tools should be democratized so that electrification analytics become accessible to more actors. Thus, global partnerships that promote collaboration between stakeholders who collect, create, manage or use geospatial data are particularly needed. A notable effort in this regard is the multi-country, multi-agency 'round-table' effort championed by DFID, as well as the fledgling Open Tools, Integrated Modelling and Upskilling for Sustainable-development (OpTIMUS) community of practice. These, might be enhanced through an inclusive, open and scalable platform that allows universal access global data layers as well as customizable modelling solutions. Such a platform would help build spatial literacy in the field of energy access and enable better decision making to deliver SDG7.

#### **5. Conclusions and Final Remarks**

As elements in a growing energy planning ecosystem, open access geospatial data and models have started a paradigm shift; a shift that constitutes a significant improvement over conventional planning efforts. Their availability and accessibility can help policy makers, government agencies, investors and project developers to overcome paucity of information and better inform decision making mechanisms. Despite its limitations, we hope that this study will help setting up new ground in the field of geospatial electrification planning and accelerate progress against the achievement of SDG7. Thus, the code basis of the updated electrification toolkit (OnSSET 2018) as well as input/output files for all electrification scenarios included in this paper are publicly available in [85] and open to review, update and/or reproduction.

**Author Contributions:** Conceptualization, A.K. and M.H.; Methodology, A.K., A.S. and B.K.; Software, A.K., B.K., A.S. and C.A.; Validation, M.H.; Formal Analysis, A.K.; Investigation, A.K.; Resources, A.K. and B.K.; Data Curation, A.K., B.K. and A.S.; Writing—Original Draft Preparation, A.K.; Writing—Review & Editing, A.K., M.H., C.A.; Visualization, A.K. and B.K.; Supervision, M.H.; Project Administration, M.H.; Funding Acquisition, M.H. and A.K.

**Funding:** This research was funded by the World Bank under the contract number 7185716.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **Appendix A. Listing and Gaps of GIS Data in Geospatial Electrification Modelling**


#### **Table A1.** GIS data gap analysis in electrification modelling.


#### **Table A1.** *Cont.*

#### **Appendix B. Methodology to Generate Population Clusters Using the High Resolution Settlement Layer and GIS Processing**

The following methodology (Figure A1) has been developed in order to create population clusters based on open access population datasets from the HRSL and a series of processes developed in QGIS, an open source desktop geographic information system application.

**Figure A1.** Methodological flowchart of creating population clusters using the High Resolution Settlements Layer and Geographic Information Systems processing.

#### *Appendix B.1. Resampling Population Layer*

The original spatial resolution of HRSL is 900 m2. In the case of Malawi this translates to 3.2 million grid cells that have to be processed in the GIS environment. This is problematic due to (a) computational limitations of the GIS software used (QGIS) and (b) memory/running time complications of the electrification model used (OnSSET). Therefore, reducing the spatial resolution (resampling) of HRSL, is a sensible—and highly suggested—first step in the process. A final resolution of 0.1 km2/10,000 m2 is a good compromise as it will significantly reduce computational limitations while maintaining a good level of granularity. Lower resolution than 0.1 km2 (or 10,000 m2) will cause undesirable distortion of the layer's values and therefore is not considered as a viable option.

*Suggested tool in QGIS: "r.resamp.stats".*

Notes/Comments: This tool is part of GRASS GIS and enables the user to resample raster datasets. As of the time of writing, this is the only tool included in QGIS 3.2 that allows for increasing the cell size while automatically aggregating the raster values.

#### *Appendix B.2. Removing Redundant Cells*

HRSL population density values derive from interpolating recent census data [113]. This creates grid cell "neighborhoods" in the raster that have the exact same value to the 16th digit. These grid cells are considered false positives and thus shall be removed. In order to eliminate falsely populated grid cells, a threshold value is defined through an iterative process described below.

Step 1. Calculate the total population in the area of interest.

*Suggested tool in QGIS: "Zonal statistics" from the QGIS package.*

Step 2. Initialize the threshold value; the initial value can be anything within the density range in the area of interest.

Notes/Comments: The threshold value can be determined by examining the distribution of pixel values for the raster dataset. Also, removing low populated grid cells increases the share of coinciding built-up areas in comparison to Google map tiles.

Step 3. Zero out all grid cells with raster value below the threshold.

*Suggested tool in QGIS: "Raster calculator" e.g., (HRSL > 6) \* HRSL removes all values below 6.*

Notes/Comments: The "Raster calculator" rounds the coordinates for the raster to the first six digits. Therefore, there might be a slight offset between the datasets after using the tool. Since this raster is the base of the clusters the raster calculator should be used twice; once to multiply by one and once to carry out the operation described above. This way there will not be any offset between the different datasets used in the analysis.

Step 4. Re-calculate the total population in the area of interest. If this loss is larger than 10% repeat again from Step 2 using a lower threshold value. Repeat until loss is acceptable.

#### *Appendix B.3. Reclassify HRSL*

The re-classification of the HRSL is necessary for the population clusters to be formed uniformly during the next step. This process creates the conditions for all adjacent cells to become part of the same cluster (Figure A2-left). If not re-classified, the clusters will be comprised by multi-part polygons as shown in Figure A2-right.

*Suggested tool in QGIS: "Reclassify by table" from the QGIS package.*

Notes/Comments: There is a number of tools that can be used in order to reclassify a raster layer in QGIS. This specific tool is from the same package as the raster calculator and therefore it does not create any further distortion or offset.

**Figure A2.** Uniform (**left**) and multi-part (**right**) population clusters created from HRSL.

#### *Appendix B.4. Convert the HRSL Raster to Vector Polygons*

In this process QGIS is used in order to convert the format of the processed HRSL from raster to vector polygons.

*Suggested tool in QGIS: "Polygonize" from the GDAL package.*

#### *Appendix B.5. Buffering Polygons*

A buffer of 10 m is applied to the polygons. This is due to QGIS treating polygons with one common corner as separate even in cases in which they touch. By applying a small buffer, it is ensured that these polygons are overlapping.

*Suggested tool in QGIS: "Buffering vectors" from the GDAL package.*

#### *Appendix B.6. Dissolving Polygons*

Dissolving the polygons ensures that overlapping polygons from the previous step are all merged. *Suggested tool in QGIS: "v.dissolve" from the GRASS package.*

#### *Appendix B.7. Remove Gaps and/or Slivers inside Polygons*

Converting a raster layer to vector polygons as in previous step, can generate gaps and slivers to some of the polygons due to holes in the raster layer and due to the buffering process. These need to be removed/dissolved so that uniform population clusters are created.

*Suggested tool in QGIS: "Delete holes" from the QGIS package.*

Notes/Comments: It is important to only cover holes and slivers caused by the clustering process and not holes naturally occurring holes (e.g., lakes, forests etc.). Therefore, a maximum area is specified in the tool and all holes smaller are deleted.

#### *Appendix B.8. Assigning Population Values to Clusters*

Due to the population being reclassified when generating the clusters there is no population value connected to the clusters. In order to assign population values the raster values are aggregated for every cluster.

*Suggested tool in QGIS: "Zonal statistics" from the QGIS package.*

#### **Appendix C. Techno-Economic Input Parameters in OnSSET**

**Table A2.** Techno-economic parameters for off-grid technologies included in the electrification analysis.


\* An indicative capacity factor was specified externally for diesel based technologies and hydro; capacity factor values for solar and wind were estimated by the model based on natural resource availability at each location; \*\* The diesel pump price was assumed at ~1.2 \$/liter (900 MWK) [114,115]; exchange ratio used as \$1 to 714.3 MWK.


**Table A3.** Techno-economic parameters related to the operation of the centralized grid and its extension process.

\* The cost of grid components adopted in this study was primarily based on reference values provided by [118,119]; the selected values reflect authors' best estimate for the case of Malawi and they are only indicative.


**Table A4.** Expected generation mix, investment costs and generating costs for the expected centralized grid technologies in Malawi in 2030.

\* Estimated overnight capital costs were retrieved from [117,120]; \*\* Estimates for Hydro, Solar and Biomass were retrieved from [121]; estimates for coal from [120] and for diesel from [122].

#### **Appendix D. Updated Grid Extension Algorithm**

The following paragraphs describe the modifications induced on the grid extension algorithm in OnSSET 2018. As of the previous version of the tool, the grid extension algorithm was based on the square geometry of a grid mesh with equal sized grid cells being adjacent to each other [123]. The integration of population clusters in the analysis, required the modification of the algorithm so that is it able to process vector data (polygons) of various geometry, size and spatial orientation. We describe the updated process in five distinctive steps.

Step 1. Sizing transmission lines (HV or MV)

As a first step, the algorithm decides the type of extension line (HV or MV) to be used to connect a settlement; the decision is based on two parameters as presented in (A1):

$$\text{transmission\\_line\\_type} = \begin{cases} \text{MV}\_{\prime} \cdot \text{grid\\_distance} \le \text{MV}\_{\text{max\\_reach}} \, || \, \text{peak\\_load} \le \text{max\\_MV}\_{\text{load}}\\ \text{HV}\_{\prime} \cdot \text{otherwise} \end{cases} \tag{A1}$$

where:

$$peak\\_load = \frac{\frac{\text{Cluster\\_electtivity\\_demand} \div (1 - \text{T\&D losses})}{8760}}{\text{Base to peak load ratio}} \tag{A2}$$

$$\text{max\\_MV\\_load} = \text{MV}\_{type} \times \text{MV}\_{amp\\_limit} \times \frac{\text{HVcost}}{\text{MVcost}} \tag{A3}$$

**Figure A3.** Estimating transmission line length from existing grid network.

Then, the mileage of additional transmission lines required to reach the cluster is estimated using (A4)–(A7) as follows:

$$\text{transmission\\_line} = \text{grid\\_distance} \times \text{No\\_of\\_transmission\\_lines} \tag{A4}$$

where:

$$No\\_of\\_transmission\\_lines = \frac{peak\\_load}{lineampera\\$ \times line\\_type} \tag{A5}$$

$$line\_{ampperage} = \frac{substitution\\_type}{transmission\\_line\\_type} \tag{A6}$$

$$\text{substitution\\_type} = \begin{cases} \text{MV to MV} \cdot \lim\\_{}type : \text{MV} \\ \text{HV to MV} \cdot \lim\\_{}type : \text{HV} \end{cases} \tag{A7}$$

Step 2. Sizing transformers and connection to sub-station

Then, the algorithm estimates the number of service transformers required to provide full coverage of the population cluster:

$$\begin{array}{l}\text{No\\_of\\_service\\_transforms} \\ \qquad = \max\left\{ \begin{array}{l} \text{S\\_{mur}}\\ \text{servic\\_transfer\\_type} \end{array} \right. \\ \left. \begin{array}{l} \text{total\\_nodes}\\ \text{nodes\\_per\\_transform} \end{array} \right\} \\ \end{array} \tag{A8}$$

where:

$$S\_{\max} = \frac{peak\\_load}{power\\_factor} \tag{A9}$$

$$
\varepsilon\_{\text{max}} \\
\text{former\\_area\\_coverage} = \pi \times \text{LV\\_line\\_length}\_{\text{max}} \,\tag{A10}
$$

$$\text{total\\_nodes} = \frac{\text{cluster\\_population}}{\text{No\\_of\\_pepole\\_per\\_household}} + \text{production\\_nodes} \tag{A11}$$

$$\text{No\\_of\\_pepple\\_per\\_household} = \begin{cases} \text{4.5, } \text{cluster type}: \text{lIrbranch} \\ \text{4.3, } \text{cluster type}: \text{Rural} \end{cases} \tag{A12}$$

The transformer load is the sum of the load of all households connected to a single transformer:

$$transform \, load = \frac{peak\\_load}{No\\_of\\_service\\_transforms} \tag{A13}$$

It should be noted that the transformers are assumed to be evenly spaced within a cluster, thus the average distance from the service transformer to the substation is 2/3 of the cluster's radius, and the average distance between two service transformers is twice the transformer radius:

$$
\text{transformer\\_distance} = \frac{2}{3} \times \text{cluster\\_radius} \tag{A14}
$$

$$\text{cluster\\_radius} = \sqrt{\frac{\text{cluster\\_area}}{\pi}}\tag{A15}$$

$$\text{transport}\_{\text{\\_}} radius = \sqrt{\frac{\frac{\text{cluster\\_area}}{\text{No\\_of\\_service\\_transforms}}}{\pi}} \tag{A16}$$

If the estimated load moment is larger than 9643 (see Appendix C) an MV line is used to connect the service transformer to the substation; if not, a LV line is used. If connected by LV lines, each service transformer is assumed to have its own connection to the substation. With MV lines, multiple transformers may be connected in series:

$$load\\_moment = \text{transform\\_distance}\_{\text{average}} \times \text{transform\\_load} \tag{A17}$$

$$\text{concentration\\_line} = \begin{cases} \frac{2}{3} \times \text{cluster\\_radius} \times \text{No\\_of\\_service\\_transforms}, & \text{load\\_moment} \le 9643 \text{ (LV)}\\\ 2 \times \text{transfer} \text{\\_radius} \times \text{No\\_of\\_series\\_transforms}, & \text{load\\_moment} > 9643 \text{ (MV)} \end{cases} \tag{A18}$$

**Figure A4.** Estimating the size of transformers and their connection to sub-station.

#### Step 3. Sizing distribution lines (LV)

The area of each service transformer is then divided into a number of smaller circles (Figure A5) each one representing a demand node, assumed to be equally spaced within the larger circle. The distance between two demand nodes is defined as twice the radius of one of the smaller circles. The calculations do not consider the routing of LV lines from the transformer.

**Figure A5.** Sizing the LV network for each transformer in the population cluster.

The total length of LV lines per transformer is defined as described in (A19):

$$\text{LV\\_km\\_per\\_transform} = 2 \times r\_{demand\\_node} \times total\\_nodes \tag{A19}$$

where:

$$r\_{dcmd\ node} = \sqrt{\frac{dcmd\ node\ area}{\pi}}\tag{A20}$$

$$\text{demand node area} = \frac{\text{transform\\_area\\_coverage\\_max}}{\text{total\\_nodes}} \tag{A21}$$

Finally, the total number of distribution (LV) lines per cluster is estimated by (A22):

$$\text{distribution\\_line} = \text{LV\\_km\\_per\\_transfer} \times \text{No\\_of\\_service\\_transfer} \times \text{(A22)}$$

Step 4. Estimating the total investment cost for grid extension per cluster

In the last step, the total cost of grid extension per cluster is estimated by taking into account all partial costs as described in (A23):


#### **Appendix E. Detailed Results of Sensitivity Analysis**

The sensitivity analysis in this study was conducted in order to identify which are the most critical parameters and how they affect the least cost electrification mix and investment requirements. Six parameters were selected as shown in Table A5. Option 1 (or Baseline) includes the values as presented in previous paragraphs and used in the analysis so far. Option 2 includes modification of these values; for parameters 1–3 modifications intend to a more aggressive electrification strategy; for parameters 4–6 modifications suggest a cost increase in selected technologies. Finally, option 3 suggest an alternative approach to electricity demand targeted for each population cluster. The latter adopted an approach based on available poverty and GDP data (elaborate description in Appendix F). In total, ninety-six scenarios were generated and analysed.


**Table A5.** List of parameters used in the sensitivity analysis and their selected available options.

\* Based on the highest variant of population growth as in [104].

Between all scenarios, the total investment requirements to achieve universal access to electricity in Malawi ranged between \$1.65–7.78 billion. As seen in Figure A6, parameter 2 shows very low variance in all options studied. That is, parameter 2 is a quite strong determinant of electrification investment in comparison to the rest of parameters studied. Higher level of targeted electricity demand in population clusters rises significantly the total cost of electrification. Parameters 1, 3 and 6 do have a noticeable—yet not as strong—impact on the total investment; option 2 of these parameters indicates higher median value. For parameter 1 this is naturally explained by higher population growth, which also causes the min/max values to shift upwards. The second option for parameter 3 mandates the electrification of bigger part of population in the first five years; this results in higher penetration of off-grid systems which in turn are more capital intensive in terms of per unit capacity (\$/kW). Higher diesel price leads to lower penetration of diesel based systems which are replaced either by other off-grid systems or grid connection; both alternatives have higher cost per capacity unit, explaining the variation observed in parameter 6. Finally, minor changes in total investment were observed by the variation of parameters 4 and 5.

The share of grid connected population ranges between 32.6–80.1%. Parameter 2 is the strongest determinant of grid penetration in the total mix, defining therefore the above limits. Parameters 3, 4 & 5 can induce a maximum of 1.3%, 1.2% and 3% increase in grid share respectively between options 1 and 2. No effect on grid share was observed by parameter 6. The share of stand-alone systems varies reversely with their share ranging between 10.2–64.7%. Mini-grids share ranges between 0–0.7% with the upper limit observed only when parameters 1, 2, 4 & 5 are set to option 2. The interplay between decentralized technologies is notably affected by parameter 5. Higher PV costs allow the penetration of other renewable off-grid technologies in the optimal mix; the cost of diesel affects the optimal mix only when parameter 5 is set at option 2, otherwise its impact is negligible.

**Figure A6.** Investment variation for the achievement of universal electrification in Malawi as retrieved by the 96 scenarios developed in this study. The scenarios reflect the modification of six selected parameters and represent the impact of each one on the total investment.

#### **Appendix F. The Custom Residential Electricity Demand Indicative Target (CREDIT) Layer**

A customized raster layer indicating residential electricity demand target over Malawi has been developed by using open access poverty and GDP maps as described in Section 2.1. First, an equal interval classification technique using five classes was applied on the poverty map; the breaking values indicated intervals between 0–100% of headcount poverty rate. The GDP map was classified based on geometric intervals since this technique is particularly useful for datasets that are not normally distributed; it creates a balance between highlighting changes in the middle values and the extreme values; therefore, a good fit for the GDP data available in this case. Then, the two layers were reclassified as shown in Table A6 and added under equal weighted factors (0.5) using raster calculation.


**Table A6.** Re-classification of GDP and poverty layers into five classes. I1–5 are the geometric intervals of the classification process.

The output provided an indicative demand target index ranging from 0 to 5; 0 indicating the lowest potential target and 5 the highest. Finally, using 1-D linear interpolation the above target index was translated into kWh/capita/year as shown in Figure A7. The interpolation was based on the multi-tier framework for energy access adapted to reflect the situation in Malawi; that is, the lowest and highest values were set at 8.8 and 680.2 kWh/capita/year for Malawi.

**Figure A7.** Customized layer indicating electricity demand target levels (in kWh/capita/year) over Malawi, based on openly available poverty and GDP maps.

#### **References**


© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## *Article* **Evaluation of Energy Distribution Using Network Data Envelopment Analysis and Kohonen Self Organizing Maps**

## **Thiago Gomes Leal Ganhadeiro 1, Eliane da Silva Christo 1,\*, Lidia Angulo Meza 2, Kelly Alonso Costa <sup>3</sup> and Danilo Pinto Moreira de Souza <sup>1</sup>**


Received: 1 August 2018; Accepted: 5 October 2018; Published: 9 October 2018

**Abstract:** This article presents an alternative way of evaluating the efficiency of the electric distribution companies in Brazil. This assessment is currently performed and designed by the National Electric Energy Agency (ANEEL), a Brazilian regulatory agency, to regulate energy prices. This involves calculating the *X*-factor, which represents the efficiency evolution in the price-cap regulation model. The proposed model aims to use a network Data Envelopment Analysis (DEA) model with the network dimension as an intermediate variable and to use Kohonen Self-Organizing Maps (SOM) to correct the difficulties presented by environmental variables. In order to find which environmental variables influence the efficiency, factor analysis was used to reduce the dimensionality of the model. The analysis still uses multiple regression with the previous efficiency as the dependent variable and the four factors extracted from factor analysis as independent variables. The SOM generated four clusters based on the environment and the efficiency for each distributor in each group. This allows for a better evaluation of the correction in the *X*-factor, since it can be conducted inside each cluster with a maintained margin for comparison. It is expected that the use of this model will reduce the margin of questioning by distributors about the evaluation.

**Keywords:** data envelopment analysis; Kohonen self-organizing maps; factor analysis; multiple regression; energy efficiency

#### **1. Introduction**

In Brazil, the electricity sector remains treated as a natural monopoly, which is regulated by the government through its own regulatory agency, the National Electric Energy Agency (ANEEL) to in order to prevent possible abuses of market power.

In order to stimulate the search for efficiency [1,2], ANEEL currently uses the price-cap model, which provides periodic price corrections based on several factors, like the service quality and to the productivity of the energy distributor.

The efficiency in the price-cap regulation model is calculated by the *X*-factor, which is related to increases in productivity. This leads to changes in the price that distributors can charge the consumer. As it affects the profitability of the regulated entities, it is important to calculate the *X* factor to be grounded in a consistent explanation in order to convince the society and industry that the calculated

value is fair to everyone. Inconsistencies can serve as a basis for discussions and may generate points for modifying the calculation method used [3].

The model used by ANEEL allows an objective evaluation of the efficiencies of each company, since there is a comparison with the others, and considers the influence of the environment where the distributors operate, including factors that can affect the efficiencies of the same. However, there are some points in ANEEL's analysis that need attention.

First, in the DEA model used, it is considered that operational expenditure (OPEX) are a process input, while network extension, number of consumers and consumption are outputs from the process. However, the network extension variable is peculiar: depending on the analysis, it can behave as much as input, since it is used to generate consumption and to serve the consumers, as well as output, because in order to maintain the operational cost demand.

Another point is the fact of using multiple regression as a way of correcting the efficiencies found. It can be argued that when multiple regression is used, it is believed that only environmental factors influence the efficiency of the distributor, so that the intrinsic factors related to the operation and management of the distributors themselves would not be essential for efficiency, which runs counter to the very purpose of the price-cap model.

The evaluation of efficiency in electric energy distributors in Brazil is addressed in several articles. The difference between them is mainly in the combination of the techniques used. The common goal is always to improve the results with the reality of the country.

In this work, some of the most relevant articles are highlighted. The article [4] deals with an evaluation of the electric power distributors using Kohonen self-organizing maps (SOM) and DEA. The article [5] studies the use of undesirable outputs in DEA with application in the electric power sector. The article [6] studies the application of a DEA network model with shared inputs to analyze the efficiency of Brazilian energy distributors. Article [7] used game theory applied to the DEA and later used mode clustering to evaluate the Brazilian energy distributors in the year. In addition, [8] finally reviews the main applications of DEA in the energy field.

This work aims to propose an alternative way to that currently used by ANEEL to evaluate the efficiency of energy distributors. In the first phase of the proposal, it is intended to modify the DEA model, failing to use a non-decreasing composite model by an input and three outputs, and using a non-decreasing DEA network model, taking the network as an intermediate variable, maintaining the OPEX as input and the consumption and number of consumers as outputs. Such an approach would avoid questioning the use of the network as input or output of the process.

In the second phase, as some environmental variables have correlations between them, a factor analysis is done to avoid multicollinearity. After that, a multiple regression is performed with the environmental factors to verify which affect in the efficiency.

In the third phase, the Kohonen maps are used to group the distributors based on the results of phase 2. Efficiency for each group found is calculated with a non-decreasing additive DEA model. In addition, in the final phase, these results are normalized.

#### **2. Materials and Methods**

#### *2.1. Price-Cap Model*

The price-cap model is a regulation model that intends to support the search for efficiency, while also stopping monopoly-associated practices, in special overpricing. As stated in a previous reference [3], the price-cap model assumes that the price charged must pay the total costs and contain a margin that generates an attractive internal rate of return to the investor. This is done by setting an initial price and correcting this price in prefixed time periods, by analyzing some factors. The general formula of the price-cap model is given by Equation (1):

$$P\_t = P\_{t-1} + \pi \pm X \pm Q \tag{1}$$

where *Pt* is the price in period t; *Pt*−<sup>1</sup> is the price in period of *t* − 1; π is the inflation of the period; *X* is the factor related to the productivity of the company; *Q* is the factor related to the quality of services provided.

The *X* factor is related to advances in methods used by the company to increase its productivity, which would lead to decreases in prices due to the competitive market. Since there is basically no competition in the regulated sector, there is the need to adjust prices by adding this variable in the model. The *Q* factor is related to the quality of the service in a way that better service allows higher prices to be charged.

#### *2.2. DEA BCC Model and Non-Decreasing Returns to Scale*

A concept of DEA was given by a previous reference [9] as follows: "DEA evaluates the relative efficiencies of a homogeneous set of decision making units (DMUs) having multiple inputs and outputs."

This means that DEA is an approach used to evaluate efficiency by comparing Decision Making Units (DMUs) in a way that each DMU tries to maximize its own efficiency, but with the restriction that no DMU can be more than 100% efficient. A DMU is an individual unit that performs a process that is similar in its entries (inputs) and exits (outputs) to other DMUs. In this work, an example of a DMU is an energy distributor in the year of 2012.

One of the benefits of using DEA is that "it provides a non-parametric estimate of the efficiency of each DMU compared to the best practice frontier constructed by the best-performing DMUs" [10].

Furthermore, one study [11] states that "DEA showed great promise to be a good evaluative tool for future analysis on energy efficiency" as a conclusion. One of the reasons for this great promise is the facility of multiple inputs and multiple outputs of the DEA model.

The BCC (Banker, Charnes and Cooper) model [12] is a DEA model that uses variable returns to scale (VRS). This means that a linear proportion between the inputs and outputs is not constant. The BCC model considers a DMU to be efficient if it uses the smallest input produces the maximum value of an output. If the returns to scale were constant, this would not necessarily be true. This DMU is considered efficient in constant returns to scale (CRS) only if a relation output/input is maximized. The CRS model is also called the CCR model, which is the first DEA model to be introduced [13].

The BCC model is described in Equations (2)–(6):

$$\text{Max } Eff\_0 = \sum\_{i=1}^{s} \mu\_i y\_{\text{jo}} + \eta^\* \tag{2}$$

$$\text{subject to} \sum\_{i=1}^{r} v\_i x\_{ik} = 1 \tag{3}$$

$$\sum\_{j=1}^{s} \mu\_j y\_{jk} - \sum\_{i=1}^{r} \upsilon\_i \mathbf{x}\_{ik} + \eta^\* \le 0, \; \forall \; k \tag{4}$$

$$
u\_{j\_1} \upsilon\_{i} \ge 0, \; \forall j, \; i \tag{5}$$

$$
\eta^\* \text{ free} \tag{6}
$$

where *u* are the weights associated with the outputs *y* are the outputs; *j* is the index of the output; *v* are the weights associated with the inputs; *x* are the inputs; *i* is the indicator index of the input; *s* is the number of outputs; *r* is the number of inputs; *Effo* is the efficiency of DMU\_0 *k* is the DMU identifier; and *η*∗ is a variable indicating the type of return scale.

With a modification in the restriction of *η*∗, it is possible to adapt non-decreasing returns of scale. This is done by restricting *η*∗ values to be only positive values, so Equations (5) and (6) will transform into Equation (7):

$$\{u\_{j\_{\prime}}v\_{i\_{\prime}}, \eta^{\*} \ge 0, \,\forall j\_{\prime}, i\} \tag{7}$$

Using Equation (7) in the model implies that some of the DMUs will be evaluated using variable returns to scale (specifically those that have a smaller input variable) while others will be evaluated using constant returns to scale.

Since the objective of the price-cap model is to adjust distributor incomes in order to reflect the increase in efficiency given by the proper usage of resources, it is only natural that the input of the model represents the costs associated with the service. In that sense, operational expenses (OPEX) are used as an input in this present work. The outputs should represent the amount of service being delivered, related to the OPEX. Therefore, the extension of the network, that represents the extension of land covered by the distributor, consumption, that is a direct output of the process, and one of the major sources of company's variable costs, and quantity of consumers, that represents the final clients of the process, could be used as outputs of the process. However, there is discussion about the nature of the extension of the network, since it is not one of the final outputs of the process, but it is a means of achieving the other two outputs. Even more, it is shown in Technical Note nº 101/2011-SRE/ANEEL that the non-decreasing returns to scale hypothesis cannot be rejected. Therefore, distributors that use less OPEX should have greater benefits under the efficiency evaluation.

The sources of the variables are the Public Audience 23/2014 from ANEEL, for the network extension, the number of consumers, and for OPEX. For consumption, data was obtained from the Associação Brasileira de Distribuidores de Energia Elétrica (in English, Brazilian Association of Electric Energy Distributors, known by its Portuguese acronym ABRADEE).

#### *2.3. Network DEA Models*

A problem identified in the classic DEA models is the fact that there is no clarification about what happens within the process. A previous study [14] proposed the first DEA network model to solve this question.

These models divide the process into two parts: the first with the objective of transforming the inputs into intermediate variables, which will be used in the process, and the second one with the objective of transforming these intermediate variables into outputs of the process. As stated by a previous study [15], "the division of the production process makes it easier to identify the sources of inefficiency in the process as a whole".

One network DEA model of high importance for this work is the additive Network DEA model, which was first proposed by reference [16]. The major relevance of this model to the present work is the fact that it incorporates variable returns to scale and the characteristics of the Network DEA models, which makes it possible to incorporate the network model in the non-decreasing returns to scale model used by ANEEL. To calculate the overall efficiency of the DMU, this model uses a weighted sum of the stages' efficiency. This model is shown in Equations (8)–(13):

$$\text{Max } Eff\_0 = \sum\_{i=1}^{s} \mu\_j y\_{\text{jo}} + \sum\_{t=1}^{T} w\_t z\_{\text{to}} + \eta\_1 + \eta\_2 \tag{8}$$

$$\text{subject to} \sum\_{t=1}^{T} w\_t z\_{rk} + \sum\_{i=1}^{r} v\_i x\_{ik} = 1 \tag{9}$$

$$\sum\_{j=1}^{s} u\_j y\_{jk} - \sum\_{i=1}^{T} w\_l z\_{tk} + \eta\_2 \le 0, \forall \ k \tag{10}$$

$$\sum\_{i=1}^{T} w\_{l} z\_{tk} - \sum\_{i=1}^{r} v\_{i} x\_{ik} + \eta\_{1} \le 0, \forall \, k \tag{11}$$

$$
\mu\_{j\_\*} v\_{i\_\*} w\_r \ge 0, \ \forall j\_\* \ i, r \tag{12}
$$

$$
\eta\_1, \eta\_2 \in \mathbb{R} \tag{13}
$$

In this model, the objective function in Equation (8) is generated by the weighted sum of the stages. There is a new variable of *z*, which represents the intermediate variable, and a new weight of w, which is associated to this variable. Equation (10) ensures that the second stage efficiency is less or equal to 1, while Equation 11 ensures this condition for the first stage. As discussed previously, for this model, the OPEX is used as input, while the consumers and consumption are used as outputs. Since the extension of network can be understood as an output for the OPEX and as an input for consumption and number of consumers, it is used as an intermediate variable. For the DEA model proposed in this work, Equations (15)–(17) are added to the additive model, while Equation (6) becomes Equation (14) to ensure non-decreasing returns to scale:

$$
\eta\_1, \eta\_2 \ge 0 \tag{14}
$$

$$\frac{w\_{consumers}}{u} \ge 30\tag{15}$$

$$\frac{v\_{network}}{u} \ge 580\tag{16}$$

$$\frac{v\_{\text{consumption}}}{u} \ge 1\tag{17}$$

Equations (15)–(17) are restrictions to the weights and are added because they are also used by the current model used by ANEEL [17].

#### *2.4. Neural Networks and Kohonen Self-Organizing Maps (SOMs)*

Neural networks can be understood as a mathematical way of trying to simulate the physical functioning of the brain. To do so, they are composed of computational units called neurons, which are responsible for acting on the received data.

Kohonen self-organizing maps (SOMs) are neural networks whose main purpose is to find similarities in a group of elements. In this model, the input data is distributed to all neurons, but only one neuron is assigned to each element. SOMs have been used in several applications, such as the modeling of hippocampal dynamics [18]. A SOM's algorithm is made by the iterative steps [19]:


The competition process finds the nearest neuron. After this, the cooperation process finds the neighborhood of the winning neuron. The synaptic adaptation adjusts the weights of the winning neuron, based on the neighborhood function and the number of iterations, to make the neuron closer to the input element [20].

By the end of the iterative process, each element is related to a single neuron in a way that similar elements tend to be in the same neuron. In this way, each neuron is attached to a cluster of elements.

In the scope of this work, the SOM is used to group energy distributors by environmental similarity. Since Brazil has a large territorial extension, the environment changes greatly within the country. As so, there are significant differences in the way each company can use its resources, or have its efficiency affected by the environment. Therefore, it could be unfair to compare all the distributors without accounting for these effects. In that sense, one way of keeping the isonomy of treatment between distributors is to group them based on their environmental area of activity similarities. By doing so, it is possible to obtain a cluster efficiency and to adjust the previously found efficiency by seeing how close the distributor is to the other distributors in the cluster. The data for environmental variables were extracted from Public Audience 23/2014 from ANEEL.

#### *2.5. Multiple Regression*

According to reference [21], the "main application [of multiple regression], after finding the mathematical relationship, is to produce values for the dependent variable when the independent variables are present."

As in the simple regression analysis, the objective of this analysis is to verify if there is a correlation between the dependent variable and the independent variables in such a way that a change in a value in the independent variables may cause a proportional change in the dependent variable. If such correlation exists, the objective is to find a linear expression that defines the dependent variable in terms of the independent ones.

In this work, this technique is used to verify which environmental variables actually influence the efficiency of a DMU. The general form of the multiple regression is given by Equation (18):

$$Y = b\_0 + \sum\_{i=1}^{k} b\_i X\_i + e \tag{18}$$

where *Y* is the independent variable; *b*<sup>0</sup> is the bias of the equation, which is the value that the dependent variable would assume if there were no errors in the analysis and all the independent variables were assumed to have the null value; *i* is the indicator index of the independent variable; *k* is the number of independent variables; *bi* is the change in the dependent variable relative to the independent variable *Xi*; *Xi* is the independent variable *i*; and e is the associated error, which is also called the residue.

We used the minimum squares method to discover the values of *bi*. It is also possible to get the *t* value associated with the *bi*, which can be used for statistical purposes in order to verify if the independent variable is in fact correlated with the dependent variable.

According to a previous reference [22], "residuals in a regression are obtained from the difference between the observed value of the response variable and that forecast by the regression model". Therefore, in order to obtain adequate results from a regression and possibly to use it to forecast the results of new elements, it is important that the residues are as small as possible.

#### *2.6. Factorial Analysis*

According to the concept given by a previous study [23], factorial analysis aims to explain the correlations between a large set of variables in terms of a set of few unobservable random variables, which are called factors.

One of the advantages of factor analysis and the main reason for using this technique in this work is its ability to reduce the number of variables and eliminate multicollinearity, which is able to maintain the explanatory power of these variables to an adequate extent.

The technique of factor analysis consists of finding the values of the coefficients such that they reproduce the variables from the factors with the highest degree of confidence, according to Equation (19):

$$Y\_i = \sum\_j a\_{ij} F\_j + a\_i \mathcal{U}\_i \tag{19}$$

where *Y* is the input variable; *F* is the factor; *i* is the indicator in the variable in the input element; *j* is the factor indicator; *U* is the specific factor associated; and *a* is the factor load.

In this work, analysis of the principal components was conducted, which involves a model that aims to find factors that have little errors or unique variance, but explains most of the variance of the original variables.

The Varimax rotation was also used, which is an approach that aims to make factor loads close to 0 or 1 in order to facilitate interpretation.

#### **3. Methodology**

In this work, the data from the year of 2012 and previous years were used. However, the focus was to find the efficiency of the distributors in 2012. Previous data was used for historical analysis and to enhance DEA's results. Distributors that did not have data for a certain year were excluded from the analysis in the given year. For the year of 2012, the distributors analyzed were distributed by states in the following way:


In order to understand the model proposed, a diagram is presented in Figure 1. The first stage is to calculate the non-decreasing returns to scale DEA model with the restrictions presented in Section 2.3, using the OPEX as input and the extension of the distribution network, the number of consumers and the consumed amount of electricity as outputs. This model is called the Retorno Contábil Médio (RCM, or in English, Average Accounting Return) model, which is the model used in the 2-phase analysis by ANEEL. The second phase of ANEEL's analysis is a multiple regression using the environmental variables as independent variables and the efficiencies as the dependent variable.

The RCM model was used for every distributor with data from 2012 and previous years in order to avoid false correlations between environmental variables and efficiency. It is important to note that due to practical issues, the effects of inflation were not considered in this step, which may cause some variations in the final result if such consideration is made in the future.

The second stage of Figure 1 was needed because the environmental variables presented had multicollinearity, which could negatively impact the analysis. The factor analysis reduced the number of variables to a number of factors that would still represent the environment properly, but would not affect the regression.

The third stage of Figure 1 involves a regression analysis with the previously found factors as the independent variables and the RCM efficiency as the dependent variable. This allows us to discover which factors actually influenced the efficiency of the distributors. Following this, these factors were used as the inputs for the SOM in stage 4.

The fourth stage of the model involves the execution of SOM with the factors that influence efficiency in order to cluster the distributors by environment. This was conducted only with the distributors in 2012 in order to avoid clustering one distributor with itself in previous years. Since the objective is to find the efficiencies in 2012, it is more logical to use only the distributors in this year.

**Figure 1.** Diagram of proposed model.

The fifth stage calculates the non-decreasing additive model, with the restrictions presented in Section 2.3. Here, since the final result will be normalized by clustering using only data from 2012, all the DMUs were used with one DMU being the data of one distributor in a given year. The OPEX was used as the input, the extension of the distribution network was used as an intermediate variable, while the consumed amount and number of consumers were used as outputs. In order to understand the model proposed in this stage, a diagram is presented in Figure 2.

**Figure 2.** Diagram of DEA model.

The assumption of non-decreasing returns to scale is based on Technical Note nº 101/2011–SRE/ ANEEL, that performs the Banker test [23] in several configurations of inputs and outputs. The results show that, for the configuration used in this work, the non-decreasing returns to scale hypothesis cannot be rejected. Finally, the sixth stage normalizes the efficiency from stage 5 within the clusters of stage 4. The data used was obtained from the ANEEL's public audience no. 023/2014 and data from ABRADEE.

#### **4. Results**

For the sake of space and relevancy, 411 DMUs were considered in stages 1–3 of Figure 1, with some specific data from these analyzes having been omitted. These include data from inputs and outputs used in the DEA model as well as environmental variables and the efficiency found in the

DEA model. However, this data is used primarily for statistical purposes in stage 3, with no other purpose in this article.

In stage 2 of the analysis, two variables were excluded for the purpose of fixing the Kaiser-Meyer-Olkin (KMO) test value. These factors were area and low vegetation. Subsequently, four factors were found in the analysis. The participation of each environmental variable in each factor is shown in Figure 3.

**Figure 3.** Participation of variables in each factor.

From Figure 3, it is possible to observe that:


In stage 3, R2 was found to be 0.356. This means that there is some correlation between the factors and the RCM efficiency, but this relationship is not enough for the factors to solely explain the efficiency of the distributors. The significance levels of factors really influencing the efficiency must be further analyzed, which is provided in Table 1.


**Table 1.** Significance levels for factors in regression.

From the analysis of Table 1, it can be seen that all factors are significant. Therefore, all four factors should be included for the creation of SOM in stage 4. In order to find the number of clusters needed, the technique of observing the U matrix and the weighted maps was used. This strategy is explained in further detail in reference [24]. For the purposes of this work, a larger than required map is created and the distances between the neurons are plotted with a color scale. The brighter, connected regions

are usually linked to the same cluster. This technique displays a natural number of clusters formed by the network. It is mostly a visual technique, but provides a good estimate of the number of clusters that should be used.

The U matrix (unified distances matrix) for this work is shown in Figure 4 and the weight distances are plotted in Figure 5. Both were generated with help of MATLAB® (version R2017b).

**Figure 4.** U-Matrix (unified distances matrix).

From the U-matrix in Figure 4, it is not possible to obtain enough data. However, the input weights in Figure 5 provides some clues. It is possible to observe a cluster in superior left part of the map, while there are two others in the lower left and right extremes of the map. Further, there is a clearer area in the top central area that can be interpreted as another cluster.

Maps ranging from 3 to 6 clusters were created. The maps with six clusters presented a cluster without any elements, so it was discarded. The map with three clusters did not represent well the ecological and economical diversity of the area in analysis. Both the four clusters and the five clusters maps could explain reasonably well the similarities of the environments. In synthesis, some distributors that belong in clusters 1 and 4 when used the four clusters map created a new cluster in the 5 clusters map. In this work, the four clusters map was used for the analysis, since it provides a greater number of distributors by clusters, therefore improving efficiency comparability in DEA.

In such way, a map with a size of 2 rows × 2 columns was used for clustering, with the results shown in Table 2. For better visualization, these results are shown graphically in Figure 6.

**Figure 6.** Cluster representation by area.

In Figure 6, there are areas of blank spaces. This is caused by the absence of data relative to the electric distributor that is located in these areas. Furthermore, some states are divided in color. This is due to there being more than one distributor serving these states, with these distributors being allocated in different clusters. However, it is possible to see that there is at least some consistency to the clustering. The final results and comparison with the ANEEL results are given in Table 2.

From Table 2, it is possible to say that there is no direct relationship between the model results and the ANEEL model. Some distributors gain efficiency, while others lose it. In fact, the correlation between the models is 24%, with a *R*<sup>2</sup> of only 0.06. Such divergence is due to the great changes made, especially in relation to the network DEA model.

Figure 7 presents efficiencies of each stage. Generally, distributors are more efficient in Stage 1. It is observed in Figure 7 that, seven distributors, such as Energy Company of Brasília (CEB) and Eletropaulo for example, are more efficient in Stage 2.

**Figure 7.** Efficiency of each concessionaire by stage.


**Table 2.** Significance levels for factors in regression.

#### **5. Conclusions**

A different model was proposed for the evaluation of the efficiency of the different Brazilian energy distributors, which aims to reduce the margin of potential doubts by using the environmental variables in the regression and extending the distribution network in the DEA model.

It must be considered that the distributors were divided into clusters for the calculation of the *X* factor. Therefore, it becomes possible to perform the comparison within each cluster in the final calculation of the *X* factor. Despite this division, there was comparison between the distributors, so that some proportionality can be achieved in the corrections of the defined energy tariffs that should be charged, which is a positive point of the presented model.

The proposed model presented differences compared to the model currently used by ANEEL, with some distributors having high efficiency, while others had their efficiency reduced. This indicates the need for adjustments to the data used. Although the proposed model presents results with high sensitivity to variable specifications.

It is reasonable to assume that there may be distortions due to the limitations presented in this paper, such as the lack of use of shared inputs or the absence of correction of inflationary effects.

This work can be used in countries of great territorial extension that deals with at least some influence of private sector in energy distribution, or any other regulated service. It allows objective evaluation of efficiency, which is relevant when the discussion involves economical rights, such as the present case. Moreover, the resulting efficiency accounts for environmental aspects, but is not defined by them.

**Author Contributions:** T.G.L.G. carried out the computational implementations and the writing of the article. E.d.S.C. and L.A.M. contributed to the analysis of DEA, SOM, factor analysis and Multiple Regression. K.A.C. and D.P.M.d.S. revised of the text and the orientation of the construction of the work.

**Funding:** This research was funded by Coordenação de Aperfeiçoamento de Pessoal de Nível Superior and Conselho Nacional de Desenvolvimento Científico e Tecnológico.

**Acknowledgments:** The authors thank the Federal University Fluminense—UFF for the teachers. In addition, to the Coordination for the Improvement of Higher Education Personnel—CAPES for the financial assistance during the course of Post-Graduation.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


© 2018 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

*Article*

## **Automatic Processing of User-Generated Content for the Description of Energy-Consuming Activities at Individual and Group Level**

#### **Roos de Kok, Andrea Mauri \* and Alessandro Bozzon**

Faculty of Electrical Engineering, Mathematics and Computer Science, Delft University of Technology, 2628 XE Delft, The Netherlands; R.E.deKok@student.tudelft.nl (R.d.K.); a.bozzon@tudelft.nl (A.B.)

Received: 31 October 2018; Accepted: 18 December 2018; Published: 21 December 2018

**\*** Correspondence: a.mauri@tudelft.nl

**Abstract:** Understanding and improving the energy consumption behavior of individuals is considered a powerful approach to improve energy conservation and stimulate energy efficiency. To motivate people to change their energy consumption behavior, we need to have a thorough understanding of which energy-consuming activities they perform and how these are performed. Traditional sources of information about energy consumption, such as smart sensor devices and surveys, can be costly to set up, may lack contextual information, have infrequent updates, or are not publicly accessible. In this paper, we propose to use social media as a complementary source of information for understanding energy-consuming activities. A huge amount of social media posts are generated by hundreds of millions of people every day, they are publicly available, and provide real-time data often tagged to space and time. We design an ontology to get a better understanding of the energy-consuming activities domain and develop a text and image processing pipeline to extract from social media the description of energy-consuming activities. We run a case study on Istanbul and Amsterdam. We highlight the strength and weakness of our approach, showing that social media data has the potential to be a complementary source of information for describing energy-consuming activities.

**Keywords:** social media; energy-consuming activities; energy consumption; machine learning; ontology

## **1. Introduction**

Europe's 2030 Energy Strategy targets a 40% cut in greenhouse gas emissions compared to 1990 levels, at least a 27% share of renewable energy consumption and at least 27% energy savings compared with the business-as-usual scenario (https://ec.europa.eu/energy/en/topics/ energy-strategy-and-energy-union/2030-energy-strategy). To meet this target, energy policies and programs should be formed and individuals should be motivated to change their energy consumption behavior [1], both in terms of energy conservation and energy efficiency. Energy efficiency involves using less energy to provide the same service; for instance, replacing a single-pane window in the house with an energy-efficient one. On the other hand, energy conservation involves saving energy by reducing or omitting an activity; for instance, turning a light off or reducing the time one watches television.

Multiple studies have examined how energy efficiency and conservation could be motivated among policy makers and citizens. In [2] the author explains how comparative feedback on energy usage with others can generate feelings of competition, social comparison, or social pressure, which appears to be more effective in motivating energy conservation than temporal self-comparisons. The author of [3] endorses this in his Social Electricity case study, which "allows people to compare their energy footprint with other online peers or with the consumption at their neighborhood, village or town, to perceive if their own consumption is low, average or high". Multiple energy saving

applications [4] have been developed, using visualized consumption feedback and gamified social interactions to motivate people to adopt energy-efficient lifestyles.

Before we can motivate individuals to change their energy consumption behavior, we need a thorough understanding of why and how they consume energy. To do so, insights into the individual's activities behind the energy consumption should be gathered at a high-granular level.

Multiple data sources are used to provide insights into energy-consuming activities (i.e., an activity that have a direct or indirect impact on energy consumption). Smart meters and smart plugs give insights into domestic energy consumption by providing aggregated energy consumption data. Techniques have been developed to isolate the signal of each appliance by looking at the total power consumed, the different current waveform and the voltage signature [5–7]. Surveys and interviews are used to break down the energy consumption into different end-uses through several questions (e.g., how much time you watch TV at home? How often do you use public transportation?) [8–10]. While being the most reliable source of quantitative data and qualitative information, the aforementioned sources come with drawbacks: surveys are costly to perform, they do not scale and are done infrequently; while smart sensors and smart plugs are costly, the data obtained lack of contextual information and is often not accessible. Moreover, smart sensor devices neglect indirect energy usage [11] (i.e., related to the production, transportation, and disposal of a variety of consumer goods and services [12]) and the disaggregation process is far from perfect [5].

On the other hand, hundreds of millions of people frequently use social media to share, communicate, connect, and interact. Although being noisy and biased (i.e., used by a subset of the population), they are publicly available and provide real-time and semantically rich data.

For these reasons, social media has proven to be a good source for human activity recognition [13–15], including, but not limited to, travel behavior [16–18], mode of transportation [16] and nutrition patterns [19–21].

This work puts the following intuition at test: since social media posts relate to different aspects of daily activities, they may either directly refer to energy-consuming activities, or contain relevant information about energy-consuming activities in their semantic signature. Therefore, by processing the content of social media posts, we aim at extracting information about the energy-consuming activity it refers to.

Hence, we aim to answer the following research question:

RQ How can we automatically process user-generated content to describe energy-consuming activities at individual and group level?

We focus on four categories of energy-consuming activities: dwelling, mobility, food consumption, and leisure. Based on the literature [22–24], they cover a considerable spectrum of the activities impacting on the energy footprint of an individual's lifestyle.

Dwelling refers to the consumption of energy due to the usage of home appliances (e.g., washing machine, gaming console), mobility includes the energy required for moving from one place to another, food consumption refers to the use of resources associated with the preparation and processing of food and leisure indicate the energy required for performing recreational activities (e.g., watching TV, playing video-games, partying). Activities related to industry—e.g., the individual being at work—are not taken into account.

Figure 1 illustrates the intuition behind this work, the message (*Great dinner at Hotel de Goudfazart [...]*) suggests that the picture is taken by the user during dinner. In addition, in the image we can indeed identify some kind of cooked fish and vegetables. Furthermore, the hash tags and the location where the user has checked in indicate that the dinner took place in the Hotel de Goudfazant. By looking at the place properties, we discover that the restaurant is located in Amsterdam, the Netherlands. Moreover, we can suppose that the person travelled to the restaurant using either a car or by public transportation. To conclude, this post discloses information about food (i.e., the dinner was cooked), leisure (i.e., the activity takes place in) and mobility (i.e., the individual had to travel to get at the venue) energy-consuming activities.

**Figure 1.** Example of social media post on Instagram.

**Contribution**: The objective of this work is to automatically extract information about energy-consuming activities from social media posts. To do so, we (1) create an *ontology* of the domain to identify relevant and important concepts and how these are interrelated. It provides terms for describing our knowledge about the energy consumption domain in a structured manner and it facilitates to draw the link between the social media post and the activity performed in the physical world. Then (2), we design a *data processing pipeline* that extract the characteristics of energy-consuming activities from the social media data. This pipeline includes multiple components: (i) the data collection (and pre-processing) from the social media data sources; (ii) different steps of data enrichment; (iii) a dictionary and rule-based classification model that outputs to which categories of energy-consuming activities social media posts are classified; and (iv) a linked data publisher that use the information gathered by the previous modules to create instances of the ontology and output them using the JSON-LD format (https://json-ld.org/).

The pipeline is evaluated through a case study performed on the social media activity in the cities of Amsterdam and Istanbul.

#### **2. Materials and Methods**

#### *2.1. The Social Smart Meter Ontology*

In this section, we present the Social Smart Meter ontology (SSMO). We create this ontology with two objectives in mind: (i) understand the domain of energy-consuming activities and (ii) identify relevant and important concepts and how these are interrelated, by providing terms for describing and representing our knowledge about this domain in a structured manner [25].

In addition, the ontology allows for an unambiguous conceptual description of the targeted domain and can be also used to enable better interaction among different fields of studies concerned with energy consumption.

Since social media data refer to individual's daily activities [15], we include social media concepts in the definition as well, by linking them to the relevant concepts of energy-consuming activities. Adding meaning to a user's social media data help us understand to what extent these data sources reflect the individual's energy-consuming activities.

The design of the ontology has been performed according to the *Methontology* guidelines [26]. We follow the methodological guidelines for specifying ontology requirements presented in [27] to

compose a set of functional requirements for the SSMO ontology, which are presented in Table A1 in Appendix A.

#### 2.1.1. The Ontology Definition

As depicted in Figure 2, an *Individual* consumes energy by performing an *Activity* at a certain *Location*, at a certain time, and for a certain period of time. That activity can be of multiple types: *Dwelling*, *Mobility*, *Food Consumption*, and/or *Leisure*.

A *Location* can either be a *Path* or *Place*. A *Place* can be a geographical location (e.g., a town or country) or a venue (e.g., a restaurant or airport) and is characterized by its corresponding coordinates and a category. A *Path* is composed of multiple (at least two) places, among which the origin and destination.

In case of a domestic activity, generally, one or more *Appliance*s are used. Among appliances, *Brown Goods* (small household electrical entertainment appliances) and *White Goods* (major household appliances) are distinguished [28].

In food consumption-related activities (having breakfast or lunch, dining, cooking, etc.), the *Food* product itself and its *Ingredient*s, the *Tableware* used for consumption, the food *Source*, and the (cooking) *Process* are relevant entities. Among processes, cooking and *Modification* are distinguished. Modification involves a technique used to modify raw food into food that is ready for cooking.

In leisure, several subcategories can be distinguished, among which: culture, event, gastronomy, playful, relaxation, social interaction, etc. In general, leisure activities require the use of one or more *Artifact*s, for instance, an appliance.

An activity that involves mobility is characterized by the transportation along a path. People travel by a certain *Mode of transport*, for which the type indicates whether the mode of transport is public or private.

**Figure 2.** Conceptual data model of energy-consuming activities.

For our ontology it is also important to include social media data. Therefore, based on the existing ontologies and studies [29,30], we created a conceptual data model, depicted in Figure 3, including the following elements:


Then the two parts are linked by the following relations: a *User* is an *Individual* and *Post* may reflect an *Activity*.

**Figure 3.** Conceptualization of social media activity.

#### 2.1.2. Implementation of the Ontology

To prevent a proliferation of ontologies covering the same entities and relationships, it is important to determine which existing ontologies can be integrated and extended to develop ours. For this reason, we looked at existing ontologies about energy consumption, travel, food, and social media.

The Suggested Upper Merged Ontology (SUMO) [31] has been designed as a foundation ontology and is the largest formal public ontology today, used for research and applications in search, linguistics, and reasoning (in computer information processing systems). Since it covers most of the concepts of our conceptual data model of energy-consuming activities, it is used as the foundation to be extended for our SSMO ontology.

The Semantic Tools for Carbon Reduction (SEMANCO) Energy Model [32] focuses on terms and attributes describing energy consumption and CO2 emission indicators for regions, cities, neighborhoods, and buildings, along with climate and socioeconomic factors affecting energy consumption. We include it to model the energy consumption part of our ontology.

The EnergyUse (EU) platform [33] is built upon the PowerOnt [28] ontology that provides information of energy consumption for numerous household appliances and extends the DogOnt [34] ontology, which aims to model intelligent domotic environments. We integrate this ontology to cover the concepts related to appliances.

The Food Ontology (FO) [35] encompasses information about recipes, their ingredients, along with suitable diets, menus, seasons, courses, and occasions. Also, entities about food chain (i.e., methods and techniques used to process the food) are promising for the integration in the SSMO ontology. FO does not cover the tableware entities; yet, this is not problematic since the SUMO ontology covers them. Finally, the Travel Ontology (TO) by Stevens [36], covers most of the relevant entities within the mobility concept, except for the actual mobility activity itself.

In Table 1 for each ontology is indicated to what extent the entities within the high-level concepts (energy activity, location, dwelling, food consumption, leisure, and mobility) are covered. A "+" indicates the entity occurs in the ontology, a "+/−" indicates the entity is covered to some extent, and a "−" indicates the ontology does not include the entity.



Regarding the social media activity, we reuse the Friend of a Friend (FOAF) [30] and the Semantically-Interlinked Online Communities (SIOC) [29] ontologies. In general, both cover the concepts of user account, post, and item; but the *mention* entity only recurs in the SIOC ontology, whereas the location entity can only be found in the FOAF ontology.

To a great extent, the SSMO ontology can be built upon existing ontologies, as can be deduced from the overview in Table 1; many classes can be reused. Table 2 summarizes the classes that are reused from existing ontologies.

On the other hand, the existing ontologies serve other purposes than identifying and describing energy-consuming activities, so even though some concepts are already covered (e.g., the mobility activity by the *SUMO:Motion* class), the exact semantic of the class is slightly different. For these cases, we create new entities for those classes and we draw the equivalence relationship between them (e.g., our *ssmo:MobilityActivity* class and the *SUMO:Motion* class). Table 3 summarizes the entities created in this way.

*Energies* **2019**, *12*, 15

In addition, not all entities from the conceptual data models can be covered by existing ontologies. The new entities that had to be created for the SSMO are listed in Table 4.

The ontology was then implemented using the Web Ontology Language (OWL) [37] with Protégé (https://protege.stanford.edu), Stanford University's free, open-source ontology editor.

Finally, the ontology is available on the companion website (http://social-glass.tudelft.nl/socialsmart-meter/#ontology).


**Table 2.** Overview of the entities in the SSMO ontology reused from existing ontologies.

**Table 3.** Overview of the new entities equivalent to reused entities in the SSMO ontology.



**Table 4.** Overview of the new entities in the SSMO ontology.

#### *2.2. Data Processing Pipeline*

The data processing pipeline, shown in Figure 4 is composed of four modules: *Data Collection*, *Data Enrichment*, *Classifier* and *Linked Data Publisher*.

**Figure 4.** Overview of the data processing pipeline.

During the first stage, the data is collecting through the APIs of the selected data sources. Both data (image, and text data) and metadata (user, time, and place data) are collected.

In the second stage, different enrichment steps are performed. First, for each social media post, computer vision and natural language processing techniques are applied to respectively the image and text. For the images, we use both object and scene recognition models to extract information regarding the items present in the picture and the context where the photo was taken, while for the text we apply state-of-the-art processing methods and word disambiguation techniques. We enrich the information about the place by looking for its category on external data sources such as Foursquare and Google Places.

Using the enriched data, the social media post is classified to one or more of the energy-consuming activity categories using a hybrid rule and dictionary-based approach.

Finally, the publisher module combines the output of the other modules and publish the information about the energy-consuming activity as linked data (http://linkeddata.org/) conforming to the Social Smart Meter ontology.

#### 2.2.1. Data Collection and Pre-Processing

The pipeline collect data from Twitter and Instagram. Those sources were chosen because these are widely used, and provide public APIs to retrieve the data (text, images, places, time, user) we are interested in.

Since a social media post is very noisy, contains slang, hashtags or mentions, we apply text pre-processing techniques (stopword removal, removal of hashtags and other special characters, stemming,) before the tokenization (word segmentation of the message). This results in a set of tokens

that might refer to an energy-consuming activity. To perform this task, we use the Python-based Natural Language Toolkit (NLTK (https://www.nltk.org/)) module.

#### 2.2.2. Data Enrichment

In this section, we describe the enrichment steps performed by our pipeline. Each step aims at extracting additional data from the text, image, and place of the social media post.

#### Text Enrichment

To overcome the ambiguity of words we use the Lesk algorithm [38] for word sense disambiguation. Assuming that words in a particular text section (i.e., a message in our case) are likely to share a common topic, it compares the definitions of each term in the section to determine the more likely sense of the word. In particular, we use the Adapted Lesk algorithm [39], implemented in the NLTK library, that incorporates WordNet (https://wordnet.princeton.edu/)'s lexical database. For each term in the social media post, this phase output its WordNet sense and the list of synonyms.

#### Image Enrichment

In this phase, state-of-the-art image processing techniques are applied to provide annotations on objects and scenes that are recognized in the images.

We include both object and scene recognition models, because they provide complementary information. For instance, the objects recognized in the example in Figure 5a (e.g., various tableware), may indicate food consumption activity. The scene recognition in Figure 5b on the other hand, recognize a cafeteria scenario, suggesting a leisure activity.

**Figure 5.** Differences in computer vision techniques applied to the same images; (**a**) uses an object recognition method that person, dining table, cup (2*x*), knife (2*x*), bowl (5*x*), while (**b**) uses a scene recognition one extracting dining hall, cafeteria, and delicatessen annotations.

For the image object recognition, we use a state-of-the-art pre-trained model based on the regional convolutional neural network Mask R-CNN [40] trained on the Microsoft Common Objects in Context (MS COCO) dataset using the mask\_rcnn\_coco.h5 weights (https://github.com/matterport/Mask\_ RCNN/releases).

For the scene recognition, we incorporated the neural network model based on the ResNet50 backbone (https://github.com/CSAILVision/places365), which is pre-trained on the Places (http://places2.csail.mit.edu/index.html) data set.

#### Place Enrichment

In this phase, we extract the category of the place where the post was published, because it could be an indicator for the category of the energy-consuming activity. We compute the distance from the previous post created by the user to infer how far he has traveled to understand if the post refers also to an energy-consuming activity related to mobility.

For the first case, we look to retrieve more information by matching the location of the social media post with the venues in Google Places and Foursquare. Numerous studies have investigated place matching; [41] found that the mean great circle distance between two matched Points of Interest (POIs) was equal to 62.8 m and in [42] a buffer area with a radius of 25 m (per POI) was used to reduce geocoding errors. Based on these values, we use a radius of 50 m. If a match is found, the corresponding place details are requested to collect one or more place categories.

Moreover, once we have an overview of all the places a user has checked in, we infer the user's home location by using spatial clustering. Then, we estimate the distances between the home and other location check-ins. To estimate the home, we use the density-based spatial clustering of applications with noise (DBSCAN, [43]). It separates high-density clusters from low-density ones and marks outlier points lying alone in low-density areas (whose nearest neighbors are too far away). We assume that the location of a user's home will be a relatively small-sized, high-density area, whereas at other places fewer check-ins take place, resulting in areas of low density.

#### 2.2.3. Classification

We apply a hybrid dictionary and rule-based classification approach to determine whether a social media post refers to one or more energy-consuming activities.

We used a custom rule/dictionary-based approach instead of a state-of-the-art classifier for mainly two reasons: first, traditional classification approaches need a large set of manually annotated data for the training; to the best of our knowledge, such dataset does not exist, and its creation is beyond the scope of this work. In addition, second, while lacking generalization, a rule-based approach performs better in a narrow domain.

We define a dictionary as a set of terms related to a specific energy-consuming activity type—e.g., ingredients or cooking utensils are associated with the food consumption category. Thus, each category of energy-consuming activities has a distinct dictionary. The basic idea is to compare the terms extracted from the message (text tokens), image (annotations), and place (categories) to the terms in the dictionary. For now, a distinct dictionary for each of these types of data is constructed. Undoubtedly, this comes with some hassle but it also rules out ambiguity to some extent—e.g., the text token "tram" might infer a mobility activity whereas the image annotation "tram" could also point at some tram in the background which might not be related to the user's activity.

For the text dictionaries, we reuse the ones created in [44], where the authors use a hybrid dictionary-similarity distant supervision with the purpose of classifying Twitter content to energy consumption-related content. We further expand the dictionaries by adding the corresponding synonym.

The image dictionary is composed by the predefined list of classes of the pre-trained models. The classes are manually classified to none, one or more of the different categories of energy-consuming activities. For instance, "television" relates to both dwelling and leisure and is part of both dictionaries, whereas "person" does not indicate any energy-consuming activity and is thereby not included in any dictionary.

Alike the image annotations, the sets of place categories are also predefined. As all place categories that could possibly be assigned to a place are known, these can be categorized in the same manner as the image annotation classes, by manually linking the place category to the energy-consuming category. (e.g., a "restaurant" place category is part of both food consumption and leisure dictionaries.)

The dictionaries are available on the companion website (http://social-glass.tudelft.nl/socialsmart-meter/#dictionary).

Then, the post is classified according to the rules illustrated in Figure 6. For each term, we identify if it is evidence (i.e., it appears in one of the dictionaries) for one or more energy-consuming activities. In case a leisure or food consumption activity is performed at home, we can classify it to dwelling as well. Furthermore, if a food consumption activity is performed at some place other than home, we classify it as a leisure activity.

**Figure 6.** Illustration of the rule-based approach.

Then, we look at the user's distance to his or her previous post. If it exceeds the threshold of 0.2 km (This value was found after several test iterations of our pipeline. It seems to provide the best trade-off between precision and recall in our context), we consider it to be a mobility activity. Along with that, we analyze whether a vehicle was required to bridge this distance. If so, the mode of transport can be inferred—e.g., if the distance traveled in a day is more than 5000 km, it is very likely the individual traveled by aircraft to cover that distance.

Given the noisy nature of social media posts we tried to model the confidence of our classifier based on three parameters: (i) the ratio of relevant tokens, distinguished on type of data (text, image, place), (ii) for each term a score indicating its relevance to the category of energy-consuming activities, and (iii) a weighted factor that represents to what extent the type of data is informative for this category of energy-consuming activities. For instance, it is hard to recognize a mobility activity from an image, since individuals do not often post images of objects such as a transportation means while traveling. A check-in which is based on a mobility-related place such as an airport or train station would be far more indicative in that situation. On the contrary, if individuals perform a food consumption activity, they are more likely to post images in which food objects can be recognized.

Taking all the above into account, the calculation of our classification confidence is formulated as follow:

$$\begin{aligned} \text{confidence}\_{x} &= \sum\_{y} (\frac{N\_{\text{relevant},x,y}}{N\_{\text{relevant},y}} \cdot w\_{x,y} \cdot \frac{1}{N\_{\text{relevant},x,y}} \sum\_{x} \text{scores}\_{x,y}) \\ &= \sum\_{y} (\frac{1}{N\_{\text{relevant},y}} \cdot w\_{x,y} \cdot \sum\_{x} \text{scores}\_{x,y}) \end{aligned} \tag{1}$$

where *Nrelevant* is the number of relevant terms, *w* is the weighted factor, *x* is the type of energy-consuming activity, *y* is the type of data (text, image, or place), and scores is the vector of the scores (∈ [0, 1]) of all relevant terms.

The relevance score of the terms (scores*x*,*y*) are determined separately for each type of data. For a text token, the relevance is computed as the similarity between the term vectors and the word vectors included in the dictionaries obtained using Word2Vec [45] (a model used for learning vector representations of words, called "word embeddings"), whereas for an image annotation this is equal to the annotation score assigned by the object or scene recognition model.

For a place category, this score is binary (either 0 or 1), depending on whether the place category occurs in the dictionary.

To avoid possible bias due to our personal opinion, we decide to use an online survey to tune the weights (*wx*,*y*). We showed social media posts and asked the participants to rank the data type according to their informativeness on a scale from 0 to 10 (*Not informative at all* to *Very Informative*). Figure A1a in Appendix B shows an example of question that was asked.

The users' average rankings are displayed in Table 5 and were adopted as data type weights in the classification module in the data processing pipeline for our case study. The weight values do not deviate a lot from each other. Yet, we observe that the users find images most and places least informative to describe dwelling activities. The same applies to food consumption activities.

Finally, the classifier confidence for a category *x* is the average of the contribution of each *y* data type. In future work, we will examine whether other strategies (such as taking the maximum of minimum instead of the average) provide in better results.

**Table 5.** The weighted factors obtained by asking the user opinions.


Hereafter, an initial threshold of 0.5 is applied to determine to which categories of energy-consuming activities the social media post is classified. This threshold value is then tuned to optimize the framework's performance.

#### 2.2.4. Linked Data Publishing

In this final step, the label obtained by the classifier and the data extracted from the enrichment module are combined to create instances of the SSMO Ontology from the social media posts.

To do so we use Triplewave [46], an open-source, reusable and generic tool for publishing linked data streams on the web using the JSON-LD format.

Listing 1 shows an example of instance of SSMO ontology created by our pipeline. This instance was created by processing the social media post shown as example in Figure 1. Our pipeline determined that the post refers to three kind of activities (e.g., *ssmo:leisure activities*, *ssmo:food activity* and *ssmo:mobility activity*), they all take place in the venue (e.g., *ssmo:location*) of *Hotel de Godfazan*, and it involve the consumption of cooked *fish*.

Listing 1: Example of JSON-LD created with Triplewave.

```
{
"@context ": {
```

```
" ssmo " : " h t tp ://www. semanticweb . org/roosdekok/on t ol ogie s /2018/1/ssm " ,
```
" sioc ":" http:// rdfs . org/sioc/ns #" ,

" sem " : " h t tp ://semanco02 . hs−albsig .de/reposi tory/ontology−rele a se s/eu/

```
semanco/on tology/SEMANCO/HEAD/SEMANCO−HEAD. owl " ,
```

```
"eu " : " http ://socsem . open . ac . uk/ o n t ol o gie s/eu # " ,
```

```
" to " : " h t tp ://www. co−ode . org/rober ts/ t r a vel . owl " ,
```

```
Energies 2019, 12, 15
```

```
" f o a f " : " h t tp :// xmlns . com/ f o a f / 0. 1/ "
} ,
" @id " : " h t tp ://smm/i 1 " ,
" ssmo : individu al " : {
"@id " : " http ://ins tagram . com/u serId " ,
" ssmo : nickname " : " username "
} ,
" sioc : post ": {
"@id " : " http ://ins tagram . com/postId " ,
"dcterms : created ":"2018 −06 −24",
" sioc : content ": " Great dinner a t Hotel de Goudfazant in a old factory
on nor th side o f Amsterdam . . . " ,
" sioc : hasCreator ": " h t tp :// ins tagram . com/u se r Id "
} ,
" ssmo : l o c a ti o n " : {
" ssmo : ca tegoryO fPlace " : " Res tauran t " ,
" ssmo : address " : " Aambeelds traa t 1 0 , 1021 KB Amsterdam " ,
" ssmo : name " : " Ho tel de Godfazan " ,
" @id " : " h t t p s ://www. google . nl/maps/pl a ce/Ho tel+De+Goudfazant/"
} ,
" ssmo : l ei s u r e activity ":{
" @id " : " h t tp ://ssm/lo1 " ,
" ssmo : i sO f fe redA t " : " h t t p s ://www. google . nl/maps/pl a ce/Ho tel+De+
Goudfazant/" ,
" ssmo : re flec tedBy " : " h t tp :// ins tagram . com/p o s t Id " ,
" ssmo : time ":"2018 −06 −24"
} ,
" fo : food " : {
" ssmo : isConsumedIn " : " h t tp ://ssm/ fo1 " ,
" fo : ingridents ": " fish "
} ,
" ssmo : food activity ":{
" @id " : " h t tp ://ssm/ fo1 " ,
" ssmo : i sO f fe redA t " : " h t t p s ://www. google . nl/maps/pl a ce/Ho tel+De+
Goudfazant/" ,
" ssmo : re flec tedBy " : " h t tp :// ins tagram . com/p o s t Id " ,
" ssmo : time ":"2018 −06 −24"
} ,
" ssmo : mobili ty activity ":{
" @id " : " h t tp ://ssm/mo1 " ,
" ssmo : i sO f fe redA t " : " h t t p s ://www. google . nl/maps/pl a ce/Ho tel+De+
Goudfazant/" ,
" ssmo : re flec tedBy " : " h t tp :// ins tagram . com/p o s t Id " ,
" ssmo : time ":"2018 −06 −24"
}
}
```
By publishing the data as linked data we allow interoperability with other services by sharing a common understanding of the energy-consuming activities domain. In this way, others can define custom queries in a standard language (e.g., the SPARQL Protocol and RDF Query Language (https://www.w3.org/TR/rdf-sparql-query/)) and perform ad-hoc aggregations to satisfy their own research needs.

#### **3. Evaluation**

Since the behavior regarding creating social media posts might differ between cities with a different culture, for our evaluation we conducted a study on the cities of Amsterdam and Istanbul.

#### *3.1. Dataset Collection*

We collected data from 22 June until 27 June, and 27 July until 28 July 2018. At first, only social media posts created in Amsterdam were collected to provide the first round of insights and tuning of our pipeline. Hereafter, social media posts created in Istanbul were collected as well to compare the results between the two cities. An overview of the numbers of collected social media posts is provided in Table 6.


**Table 6.** Number of collected social media posts per day.

We observe that, in general, more social media posts are created in Istanbul than in Amsterdam. Given that Istanbul's population is more than 15 times as large as Amsterdam's population, this is expected. In both cities, Instagram yielded more posts than Twitter.

#### *3.2. Performance Analysis*

The performance of the framework was evaluated using the standard metrics of precision, recall, accuracy, and F1-score. Precision is the ratio between the posts classified correctly in one of the categories and all the classified posts, recall is the ratio between posts classified correctly in one of the categories and all the set of relevant posts. Accuracy is the fraction of posts correctly classified, taking into the account also the true negatives (i.e., the posts correctly not classified in any category). Finally, the F1-score is the harmonic average of the precision and recall.

The groundtruth was created through an online survey. We asked the participants to assess whether a social media post relates to an energy-consuming activity. We use a random sample of 100 social media posts and balanced the representation of each energy-consuming activity category. We collected 9 responses for each post and the final categories were decided with a majority vote.

Figure A1b in Appendix B shows an example of question asked in the survey.

Tables 7–9 summarize the evaluation metric values for each category of energy-consuming activities individually, as well as for the total. The evaluation metrics are calculated for different classification thresholds (from 0.3 to 0.7), to find the best-performing one. The framework's overall accuracy varies from 0.69 to 0.78. The accuracy for the classification of leisure activities is relatively low compared to the other categories due to many false negatives—i.e., social media posts that are not classified to leisure while, based on ground truth, they should be. Furthermore, the precision for dwelling activities is rather low whereas the accuracy is relatively high due to many true negatives—i.e., social media posts that (based on ground truth) do not refer to dwelling activities and are indeed not classified to this category by our classification model.

In Figure 7 the evaluation metric scores are plotted for the different threshold values. As expected, the recall scores decrease while increasing the threshold—i.e., decreasingly relevant social media posts have sufficient high confidence scores to exceed the threshold. As for the precision, we observe that the scores are fluctuating for different threshold values. Increasing the threshold results in less true positives, as well as less false positives. However, the numbers of true and false positives do not decrease proportionally. Also, there are very few social media posts with a high confidence score for dwelling. For a threshold greater than 0.4, the precision is zero for dwelling because no post was classified as such.





**Table 8.** Precision and recall values for each energy-consuming activities at varying values of threshold. The values of the precision and recall for the Dwelling category for threshold greater than 0.4 are 0 because no posts were classified in that category.

Based on Figure 7 a threshold of either 0.30 or 0.35 appears to result in the best performance. For a threshold of 0.30, a precision of 0.59 is obtained whereas a threshold of 0.35 results in a precision of 0.73. Furthermore, these thresholds (0.30 and 0.35) respectively result in recall scores of 0.63 and 0.47 and in F1-scores of 0.60 and 0.54. Based on the F1-score, a threshold of 0.30 seems to be better performing. Yet, it is dependent on the context whether it is more important to have a higher precision or recall score—i.e., whether it is more important to classify as many social media posts as possible correctly or to discover as many as possible that are referring to energy-consuming activities. In case the quantity of energy (in terms of kWh consumption or CO2 emission) during an activity is analyzed, a higher precision is considered more beneficial. However, when a qualitative overview of all energy-consuming activities performed by an individual is required, it is more advantageous to have a higher recall score. For our case study, a threshold of 0.35 was selected.


**Table 9.** The F1-score value for each energy-consuming activity category at varying level of threshold. The values for the Dwelling category for threshold greater than 0.4 are undefined because no posts were classified in that category.

#### *3.3. Use Case*

In this section, we give a deeper look to the posts that were classified in any of the four energy-consuming activities.

We collected the posts regardless of the language. In the analysis, for Amsterdam we consider the terms in English and Dutch, while for Istanbul we consider the terms in English and Turkish. Notice that the terms in different languages are needed only for the textual part of the social media posts, and not for the image labels and place categories.

For the text processing we used three pre-trained embeddings: for the English language we use the model trained on the Google News corpus (https://github.com/mmihaltz/word2vec-GoogleNewsvectors), for Dutch we use a model trained on the combined dataset of Wikipedia (https://dumps. wikimedia.org/nlwiki/20150703), Sonar500 (http://hdl.handle.net/2066/151880) and Roularta corpus (a set of articles form the publishing consortium http://www.roularta.be/en) [47], while for the Turkish language we use a model trained on the Turkish Wikipedia dataset (https://github.com/akoksal/ Turkish-Word2Vec).

Table 10 shows the percentage of each category of energy-consuming activities for both cities. In general, we observe that few social media posts are classified to dwelling. Our rule-based

classification approach demands evidence for the user being at home before it classifies a post to dwelling. It is very difficult to derive this evidence from the social media post because rarely people check-in at their own home.


**Table 10.** Percentage of classified social media posts per category of energy-consuming activity.

For both Amsterdam and Istanbul, the leisure category has the largest share (approximately 40%) compared to the other categories. The mobility category has the second largest share (approximately 30%). The category of food consumption has a rather small share (approximately 20%). However, nearly all social media posts that are classified to food consumption are also classified to leisure based on the rule-based approach—a food consumption activity that is performed at some other place than home is also considered a leisure activity. This explains why the share of the leisure category is more than twice as large as the share of the food consumption category.

The distribution of social media posts classified to energy-consuming activities cities differs between them. For Amsterdam (Figure 8a), most social media posts are created around the city center—the neighborhood with the highest density (Burgwallen-Nieuwe Zijde) also include the city center. For Istanbul (Figure 8b), multiple neighborhoods share a high amount of energy-consuming activities; Ba¸sak¸sehir and Be¸sikta¸s on the European part of the city and Kadıköy on the Asian part.

**Figure 8.** Overall distribution of energy-consuming activities of Amsterdam (**a**) and Istanbul (**b**).

#### 3.3.1. Dwelling

For both cities, few social media posts are classified to dwelling. For Amsterdam (Figure 9a), the posts in this category were mainly created in the city center while in Istanbul (Figure 9b), the posts are more evenly distributed with a higher concentration in the European part of the city (especially in the Ba¸sak¸sehir district).

**Figure 9.** Map visualizing the distribution of social media posts; (**a**,**b**) refer to dwelling, (**c**,**d**) refer to food consumption, (**e**,**f**) refer to leisure and (**g**,**h**) refer to mobility.

As shown in Figure 10, the text terms that are most informative for a dwelling activity in Amsterdam are "House", "TV", and "gaming". In images, "tv", "laptop", and "keyboard" are the most frequently recognized objects that indicate a dwelling activity for both cities. These seem to indicate either recreational or work activities.

There are no place terms related to this type of activity because houses do not have a category in the sources used in the data enrichment phase.

**Figure 10.** Bar charts visualizing the most occurring terms in social media posts classified to dwelling activities in Amsterdam (**a**) and Istanbul (**b**). For readability purposes, in the figures we show only English terms.

#### 3.3.2. Food Consumption

As shown in Figure 9c, the city of Amsterdam shows the highest concentration of food energy-consuming activities in the city center. On the other hand, Istanbul, as shown in Figure 9d, shows peaks in the Be¸sikta¸s district and in the northern neighborhoods.

Based on the top frequent terms in Figure 11a,b, images seem to be most informative to identify food consumption activities. Furthermore, "food" and "coffee" were the top frequent text terms indicating a food consumption activity in both cities. Besides that, individuals appear to create food consumption-related post most often while checking in at a "Bar" (Amsterdam), "Cafe" (both cities) or "Restaurant" (both cities).

#### 3.3.3. Leisure

In Figure 9e the distribution of social media posts in Amsterdam classified to leisure activities seems to be more distributed over the different neighborhoods. When zooming in on a few neighborhoods (Burgwallen-Nieuwe Zijde, Museumkwartier, and Amstel III/Bullewijk) some interesting observations are made.

In general, the city center (Burgwallen-Nieuwe Zijde) is characterized by many tourists, who are partying, visiting the flower markets, going to museums, or enjoying the canals, among other things. This is reflected in the top frequent text terms: "night", "holiday", "party" (text), "Flower Shop", "Art Museum", and "Hotel" (place) are some terms that comply with these activities.

Museumkwartier is the neighborhood where many of Amsterdam's most famous museums are situated. In fact, we find that the top occurring terms are related to these museums: "museum" (text), "art\_gallery" and "museum/indoor" (image), and "Art Museum" (place).

Amstel III/Bullewijk is known for Amsterdam's soccer stadium and the major concert halls. As expected, the top occurring terms are: "concert" and "music" (text), "arena/performance" and "stage/indoor" (image), and "Concert Hall" and "Soccer Stadium" (place).

The distribution of the leisure-related social media posts over Istanbul's neighborhoods (Figure 9f) is rather similar to the food consumption-related one: most dense in the center and west of it (the Ba¸sak¸sehir district, where also the stadium of the homonymous soccer team is present). Interestingly, as shown in Figure 12, it seems that in Istanbul the majority of leisure activities take place in shopping malls.

**Figure 11.** Bar charts visualizing the most occurring terms in social media posts classified to food consumption activities in Amsterdam (**a**) and Istanbul (**b**). For readability purposes, in the figures we show only English terms.

**Figure 12.** *Cont*.

**Figure 12.** Bar charts visualizing the most occurring terms in social media posts classified to leisure activities in Amsterdam (**a**) and Istanbul (**b**). For readability purposes, in the figures we show only English terms.

#### 3.3.4. Mobility

Since Amsterdam's train station is situated in the city center, it makes sense that this neighborhood is most dense regarding the count of social media posts classified to mobility (Figure 9g). This is also due to the canal trips in the city center that individuals (mainly tourists) tend to post about.

In Figure 9h two of the western neighborhoods (Ba¸sak¸sehir and Eyüp) are the densest regarding mobility activities. Multiple highways run through these neighborhoods (and particularly Eyüp connects the Black Sea to the Golden Horn) as well as a large highway junction. If we look at the terms (Figure 13), we can notice that in Istanbul are present more term related to transportation by car (e.g., Gas Station, Car Wash, parking\_lot, car, etc.).

**Figure 13.** Bar charts visualizing the most occurring term in social media posts classified to mobility activities in Amsterdam (**a**) and Istanbul (**b**). For readability purposes, in the figures we show only English terms.

If we compare the frequencies of displacements of both cities (Figure 14) we can observe that while in Amsterdam people tend to travel for short distances (between 1 and 5 km), in Istanbul the chart shows a long tail distribution. Since Istanbul is significantly larger in size than Amsterdam, this is in line with our expectations.

**Figure 14.** Bar chart visualizing the frequency of displacements (average distance between posts in kilometers) in Amsterdam and Istanbul.

#### 3.3.5. Discussion

In both cities, few social media posts referring to dwelling activities were captured by the framework. This may be because social media users do not consider their regular domestic activities interesting enough to be shared with other social media users.

More posts related to food consumption were captured, but, by looking at the most occurring terms, they seem to occur out of home.

Then, as expected by the typical usage of social media, we detected many posts related to leisure energy-consuming activities. Moreover, they seem to reflect the types of venue present in a particular district, for instance, in the Museumkwartier neighborhood in Amsterdam, we identified many social media posts referring to museums and art.

Finally, people do not create explicit social media content about their mobility activities. When they are traveling, they are more likely to create content about the activities they performed before. However, we can use the distance between posts to detect if a transportation activity occurred.

Even if the two cities present the same ratio of energy-consuming activities, they show a different geographical distribution; while in Amsterdam the activities are localized near the city center and in Amstel III/Bullewijk (where the soccer stadium and the major concert halls are present), in Istanbul the activities are distributed in different neighborhoods, mainly Ba¸sak¸sehir, Be¸sikta¸s, and Kadıköy. Probably, this is due to the different features of the two cities: Amsterdam has a well-defined center, where the main venues are localized; while in Istanbul, also given the different size, have them scattered in various parts of the city.

By looking at the most occurring terms, we notice a small difference between the characterization of the energy-consuming activities in the two cities. In the food category, we can see place categories more related to the Turkish cuisine (e.g., Turkish restaurant and kebab restaurant), and many leisure activities in Istanbul seems to take place in shopping malls. Finally, for the mobility category, in Istanbul, we notice a higher occurrence of terms related to transportation by car.

Summarizing, our pipeline can detect more activities that fall in the broad category of indirect energy-consuming activities, that are, as mentioned in Section 1, activities related to the production, transportation, and disposal of a variety of consumer goods and services [12]. As expected from the typical usage of social media, people post on social media when they are partying, having a fancy dinner out; more rarely they share their domestic activities. Nevertheless, this should not be seen as a flaw of our approach, but it should suggest that indeed social media can be used as a **complementary** source of information regarding energy-consuming activities. In fact, domestic activities are already partially captured by traditional data sources, while the indirect ones are either neglected [11] or the methods used for collecting them have low temporal resolution and are costly (e.g., surveys).

Moreover, our coverage of activity types can be improved by including additional data sources, for instance, the Steam (https://steamcommunity.com/) community for games or the Spotify (https://www.spotify.com/nl/) music stream provider, are more likely to be used for sharing data on dwelling activities, such as gaming or playing music.

#### 3.3.6. Limitations

We acknowledge our approach is not free from limitations. Social media are inherently biased: they are used by only a set of the population (e.g., youths, tourists, etc.) and for purposes different from sharing energy-consuming activities. Moreover, the information shared on social media it is often ambiguous and noisy (e.g., a picture of a tram does not mean that the user is traveling). The issue of ambiguity and noise is partially mitigated by our rule-based approach, which shows promising performance. However, the goal of this work is to investigate to what extent social media can be used as a complementary source of information for energy-consuming activities. A study of demographic representation is left to future work. Language can be an issue when applying our method in areas where English is not the native language. However, this is addressed with multi-language dictionaries and by the use of embeddings trained on the main language spoken in the considered area (e.g., Dutch for Amsterdam). In addition, this issue only concerns the analysis of the text of the social media post, and not the image or the location.

#### **4. Conclusions**

In this paper, we proposed a framework to automatically identify and describe energy-consuming activities from social media posts. This framework is composed by an ontology that provides a better understanding of the domain of energy-consuming activities and a data processing pipeline that classify social media posts to the different categories.

Future works will focus on the improvement of the enrichment module of the framework. For instance, entity extraction can be employed to understand whether a word refers to a place (instead of only taking the place check-in into account) to increase the number of geolocated posts processed by the pipeline.

Moreover, our rule-based approach could be used to generate large training sets for a classifier in a distant-supervision fashion.

As mentioned in the previous section, other data sources will be investigated to increase the coverage of types of energy-consuming activity, with a focus on dwelling.

A further validation will be performed by looking at correspondence with more traditional sources (e.g., surveys, smart meter data etc.).

We will also investigate methods to link the information extracted from the social media post to concrete values of energy consumption (in terms of e.g., kWh or CO2 emissions).

**Author Contributions:** R.d.K. carried out the design of the framework, the evaluation, and the writing. A.M. helped with the design and contributed to the writing of the article. A.B. supervised all the steps of this work and revised the text.

**Funding:** This work is partially funded by the JPI Urban Europe Project CODALoop (Project no. 646453).

**Acknowledgments:** This work is supported the Amsterdam Institute for Advanced Metropolitan Solutions (AMS Institute) and it was carried out on the Dutch national e-infrastructure with the support of SURF Cooperative.

**Conflicts of Interest:** The authors declare no conflict of interest

#### **Appendix A. Ontology Requirements**

**Table A1.** Competency Questions that form the set of functional requirements for the SSMO ontology.


## **Appendix B. User Online Survey**

**Figure A1.** Example of question for tuning the weights (**a**) and creating the groundtruth (**b**).

#### **References**


c 2018 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## *Article d2ix***: A Model Input-Data Management and Analysis Tool for MESSAGE***ix*

#### **Thomas Zipperle \*,† and Clara Luisa Orthofer †**

Chair of Energy Economy and Application Technology, Department of Electrical and Computer Engineering, Technical University of Munich, 80333 Munich, Germany; clara.orthofer@tum.de

**\*** Correspondence: thomas.zipperle@tum.de

† These authors contributed equally to this work.

Received: 7 March 2019; Accepted: 15 April 2019; Published: 18 April 2019

**Abstract:** Bottom-up integrated assessment models, like MESSAGE*ix*, depend on the description of the capabilities and limitations of technological, economical and ecological parameters, and their development over long-time horizons. Even small models of a few nodes, technologies and model years require input-data sets involving several hundred thousand data points. Such data sets quickly become incomprehensible, which makes error detection, collaborative working and the interpretation of results challenging, especially for non-self-created models. In response to the resulting need for manageable, comprehensible, and traceable representation of input-data, we developed a Python-based spreadsheet interface (*d2ix*) that enables presentation and editing of model input-data in a concise form. By increasing accessibility and transparency of the model input-data, *d2ix* reduces barriers to entry for new modellers and simplifies collaborative working. This paper describes the methodology and introduces the open-source Python-package *d2ix*. The package is available under the Apache License, Version 2.0 on GitHub.

**Keywords:** MESSAGE*ix*; reproducibility; collaborative work; open modelling and data; data-handling; integrated assessment modelling; data pre- and post-processing

#### **1. Introduction**

The software package described in the following —*d2ix*— is freely available under the Apache License, Version 2.0 on GitHub under: https://github.com/tum-ewk/d2ix.

#### *1.1. Input-Data-Handling—The Underrated Modelling Challenge*

Technology-based integrated assessment models, such as MESSAGE*ix* (formerly known as MESSAGE) have a long history in energy and environmental systems modelling [1,2]. Despite having been developed in times of relatively low computing power, over the last forty years these models have grown in line with multiplying computing capacity, expanding models in dimensions such as coverage and detail [3]. Until the 1990s, the models focused on the energy-system only [4]; however, today's energy-engineering-economic-environment optimising models are designed to describe the full extent of energy-system dynamics, including effects such as polluting greenhouse gas emissions, economic development, land and water use and health implications [5,6]. At the same time, rising computing power allows not only increasing coverage but also magnifies the level of detail represented in models, such as the number of model years, nodes, technologies and technology parameters.

While in line with this structural change, big data not only presents a challenge in terms of energy-systems modelling: here too, the amount of input-data has skyrocketed [7]. Today, one technology in MESSAGE*ix* is described by forty parameters, of which fourteen are defined not only by the installation year of the technology but also the age of the technology. Thus, in even very simple input-data sets (e.g., describing one node over ten model years), each technology is defined by approximately one thousand

input parameters, each again defined by up to twelve sets. Therefore, even a small model of one node over ten years has an input-data set per technology of more than twelve-thousand data points, not including the input-data for describing the ecology or economy (Figure 1).

**Figure 1.** Average number of data points per technology in the input-data set of a MESSAGE*ix* model in dependence of the number of modelled years and nodes.

#### *1.2. d2ix—Combining Benefits of Non-Binary and Binary Data Formats*

Currently, most models handle input-data using vast spreadsheets (e.g., MS Excel), csv (comma-separated value) or plain text-files to organise, pre-process and document the model-data. While on the one hand, binary ('higher') formats, such as spreadsheets, provide support with data-handling, (un)intentional changes made to the input files are not trackable and are difficult to retrace. On the other hand, non-binary ('lower') but trackable formats such as csv and text-files lack visual clarity and data-handling support. To ensure transparency in data-handling and reproducibility of model results, the modelling-platform (ixmp), supplies MESSAGE*ix* users with tools for (i) database communication for version-controlled data management, (ii) a Python/R interface for efficient input-data and results processing and (iii) a web-browser based tool for drag and drop results visualisation [8]. The newly developed 'data to MESSAGE*ix*' (*d2ix*) package adds to this functionality by providing the user with a visually comprehensible overview of the input-data by reducing the dimensions of the input-data set, thereby reducing the number of data points to be handled by the user (Figure 2).

**Figure 2.** Integration and interlinkages of *d2ix* to the *ixmp* modelling-platform (adapted from [8]).

This model input-data-handling approach, as such, is novel as it is the first to combine reduced form MS Excel spreadsheet data and lucidly change-tackable .yaml files for input-data documentation. By following the FAIR principles of scientific data-handling and analysis, *d2ix* makes data findable, accessible, interoperable and reusable, and thus facilitates collaborative working and can therefore support the energy-modelling community [9]. By enabling new users to quickly become acquainted with existing models, and by simplifying the generation of new scenarios, *d2ix* reduces the barriers to entry into energy- and climate-policy modelling. Furthermore, the synoptic organisation of the input-data set can reduce the risk of errors prone to happen when organising big data sets (Figure 3). Such errors can have detrimental effects such as the data and coding mistakes causing the infamous Reinhart-Roghoff spreadsheet error [10]. Lastly, the interface will be equipped with a unit test that can inspect the model for commodity 'dead ends' and overly restrictive bounds, a feature that can prevent infeasibilities, undesired exceedingly restricted scenarios and the misinterpretation of results. Overall, *d2ix* is a well-suited data-handling tool for large energy-system models such as MESSAGE*ix*. The easy change-trackable framework for transparent model input-data preparation is the first of its kind to be introduced as a standardised model-creation workflow.

**Figure 3.** Reduced spreadsheet input for technology specification for the tutorial.

#### **2. Related Work**

While, in the light of good scientific practice, model transparency and reproducibility have received wide academic attention, the focus has remained on how to deal with and how to publish raw-data and model code [11]. However, the important link between the two much noted components—the raw-data and the model—the input-data-handling, has so far not been dealt with scientifically [12,13]. In contrast, the major strategies of input-data-handling which established themselves as go-to solutions in energy-system modelling have never been subject to publication but rather research-institution internal, customised, single-user solutions. Thus, most models now provide different data-handling strategies. Four mayor types can be identified among the most commonly used input-data-handling methods. They are:


appreciate the high flexibility of the input-data-handler in combination with the low requirements regarding the programming skills of the modeller.


Table 1 summarises and lists the strengths and shortcomings of those four strategies and compares them to the newly developed *d2ix* workflow. It shows that by filling the gaps in documentation, standardisation and transparency, frameworks such as *d2ix* can help improve energy-system modelling by combining the strength of binary and non-binary input-data storage and handling formats.



#### **3. Methodology**

The Python-package we have created, *d2ix*, supports the user in creating new MESSAGE*ix* models as well as adapting and analysing existing input-data sets and scenarios. The support consists of four main tasks: first, *d2ix* supports the user in organising the input-data for MESSAGE*ix*. For this task, we created an abstracted data model, summarising the reduced model input-data in two spreadsheet files. Secondly, *d2ix* functions as a standardised interface between the spreadsheets and the MESSAGE*ix* Python API. Third, *d2ix* documents the pre-processed model input-data in yaml text-files. This allows systematic and visual change-tracking of the spreadsheets-based scenario-data using automated change-tracking services such as Git. Lastly, several unit tests implemented in *d2ix* will allow an automated structured inspection of input-data sets to identify commodity 'dead ends' and overly restrictive constraints.

#### *3.1. Class Structure and Definition*

The *d2ix* package supports researchers who want to create a MESSAGE*ix* model, either from scratch or by modifying existing models (Figure 4). This support is supplied by the means of four different classes which handle the data input. In the following, the classes are described in their functionality and structure.

**Figure 4.** Class hierarchy diagram of the *d2ix* package.

#### 3.1.1. MessageInterface—Communication with the Ix Modelling-Platform

The MessageInterface class acts as the interface between *d2ix* and the ix modelling-platform (ixmp). To communicate with the ix modelling-platform, MessageInterface applies the MESSAGE*ix* classes ixmp.Platform and message\_ix.Scenario. While the Platform instance contains the connection to the database, the Scenario class predefines the format and indexation of the model in- and output-data (parameters, sets and variables) required for running the MESSAGE*ix* model. The database with which MessageInterface establishes a communication with is defined in the run-config file provided in the config folder (..\d2ix\config\run\_config.yaml.template). The unique identification of the established Scenario instance is defined by the user input (Section 4.2), as is the logger setting of the *d2ix* module.

#### 3.1.2. DBInterface—Data-Handling in *d2ix*

The DBInterface class enables data-handling in the *d2ix* package. The DBInterface class holds the model input-data in the form of a dictionary containing all model sets and parameters which can be accessed and modified before being transferred to the database via the MessageInterface class. The central tasks of this class are (i) to hand over the final input-data created in *d2ix* to the database, (ii) to write the final input-data into text-files for transparency and change-tracking, and (iii) to collect the model results from the database after a model run. Furthermore, the DBInterface class will check whether the units used in the input-data are already stored in the database and will add them if they are not.

#### 3.1.3. Model—Data Transformation from Reduced Spreadsheet to Database Format

The Model class constitutes the core of the *d2ix* package. Its main task is the pre-processing of the input-data from the reduced *d2ix* spreadsheet format to the expanded final input-data format required by MESSAGE*ix*. Apart from creating all required sets and parameters, the Model class automatically adds one slack technology for each demand provided in the input-data set, in order to prevent the model from running into infeasibilities during calibration, and to simplify debugging. After each successful scenario-run in MESSAGE*ix*, the Model class reformats the results from database tables into time-series elements optimised for post-processing, applying the TimeSeries class from the ixmp package [8].

#### 3.1.4. ModifyModel—From Database to Spreadsheet and Back

The ModifyModel class is used to enable the analysis and modification of existing MESSAGE*ix* models, i.e., models readily available in the database. To do so, the ModifyModel class has two main functions: (a) ModifyModel allows users to choose a specific MESSAGE*ix* scenario-run, which is then, first collected from the database, secondly, written to an excel sheet and lastly, made accessible to the user as a Python dictionary. The data can then be analysed and modified either in spreadsheet or through scientific computing (e.g., Python). In the second function (b) the modified data can be returned to the database as a new scenario containing the changes applied by the user.

#### *3.2. Testing and User Experience*

In accordance with best-practice collaborative programming [21], we set up a Continuous Integration implementation, with CircleCI and Docker each executing several tasks. Additionally two linters, thus static code analysis segments, are configured for basic code quality checks to ensure long term code maintainability. The coding style is tested with Flake8 and MyPy, the static types in Python. Furthermore the API functionality is tested in a defined environment inside a Docker container using the *d2ix* tutorial and some basic examples.

We tested the functionality of *d2ix* together with various beta users. In a first step, the data transfer from the spreadsheet to the *ixmp* platform and the git-tracked text-files was evaluated, thus proving the data-model functionality of *d2ix*. In a second step, we created three models of different sizes, in order to analyse and improve the runtime performance. The model descriptions and runtime performance are documented in Table 2. Finally, we tested the tool's intuitiveness with users without programming skills. By having such a user without programming experience recreating an existing MESSAGE*ix* model we succeeded in proving the data-model functionally as well as the coherence of the API. As a test model to recreate, we used the standalone country model of South Africa, which is available under the GNU General Public License, Version 3 on GitHub (https://github.com/tumewk/message\_ix\_south\_africa) [22]. Two further MESSAGE*ix* country models are currently being developed for energy-research purposes.


**Table 2.** d2ix model-creation performance tests with different models. Calculations were performed on a Intel(R) Core(TM) i7 CPU with 3.2 GHz and 64 GB RAM.

\* average of 10 runs.

#### **4. Tutorial**

#### *4.1. Installation*

To start using the open source Python-package *d2ix*, you must to ensure that your environment is equipped with the requirements as described in the README instructions found alongside the d2ix repository (https://github.com/tum-ewk/d2ix).

#### *4.2. Running d2ix—Creating a Model from Scratch*

The core functionality of the *d2ix* tool is to create a model from scratch. The bases for model creation are two reduced spreadsheets (Figure 3). In this example, we create a new MESSAGE*ix* scenario—in this case the replica of the 'Westeros' tutorial from the MESSAGE*ix* repository—using the *d2ix* MS Excel templates. The required parameters, configurations and files with the corresponding path are shown in Listing 1. The code creating the scenario is shown in Listing 2 and is explained below.

Furthermore, an introductory tutorial is provided in the *d2ix* repository under tutorial.ipynb.

```
Listing 1: Defining the d2ix model-creation parameters.
```

```
1 CONFIG = 'config/run_config.yaml'
2 BASE_XLS = 'input/modell_data_westeros. xlsx'
3 MANUAL_PARAMETER_XLS = 'input/manual_input_parameter_westeros. xlsx'
4 MODEL = 'MESSAGE_Westeros'
```
4.2.1. Creating a Model Instance

The Model class provides the functionality to create a model from scratch. The class instance is specified by thirteen parameters which are described in Table 3. Furthermore, the code to create a new instance is provided in Listing 2.


**Table 3.** Parameters used for creating a model instance.

<sup>1</sup> ..\d2ix\config\run\_config.yaml.template; <sup>2</sup> ..\d2ix\input\model\_data.xlsx.

In the example shown in Listing 2, we create an instance of the dummy model 'MESSAGE Westeros', which comes as a tutorial in the *d2ix* repository. The run configurations required for scenario creation with *d2ix* as well as the model input-data paths and the model name are defined in Listing 1. The newly created instance is named 'baseline' and spans over a time horizon from the year 690 to the year of 720. The first model year is defined as the year 700. The resulting model-year vector is equal to [690, 700, 710, 720], wherein 690 is a historical year, thus, not considered in the optimisation.

By setting verbose to true, the log-level is set to debug mode which allows for more information to pass from the creation process to the user. Setting the yaml export parameter to true permits the creation of git-trackable yaml files of the input-data. It is recommended to only set it to false during calibration, as this shortens the model creation runtime, though it disables the git-trackability of the input-data set.

Listing 2: Creating a new MESSAGE*ix* scenario using the *d2ix* spreadsheet templates.

```
1 from d2ix import Model
2
3 # Create a Model instance from the data provided in base_ & manual_parameter_xls
4 d2ix_model = Model(run_config=CONFIG , base_xls=BASE_XLS ,
5 manual_parameter_xls=MANUAL_PARAMETER_XLS , model=MODEL , scen='baseline',
6 historical_data=True , first_historical_year=690, first_model_year =700,
7 last_model_year=720, historical_range_year=10, model_range_year=10,
8 verbose=True , yaml_export=True)
9
10 # write data from 'model' dictionary to the database and solve
11 scenario = d2ix_model.model2db()
12 scenario.solve(model='MESSAGE')
13 d2ix_model.close_db()
```
#### 4.2.2. Transferring a Scenario from *d2ix* to the Database—*model2db()*

When the input-data is ready, it can be passed to the database using the *model2db* function, which returns an instance of the messageix.Scenario class (Listing 2, line 11).

#### 4.2.3. Solving a Scenario

Using the solve function (from the messageix.Scenario class), the database model is dropped to a structured input-gdx file, which is passed on via a solve command to the mathematical model formulation of MESSAGE*ix*. After the successful model run, an output-gdx file is created containing all input and output-data. This file content is automatically passed on to and stored in the database. Further details on the *solve()* function can be found in the MESSAGE*ix*documentation [23]. Sample results of the baseline scenario from the Westeros example are shown in Figure 5.

**Figure 5.** Power plant activity and capacity in the 'baseline' scenario (Listing 2).

4.2.4. Modifying the Input-Data—*get\_parameter()*, *set\_parameter()*

After creating the class instance, model contains a dictionary of all parameters and sets of the expanded input-data, which can now be accessed (Listing 3, line 11), modified (line 12) and returned to the dictionary (line 13). In this scenario we introduce an emission tax using the *d2ix get\_parameter*, *set\_parameter* procedure. The comparison between Figures 5 and 6 visualises the change in results induced by the introduction of the tax.

Listing 3: Creating a MESSAGE*ix*scenario—with a carbon tax—using the *d2ix get\_parameter()* and

**Figure 6.** Power plant activity and capacity in the 'tax-emission' scenario (Listing 4).

#### *4.3. Running d2ix— Modifying Existing Models*

The *d2ix* package can also be used to modify existing models. The code required for retrieving, modifying and returning input-data sets to the database is shown in Listing 4, and is explained below.

#### 4.3.1. Creating a ModifyModel Instance

The ModifyModel class provides the functionality of collecting models from the database, writing them into a structured spreadsheet file for user modification and returning the modified model to the database. The parameters specifying the ModifyModel instance not introduced in Section 4.2.1 (run\_config, model, scen and verbose) are described in Table 4.

**Table 4.** Parameters used for creating a modified model instance.


4.3.2. From Database to ModifyModel Instance & Excel Sheet—*scen2xls()*

The *scen2xls()* function (Listing 4, line 8) searches the database for the scenario defined by model and scenario name, and in the mod\_model. If a scenario with the defined model and scenario name is available, all parameters and sets from the most recent (default) version of the scenario will be written to the spreadsheet. If a version is specified, this version instance of the scenario will be copied.

4.3.3. From the Excel File to ModifyModel Instance—*xls2model()*

The *xls2model()* (Listing 4, line 10) function reads the spreadsheet file specified in the mod\_model instance and stores the data as a structured dictionary in the instance. The data is then available to modify, analyses and visualise using the Python functionality.

Listing 4: Modifying an existing MESSAGE*ix* model using spreadsheet inputs.

```
1 from d2ix import ModifyModel
2
3 # Create a ModifyModel instance
4 mod_model = ModifyModel ( run_config=CONFIG , model=MODEL , scen=SCEN , xls_dir='xls_folder',
5 file_name='db_data. xlsx', verbose=False)
6
7 # Collecting a scenario from the database and saving it to an MS Excel file
8 mod_model.scen2xls(version =None)
9 # Collection a scenario from a MS Excel file and saving it to the database
10 mod_model.xls2model()
11
12 # write data from 'mod_model' dictionary to the database and solve
13 scenario = mod_model.model2db()
14 scenario.solve(model='MESSAGE')
15 mod_model.close_db()
```
#### *4.4. Post-Processing a MESSAGEix Scenario*

The *ixmp* package supplies tools for standardised reporting of reference data and results. These tools are documented and described in [8] as well as in the online documentation [24].

#### **5. Conclusions**

In *d2ix*, we built a package that supports users in creating, modifying, and analysing MESSAGE*ix* scenarios. The main benefits of using *d2ix* for scenario creation are threefold. (i) The synoptic input-data supports the transparency and reproducibility of even large models and can thus reduce errors. It further encourages collaborative modelling attempts by making it easier to understand and review model parameters and assumptions implemented by other researchers. (ii) By reducing the dimensions of the input-data, the researchers can easily handle the data using two MS Excel sheets. Hence, *d2ix* reduces barriers to access by reducing input-data complexity and allowing scenario creation without programming knowledge. (iii) *d2ix* permits the combination of the benefits of 'higher' (easy and synoptic data-handling) and 'lower' (change-trackability) data formats. To put it succinctly: by providing a synoptic and easy input-data-handling workflow *d2ix* can support the efforts of the open data movement within the MESSAGE*ix* modeller community and can serve as an example for data-handling frameworks built for other model types.

However, simplification of input-data does reduce the flexibility of the model, e.g., currently a maximum of two outputs is supplied for each technology. However, this can be bypassed by either adapting the model parameter 'output' using the *get\_* and *set\_ parameter* functionality, or by adapting the input spreadsheet and the underlying code to supply as many outputs as required. An expansion of *d2ix* to increased flexibility could be subject of future work; however, the decision on the specific balance between flexibility and simplicity requires practical experience which still remains to be collected.

**Author Contributions:** T.Z. and C.L.O. have cooperated in the development of *d2ix*. While T.Z. is the creator of most of the code and software concept, C.L.O. developed the work flow and the overlying package concept, which are described in this paper. Both authors read and approved the final manuscript.

**Funding:** This work was supported by the German Research Foundation (DFG) and the Technical University of Munich (TUM) in the framework of the Open Access Publishing Program.

**Conflicts of Interest:** The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

#### **References**


c 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

#### *Article*

## **Towards an Automated, Fast and Interpretable Estimation Model of Heating Energy Demand: A Data-Driven Approach Exploiting Building Energy Certificates**

**Antonio Attanasio 1,†, Marco Savino Piscitelli 2,†, Silvia Chiusano 3,\*,†, Alfonso Capozzoli 2,† and Tania Cerquitelli 1,†**


Received: 22 February 2019; Accepted: 27 March 2019; Published: 2 April 2019

**Abstract:** Energy performance certification is an important tool for the assessment and improvement of energy efficiency in buildings. In this context, estimating building energy demand also in a quick and reliable way, for different combinations of building features, is a key issue for architects and engineers who wish, for example, to benchmark the performance of a stock of buildings or optimise a refurbishment strategy. This paper proposes a methodology for (i) the *automatic estimation* of the building *Primary Energy Demand* for space heating (*PEDh*) and (ii) the *characterization of the relationship* between the *PEDh* value and the main building features reported by Energy Performance Certificates (EPCs). The proposed methodology relies on a two-layer approach and was developed on a database of almost 90,000 EPCs of flats in the Piedmont region of Italy. First, the *classification layer* estimates the segment of energy demand for a flat. Then, the *regression layer* estimates the *PEDh* value for the same flat. A different regression model is built for each segment of energy demand. Four different machine learning algorithms (Decision Tree, Support Vector Machine, Random Forest, Artificial Neural Network) are used and compared in both layers. Compared to the current state-of-the-art, this paper brings a contribution in the use of data mining techniques for the asset rating of building performance, introducing a novel approach based on the use of independent data-driven models. Such configuration makes the methodology flexible and adaptable to different EPCs datasets. Experimental results demonstrate that the proposed methodology can estimate the energy demand with reasonable errors, using a small set of building features. Moreover, the use of Decision Tree algorithm enables a concise interpretation of the quantitative rules used for the estimation of the energy demand. The methodology can be useful during both designing and refurbishment of buildings, to quickly estimate the expected building energy demand and set credible targets for improving performance.

**Keywords:** energy performance certificate; heating energy demand; buildings; data mining; classification; regression; decision tree; support vector machine; random forest; artificial neural network

#### **1. Introduction**

Energy efficiency is a growing policy priority for many countries around the world, for both economic and environmental reasons. In the 28 countries that are part of the International Energy Agency (IEA), buildings are responsible for about the 21% of total final energy consumption (26% in Italy) [1]. The amount of this energy used for heating and cooling systems is about 55% in the residential sector (74% in Italy) [1]. Regulatory bodies in several countries took actions to reduce wasteful energy consumption and greenhouse gas emissions and to encourage the use of renewable sources and the design of energy efficient buildings [2].

In most cases, the building energy performance rating has been indicated as a cornerstone to pursue the aforementioned aims. For instance, the *Energy Performance of Buildings Directive* (EPBD), issued by the European Commission, makes the evaluation of energy performance compulsory for new and existing buildings [2].

The EPBD provides member states with guidelines for the building *energy performance certification* process, which includes *energy performance rating* and *energy labeling*. The former is based on a scale of values referred to one or more significant parameters like Energy Use Intensity (EUI) and Primary Energy Demand (PED), while the latter consists in the assignment of an energy performance class (or label) to the building, based on the energy performance rating value. The EPBD lets member states to define the actual implementation of its directives. In Italy the EPBD is currently implemented by various national legislative decrees and technical standards, but there are different rating schemes developed in local areas (regions and autonomous provinces) [3].

Among the existing rating systems worldwide, the *Building Research Establishment's Environmental Assessment Method* (BREEAM) developed in the United Kingdom in 1990, is the first and leading assessment method. *Leadership in Energy and Environmental Design* (LEED) developed in the United States in 1998, is nearly the dominant building assessment system (implemented in more than 40 countries). Other well-known methods include *Comprehensive Assessment System for Building Environmental Efficiency* (CASBEE) of Japan, *National Australian Built Environment Rating System* (NABERS), *Building Environmental Assessment Method of Hong Kong* (HK-BEAM), *Green Mark* of Singapore, *EcoProfile* of Norway, *Deutche Gesellschaft fur Nachhaltiges Bauen* (DGNB) of Germany, *Green Building Label* (GBL) of China [4–6].

The interest in buildings energy performance assessment is increased in the last years, especially to estimate how different features affect the building efficiency. Indeed, from a design perspective, it is very important to determine the effect of the building features on its future energy performance in the early designing phase [7]. Similarly, for existing buildings, it could be useful to evaluate the suitability of a refurbishment plan [8,9]. Whatever the used approach, estimating building energy performance in a quick and reliable way, for different combinations of building features, is a key issue for different actors including public authorities [10]. In this context, Energy Performance Certificate (EPC) provides theoretical measure of how efficient a building could be if operated in standard conditions. However, the performance gap, i.e., the difference between estimated and actual energy performance could be significant. For instance, in [11] is stated that for the Swedish EPCs dataset the assessed performance gap is about the 20% for energy consumption assessments. An EPC is therefore not fully representative of the actual performance during operation but makes it possible to perform comparisons and benchmarking analysis between buildings.

In this paper we propose the *Heating Energy Demand Estimation for Building Asset Rating* (HEDEBAR) methodology providing the following features. (i) HEDEBAR allows the *automatic estimation* of the *Primary Energy Demand for space heating* (*PEDh*) reported by Energy Performance Certificates (EPCs) (calculated in "standard rating" conditions, according to EN ISO 13790 [12], UNI TS 11300-1 [13], and UNI TS 11300-2 [14]). (ii) Moreover, HEDEBAR allows to unfold the criteria adopted during the asset rating of real buildings, through the extraction of the *principal building features* that contribute *to estimate the building energy demand*. The purpose is twofold: (i) *predictive*, as we define models for the robust energy rating of residential buildings, through the estimation of their *PEDh*; (ii) *descriptive*, as we provide an interpretation of the method used to issue EPCs, by highlighting the main features that determine the energy demand of buildings.

The HEDEBAR methodology uses data from EPCs to learn the criteria used by the rating system to issue them. It is based on the hypothesis that building features affect the energy demand in different ways for different classes of building energy efficiency. Therefore, a *two-layer approach* is defined to differentiate the analysis of buildings that belong to distinct *segments of energy demand* (i.e., distinct ranges of *PEDh* value) and to eventually increase the precision in predicting the *PEDh* value. In the first layer a *classification* problem is considered to estimate the segment of energy demand of the building to be analyzed. Then, in the second layer a *regression* problem is considered to estimate the *PEDh* value for the same building. We build a different regression model for each segment of energy demand. The proposed two-layer approach allows us to increase the prediction accuracy with respect to a single layer model, which disregards the possible segment of energy demand of the building.

As a case study, the HEDEBAR methodology has been validated on a dataset of real EPCs of almost 90,000 flats in the Piedmont region of Italy [15–17] released as open data by the Piedmont region. These data are available on a Web platform developed by *CSI Piemonte* (the Information System Consortium) and are regulated by the *Piedmont Region* authority (Sustainable Energy Development Sector).Experimental results obtained on such open data demonstrated that HEDEBAR allows estimating *PEDh* with a reasonable error by only analyzing a small set of 10 building features. Extracted knowledge, human-readable, can be easily exploited by different stakeholders during the decision making process, e.g., public authorities and regulatory bodies should plan future energy policies that leverage on specific building features [18].

The proposed methodology can be useful for designers and building stakeholders to estimate *PEDh* and to set reference threshold values for physical input variables. Due to the large dimension of the adopted dataset, the information provided can be considered representative of residential dwelling stock in Piedmont. Moreover, the proposed models are based on statistical variables easy to be adaptable to different datasets. Moreover the developed models can be profitably used by local authorities for a preliminary and quick estimation of *PEDh* as a function of different values of few influencing attributes in order to perform benchmarking analysis or energy savings scenario analyses.

The paper is organized as follows: Section 2 analyses relevant works in the analysis of data from energy performance assessment; Section 3 describes the HEDEBAR methodology adopted to find a model for the characterization of heating energy demand; Section 4 shows the experimental results, which are then discussed in Section 5.

#### **2. Related Work**

Three main types of buildings *energy performance assessment* are commonly acknowledged [19]: *Energy benchmarking*, i.e., the comparison of Energy Performance Indicators (EPIs) of a building with a sample representative of similar buildings; *Energy rating*, i.e., the evaluation and classification of the building energy performance according to predefined criteria; and *Energy labeling*, i.e., the assignment of an energy performance class (or label) to the building, according to a scale of values defined for some relevant parameter (e.g., EUI, PED).

Energy rating can be implemented in the following ways: (i) *measured* (or *operational*) *rating*, based on real metering on-site [20] and (ii) *calculated rating*, based on ideal energy use. Measured rating is mostly used in the operation and maintenance phases of existing buildings [21]. Calculated rating is more suitable in the design phase of new buildings, in particular with the aid of Building Energy Simulation (BES) software like in the case of LEED and BREEAM rating systems [6,22]. Calculated rating is further divided into *asset rating* and *tailored rating*. While asset rating methods consider standard usage patterns and climatic conditions and can be shaped either to building designs or to existing buildings, tailored ratings consider actual conditions and usage patterns for the buildings under analysis.

Within the scientific context, several research activities have been carried out on buildings energy performance assessment, for: (i) predicting energy demand [7,10,23] and energy class [24], (ii) rating and benchmarking [25–28], (iii) individuating representative buildings for different classes of energy performance [29–31], (iv) characterizing the relationship between energy demand and relevant building features [32–34], and (v) improving existing methods, also using new model based on data mining algorithms like regression models, decision trees, neural networks, and clustering [24,32,35–38].

Several works have proposed a benchmarking of different types of buildings. Dall'O' et al. [25] analyse a real data set of energy certificates to assess the energy performance, to detect anomalies in the registered certificates and to quantify the energy retrofit potential in existing buildings. Chung et al. [26] developed a benchmarking process for energy efficiency of commercial buildings by means of Multiple Regression Analysis (MRA). Gao and Malkawi [29] use clustering to classify buildings according to multiple features, like physical properties, environmental conditions, occupancy. Lara et al. [30] adopt the cluster analysis to find out a few samples representative of about 60 buildings, in order to optimize the energy retrofit measures. Hong et al. [27] use an approach based on case-based reasoning, MRA, ANN and GA, to produce a methodology for operational rating with higher explanatory power and higher prediction accuracy at the same time. A parallel research effort by Acquaviva et al. [39] has been devoted to efficiently compute inter- and intra-building performance indicators on fine-grained thermal energy consumption data for a large set of buildings located in a major Italian city. Tso and Yau [37] compared the accuracy of linear regression, ANN, and decision tree in predicting average weekly electricity consumption during both summer and winter in Hong Kong. Koo et al. [7] use the finite element method to estimate the heating and cooling energy demand of buildings, using data about building envelope design. In [10] a decision tree is used to model the real consumption of residential buildings in order to predict the energy use of newly designed buildings. Melo et al. [24] use ANN to improve the accuracy of surrogate models for labeling purposes, based on simulations results. Khayatian et al. [35] tackle the problem of uniformity of criteria among different certificates, therefore they use ANNs to predict the heating energy demand and to validate a dataset of energy certificates.

The analysis of real data from EPC databases has been performed in various countries [11]. The authors in Fabbri et al. [40] discuss about the effects of EPBD Directive and Italian EPC system on the real estate market prospective. The study presented in Hjortling et al. [41] provides an energy consumption baseline for buildings in Sweden, using data from 186k energy performance certificates issued for commercial buildings and based on energy bills rather than on theoretical calculations. The paper shows that real energy consumption is often higher than the one stipulated by the building code. The methodology presented in Xiao et al. [42] exploits a cluster analysis of the energy consumption (EUI excluding District Heating) of office buildings in China, to study its statistical distribution characteristics. It was found that the distribution of energy consumption has quite different characteristics than in Japan and the US. Other analyses of EPCs aimed at defining the current energy consumption baseline of existing buildings in Greece and Spain are presented respectively in Dascalaki et al. [43] and Gangolells et al. [44].

Compared to the current state-of-the-art, this paper brings a contribution in the use of data mining techniques for the asset rating of buildings, both in methodological and analytical terms. From the *methodological* perspective, the paper proposes a novel approach to characterize the heating energy demand of buildings using multiple independent models for different building segments. From the *analytical* perspective, the proposed approach estimates the heating energy demand with reasonable errors, using a small set of building features and generating interpretable models that provide useful information about the most relevant features affecting energy demand.

#### **3. Data Analysis Methodology**

The HEDEBAR (*Heating Energy Demand Estimation for Building Asset Rating*) methodology estimates the *heating energy demand* of residential flats as a model of a few influencing features available within Energy Performance Certificates (EPCs).

HEDEBAR considers different building features that affect the energy demand. It is based on the hypothesis that the impact of each feature over the energy demand varies for different *segments* of values of the same energy demand. Hence, a *two-layer approach* has been defined to model this aspect. The logical components of HEDEBAR are represented in Figure 1 and they are briefly described below.

**Figure 1.** The proposed HEDEBAR methodology for automatic asset rating.

*Data collection and preprocessing* includes all the preliminary tasks necessary to provide the proper data set to the algorithms that operate in the later phases. Specifically, the *Data collection* component takes data from the energy certificates and other contextual information. *Data preprocessing* includes removing records with errors and missing values; discarding features that are useless to energy demand modeling; and enriching the resulting data set with contextual information not included in EPCs. These steps are better described in Section 3.2.

The *Segment estimation* is the first phase of the two-layer approach. Different classification algorithms have been trained during this step, to learn a classification model that properly assigns flats to different predefined segments of energy demand, considering only the selected features.

The *Local energy demand prediction* is the second phase of the two-layer approach. It uses regression algorithms to learn a regression model for estimating the *heating energy demand* considering only the selected features. An independent regression model for each segment of the first layer has been trained and tested.

During the two phases of the two-layer approach, the performance of each algorithm has been assessed in order to select the best one. When two or more algorithms have similar prediction performances, the one generating the most interpretable, i.e., human-readable model is preferred.

The two-layer approach provides a twofold output: the *classification and regression models* for the analyzed flats, useful to understand the features with the highest explanatory power with respect to the energy demand and to highlight the differences among the segments; the *heating energy demand prediction* for new flats.

#### *3.1. Flat Characterization*

The EPC includes the different features of a building affecting its energy performance as well as the variables used to quantify its energy demand. The feature selection process has been driven by previous experiences on EPCs datasets analysed by the authors [15,16] with the aim of using few input variables that are also easy to be collected. The following four main categories of input variables were identified for the purpose of the analysis: (i) *geometry*, (ii) *envelope*, (iii) *time*, and (iv) *system*. The categories are briefly described below, while Table 1 reports the relevant features for each of them.

*Geometry.* The variables in this category describe the different geometric features of the flat, which have an impact on its energy performance. The category includes variables such as average ceiling height, heat transfer surface and heated gross volume of the flat.

*Envelope.* The features in this category are related to the physical properties of the building (i.e., the thermal transmittance values of the opaque and transparent building envelope). In this category are also considered the dynamic characteristics of the building envelope through the variable *qenv* . This variable is expressed as an ordinal attribute that ranges from 1 to 5. The five quality classes are related to specific numerical ranges of time lag and decrement factor that can be extracted from a table provided in DM 26/6/2009 [45].

*Time.* This category includes time variables such as the building construction year.

*System.* This category includes features related to the heating system (i.e., the average system global efficiency for space heating). The average global efficiency of the heating system is calculated on the basis of the standard values of efficiency for each sub-system (generation, distribution, control, emission) according to UNI TS 11300-2 [14].

Among all the variables considered in this study, the *Primary Energy Demand for space heating PEDh* has been selected as the *target variable* of the analysis. *PEDh* (expressed in kWh/m2y) is an energy related variable defined for benchmarking purposes. It is an estimation of the amount of real energy consumption of a flat in standard use conditions and it contributes to assign an energy class label to the flat. The *PEDh* value is estimated starting from the remaining *explanatory variables* included in Table 1 and can be used to compare different flats. In particular, similar pools of input variables proved to be robust enough for modeling in an effective way the building energy demand [15,16]. The *PEDh* value refers to the period of a heating season and it is normalized by the flat floor area. *PEDh* contributes to the evaluation of the overall Primary Energy Demand of flats (*PED*) together with the Primary Energy Demand for domestic hot water (*PEDw*). The heating energy demand is evaluated considering a building energy balance. The modelling of the building geometry considers real shapes and self or over shading of other buildings. The quasi steady-state calculation method is based on the monthly balance of heat losses (transmission and ventilation) and heat gains (solar and internal) evaluated in monthly average conditions. Transmission heat losses are estimated taking into account opaque and transparent surfaces and as well as the thermal bridging effect. In "Standard Rating", parametric values depending on floor area or heated net volume are taken into account when evaluating the ventilation rate and internal heat gains. The dynamic effects on the net heating energy demand are taken into account by introducing the dynamic parameters, utilization factor and an adjustment of the set-point temperature for intermittent heating/cooling or set-back. These parameters depend on the thermal inertia of the building, on the ratio of heat gains to heat losses and on the occupancy/system management schedules. The annual PED for space heating is calculated from the net energy demand through different system efficiencies (emission, control, distribution, generation) considering the thermal losses in the various sub-systems. For the heating season, the average system efficiency is defined as the ratio between the annual net energy and the annual PED for heating. The PED includes also the electrical energy demand of auxiliary systems.


**Table 1.** List of features selected to characterize and estimate the heating energy demand with the HEDEBAR asset rating methodology.

#### *3.2. Data Preprocessing*

The whole raw data set gathered from EPCs usually includes many building features, represented through variables of different data types such as numeric (integer or real), nominal, textual, and boolean. However, some features could be not relevant for the subsequent data analysis and their inclusion in the features set would increase the complexity of the generated models. Most of the not selected variables are poorly related with the *PEDh* (e.g., textual descriptions, address of the flat) or include attributes with a high explanatory potential that are not so easy to be assessed without running a simulation in advance (e.g., heat losses for transmission, ventilation and infiltration). Moreover, data sets derived from energy certificates filled by auditors could contain imputation errors which can badly affect the quality of the extracted knowledge.

To address the above issues and to improve both accuracy and usefulness of the data analytics phase, HEDEBAR includes a preprocessing step. This step aims to (i) *clean* the original data collection to remove outliers and errors in data and (ii) *enrich* data with additional *contextual information* to cope with external environmental conditions that could differently affect the estimation of the *PEDh* value for each flat. These steps are better described below.

**Data cleaning**. The whole data set is firstly inspected based on the advice of domain experts to remove the less relevant features. In addition, on the selected input variables a data cleaning analysis was performed. The data cleaning phase is crucial in order to ensure the robustness of the analysis. In fact, EPCs datasets can be characterized low quality (in terms of attribute inconsistencies) [11]. However, the domain expertise in the energy and buildings field can prevent or at least limit inconsistency issues. According to [11] the consistency checks considered in this study are:

(i) Constraint rules for columns (e.g., area or volume cannot be negative); (ii) Domain expert analysis of values of the attributes (e.g., physical thresholds of system efficiency or thermal transmittance); (iii) Statistical checks (e.g., outlier detection though box plots).

**Data enrichment**. Data collected from the energy certificates are enriched with additional contextual information acquired from external data sources. To cope with external environmental conditions that could differently affect the estimation of the *PEDh* value for each flat, *PEDh* has been recalculated according to a reference standard climatic condition. In particular, all the EPCs issued in Piedmont region are evaluated for both the standard climatic conditions of the actual city (in which the building is located), and the one of Turin. The *PEDh* considered as target variable in this study is then expressed for all flats as if they were located in Turin considering the same standard monthly outdoor temperature and solar radiation. Therefore, comparisons among flats can be done regardless of their actual location. However, if it is necessary to assess the performance of a flat in a city different from Turin, a data scaling based on standard Degree Days (DD) can be considered a valuable procedure. Specifically, to scale the estimated *PEDh* it is possible to multiply it for the ratio between the standard DD value of the city where the flat is located and the ones of Turin.

#### *3.3. Two-Layer Approach for the Estimation of Heating Energy Demand*

The HEDEBAR methodology makes use of the features from energy certificates as explanatory variables to predict the *PEDh* value of a flat.

The impact of each feature on the *PEDh* value can vary over different classes of energy efficiency. To cope for this aspect, distinct ranges of *PEDh* value, called *segments of energy demand* or simply *segments*, can be defined to partition the data set into groups of flats with more uniform energy efficiency. This segmentation allows HEDEBAR to analyze independently the different classes of flat energy demand (e.g., low, medium, and high).

The estimation of the *PEDh* value is structured in HEDEBAR as a *two-layer approach*, including two phases named *Segment estimation* and *Local energy demand prediction*. The two phases are applied in sequence to accurately predict the *PEDh* value of a flat:


Thus, in HEDEBAR a new flat (with unknown energy demand *PEDh*) is first classified into a segment of energy demand through the *Segment estimation* phase. Then, the *PEDh* value of the flat is estimated through the *Local energy demand prediction* phase, using the regression model assigned to that segment.

To generate the classification and regression models used in the two phases, the HEDEBAR system can easily integrate most classification and regression algorithms currently available in literature. To select the most appropriate algorithms, two complementary aspects were considered: (i) the ability of the algorithm to *accurately predict* the *segment of energy demand* and the *PEDh* value for a flat, and (ii) the *interpretability of the model* it generates. Based on these criteria, we selected four reference algorithms to be evaluated for integration in the two phases of HEDEBAR: *Artificial Neural Network* (ANN), *Support Vector Machine* (SVM) [46], *Reduced Error Pruning Tree* (REPT), and *Random Forest* (RF). ANN and SVM methods provided good performances for both classification and regression tasks in several applications. However, these methods generate non-interpretable models and are usually characterized by high computational cost for building the model. REPT and RF methods have good performances as well, but with overall lower computational costs. Moreover, REPT algorithm generates an interpretable model, which makes possible a better understanding of the relationship between the features and the energy demand. Finally, all the four algorithms have a good degree of robustness to outliers and missing values in the data set, even if in HEDEBAR these issues are handled in advance in the data preprocessing phases. The open source Rapid Miner v5.3.0 toolkit [47] and the statistical software R [48] have been used for the development of the classification and regression algorithms. The following paragraphs provide an overview of the main characteristics of four algorithms.

**Artificial Neural Network (ANN).** Inspired by the structure and behavior of biological neural networks, *Artificial Neural Networks* (ANNs) are often used to model complex relationships between input and output variables or to find patterns in data. An ANN consists of an interconnected group of nodes (neurons), organized in different layers, which receive inputs from other nodes and return as output a value computed as a function of suitably weighted inputs. A very popular type of ANN is the *feed-forward* neural network, where information moves through neurons only in forward direction, from the input to the output nodes.

The training of ANN is usually performed through *back-propagation* algorithm: the final outputs are compared with the correct values of training samples to compute the value of a predefined error-function. The error is then fed back through the network to adjust the weights of each connection in order to reduce the value of the error function. After repeating this process for a sufficiently large number of training cycles, the network usually converges to some state where the error of the calculations is small [49].

**Support Vector Machine (SVM).** Based on the work of Vladimir Vapnik in statistical learning theory [50], *Support Vector Machines* (SVMs) are a set of supervised learning methods, which can be used for classification or regression. A SVM model represents data samples as points in space, separated by a set of hyperplanes, so that the samples of the different categories are divided by a clear gap that is as wide as possible. Intuitively, a good separation is achieved by the hyperplane that has the largest distance to the nearest training data points of any class (*functional margin*), since, in general, the larger is the margin the lower is the generalization error of the classifier. When the samples are not linearly separable, *soft-margin* SVMs allow for classification errors during the training, to produce a more generic model for new data [51].

SVMs map samples into a higher-dimensional space, where presumably the separation is easier. However, the computational and storage requirements of SVMs increase rapidly with the number of training vectors and with the space dimension. To keep the computational load reasonable, SVMs use a kernel function K(x,y) that simplifies the computation of dot products in terms of the variables in the original space. The kernel function can be of different type such as linear, polynomial, sigmoid [49].

**Reduced Error Pruning Tree (REPT).** *Reduced Error Pruning Tree* (REPT) [52] is a fast decision tree learning algorithm that builds classification or regression trees using information gain or variance reduction as splitting criterion. More specifically, it generates multiple trees and it picks the best one, that will be considered as the representative. REPT uses *reduced error pruning* with *back fitting* method to prune the tree. At each iteration, a validation subset is used to estimate the Mean Square Error (MSE) on the predictions made by the tree. Starting at the leaves, each node is replaced with its most popular class and if the prediction accuracy is not affected then the change is kept.

Optimized for speed, REPT only sorts values of numeric attributes once at the beginning of the model preparation. Reduced error pruning has the advantage of simplicity and speed, moreover the representation of the data in form of a tree has the advantage, compared with other approaches, of being meaningful and easy to interpret.

**Random Forest (RF).** *Random Forest* is a combination of tree predictors such that each tree depends on the values of a random vector sampled independently and with the same distribution for all trees in the forest [53]. The generalization error for forests converges almost surely to a limit as the number of trees in the forest becomes large. RF is based on *bagging*, a technique for reducing the variance of an estimated prediction function. Indeed, RF fits a number of decision tree classifiers on various sub-samples of the data set (and also on various subsets of features) and uses averaging to improve the predictive accuracy and to control over-fitting. The resulting model is a voting model of all the random trees in the forest.

#### **4. Case Study**

In this section we validate the effectiveness and the usability of the proposed HEDEBAR methodology focusing on the following aspects: (i) the ability to correctly estimate the segment of energy demand for each flat, and (ii) the ability to accurately predict the *PEDh* value for each flat. The experimental analysis also addresses (iii) the selection of the classification and regression algorithms integrated in the two layers of the system, (iv) the comparison with a single layer approach in terms of prediction error and overall execution time, (v) the impact of the system configuration parameters, and (vi) the explanation of the main variables that determine the membership of flats into segments and their *PEDh* values.

We experimentally evaluated HEDEBAR on a real data collection of EPCs issued in 2013 for buildings located in the Piedmont region, North West of Italy. The data set includes approximately 90,000 energy certificates, of flats located across the 8 provinces of the Piedmont region.

#### *4.1. Characterization of Flat Segments*

As explained in the methodology section, the data set has been partitioned into different segments according to the values of variable *PEDh* with the aim of grouping together flats with similar energy efficiency.

Specifically, three reference segments have been considered representing respectively *low energy demand flats* (segment *s*1), *high energy demand flats* (*s*2), and *very high energy demand flats* (*s*3). Data set splitting into segments has been done considering also the reference value range of *PEDh* specified in [15,16]. Segment *s*<sup>1</sup> includes flats with *PEDh* values between 0 and 100 kWh/m2y, while flats in segment *<sup>s</sup>*<sup>2</sup> have 100 kWh/m2y ≤ *PEDh* ≤ 300 kWh/m2y, and in segment *<sup>s</sup>*<sup>3</sup> *PEDh* ≥ 300 kWh/m2y.

The three segments result into sets with the following cardinalities. The larger segment is *s*<sup>2</sup> including 39,003 flats, followed by *s*<sup>1</sup> with 25,930 flats, and the *s*<sup>3</sup> with 21,176 flats.

The dataset has been split into three segments to identify representative groups of energy performance certificates representing flats with similar performances. Specifically, a group represents flats with low energy demand, the second includes flats with medium-high energy demand, while the last one includes flats with very high energy demand. The three segments also allow guaranteeing a significant number of flats in each group together with a variable distribution for each feature under analysis. A number of segment higher than three should lead to very small groups of energy performance certificates with a limited data variability for each variable. In this case an estimation model for a segment should not be general (i.e., data overfitting). A small number of segments should lead to the definition of complex estimation models of heating energy demand. In this case derived models could not be easily understood and quickly exploited by a domain expert.

Box plots in Figure 2 show the distribution for some interesting variables (i.e., average U-value of vertical transparent envelope, average global efficiency for space heating, construction year, and aspect ratio) separately for each segment under analysis. In general, all segments present a good variability range for each variable under analysis. Specifically, segment *s*<sup>1</sup> includes a set of residential flats characterized by a low energy demand. In fact, flats in this group are characterized by the lowest values of *Uw* (median 2.11, IQR [1.75, 2.76]), *Uo* (median 0.45, IQR [0.33, 0.67]) and *R* (median 0.6, IQR [0.4, 0.7]); and the highest values of *η<sup>h</sup>* (median 0.81, IQR [0.73, 0.87]) and *yc* (median 2004, IQR [1970, 2009]). On the other hand, segment *s*<sup>3</sup> includes flats characterized by a very high energy demand, represented by the highest values of *Uw* (median 3.66, IQR [2.80, 4.62]), *Uo* (median 0.98, IQR [0.83, 1.04]) and *R* (median 0.9, IQR [0.7, 1.0]); and the lowest values of *η<sup>h</sup>* (median 0.68 range [0.60, 0.73]) and of *yc* (median 1962, IQR [1940, 1973]. Finally, segment *s*<sup>2</sup> is characterized by median values and IQRs of the five variables that lie between those of the two previous segments.

**Figure 2.** Box plots of the values of 5 input variables evaluated for each of the three different segments of energy demand.

Figure 3 shows the distribution of the certificates across the 8 Piedmont provinces, separately for each segment. The three charts are quite similar to each other, demonstrating that the geographical distribution is very similar across the three segments.

**Figure 3.** Distribution of the buildings across the 8 provinces of the Piedmont region for each of the three different segments of energy demand.

#### *4.2. Segment Estimation*

The classification task aims at assigning each new flat into the correct segment of energy demand. The classes of the classification task are the three segments presented in Section 4.1, identified by the nominal labels *s*1, *s*2, and *s*3. All the four classification algorithms integrated in HEDEBAR (i.e., ANN, REPT, RF and SVM) have been experimentally evaluated for the classification of flats. The algorithm providing the classification model with the best classification performance has been selected as reference for this phase.

To validate the results of the classification process four established performance measures [54] have been considered. The overall quality of the classification model is evaluated in terms of *accuracy*. This measure counts the total number of flats correctly assigned to their corresponding segment. However, the unbalanced distribution of flats in the three segments could lead to a biased value of accuracy, as it could be mostly influenced by bigger segments. Therefore, other measures have been also used for a more accurate evaluation of the classification model. Per-class classifier predictions were evaluated according to *precision*, *recall*, and *F1-measure*. *Precision*(*si*) indicates the percentage of flats

that are correctly revealed as in segment *si*. *Recall*(*si*) indicates the number of flats assigned to segment *si* with respect to the total number of flats actually in *si*. The *F1-measure*(*si*), which is computed as the harmonic average of *Precision*(*si*) and *Recall*(*si*), quantitatively estimates the balancing between *Recall*(*si*) and *Precision*(*si*). In the experiment evaluation, we computed the precision, recall, and F1-measure values for each class label corresponding to each of the three segments.

A good trade-off between recall and precision is needed to properly predict the *PEDh* values for a new flat. On the one side, high precision values on most (all) segments are crucial to foster an accurate prediction of the *PEDh* values in the subsequent regression task. Indeed, the correct classification of a flat into the corresponding segment facilitates the subsequent prediction of the *PEDh* value for the flat. In fact, this prediction is performed through a model trained using data of flats with similar energy performance. A low *Precision*(*si*) value indicates that many flats were mistakenly classified into segment *si*. This would result in erroneous predictions of *PEDh* values in the second step. On the other hand, achieving high recall values on most segments is desirable as well. A low *Recall*(*si*) indicates that few flats of segment *si* are correctly classified into *si*, and they have been wrongly assigned to a segment other than *s*1. This wrong assignment would result into an erroneous predictions of *PEDh* values due to the selection of a less appropriate prediction model in the second step.

Table 2 reports the results achieved by the four classification algorithms integrated into HEDEBAR. It shows the accuracy on the overall data set as well as precision, recall, and F1-measure for the three segments.


**Table 2.** Overall classification accuracy and precision, recall and F1-measure for each segment of ANN, REPT, RF and SVM algorithms.

The RF classifier provides the highest accuracy value (85.67%) followed by REPT (82.03%), ANN (67.51%) and SVM (67.24%). Moreover, RF achieves also the best F1-measure on all segments (88.87%, 84.05%, and 82.76% in segments *s*1, *s*<sup>2</sup> and *s*<sup>3</sup> respectively). More in detail, RF obtains the highest precision value for all segments (90%, 82.65%, and 83.58% for segments *s*1, *s*2, and *s*<sup>3</sup> respectively). RF also provides the highest recall values for two segments (87.27% and 85.49% for segments *s*<sup>1</sup> and *s*<sup>2</sup> respectively), while the recall obtained on segment *s*<sup>3</sup> (81.96%) is very close to the value provided by algorithm REPT (82.53%), which is the highest recall value over the four algorithms. Since the RF classifier achieves the highest values for almost all performance parameters, we chose it as reference algorithm for creating the model which classifies a new flat into the corresponding segment.

REPT is the second best algorithm for almost all performance parameters, providing accuracy, precision and recall values lower than those of RF, but still more than acceptable. An additional key point of REPT is the fact that this algorithm builds an interpretable classification model. This model is a decision tree from which human-readable classification rules can be extracted. Thus, domain experts

can use the model not only to automatically classify a flat into the corresponding segment but also to analyze the most relevant properties that characterize each segment as well as to understand why a flat has been classified into a segment (see Section 4.5.1).

The SVM and ANN algorithms provide the worst values for all performance parameters, which are significantly lower than those obtained with RF and REPT algorithms.

Therefore, according with the experimental evaluation we decided to include two different classification models into the *Segment estimation* layer of the HEDEBAR framework. The RF classifier is used to automatically label a new flat with the corresponding segment. Based on the assigned segment, the proper regression model is selected in the subsequent layer (*Local energy demand prediction*) to predict the *PEDh* value for the flat. Instead, the REPT model is used to provide domain experts with a qualitative analysis of the impact of variables characterizing flats on the primary heating energy demand. This aspect is further discussed in Section 4.5.1.

#### *4.3. Local Energy Demand Prediction (PEDh)*

The regression task aims at estimating the value of *PEDh* for a flat. In HEDEBAR a different regression model for *PEDh* prediction is created for each of three segments *s*1, *s*2, and *s*3. The ANN, REPT, RF and SVM algorithms have been experimentally evaluated for the creation of the regression model for each segment.

Table 3 displays the mean prediction errors of the four algorithms in predicting *PEDh* for each segment as well as the mean errors averaged over the three segments. The *prediction error* is the difference between the real value and the predicted value of *PEDh*. Three different measures of prediction error, among those commonly used in literature, have been calculated: (i) *Mean Absolute Error* (MAE) is the mean of all the absolute values of the errors obtained with the test samples; (ii) *Mean Absolute Percentage Error* (MAPE) expresses the mean absolute error in percentage terms; (iii) *Root Mean Square Error* (RMSE) is the square root of the mean of the square of all the errors obtained with the test samples. While MAE refers only to the mean value of the distribution of absolute errors, RMSE is affected also by the standard deviation of such distribution. Compared to MAE, RMSE amplifies and severely punishes large errors.


**Table 3.** Errors in predicting *PEDh* for ANN, REPT, RF, and SVM algorithms and for each flat segment.

The REPT algorithm produces the overall lowest error values for the three measures (MAPE = 16.64%, RMSE = 33.12 kWh/m2y, MAE = 22.21 kWh/m2y) and it has also the best performance in each segment. In relative terms, REPT performs better in segments *s*<sup>2</sup> and *s*3, where MAPE is 14.75%, and 15.90% respectively, while it has a substantially lower performance in segment *s*1, where MAPE = 20.25%. The second best algorithm is RF, with an overall MAPE of 16.89%, while SVM and ANN provide higher error values (MAPE = 21.52% and MAPE = 27.02% respectively). Therefore, the REPT algorithm has been selected for local energy demand prediction, in order to better characterize groups of flats with similar features.

Figure 4 analyses more in depth the distribution of prediction errors, by reporting the box plots for *absolute error* and *percentage error* of the four algorithms over the three segments. The difference between REPT and the other algorithms is clear especially in segments *s*<sup>1</sup> and *s*2.

**Figure 4.** Box plots of absolute error and percentage error of estimation of energy demand for each algorithm and for the three different flat segments.

#### *4.4. Performance Comparison with a Single Layer Approach for PEDh Prediction*

In this section we compare the performance in the prediction of the *PEDh* value between the *two-layer approach* used in HEDEBAR and a *single layer approach*. This latter approach exploits a unique regression model for all three segments, instead of building different models tailored to each segment. The ANN, REPT, RF and SVM algorithms have been evaluated to build the regression model for *PEDh* prediction with the single layer approach. The configuration setting for the single layer approach is discussed in Section 4.6.

Results for the two-layer and single layer approaches are reported in Tables 3 and 4, respectively. The experimental evaluation showed that, as for the two-layer approach, also for the single layer approach the best performance for *PEDh* prediction is obtained using the REPT algorithm. However, the REPT algorithm applied to the overall data set provides a model with MAPE value equal to 21.26% (see Table 4). Instead, using the two-layer approach the REPT models tailored to each segment result into a significantly lower overall MAPE value, equal to 16.64% (see Table 3). Also the RMSE and MAE values are significantly higher with the single layer approach (respectively, 37.37 kWh/m2 and 26.10 kWh/m2) than with the two-layer approach (respectively, 33.12 kWh/m2 and 22.21 kWh/m2). These results demonstrate the suitability of the two-layer approach used in HEDEBAR. In fact, the

segmentation of the entire data set into groups of flats with similar energy demand allows to build differentiated models, which can more precisely predict the *PEDh* value for a flat in the segment.


**Table 4.** Errors in predicting *PEDh* for ANN, REPT, RF and SVM algorithms using a single step regression.

#### *4.5. Interpretation of the Energy Demand Estimation Models*

This section provides a qualitative analysis of the impact of explanatory variables (building features) on the dependent variable, (heating energy demand). The analysis makes use of the REPT model, which has the advantage of providing interpretable decision trees. To better understand how the REPT algorithm models the relationship between input variables and the heating energy demand, we illustrate the first levels of the obtained decision trees.

#### 4.5.1. Segment Estimation Model

The descriptive power of the REPT model comes from its capacity of putting in evidence the features that mostly affect the energy demand, according to the analyzed certification system.

The REPT model is represented by a tree graph, made of nodes and leaves connected by edges. In the REPT model built in HEDEBAR for segment estimation, each path of the tree includes a subset of building features. The leaf node of a path represents the predicted class label, corresponding to the energy demand segment *s*1, *s*<sup>2</sup> or *s*<sup>3</sup> in this study. Therefore, each tree path includes a subset of features describing the buildings in one of the three segments.

A common way to build such trees is based on a recursive partitioning method. It consists in a forward step-wise approach where at each node the best split (according to input split variable, and the split value) is automatically evaluated by the algorithm for maximizing homogeneity in its child nodes. In this way the selection of split variables and split values consists in a data-driven process that does not require a manual selection by the analyst. As an example, the node including the *construction year* feature (*yc*) can include the value 2007 as splitting value. The two outgoing edges for the node are associated to two distinct sets of values for *yc* such as for example *yc* < 2007 and *yc* ≥ 2007. Thus, each path includes a subset of variables, together with their corresponding ranges of values, describing the buildings associated with the segment label appearing in the leaf node of the path. For the classification of a new flat, the tree path composed of all the edges with splitting rules satisfying the features of the flat is selected. The segment label appearing in the leaf node of the path is used to estimate the segment of energy demand for the flat.

The first four levels of the REPT model are illustrated in Figure 5 (please refer to Table 1 for the interpretation of input variable symbols). It is possible to observe that the *average U-value of vertical opaque envelope* parameter (*Uo*) is the one mostly affecting the energy demand. Also the *aspect ratio* (*R*) and the *construction year* (*yc*) appear at the first three levels of the tree. *Average U-value of the windows* (*Uw*) and *average global efficiency for space heating* (*ηh*) appear only at the fourth level. In general, the splits closest to the root node are the most important ones. This is the reason why only the upper portion of the classification tree is shown in Figure 5.

**Figure 5.** REPT model of the classification phase. The first four levels of the tree are illustrated and, for each path, the histogram illustrates the number of leaves assigned to each segment.

To further facilitate the interpretation of the tree model and to highlight the characteristics of each segment, the classification rules that summarize the main paths of the tree were extracted. The model developed for segment estimation has an overall size of 342 nodes with a maximum depth of 20 levels. Identify the most significant paths of the tree means to extract from the set of decision rules the ones that involve a significant number of records and reach high values of accuracy. These rules bring out the most representative building properties of each segment together with their ranges of values. Rules are extracted by traversing tree paths and they are structured in two parts: (i) the *rule antecedent* includes the buildings features and the corresponding ranges of values; (ii) the *rule consequent* includes the energy demand segment associated to flats that satisfy the conditions of the rule antecedent. Table 5 resumes the subset of rules selected as reference example from the REPT model. Specifically, for each segment we selected the rules with the highest classification accuracy among those that classify at least 500 flats. For the selected paths, the classification accuracy, i.e., the percentage of flats classified into the correct segment, ranges from 74.7% to 93.7%.


**Table 5.** Main rules of the REPT model for classification. For each row, intervals are specified only for the variables used by the corresponding rule. The last column contains the segment assigned by the rule.

Rules like those in Table 5 are an important source of information about the classification model. Therefore, by examining these decision rules, the significant factors influencing *PEDh* can be identified also by a non-expert user and it is possible to roughly estimate the segment of a new flat.

For instance, the rule for segment *s*<sup>1</sup> is based on the average U-values of vertical opaque envelope (*Uo*) and of the windows (*Uw*) and on the construction year (*yc*). More specifically, the rule states that, if *Uo* < 0.37 W/m2K and *Uw* < 2.15 W/m2K, the building envelope guarantees a very high level of thermal insulation and low heat dissipation. Moreover, flats that satisfy this rule were built with

construction standards adopted from 2007 onwards, thus guaranteeing an overall energy efficiency that is classified into segment *s*1.

The rule for segment *s*<sup>2</sup> includes also the aspect ratio (*R*) and the average global efficiency for space heating (*ηh*). This rule shows that, for high energy demand flats, *R* has intermediate values, while the *η<sup>h</sup>* is always lower than 0.77. The average U-value of vertical opaque envelope (*Uo*) has a minimum value of 0.56 W/m2K, which is higher than the maximum value used in the previous rule of *s*<sup>1</sup> (0.37 W/m2K), thus implying always a higher thermal transmittance. Moreover, the rule includes high energy demand flats constructed since 1992, i.e., the minimum construction year for this rule is 15 years lower than the one for the previous rule (2007).

The rule selected for segment *s*<sup>3</sup> has very high values of aspect ratio (*R*), starting from a minimum of 0.63 m−<sup>1</sup> which is almost equal to the maximum value for *s*<sup>2</sup> (0.68 m<sup>−</sup>1). Additional negative factors are represented by the high lower bounds for U-values (*Uo*, *Uw*) and the construction year (*yc*) always before 1991.

#### 4.5.2. Local Energy Demand Prediction Models

Figure 6 depicts the first three levels of the REPT regression models of *Local energy demand estimation* for the three flat segments. Variables of splitting rules associated to the tree nodes are almost the same of the classification model represented in Figure 5, however their importance vary according to the segment. The tree for segment *s*<sup>1</sup> has a single variable for each level, i.e., *U-value of vertical opaque envelope* (*Uo*) at the first, *aspect ratio* (*R*) at the second, and *U-value of the windows* (*Uw*) at the third, thus providing a simple and easily interpretable model. In segment *s*<sup>2</sup> the *average global efficiency for space heating* (*ηh*) has a higher importance than in *s*1, as it appears at the third level of the tree. The same variable appears in most of the rules of the same level in segment *s*3. Here *average U-value of the windows* (*Uw*) is considered only for the most efficient flats (with *Uo* < 0.76 W/m2K and *R* < 0.89 m<sup>−</sup>1), while for those with higher energy demand, the *average global efficiency for space heating* (*ηh*) becomes more significant.

The splitting value of *average U-value of vertical opaque envelope* (*Uo*) increases from segment *s*<sup>1</sup> to segment *s*3, meaning that flats belonging to the first segment are characterized by higher thermal insulated walls.

**Figure 6.** *Cont.*

**Figure 6.** REPT models for each of the 3 flat segments.

#### *4.6. Parameter Tuning of Algorithms*

This section describes how the main parameters of the four algorithms considered in this study were tuned in order to reach the lowest values of prediction error both in the *Segment estimation* and *Local energy demand prediction* phases in the HEDEBAR framework. The same tuning procedure has been used also for the configuration of the single layer approach considered for performance comparison and described in Section 4.4.

For both phases, the prediction error was assessed using the *k*-fold cross-validation method, with *k* = 10. Therefore, the input dataset for the target phase has been split into *k* subsets of the same size. In turn, 1 subset is used for testing and the remaining *k* − 1 are used for training. Hence, *k* independent training and test iterations are performed. For each iteration, the training set is used by the four algorithms to generate the classification or regression models, according to target phase in the HEDEBAR framework. Then, the test set is used to evaluate the capacity of each classification and regression model to predict respectively the segment of energy demand and the *PEDh* value of new flats. The overall error value after the *k* iterations is computed as the mean of the errors of the *k* tests.

The procedure for tuning the optimal configuration for each of the four algorithms used in HEDEBAR produced similar values of parameter settings for the creation of the classification and regression models. These parameter settings turned out to be the optimal configuration even for the single layer approach. As an example, this section describes the results of parameters tuning for the creation of the regression model used in the *Local energy demand prediction* phase. The parameter tuning procedure is aimed at minimizing the values of the prediction errors MAPE, MAE, and RMSE (Figure 7).

(**a**) ANN algorithm with respect to the size of the hidden layer.

(**b**) REPT algorithm with respect to the minimum number of instances per leaf *M*.

(**c**) RF algorithm with respect to the number of trees. (**d**) SVM algorithm with respect to the complexity constant *C*.

**Figure 7.** Overall Local energy demand prediction errors of the algorithms for different values of their parameters.

For the ANN algorithm, a single hidden layer of variable size was considered, since using more than one layer did not provide any significant improvement of accuracy. Some common rules of thumb for the size of the hidden layer in the ANN are suggested by different works like [55], where the number of neurons are related to the number of input and output variables. Overall, the size of the hidden layer should be high enough to let the ANN model the problem correctly, but also low enough to ensure generalization. An increasing number of neurons was used during the tests, ranging in the interval [4, 100] until the prediction error starts to grow due to over-fitting. The other parameters of the ANN are: *learning*\_*rate* = 0.3, *training*\_*cycles* = 103, = <sup>1</sup> × <sup>10</sup>−5. The values of RMSE, MAE and MAPE for different sizes of the hidden layer are reported in Figure 7a. 16 neurons for the hidden layer provide the lowest values of the three errors.

In the REPT algorithm, the dimension of the pruning subset was set to one third of the training set, hence with three folds in the algorithm (*N* = 3). No maximum tree depth has been set instead. The *information gain* was used as splitting criterion. The REPT algorithm was tuned by varying the minimum number of instances per leaf (*M* ∈ {10, 20, 30, 40, 50}). The values of RMSE, MAE and MAPE are reported in Figure 7b. The three error measures slightly, yet constantly, increase together with M. Therefore *M* was set equal to 10.

In the RF algorithm, the previous settings of REPT was used for all the decision trees. The variation of prediction error was assessed with respect to the number of trees *I* in the range [10, 100]. The values of RMSE, MAE and MAPE are reported in Figure 7c. *I* = 70 provides the lowest error values.

For SVM regression, a linear kernel function was considered and the variation of prediction errors, with respect to the complexity constant *C*, was assessed. This variable is used to set a degree of tolerance for misclassification of training samples. A too large value of complexity constant can lead to over-fitting, while too small values may result in over-generalization. Values for *C* have been selected in the range [0, 10]. The other parameter settings of the SVM are: *max*\_*iterations* = 104, convergence = <sup>1</sup> × <sup>10</sup><sup>−</sup>3. The values of RMSE, MAE and MAPE are reported in Figure 7d. The trends of the three error measures are nearly constant with a slightly lower value of RMSE for *C* = 0.

#### **5. Discussion and Conclusions**

In this paper, the HEDEBAR methodology for the automatic asset rating of flats energy efficiency has been described. We recall that the analysis has been possible thanks to the availability of open data of Energy Performance Certificates. HEDEBAR proposes a two-layer approach to compute the ideal *Primary Energy Demand for space heating* (*PEDh*) of flats according to the certification scheme used to issue their EPCs. In this section we discuss the results obtained through HEDEBAR, addressing the results achieved using the proposed two-layer approach, and the interpretation and the possible exploitation of the extracted knowledge.

**Accurate estimation of the flat energy demand with a reduced features set.** Experimental results demonstrated the ability of the HEDEBAR methodology to estimate the *PEDh* value for a flat. *PEDh* is not the actual energy consumption of a flat, but its primary energy demand calculated in standard conditions. It is a significant parameter for the comparison of flats based on their features. The estimated values of *PEDh* are precise enough to provide a dependable assessment of flat energy efficiency for different values of the features characterizing flats.

From a methodological perspective, the experimental evaluation demonstrated that the two-layer approach used in HEDEBAR performs significantly better than a single layer algorithm in estimating the *PEDh* (MAPE values are respectively 16.64% and 29.82%). Therefore the segmentation of the initial data collection into different groups of flats with similar energy demand allows to produce differentiated models, which fit better the specific features of the respective segments.

The predictive performance of the HEDEBAR methodology is similar to the one of Khayatian et al. [35], where ANNs are used to predict the *PEDh* value, using EPCs related to the Lombardy region. Indeed, even if the experimental evaluation has been conducted on different datasets, HEDEBAR and the approach in [35] provide comparable results (MAPE equal to 16.64% HEDEBAR and to 14.44% in [35]). However, differently from [35], HEDEBAR estimates the value of *PEDh* in two steps using the REPT algorithm, which provides an interpretable model.

**Modular approach able to integrate various algorithms and applicable to EPCs from other certification schemes**. The HEDEBAR approach can make use of various classification and regression algorithms and can be used also to analyze data of EPCs issued according to other certification schemes.

The performed experimentation puts in evidence the algorithms with the best performances among those which were tested. In the *Segment estimation* phase, RF algorithm has the highest classification accuracy, while, in the *Local energy demand prediction* phase, REPT algorithm has the lowest error values in predicting *PEDh*. REPT also has a good classification accuracy. Therefore, RF in the first and REPT in the second phase turned out to be the most suitable combination of algorithms for the estimation of *PEDh* from the variables included in the EPC data set.

**Interpretation of the energy demand estimation models.** A key advantage of HEDEBAR is the use of REPT algorithm, whose decision tree models make results understandable and exploitable also for non-domain experts. Useful information can be obtained from this model as it helps to discover in a straightforward way energy patterns among large dataset. The algorithm automatically selects the different attributes for generating split rules and the ones closest to the root node can be assumed as the most influencing attributes. Therefore, the performance improvement brought by the two-layer approach, especially to the REPT algorithm, provides the HEDEBAR methodology with both a good estimation precision and a set of interpretable models of energy demand. Resulting models pointed out the most relevant features according to the considered rating system.

In the *Segment estimation* layer, 5 features out of 10 (*average U-values of opaque envelope and of the windows*, *aspect ratio*, *construction year*, and *average global efficiency for space heating*) appear in the first four levels of the decision tree and can be considered as the most relevant ones of the model. Indeed, they were preferred to other variables for splitting the initial flat set since they generate more homogeneous subsets in terms of *PEDh* value, thus allowing the overall model to reach a more accurate segmentation of the flat set. The characteristics of the three segments of energy demand are also summarized by means of short *decision rules*, which bring out the most representative building properties and their ranges of values for each segment. With a view to improving the efficiency of a flat, the model makes possible to individuate the features that mostly cause its membership to a specific energy demand segment. A proper change of their values, when possible (e.g., by means of targeted refurbishment actions), can substantially increase the energy efficiency of the flat. For some flats, bringing the values of few features within the appropriate ranges causes their reassignment to a lower segment.

In the *Local energy demand prediction* layer, 4 features out of 10 appear in the first three levels of the three decision tree models (the same as in Segment estimation except *construction year*). The differentiated analysis highlighted the main features impacting on *PEDh* for different segments of energy demand. In this case, the *U-value of vertical opaque envelope* (*Uo*) has demonstrated to be one of the most important variables for all segments. Indeed *Uo* is at the first level of all the three REPT models, with increasing splitting values from *s*<sup>1</sup> to *s*3. The aspect ratio (*R*) is also a significant variable, as it appears in the second level of all the three REPT models. The *average U-value of windows* (*Uw*) is more important for low levels of energy demand (segment *s*1), where the contribution of heat loss through windows can make the difference. On the other hand, the relevance of the *overall efficiency of the heating system* (*ηh*) is evident only for *high* and *very high* energy demand flats (segments *s*<sup>2</sup> and *s*3).

**Possible exploitation of HEDEBAR findings.** Energy demand estimation is crucial to assess the energy performance in buildings and represents the first step to make any decision for enhancing their efficiency. The proposed approach has the advantage of learning a model from data about previous certificates that is then applied to new flats. The methodology can concretely help domain experts to evaluate the possible improvements of energy efficiency of flats. To this purpose, data driven models are useful for quickly estimating the expected building energy demand and in setting credible targets for improving performance [56]. In general, designers and authority planners should exploit such tools capable to suggest them where put their effort, among large stocks of buildings, and which could be the most convenient retrofitting strategies. In this way it is possible to plan future financial investment policies that leverage on specific building features and help devising more targeted actions to improve energy efficiency for different segments of buildings. Moreover the proposed methodological process allows to extract, by means of interpretable models (i.e., decision trees), useful and understandable knowledge regarding the expected energy performance of buildings according to few physical driving variables . Such benchmarks should be the reference for the building owners to improve the energy performance when it is poor and for technicians to identify the optimal cost-effective energy saving opportunities.

**Author Contributions:** The research presented in this paper was a collaborative effort made by all the authors. All the authors contributed to the literature review, methodology, implementation and experimental analyses, as well as to the writing and reviewing of the paper.

**Funding:** This research received no external funding.

**Acknowledgments:** The authors express their gratitude to Giovanni Nuvoli (Settore Sviluppo Energetico Sostenibile Regione Piemonte) and to CSI Piemonte.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **Abbreviations**

The following abbreviations are used in this manuscript:


#### **References**


c 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
