1. Introduction
As nations around the world struggle with the technical aspects of energy transitions, i.e., the need to deploy new, clean energy technologies to meet their future energy needs, policy makers and governments are wrestling with policy and energy system design which meets our energy, economic and social needs. These challenges can be summarized as the need for equitable and sustainable energy system design, often described as the just transition [
1]. The just transition deals with a number of critical issues associated with rapidly changing energy systems, ranging from impacts on employment [
2], pressures on energy prices [
3], health impacts [
4], the allocation of subsidies [
5], and a number of potential trade-offs depending on stakeholder preferences [
6,
7,
8].
In order to better understand these trade-offs, policy makers and researchers utilize a number of tools to undertake stakeholder engagement and, ideally, to propose energy systems which can meet the needs of all stakeholders, and which can achieve common energy transition goals including sustainability, equitability and the ability to meet carbon reduction targets. These stakeholder engagement tools include public focus groups, interviews, surveys, social media and a variety of interactive engagement methodologies, among others [
9,
10,
11]. Considering research approaches, in addition to literature reviews, survey analysis is often used to understand stakeholder’s opinions and to improve outcomes across a range of fields [
12]. Focal energy issues have shifted over time, i.e., from environmental justice, concerned with unrestricted stakeholder involvement in environmental policy development [
13], toward climate justice, more concerned with sharing the burdens of climate change [
14,
15] and to the more recent ‘energy justice’ which is concerned with the distribution, recognition and procedural justice aspects of energy [
16,
17]. For this reason, surveys, over time, have become a useful tool to elicit the critical issues surrounding energy systems. With a recent focus on the just transition, i.e., a transition which achieves carbon reduction goals, while upholding the tenets of environmental justice, climate justice and energy justice [
16], targeted surveys and their analysis can complement efforts toward designing energy systems which are both equitable and sustainable.
One major challenge experienced by researchers or energy policy makers is the successful design of surveys that can meet these needs, followed by their application to energy system design. Although some unique approaches exist, such as responsive survey design to aid in developing surveys that improve response rates and reduce bias [
18], there is also a significant body of survey work in the energy and energy transition space that can be drawn upon for retroactive analysis to determine which responses influence energy technology and system preferences the most. This paper leverages statistical analysis and innovative machine learning methods applied to recent, topical surveys regarding energy system preferences, opinions and design priorities to attempt to extract the factors that are most influential in the design of desirable (i.e., equitable and sustainable) energy systems.
The aim of this research is to provide an evidence base for the streamlining of stakeholder engagement which can be utilized to underpin bottom-up, desirable future energy system design. Four energy-related surveys from Japan and the United States (US) are utilized to derive the key factors that should be explored as a priority in order to best inform desirable energy system design stakeholder response prediction.
The study is presented as follows:
Section 2 details recent literature regarding stakeholder engagement, survey design challenges and approaches to energy system design;
Section 3 details the machine learning methods used in this study to identify influential survey responses which are useful to predict energy factor and system design preferences and their importance regarding survey design;
Section 4 describes the results and variation across investigated surveys and nations as well as the efficacy of various machine learning approaches;
Section 5 provides the discussion and implications for future energy survey design; and conclusions are provided in
Section 6.
2. Literature Review
The challenge of designing appropriate surveys to elicit desired responses while being conscious of cost, timeliness and effective response elicitation is not unique to the energy field. For example, in the design of a national survey in the Netherlands, an increasing need to provide more targeted and detailed information and to improve analysis was identified. In order to meet these needs, researchers undertook a fundamental change in their surveying processes in order to achieve an integrated design and to move away from single-purpose surveys [
19]. In the realm of information and communications technology, researchers sought to design a web survey that would identify travel behaviors. In order to design this survey, they undertook a review of previous work to identify gaps and necessary questions (i.e., literature review) and attempted to design a survey that would elicit the desired data points while limiting respondent fatigue and cognitive burden [
20]. In the case of a US survey to assess social life, health and aging, the design of the survey instrument and the definition of measurement domains took several months. In addition, a pretest of the questionnaire was undertaken, identifying issues with the time required to administer, thus leading to a prioritization of factors to be extracted and identified non-essential factors being cut from the final survey instrument [
21].
Cognizant of the need to reduce both the cost and respondent burden of surveys, a modular design was proposed for survey design using a random search algorithm which attempts to maintain precision requirements in light of other constraints. One shortcoming of such an approach is the identified need to estimate required design effects for the algorithm, or to undertake a pilot sample [
22]. Recognizing that innovation in survey design can often yield unexpected results or have a negative impact on outcomes such as bias or survey costs, the idea of ‘responsive survey design’ has been proposed. Utilized in Germany, researchers proposed a self-administered survey and experimented with incentives for completion, question ordering (sequential or simultaneous mode) and cost, finding that response rates can be improved through prepaid incentives, and that mode choice sequence had only a small impact on cost overall, while response bias was not largely impacted [
18]. Another source of response bias was reported to occur in the use of different devices for responding to surveys, particularly for mobile devices (anticipated to account for up to 60% of respondents) and the need to ensure consistent display of response ranges to avoid prioritization of visible elements [
23].
In terms of general survey design ideals, the Pew Research Center offers guidance in terms of keeping question numbers to a minimum to ascertain the defined goals of the survey, using closed-ended questions to compare specific traits, and being conscious of the order of questions, placing open-ended questions before closed-ended ones to avoid order effects and leading bias [
24]. The key considerations of brevity, consistency, avoiding leading questions and having appropriate categorical response ranges are also extolled by the General Medical Council in their National Training Survey best practice guidelines [
25]. Having a clear purpose for the survey is also critical, along with only asking questions that add value and relate specifically to the research goals. Testing a survey is also identified as a critical step prior to large scale deployment [
26]. Further, the use of literature review to establish if model questions have been implemented and tested in previous surveys which could meet the needs of new research is also identified as valuable by Harvard University, with the additional benefit of comparative analysis across surveys [
27].
Although no specific machine learning approaches are proposed in the literature to retrospectively design surveys, the idea itself is consistent with literature review processes, utilizing existing survey data. On the other hand, the use of machine learning techniques to analyze survey results is becoming increasingly common, with multiple benefits in addition to the computing power offered expected as a result [
28]. For example, machine learning algorithms are expected to offer advantages in terms of quality assurance and real-time data monitoring and trend identification [
29]. Furthermore, neural networks have been identified as more capable of modeling non-linear relationships than traditional analysis approaches such as statistical regression [
30]. Other machine learning approaches such as decision tree classification algorithms have also been shown to perform better than other classifiers for dependent variable classification related to stakeholder intent [
31]. Research has also been undertaken regarding the broad range of computational methods applicable to response prediction, including data hierarchy, collaborative filtering, supervised, and semi- and unsupervised learning-based approaches, with a focus on online advertising responses [
32]. Considering the energy-related fields of study which employ machine learning and artificial intelligence, energy and eco-efficiency have been investigated in some depth recently. For example, machine learning approaches were combined with optimization methods to evaluate cement companies’ eco-efficiency, detailing model efficiencies [
33]. Further, an evaluation of machine learning employing recommender systems was undertaken to show how these could be used to improve building energy efficiency, as well as the possibility of combining these approaches with smart meters and Internet of Things sensors [
34]. Additionally, a review-based study on enhancing energy efficiency among other aspects via occupancy prediction was undertaken, demonstrating the value of both machine learning and neural network approaches in combination with sensors [
35]. An additional review was undertaken, considering machine learning toward achieving thermal comfort, finding that machine learning-based controls can improve indoor air quality while reducing carbon dioxide levels and energy consumption [
36].
Toward energy system design, neural networks are often utilized in the fields of evaluation, control, and operation or in the application of artificial intelligence for the realization of a modern smart grid or renewable energy systems, with applications increasing rapidly over time [
37]. To support energy system optimization, a modeling approach was developed using neural networks and a hybrid optimization algorithm (HOA) which can reach Pareto optimal solutions approximately 17 times faster than by using an actual engineering model (AEM) alone [
38]. Further, it was identified that a data-driven approach based on reinforcement learning to design distributed energy systems by altering the number of hidden layers in the neural network can effectively predict system needs, and reinforcement learning with deep neural networks can help energy systems operate more efficiently with minimal impacts on the grid itself [
39]. Research utilizing metamodels which can significantly reduce implementation effort and runtime per scenario for artificial intelligence-based modeling when compared to traditional white box modeling approaches has been explored in Germany, achieving a mitigation of complexity while maintaining a high level of accuracy [
40].
Recognizing the challenges toward survey design, the existing approaches toward determining ideal questions and response ranges, and the recent utilization of machine learning for survey response analysis and prediction, this study proposes a methodology for the streamlining of survey design, using energy as an example. Specifically, the aim of this research is to leverage existing survey data (with similar applicability to consumption or behavioral data) and machine learning to reduce survey burden. Burden here refers to three key aspects: (1) the burden imposed by the high cost of the deployment of surveys; (2) the burden imposed on researchers in designing and testing survey instruments; and (3) the burden on respondents, in terms of fatigue in having to respond to an excessive number of questions. By reducing these burdens, it is anticipated that the deployment costs can be reduced, enabling a larger sample to be obtained for the same budget expenditure. In addition, the time and effort required to develop surveys can be reduced, allowing for the allocation of limited resources toward analyses and energy system design. Finally, it is anticipated that respondent fatigue can also be reduced, improving the quality and accuracy of responses while reducing response bias. By reducing these burdens and improving research outcomes, a flow-on benefit will be the usefulness of obtained data to inform future energy system design and analysis activities. Overall, we are aiming to apply machine learning to streamline survey design, improve deployment and data analysis, and to provide prediction-based insights toward energy system design.
3. Methodology
The methodology employed in this study utilizes machine learning evaluation of previously administered energy system-related surveys to establish the demographic factors or response types which are most influential toward our variables of interest, i.e., those related to energy factor preference and system design. Further, machine learning algorithms are compared to identify those which exhibit optimal prediction (classification) accuracy in eliciting critical demographics and response types.
3.1. Data Sources
The surveys utilized in this study include three conducted in Japan and one conducted in the US. The deployment date, original use case, number of questions asked, samples gathered, and common factors analyzed across surveys are detailed in
Table 1.
In addition to the common factor questions identified in
Table 1, each survey captures demographic data and, in some cases, additional specific enquiries about educational achievement, region, race, etc. A list of all factors analyzed is presented in
Appendix A. All surveys are used to assess energy system design factors, while only surveys 3 and 4 are used to analyze key future energy system preferences due to the different design specifications of surveys utilized in this study.
3.2. Analysis Methods
Survey data are prepared by coding responses to numerical ranges (for levels of knowledge, preferences, etc.) and categorical (importance, increase or decrease in specific energy sources in the future energy mix, etc.) inputs. Prediction machine learning models including naïve Bayes, generalized linear model (GLM), logistic regression, large margin, deep learning, decision tree, random forest, gradient boosted trees and support vector machine (SVM) are run for each of the target factors of interest for this study, as detailed in
Table 2.
For energy system preferences, the survey respondents were asked to indicate how important they felt each factor was using a Likert scale response ranging from 1 to 5. These responses were then summarized into the three categories of unimportant (responses in the range of 1–2), neutral (a response of 3), and important (responses in the range of 4–5). Similarly to the energy system design factors, where respondents were asked to indicate their preference toward increasing or decreasing specific energy sources, we used categories such as decrease, neutral and increase, using a composite score for fossil fuels (oil, coal, natural gas, etc.) and renewables (solar, wind, hydro, etc.).
Machine learning modelling and algorithm performance comparisons are undertaken using RapidMiner Studio v9.10.010 utilizing a 64-bit architecture. The data preparation, modeling, comparison, and influential factor extraction process flow is summarized in
Figure 1.
The results are summarized to detail influential factors toward predicting desired targets for each survey, optimal machine learning models and their predictive ability (i.e., accuracy). Finally, sensitivity analysis is conducted to test whether combined survey samples improve prediction ability where survey variables and ranges allow.
4. Results
The results include machine learning algorithm predictive accuracy results, followed by the weight of the most influential inputs toward target variables. Hereafter, surveys are referred to by number, as identified in
Table 1.
4.1. Energy System Design
Energy system design factors are evaluated based on whether the respondents indicated a preference to increase or decrease fossil fuels (encoded FF), nuclear energy (NE) and renewable energy (RE) within the energy mix. Machine learning algorithm predictive accuracy for energy system design target factors is summarized in
Table 3, with the best performing algorithms being identified in bold text.
The predictive model accuracy ranges from a low of 54.9% for predicting nuclear energy preferences in survey 3 to a high of 78.9% for predicting preferences toward renewable energy in survey 1. In all surveys, the prediction accuracy for renewable energy preferences was the highest. The sample size and the number of questions posed did not appear to heavily affect prediction accuracy for single survey analysis. No one machine learning approach was consistently superior, with GLM, large margin and gradient boosted trees models each having the best predictive ability three times (25% of the time, respectively), decision tree twice, and random forest once. Naïve Bayes, logistic regression, deep learning and SVM models did not demonstrate superior predictive ability for any of the survey targets.
Utilizing the outcomes of the best performing models for each survey and factor, the most influential factors underpinning predictions can be extracted, as shown in
Figure 2, according to their comparative weights. Response variables are categorized as preferences (P), knowledge levels (K), demographics (D), and behavior (B).
Energy system design preferences toward fossil fuels appear to be highly influenced by people’s preferences toward nuclear and renewable energy types. Other commonly influential factors identified were knowledge of solar energy and the respondents’ age. In terms of types of predictors, knowledge levels were the most commonly influential responses, followed by preferences and demographics, and behavior. For nuclear energy, renewable preferences were highly influential, followed, though to a lower degree, by fossil fuel preferences. Commonly influential factors included knowledge of wind and the demographic factor of sex. Again, knowledge-based responses were most commonly influential on the predictability of energy system design preferences. For renewable energy preference prediction, the most accurate among energy types, fossil and nuclear energy preferences were influential but at a relatively lower level than for other energy types. Solar knowledge levels and the age demographic factor were commonly influential across surveys.
4.2. Energy System Preferences
The energy system preferences tested include environmental protection (coded as EP), climate change response (CC), and social equity (SE); these acted as proxies for equitable and sustainable energy systems (i.e., desirable in accordance with a just transition). Machine learning algorithm predictive accuracy is detailed for each energy system preference target variable for surveys 3 and 4 (the two surveys exploring these factors) in
Table 4. The best performing algorithms are identified in bold text.
The results show that the predictability of people’s preference toward environmental protection is higher than that of climate change response and social equity. The gradient boosted trees algorithm was the most accurate in four out of six predictions, having the second highest performance in the remainder of cases.
The six most influential response variables toward the prediction of energy system preferences from each analyzed survey are detailed in
Figure 3. As was the case for energy system design, the response variables are categorized as either preferences (P), knowledge levels (K), demographics (D), and behavior (B).
As shown in
Figure 3, preferences regarding the energy system and energy sources are the most influential toward predicting respondent’s overall views toward the importance of environmental protection, climate change response and social equity. Some preferences are common across surveys, notably respondent’s preferences toward solar power in the future energy mix, and the desire for energy availability. Knowledge of energy types was influential toward environmental protection and climate change response; however, this was not the case for social equity importance. In terms of influential demographics, age was influential toward environmental protection importance, sex was influential on climate change response and social equity importance, and in the case of the US survey (Survey 3), ethnicity was also influential.
4.3. Sensitivity Analysis
As the addition of data points has been empirically shown to improve machine learning performance, albeit with some caveats such as introducing samples from different time periods [
46], our sensitivity analysis combines compatible surveys to develop larger datasets, each within 2 years of each other.
Following the combination of compatible surveys, i.e., surveys 1 and 2 for energy system design factors and surveys 3 and 4 for energy system preferences, we repeat the machine learning multi-model prediction analysis and contrast the accuracy results. As the number of questions (input variables) is limited to the survey with the smallest number of usable questions, predictive accuracy is contrasted with survey 2 for energy system design factors, and survey 3 for energy system preferences, with the results detailed in
Figure 4.
For fossil fuel and nuclear energy preference predictability, the combined survey samples offered consistently higher accuracy across all models ranging between 1.7 and 3.7% for fossil fuels and 0.9 and 2.6% for nuclear energy, respectively. In the case of renewable energy preference predictability, which was predicted at a much higher level of accuracy than that for fossil fuels or nuclear energy, the combined sample did not improve predictability in all cases, with changes ranging in a tight band from −0.9% to 0.6%. The only models which improved predictive accuracy were the deep learning (0.3%) and gradient boosted trees (0.6%) algorithms, suggesting an upper limit to the predictive accuracy for this dataset. For fossil fuel preference predictability, the top influential factors of preferences toward nuclear and renewables remain relevant, along with age and technology knowledge for fossil fuel preferences. For nuclear, preferences toward renewable energy deployment remained overwhelmingly the top response. Finally, for renewable energy, attitudes toward nuclear and fossil fuel deployment remain influential, along with technology knowledge.
For energy system preference predictability, an increase in accuracy was again observed for all factors, ranging from 5.1 to 7.5% for environmental protection, 11.2 to 21% for climate change response and 8.4 to 12.7% for social equity. Within energy system preferences, the combined survey samples identified that for environmental protection, preferences toward the energy system (resource preservation, energy cost) and technology knowledge and energy mix preferences toward solar were important, consistent with single sample findings. For climate change response, preserving limited resources and thus ensuring energy for the future was most influential toward predictability, along with technology and policy knowledge. For social equity, preferences toward the future energy system were the most important, particularly regarding energy availability and cost.
Generally speaking, a higher number of samples improves the model’s overall prediction accuracy, with a maximum accuracy of 78.2% being achieved for environmental protection preferences; however, it should be noted that before combining samples, a maximum predictive accuracy of 78.1% had been achieved for social equity, suggesting an upper limit on predictability for energy system preferences being reached. There is also a suggestion that the ratio of sample increase may also be linked to model predictive accuracy increase.
5. Discussion and Implications
Previous research has explored how people’s preferences can influence energy system design; for example, in Vietnam, a site selection tool for the future deployment of solar and other renewables was developed based on experts’ opinions, knowledge and judgements [
47]. In the US, factors such as ethnicity, region and lived experience were found to be influential on the bottom-up-based development of a desirable energy mix [
43]. People’s aversion to risk-taking, curiosity, environmental awareness and overall activeness toward energy system participation was found to be influential on future energy provider selection in Japan [
42]. In Spain, householder’s propensity toward improving their homes’ insulation was found to be linked to demographic factors such as age and income level, as well as the heating technologies employed in their home [
48]. Likewise, in India, socio-demographic factors such as age, gender, educational achievement, job type and vehicle ownership influence people’s propensity toward investing in solar PV or car-charging infrastructure [
49]. In a multi-nation study across Romania, Hungary and Serbia, it was identified that peoples’ preference toward participation in demand response initiatives is largely contingent on a perceived increase in renewable energy deployment as a result, reducing carbon emissions and global warming, and with the aim of reducing energy bills [
50]. In light of this, recent research has suggested that both end-users and experts can influence the future energy system through their preferences, behavior and perceived benefits; this research offers a methodology for the rapid acquisition of these preferences through previously conducted national surveys (the method used in the majority of these studies), utilizing machine learning to uncover critical preferences, knowledge and demographics toward the prediction of target variables.
Examining the influence weightings of energy system design preferences toward fossil, nuclear and renewables in the future energy mix, we found that preferences related to generation technologies, followed by policy factors, were the most important. Demographics were influential at a much lower level, and only consistently for age, toward renewables and fossil fuels. The only influential behavior was the ownership of an FCV toward fossil and nuclear energy at a similar level to the influence of demographic factors. These findings have some commonality with those of previous research, specifically for the socio-demographic factors of vehicle ownership and age. For energy system preferences toward environmental protection, climate change response and social equity, preferences toward technologies and lifestyle factors were most influential, significantly more so than demographics or knowledge levels. Although demographics appear to play a small role toward people’s preferences with regard to energy systems, they were not as influential as hoped, as these kinds of data can be elicited from other sources, including census records or consumer survey data.
In terms of machine learning algorithms utilized, gradient boosted trees showed excellent performance overall for training and scoring times, in addition to being consistently the most accurate. This algorithm has proven to be popular in social science research and has multiple applications toward classification, including human behavior-related factors [
51,
52]. On the other hand, the SVM algorithm, although intensive in terms of computing resources and solution times, was not superior in our classification models; however, as has been noted in previous sentiment-based research, it generally offers accuracy advantages over the less intensive, simpler Naïve Bayes classification algorithm [
53].
The availability of additional data appears to improve overall accuracy, as shown by the combination of survey samples, and, although not conclusive with the limited investigation offered in this research, it appears to scale somewhat according to overall sample sizes in line with expectations [
54]. Overall, irrespective of the size of the samples processed, ranging from 4148 to 9000 in our study, the prediction accuracy never exceeds 79.8%. Although this accuracy is sufficient for our purposes, further investigation is required to uncover related factors that may increase accuracy in the future. Machine learning has proven to be accessible and useful for the prediction of influential factors of energy system preferences and design; however, some factors have proven to be more difficult to predict than others. In this study, for energy system design factors, opinions toward renewable energy proved easier to predict than for fossil fuels and nuclear power. This may be due to an overwhelming preference toward the increased deployment of these types of energy. For energy system preferences, environmental protection opinions were the most accurately predicted. In this case, the ease of understanding of climate change and social equity concepts may play a role, and the relationship between other demographic factors (education, age, income, etc.) and these concepts may require further investigation.
Although some work has been undertaken regarding energy and eco-efficiency utilizing machine learning, this study is unique in that it focuses specifically on energy system preferences and underpinning demographics to aid in energy system design. Through the provision of a methodology for the rapid acquisition of preferences that can be utilized for energy system design, as well as appropriate weightings of energy system technology deployment preferences (i.e., stakeholder desires for future deployment of fossil fuels, renewable and nuclear), survey design can be improved toward deriving sustainable, desirable energy systems.
The findings of this research provide a framework for the investigation of the linkage between people’s preferences, knowledge, demographics and behaviors and desirable energy system design based on survey data over a period of time and in two discrete jurisdictions. This approach could be generalized to other research areas, and could be utilized in future survey design, streamlining survey design processes and allowing for a targeted selection of questions, reducing survey-related burdens.
6. Conclusions
In the age of big data and increased data availability, the operations described in this research are likely useful toward reducing a significant portion of menial work involved in survey preparation, analysis and application. In addition, it is hoped that cost-prohibitive portions of survey design and implementation, including workshops and question-testing, can be streamlined, and that the predictive ability of surveys can be improved by utilizing machine learning in research workflows.
A lower number of questions ultimately result in less demanding surveys, increased response rates and sample sizes achievable from available budgets while reducing respondent fatigue. Considering that machine learning is relatively accessible and, due to our finding that resource-intensive algorithms such as SVM do not necessarily offer predictability advantages, less resource-intensive algorithms that can be comfortably deployed on mid-range hardware have been identified as sufficient and suitable to the task. Among these algorithms, gradient boosted trees showed excellent performance in predicting influential factors of energy system preferences and design, and offering accuracy advantages over simpler algorithms such as naïve Bayes classification. Within predictions, people’s preferences related to generation technologies and policy factors were identified as the most important factors influencing energy system design preferences, followed by demographic factors.
Furthermore, by employing a human-guided artificial intelligence-based workflow, we do not seek to automate our research processes; thus, we ensured human oversight and the provision of expertise, in line with the concept of making artificial intelligence engagement trustworthy [
55]. This research identifies insights toward the rapid, fit-for-purpose deployment of surveys and their utilization toward energy system design.
Our proposed methodology has identified key factors to be investigated in future surveys, and thus meets our stated goals of reducing survey-related burdens for both researchers and respondents alike. With regard to future work, the investigation of the leveragability of publicly available data, including big data and open access databases, to assess their utilization toward predictive ability would be a useful endeavor. Ideally, the design of desirable energy systems can be a positive outcome based on this work, leading to the meeting of energy system goals at a national level while deriving desirable and equitable outcomes for individuals.
The findings of this research provide a framework for investigating the linkages between people’s preferences, knowledge, demographics, behaviors, and desirable energy system design, with potential applications in survey design and targeted question selection to streamline processes and reduce survey-related burdens, which is broadly applicable to a number of research fields.