Comparison of Functionality and Evaluation of Results in Different Prediction Models

Kazolis, Dimitrios; Fotakis, Christos Dionyshs; Tramantzas, Konstantinos

doi:10.3390/engproc2024070031

Open AccessProceeding Paper

Comparison of Functionality and Evaluation of Results in Different Prediction Models^†

by

Dimitrios Kazolis

^*

,

Christos Dionyshs Fotakis

and

Konstantinos Tramantzas

Department of Physics, Democritus University of Thrace, 65404 Kavala, Greece

^*

Author to whom correspondence should be addressed.

^†

Presented at the International Conference on Electronics, Engineering Physics and Earth Science (EEPES’24), Kavala, Greece, 19–21 June 2024.

Eng. Proc. 2024, 70(1), 31; https://doi.org/10.3390/engproc2024070031

Published: 7 August 2024

(This article belongs to the Proceedings of International Conference on Electronics, Engineering Physics and Earth Science (EEPES 2024))

Download

Browse Figures

Versions Notes

Abstract

This article represents a further step in the continuously developing process of improving the prediction capabilities using databases. Its aim is to compare and evaluate the operation, performance and validity of knowledge extraction techniques related to prediction. The innovative part of this study concerns the selection, enrichment and processing of the database used. In particular, the database contains consumption data for an entire city over the course of a year. These data were then enriched with elements concerning the determination of the time and the environmental conditions, in order to take into consideration the correlation of the data with these parameters. Subsequently, after being converted into an editable format, they were processed using techniques such as normalization and factor analysis, which finally led to the prediction process. At this stage, different methods, such as decision trees, deep learning and generalized linear models, were applied and thoroughly analyzed, and both their operation and their effectiveness were compared and evaluated. The present effort, therefore, intends to provide a useful tool that will contribute to future efforts to improve predictions from existing data.

Keywords:

data mining; prediction models; evaluation; decision trees; generalized linear model; deep learning method; data correlation; normalization; factor analysis

1. Introduction

The implementation of databases is, nowadays, an integral part of information technology.

The increasing relevance of their use is due to the fact that their purpose is not only the simple archiving of data but also the extraction of knowledge, in the form of understandable correlations, from them [1,2].

Moreover, technological developments have made it possible to export predictions based on the analysis of recorded data [3]. As a result, effective data mining techniques are now essential due to the complexity of data patterns and the growing significance of precise provisions. These techniques, with different approaches, aim to predict the future as accurately as possible; each of them applies and works most effectively under certain conditions [4,5].

Thus, the present study aims at the evaluation of these processes. This is performed by applying these different prediction methodologies to a specific type of database. This base consists of data in numeric form and reflects the electricity consumption in the wider region of the city of Kavala in Greece. In attempting to forecast trends in electricity consumption, it was found that different methods of data extraction can lead to slightly different numerical prediction results. Thus, in order to increase the accuracy of the forecasts and the understanding of the elements influencing the consumption, the database was enriched with additional data relating to time and environmental conditions. Finally, through the use of experimentation and performance evaluations, the advantages and disadvantages of each method in terms of prediction are highlighted.

The results of this study indicate which of the applied methods are most appropriate for specific types of databases. Furthermore, the knowledge gathered opens the opportunity for the creation of more precise prediction models, which can provide assistance to and have a significant impact on decision-making processes.

2. Materials and Methods

2.1. Data Processing

As mentioned above, the initial data were obtained from the public electricity company of Greece and included the consumption of the city of Kavala in the last two years, i.e., 2022 and 2023. These data concerned the loads in amperes of twelve transformers in the wider area of the city and were taken at half-hourly intervals over the past two years. In order to achieve the process of comparing the prediction results, it was decided that the first archive, that of 2022, would be used for the implementation of the forecasting techniques, while the other, that of 2023, would be used to evaluate the results of the above methods.

Thus, after both files were examined and cleared of missing or incorrect data [6], the 2023 file was left as it was for the purpose of being used in the final results evaluation process. However, the 2022 archive was enriched with additional data that correlated the consumption with the factors that influenced it [7]. Thus, columns relating to the days of the week were added to the half-hourly consumption throughout the year of 2022. In addition, the daily temperatures were categorized into maximum, minimum and average values. In addition, the average humidity, the monthly rainfall, the rainy days, the intensity of the wind, the barometric pressure, whether the particular day was sunny or not and finally the monthly amount of solar energy using a photovoltaic panel were recorded. The remaining values were obtained from the official website of the Hellenic National Meteorological Service [8]. Finally, it should be pointed out that the data were not in numeric form but transformed by substitution. For example, the days of the week were represented in seven corresponding columns. Each column was numbered 1 only for the hours in which the particular day matched it, and the rest had the value of 0. Finally, both archives were subjected to a normalization process with the aim of placing them on the same scale.

After the completion of the above processes, the two files were ready for the continuation of the procedure.

2.2. Factor Analysis

As described previously, two archives were created. The one of the year 2022, which was enriched, contained, in its final form, 32 columns and 17,520 lines. The columns corresponded to variables such as consumption, temperatures, etc., and the lines to the temporal subdivision of the whole year by half hours. This led to a total 543,151 records. Therefore, in order to reduce the number of data to be examined, and to enable the methods to be applied subsequently, it was decided to use factor analysis. Using this method, we achieved the replacement of all consumption data, contained in 31 columns, with a certain number of factors. To determine this number, the Kaiser criterion and a scree plot were implemented. This can be seen in the following Figure 1, where it is seen that the number of factors is 3.

Additionally, it should be mentioned that, for factor rotation, Varimax raw was used, with principal components for the extraction [9,10].

Following the completion of the above methodology, the set of data in the initial file was replaced by three single factors, resulting in a significant reduction in the number of data in the original file. Only the column containing the total consumption has been preserved, on which the following prediction procedures will be applied.

2.3. Prediction Process

After the completion of the processing of the files, in this section, the three forecasting methodologies chosen, i.e., decision trees, deep learning and a generalized linear model, will be applied in sequence. It is worth noting that the methodologies that were initially considered totaled five—specifically, random forest and gradient boosted trees, in addition to the above-mentioned ones. However, in the end, to mitigate the duration of this task, only the three with the best performance in their forecasts were chosen to be decomposed.

The Rapidminer software was used for the implementation of the procedure. Figure 2 below depicts the column selected for the final prediction, which is the sum of the electrical consumption, by half-hour intervals, for the year 2022.

The consumption column was then compared with the three factors created from the previous process. As a result, a general graph with the performance of all methods used for the prediction was provided. The performance was measured with runtimes in (ms) and the relative error of each method individually. This includes the model’s prediction accuracy and other performance criteria, depending on the type of classification problem. The performance was calculated on a 40% hold-out set, which had not been used for any of the performed model optimizations. This hold-out set was then used as input for a multi-hold-out-set validation, where we calculated the performance for 7 disjoint subsets. The largest and the highest performance were removed and the average of the remaining five performance cases is reported here. Although this type of validation is not as thorough as full cross-validation, this approach strikes a good balance between the runtime and model validation quality. Some examples are illustrated below in Figure 3.

For the reasons mentioned above, in this study, only the generalized linear model, deep learning and decision tree methods were selected for the comparison, due to the fact that they had relatively close performance metrics. The analysis of the results of these methods follows below.

3. Results

In this section, the predictions of the methods chosen will be described. These results will then be compared with the actual consumption that occurred in the following year, 2023. At this point, therefore, the second archive will be used, i.e., the 2023 consumption data, in order to evaluate the results.

3.1. General Linear Model

The first method analyzed is the general linear model method. The numerical model used in the method is presented in Table 1.

The relative error of this method is calculated to be roughly 7% and the runtime efficiency is nearly 0 due to the simplicity of the model. Furthermore, the graph of the generalized linear method with all of the prediction values of the consumption is depicted below in Figure 4.

The main result of this method is that the value of the average electrical consumption predicted for the year 2023 is 627.8678. In comparison with the actual value of 629, which was derived from the actual electricity consumption file of 2023, the percentage error was only 0.18%.

The chart of the final prediction of this approach, compared with the actual data and the initial data for the comparison, can be seen in the Figure 5.

3.2. Decision Tree

The second prediction methodology that will be examined is the decision tree method. This non-parametric algorithm can efficiently deal with large, complex data sets. Furthermore, this methodology is widely used for both data mining, to create classification systems, and also for the development of prediction algorithms for a target variable, as in our case.

The decision tree classifies data into branch-like blocks and creates an inverted tree-like structure, part of which is shown in Figure 6.

Moreover, the chart with all of the prediction values of the decision tree method can be seen in the following Figure 7.

By comparing, in a similar way, the results of this approach with the actual data available, it is concluded that this method is not so effective in this particular forecasting process.

In particular, 628.4339 is the average value predicted with the decision tree method, which indicates a 0.5661% difference form the actual target value of 629. The results are illustrated in the following Figure 8.

3.3. Deep Learning

Finally, the last method to be examined is the deep learning method. The model is displayed in Table 2.

Moreover, the chart with all of the prediction values for the deep learning method is depicted in the following Figure 9.

While the previous methods investigated predicted a decrease in consumption with varying degrees of accuracy, the results of this method predict an increase to 637.0455. This of course contradicts reality, as the actual value is 629, and leads to an error of 1.2638%. These results are presented below in Figure 10.

4. Conclusions and Proposals

After the completion of the above procedures and the analysis of the results, it is obvious that data mining processes are very important in creating valid prediction models. Through the application of different data mining techniques, the ability to predict and the possibility of comparing the functionality of them, as well as identifying the relative error in each case, is possible. In this particular implementation, all of the examined methods had satisfactory results regarding the correctness of the prediction process. All of them, with different percentages of accuracy, predicted the reduction in electricity consumption for the year 2023. It should be noted that the general linear model method had the highest accuracy, with only a 0.18% percentage error, while the deep learning method had the lowest, with a 1.26% error.

Generally, all of the forecasts were very accurate, which can be explained by the way in which the database was constructed and processed. The positive contribution to the process of enriching the database with additional relevant data is evident, as is the use of the statistical method of factor analysis in order to reduce the size of the database. Overall, knowledge extraction from large data sets is vital in every aspect of science, technology and economics, due to the fact that it gives the possibility to prepare for different potential outcomes depending on the conditions and parameters taken into account each time we analyze patterns in the data.

As a complement to the present study, the same methods of extracting predictions could be used in different types and sizes of databases. It would also be possible to repeat the whole procedure without using statistical methods such as factor analysis and normalization, in order to establish the ways in which these methodologies contribute to the forecasting processes.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/engproc2024070031/s1, Data.

Author Contributions

Conceptualization, D.K. and C.D.F.; methodology, D.K.; software, C.D.F.; validation, D.K., K.T. and C.D.F.; formal analysis, D.K. and C.D.F.; investigation, D.K., K.T. and C.D.F.; data curation, K.T.; writing—original draft preparation, K.T. and C.D.F.; writing—review and editing, D.K., K.T. and C.D.F.; supervision, D.K.; project administration, D.K., K.T. and C.D.F. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

This study did not require ethical approval.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data are contained within the article and Supplementary Materials.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Wang, C.; Ma, X.; Chen, J. Information extraction and knowledge graph construction from geoscience literature. Comput. Geosci. 2018, 112, 112–120. [Google Scholar] [CrossRef]
Kazolis, D.; Kogias, P.; Roumeliotis, N. Energy cluster analysis based on consumption data in different weather condition. E3S Web Conf. 2023, 404, 01005. [Google Scholar] [CrossRef]
Stoyanov, I.S.; Iliev, T.B.; Mihaylov, G.Y.; Evstatiev, B.I.; Sokolov, S.A. Analysis of the Cybersecurity Threats in Smart Grid University of Telecommunications and Post, Sofia, Bulgaria. In Proceedings of the IEEE 24th International Symposium for Design and Technology in Electronic Packaging, Iasi, Romania, 25–28 October 2018; pp. 90–93. [Google Scholar]
Singh, S.; Yassine, A. Big data mining of energy time series for behavioral analytics and energy consumption forecasting. Energies 2018, 11, 452. [Google Scholar] [CrossRef]
Saaty, T. Decision making with the analytic hierarchy process. Int. J. Serv. Sci. 2008, 1, 83. [Google Scholar] [CrossRef]
Vlahavas, I.; Kefalas, I.; Bassiliades, P.; Kokkoras, N.; Sakellariou, F. Artificial Intelligence, 4th ed.; University of Macedonia Press: Thessaloniki, Greece, 2020. [Google Scholar]
Kazolis, D.; Gerontidis, I. Knowledge Mining from Student Data. In Proceedings of the Conference in Telecommunications, Informatics, Energy and Management (TIEM 2019), Kavala, Greece, 12–14 September 2019; pp. 299–302. [Google Scholar]
Hellenic National Meteorological Service. Available online: http://www.emy.gr/emy/el/climatology/climatology_city?perifereia=East%20Macedonia%20and%20Thrace&poli=Kavala_Chryssoupoli (accessed on 5 April 2024).
Horn, J. A rationale and test for the number of factors in factor analysis. Psychometrika 1965, 30, 179–185. [Google Scholar] [CrossRef] [PubMed]
Yong, A.; Pearce, S. A beginner’s guide to factor analysis: Focusing on exploratory factor analysis. Tutor. Quant. Methods Psychol. 2013, 9, 79–94. [Google Scholar] [CrossRef]

Figure 1. Scree plot.

Figure 2. Final prediction column.

Figure 3. General graph of the performance of each method.

Figure 4. Graph of the generalized linear method with the prediction values.

Figure 5. General linear model final prediction.

Figure 6. Part of the decision tree model.

Figure 7. Graph of the decision tree method with the prediction values.

Figure 8. Decision tree final prediction.

Figure 9. Graph of the deep learning method with the prediction values.

Figure 10. Deep learning final prediction.

Table 1. The numerical model of the general linear model.

Factor 1	Factor 2	Factor 3	Intercept
13.05454258043458	227.56971119220478	−11.952557035061462	728.2974408976228
13.122805792387307	228.70571266678215	−11.858938707365871	727.6560121765601

Table 2. The deep learning model.

Model Metric Type: Regression
Description: Metrics reported on temporary training frame with 10,014 samples
Model ID: rm-h2o-model-model-3
Frame ID: rm-h2o-frame-model-3.temporary.sample.95.13%
MSE: 3069.163
RMSE: 55.40003
R²: 0.946572
Mean residual deviance: 3069.163
Mean absolute error: 39.31098
Root mean squared log error: 0.3759962
Status of Neuron Layers (predicting sum, regression, Gaussian distribution, quadratic loss, 2801 weights/biases, 37.5 KB, 105, 120 training samples, mini-batch size 1): Layer Units; Type; Dropout L1; L2 Mean Rate; RMS Momentum Mean Weight; Weight RMS Mean Bias; Bias RMS.
1	3 Input		0.00%
2	50 Rectifier	0	0.000010	0.000000	0.009010	0.008097	0.000000	0.018053	0.180657	0.293302	0.158455
3	50 Rectifier	0	0.000010	0.000000	0.250662	0.316809	0.000000	−0.039544	0.154008	0.862177	0.103999
4	1 Linear		0.000010	0.000000	0.005598	0.006922	0.000000	0.009856	0.141613	−0.023132	0.000000
Scoring History: Timestamp; Duration; Training; Speed; Epochs; Iterations; Samples; Training RMSE; Training Deviance; Training MAE; Training r2.
2024-04-27 19:47:20		0.000 s		0.00000	0	0.000000		NaN	NaN	NaN	NaN
2024-04-27 19:47:21		0.740 s	17261 obs/s	1.00000	1	10512.000000		60.61666	3674.37925	42.40557	0.93604
2024-04-27 19:47:22		1.260 s	19164 obs/s	2.00000	2	21024.000000		65.79757	4329.32059	56.35478	0.92464
2024-04-27 19:47:22		1.746 s	20491 obs/s	3.00000	3	31536.000000		57.05335	3255.08441	39.49963	0.94334
2024-04-27 19:47:22		2.175 s	21674 obs/s	4.00000	4	42048.000000		59.01743	3483.05675	38.97498	0.93937
2024-04-27 19:47:23		2.566 s	22822 obs/s	5.00000	5	52560.000000		56.47749	3189.70682	38.95827	0.94447
2024-04-27 19:47:23		2.931 s	23899 obs/s	6.00000	6	63072.000000		59.50469	3540.80758	47.82509	0.93836
2024-04-27 19:47:24		3.306 s	24626 obs/s	7.00000	7	73584.000000		58.24340	3392.29414	46.47275	0.94095
2024-04-27 19:47:24		3.643 s	25499 obs/s	8.00000	8	84096.000000		55.40003	3069.16301	39.31098	0.94657
2024-04-27 19:47:24		3.950 s	26434 obs/s	9.00000	9	94608.000000		60.44436	3653.52103	49.86756	0.93640
2024-04-27 19:47:25		4.246 s	27310 obs/s	10.00000	10	105120.000000		60.07049	3608.46377	50.90866	0.93718
2024-04-27 19:47:25		4.276 s	27296 obs/s	10.00000	10	105120.000000		55.40003	3069.16301	39.31098	0.94657

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Kazolis, D.; Fotakis, C.D.; Tramantzas, K. Comparison of Functionality and Evaluation of Results in Different Prediction Models. Eng. Proc. 2024, 70, 31. https://doi.org/10.3390/engproc2024070031

AMA Style

Kazolis D, Fotakis CD, Tramantzas K. Comparison of Functionality and Evaluation of Results in Different Prediction Models. Engineering Proceedings. 2024; 70(1):31. https://doi.org/10.3390/engproc2024070031

Chicago/Turabian Style

Kazolis, Dimitrios, Christos Dionyshs Fotakis, and Konstantinos Tramantzas. 2024. "Comparison of Functionality and Evaluation of Results in Different Prediction Models" Engineering Proceedings 70, no. 1: 31. https://doi.org/10.3390/engproc2024070031

APA Style

Kazolis, D., Fotakis, C. D., & Tramantzas, K. (2024). Comparison of Functionality and Evaluation of Results in Different Prediction Models. Engineering Proceedings, 70(1), 31. https://doi.org/10.3390/engproc2024070031

Article Menu

Comparison of Functionality and Evaluation of Results in Different Prediction Models^†

Abstract

1. Introduction