1. Introduction
The implementation of databases is, nowadays, an integral part of information technology.
The increasing relevance of their use is due to the fact that their purpose is not only the simple archiving of data but also the extraction of knowledge, in the form of understandable correlations, from them [
1,
2].
Moreover, technological developments have made it possible to export predictions based on the analysis of recorded data [
3]. As a result, effective data mining techniques are now essential due to the complexity of data patterns and the growing significance of precise provisions. These techniques, with different approaches, aim to predict the future as accurately as possible; each of them applies and works most effectively under certain conditions [
4,
5].
Thus, the present study aims at the evaluation of these processes. This is performed by applying these different prediction methodologies to a specific type of database. This base consists of data in numeric form and reflects the electricity consumption in the wider region of the city of Kavala in Greece. In attempting to forecast trends in electricity consumption, it was found that different methods of data extraction can lead to slightly different numerical prediction results. Thus, in order to increase the accuracy of the forecasts and the understanding of the elements influencing the consumption, the database was enriched with additional data relating to time and environmental conditions. Finally, through the use of experimentation and performance evaluations, the advantages and disadvantages of each method in terms of prediction are highlighted.
The results of this study indicate which of the applied methods are most appropriate for specific types of databases. Furthermore, the knowledge gathered opens the opportunity for the creation of more precise prediction models, which can provide assistance to and have a significant impact on decision-making processes.
2. Materials and Methods
2.1. Data Processing
As mentioned above, the initial data were obtained from the public electricity company of Greece and included the consumption of the city of Kavala in the last two years, i.e., 2022 and 2023. These data concerned the loads in amperes of twelve transformers in the wider area of the city and were taken at half-hourly intervals over the past two years. In order to achieve the process of comparing the prediction results, it was decided that the first archive, that of 2022, would be used for the implementation of the forecasting techniques, while the other, that of 2023, would be used to evaluate the results of the above methods.
Thus, after both files were examined and cleared of missing or incorrect data [
6], the 2023 file was left as it was for the purpose of being used in the final results evaluation process. However, the 2022 archive was enriched with additional data that correlated the consumption with the factors that influenced it [
7]. Thus, columns relating to the days of the week were added to the half-hourly consumption throughout the year of 2022. In addition, the daily temperatures were categorized into maximum, minimum and average values. In addition, the average humidity, the monthly rainfall, the rainy days, the intensity of the wind, the barometric pressure, whether the particular day was sunny or not and finally the monthly amount of solar energy using a photovoltaic panel were recorded. The remaining values were obtained from the official website of the Hellenic National Meteorological Service [
8]. Finally, it should be pointed out that the data were not in numeric form but transformed by substitution. For example, the days of the week were represented in seven corresponding columns. Each column was numbered 1 only for the hours in which the particular day matched it, and the rest had the value of 0. Finally, both archives were subjected to a normalization process with the aim of placing them on the same scale.
After the completion of the above processes, the two files were ready for the continuation of the procedure.
2.2. Factor Analysis
As described previously, two archives were created. The one of the year 2022, which was enriched, contained, in its final form, 32 columns and 17,520 lines. The columns corresponded to variables such as consumption, temperatures, etc., and the lines to the temporal subdivision of the whole year by half hours. This led to a total 543,151 records. Therefore, in order to reduce the number of data to be examined, and to enable the methods to be applied subsequently, it was decided to use factor analysis. Using this method, we achieved the replacement of all consumption data, contained in 31 columns, with a certain number of factors. To determine this number, the Kaiser criterion and a scree plot were implemented. This can be seen in the following
Figure 1, where it is seen that the number of factors is 3.
Additionally, it should be mentioned that, for factor rotation, Varimax raw was used, with principal components for the extraction [
9,
10].
Following the completion of the above methodology, the set of data in the initial file was replaced by three single factors, resulting in a significant reduction in the number of data in the original file. Only the column containing the total consumption has been preserved, on which the following prediction procedures will be applied.
2.3. Prediction Process
After the completion of the processing of the files, in this section, the three forecasting methodologies chosen, i.e., decision trees, deep learning and a generalized linear model, will be applied in sequence. It is worth noting that the methodologies that were initially considered totaled five—specifically, random forest and gradient boosted trees, in addition to the above-mentioned ones. However, in the end, to mitigate the duration of this task, only the three with the best performance in their forecasts were chosen to be decomposed.
The Rapidminer software was used for the implementation of the procedure.
Figure 2 below depicts the column selected for the final prediction, which is the sum of the electrical consumption, by half-hour intervals, for the year 2022.
The consumption column was then compared with the three factors created from the previous process. As a result, a general graph with the performance of all methods used for the prediction was provided. The performance was measured with runtimes in (ms) and the relative error of each method individually. This includes the model’s prediction accuracy and other performance criteria, depending on the type of classification problem. The performance was calculated on a 40% hold-out set, which had not been used for any of the performed model optimizations. This hold-out set was then used as input for a multi-hold-out-set validation, where we calculated the performance for 7 disjoint subsets. The largest and the highest performance were removed and the average of the remaining five performance cases is reported here. Although this type of validation is not as thorough as full cross-validation, this approach strikes a good balance between the runtime and model validation quality. Some examples are illustrated below in
Figure 3.
For the reasons mentioned above, in this study, only the generalized linear model, deep learning and decision tree methods were selected for the comparison, due to the fact that they had relatively close performance metrics. The analysis of the results of these methods follows below.
3. Results
In this section, the predictions of the methods chosen will be described. These results will then be compared with the actual consumption that occurred in the following year, 2023. At this point, therefore, the second archive will be used, i.e., the 2023 consumption data, in order to evaluate the results.
3.1. General Linear Model
The first method analyzed is the general linear model method. The numerical model used in the method is presented in
Table 1.
The relative error of this method is calculated to be roughly 7% and the runtime efficiency is nearly 0 due to the simplicity of the model. Furthermore, the graph of the generalized linear method with all of the prediction values of the consumption is depicted below in
Figure 4.
The main result of this method is that the value of the average electrical consumption predicted for the year 2023 is 627.8678. In comparison with the actual value of 629, which was derived from the actual electricity consumption file of 2023, the percentage error was only 0.18%.
The chart of the final prediction of this approach, compared with the actual data and the initial data for the comparison, can be seen in the
Figure 5.
3.2. Decision Tree
The second prediction methodology that will be examined is the decision tree method. This non-parametric algorithm can efficiently deal with large, complex data sets. Furthermore, this methodology is widely used for both data mining, to create classification systems, and also for the development of prediction algorithms for a target variable, as in our case.
The decision tree classifies data into branch-like blocks and creates an inverted tree-like structure, part of which is shown in
Figure 6.
Moreover, the chart with all of the prediction values of the decision tree method can be seen in the following
Figure 7.
By comparing, in a similar way, the results of this approach with the actual data available, it is concluded that this method is not so effective in this particular forecasting process.
In particular, 628.4339 is the average value predicted with the decision tree method, which indicates a 0.5661% difference form the actual target value of 629. The results are illustrated in the following
Figure 8.
3.3. Deep Learning
Finally, the last method to be examined is the deep learning method. The model is displayed in
Table 2.
Moreover, the chart with all of the prediction values for the deep learning method is depicted in the following
Figure 9.
While the previous methods investigated predicted a decrease in consumption with varying degrees of accuracy, the results of this method predict an increase to 637.0455. This of course contradicts reality, as the actual value is 629, and leads to an error of 1.2638%. These results are presented below in
Figure 10.
4. Conclusions and Proposals
After the completion of the above procedures and the analysis of the results, it is obvious that data mining processes are very important in creating valid prediction models. Through the application of different data mining techniques, the ability to predict and the possibility of comparing the functionality of them, as well as identifying the relative error in each case, is possible. In this particular implementation, all of the examined methods had satisfactory results regarding the correctness of the prediction process. All of them, with different percentages of accuracy, predicted the reduction in electricity consumption for the year 2023. It should be noted that the general linear model method had the highest accuracy, with only a 0.18% percentage error, while the deep learning method had the lowest, with a 1.26% error.
Generally, all of the forecasts were very accurate, which can be explained by the way in which the database was constructed and processed. The positive contribution to the process of enriching the database with additional relevant data is evident, as is the use of the statistical method of factor analysis in order to reduce the size of the database. Overall, knowledge extraction from large data sets is vital in every aspect of science, technology and economics, due to the fact that it gives the possibility to prepare for different potential outcomes depending on the conditions and parameters taken into account each time we analyze patterns in the data.
As a complement to the present study, the same methods of extracting predictions could be used in different types and sizes of databases. It would also be possible to repeat the whole procedure without using statistical methods such as factor analysis and normalization, in order to establish the ways in which these methodologies contribute to the forecasting processes.
Author Contributions
Conceptualization, D.K. and C.D.F.; methodology, D.K.; software, C.D.F.; validation, D.K., K.T. and C.D.F.; formal analysis, D.K. and C.D.F.; investigation, D.K., K.T. and C.D.F.; data curation, K.T.; writing—original draft preparation, K.T. and C.D.F.; writing—review and editing, D.K., K.T. and C.D.F.; supervision, D.K.; project administration, D.K., K.T. and C.D.F. All authors have read and agreed to the published version of the manuscript.
Funding
This research received no external funding.
Institutional Review Board Statement
This study did not require ethical approval.
Informed Consent Statement
Not applicable.
Data Availability Statement
Conflicts of Interest
The authors declare no conflicts of interest.
References
- Wang, C.; Ma, X.; Chen, J. Information extraction and knowledge graph construction from geoscience literature. Comput. Geosci. 2018, 112, 112–120. [Google Scholar] [CrossRef]
- Kazolis, D.; Kogias, P.; Roumeliotis, N. Energy cluster analysis based on consumption data in different weather condition. E3S Web Conf. 2023, 404, 01005. [Google Scholar] [CrossRef]
- Stoyanov, I.S.; Iliev, T.B.; Mihaylov, G.Y.; Evstatiev, B.I.; Sokolov, S.A. Analysis of the Cybersecurity Threats in Smart Grid University of Telecommunications and Post, Sofia, Bulgaria. In Proceedings of the IEEE 24th International Symposium for Design and Technology in Electronic Packaging, Iasi, Romania, 25–28 October 2018; pp. 90–93. [Google Scholar]
- Singh, S.; Yassine, A. Big data mining of energy time series for behavioral analytics and energy consumption forecasting. Energies 2018, 11, 452. [Google Scholar] [CrossRef]
- Saaty, T. Decision making with the analytic hierarchy process. Int. J. Serv. Sci. 2008, 1, 83. [Google Scholar] [CrossRef]
- Vlahavas, I.; Kefalas, I.; Bassiliades, P.; Kokkoras, N.; Sakellariou, F. Artificial Intelligence, 4th ed.; University of Macedonia Press: Thessaloniki, Greece, 2020. [Google Scholar]
- Kazolis, D.; Gerontidis, I. Knowledge Mining from Student Data. In Proceedings of the Conference in Telecommunications, Informatics, Energy and Management (TIEM 2019), Kavala, Greece, 12–14 September 2019; pp. 299–302. [Google Scholar]
- Hellenic National Meteorological Service. Available online: http://www.emy.gr/emy/el/climatology/climatology_city?perifereia=East%20Macedonia%20and%20Thrace&poli=Kavala_Chryssoupoli (accessed on 5 April 2024).
- Horn, J. A rationale and test for the number of factors in factor analysis. Psychometrika 1965, 30, 179–185. [Google Scholar] [CrossRef] [PubMed]
- Yong, A.; Pearce, S. A beginner’s guide to factor analysis: Focusing on exploratory factor analysis. Tutor. Quant. Methods Psychol. 2013, 9, 79–94. [Google Scholar] [CrossRef]
| Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).