1. Introduction
Energy consumption is continually increasing globally, in parallel with the advancement of science and technology. To maintain a modern and appropriate technology level, nations must improve and sustain their energy resources. Today’s principal challenge facing the energy sector is maintaining the balance between supply and demand. Furthermore, as the world population grows, the per capita consumption rate also increases, driven by technological advancements [
1]. Thus, there exists a direct correlation between an individual’s daily consumption rate in a country and the level of development of that country.
Energy resources are classified based on their consumption and convertibility. They are classified as renewable or non-renewable energy sources and primary or secondary energy sources. Non-renewable energy sources are finite, unchanging and discontinuous in nature and include fossil fuels such as oil, natural gas and coal. On the other hand, renewable energy resources can be replenished over time and are available for a prolonged period, including solar, wind, geothermal, biomass and hydro-power [
2].
The economic feasibility and popularity of solar energy are increasing daily. However, regular solar energy monitoring is essential to ensure high efficiency and prevent problems. The importance of research in this field is directly related to the increase in the solar energy market share. The global solar energy market, which was valued at USD 86 billion in 2015, is projected to reach USD 422 billion by the end of 2022 [
3]. It is estimated that approximately 2% of photovoltaic panels will fail after 11–12 years [
4], and losses from dust collection (contamination) can be greater than losses from cell disruption. Therefore, the regular production monitoring and reporting of possible losses are essential to ensure early diagnosis and regular maintenance.
In the article by Rahman et al. [
5], Artificial Neural Network (ANN) systems were explored for predicting renewable energy generation from solar, turbine and hydro-power sources. Similarly, Zheng et al. [
6] utilized particle swarm optimization combined with long-short-term memory techniques to predict energy output from photovoltaic (PV) systems.
Various methodologies are reported in the literature for predicting the energy generated by photovoltaic systems. For example, some studies [
7,
8,
9] have employed neural network techniques to make predictions of energy output. Additionally, a similar analysis has been applied to forecasting the temperature of photovoltaic modules. In addition, some of them focus [
10,
11] on the feature selection because it is believed that if it can be configured well, ML models can predict solar power better.
However, due to the non-linear and chaotic nature of solar power plants, the choice of prediction models must be made carefully. Therefore, this research decided to use the three most popular machine learning algorithms, and some of these models can work with few features while others prefer to have a larger set of variables. This is important because every stage type and feature amount will differ.
These articles describe the use of Digital Twin (DT) technology in various renewable energy systems. In [
12], modules were designed to store, map and process data from a solar power plant to develop life-cycle management with DT. In [
13], the authors designed an architecture, mathematical model and big data analytic engine to monitor the state of solar panels using DT. In [
14], the authors proposed using DT for optimum control, virtual modeling and pre-diagnosis in production processes. In [
15], it was suggested to use DT to monitor decentralized renewable energy sources in the electricity grid. In [
16], the authors used DT to observe wind turbine fatigue failure and evaluate alternative processes for a floating wind turbine.
The articles reviewed in this study propose the use of Digital Twin (DT) technology to monitor and optimize various aspects of solar and wind power plants. The studies involve designing modules to collect and process data, creating virtual models of the physical systems and implementing AI algorithms and big data analytics to improve performance.
The implementation of Digital Twin (DT) technology faces a challenge in detecting errors or abnormalities, as it requires waiting for the entire cycle to complete, which slows down the system and reduces sensitivity. To overcome this, the authors suggest dividing the power plant into subsystems and using multiple models, each representing a specific component of the solar PV. This different idea provides detailed insights into the performance and health of individual components, enabling the identification of potential failures or degradation. Compared to the standard unique model version, this new approach, with three Digital twins (DT) inside one system, provides a comprehensive understanding of the overall power plant, facilitating proactive maintenance and optimizing performance.
Considering the aforementioned factors, this study aims to achieve the production and error detection of the system with machine learning models while creating the Digital Twins of the photovoltaic systems and transferring them to the virtual system. In summary, the paper’s key contributions are focused on the development of an innovative DT for solar PV, which is based on the development of a DT of different components, allowing identification of the faulty component, which can be easily integrated into an online-based platform for real-time monitoring of a real SCADA system. The study is structured as follows: in
Section 2, the methods used in this study and their working methods are explained.
Section 3 is used to compare the results. Finally,
Section 4 shows the conclusion. Although the paper provides results, the presented values come from simulations and lack real data due to the access restrictions of real environments.
3. Methodology of Digital Twin
The fundamental objective of the DT concept is to enable real-time monitoring and detailed analysis of a solar panel system through a virtual model (
Figure 4). With data from the real PV plant, the trained model can predict the system’s behavior using one of the already explained machine learning methods. With all of this, the results from the ML model and the ones from the real plant can be compared to determine if there is any deviation and warn the responsible party to take countermeasures.
All the data can come from different sources, such as different IoT devices. The unique condition is to have a time stamp and the minimum data to make a correlation between all the inputs. This approach simplifies the process by consolidating all necessary information on a single platform and obviates the necessity for intricate and burdensome systems that entail an abundance of data. The details of the internal architecture are depicted in
Figure 5. The platform uses docker to split the components into small modules that are easy to manage and maintain. To interact with the external elements and to receive the data, we created a REST API supported by the FasAPI framework. The machine learning element component uses the framework sckitlearn and keras. It uses the Redis database to receive the orders to perform prediction or re-training. To store the data, it uses Influx DB—a tool capable of managing time series data efficiently and sufficiently fast to hold all the needed data. Finally, the tool called Grafana can be used to visualize the data from the different sources and also the Digital Twin.
The methodology used to obtain the results follows four different steps. First, the weather data are collected from the PVGIS system for several years. The second step is to use a part of the obtained data to run the power plant model using Matlab/Simulink and obtain the experimental data as if the plant was a real plant. Then, the model uses the generated data to train the ML models. Finally, after all these steps, the DT is run using the same plant model again, but in this case, the weather data that have not been used from the previous steps are used. In this study, there were no data from a real plant, so the two initial steps were needed. All these steps used docker to build containers, which are Influx DB, grafana, Redis, ML models and FastAPI.
4. Results
In
Figure 6, there is a comparison between the original concept of DT, where there is a unique model of the system to predict the whole system, and the proposed concept of a “box of boxes”, where it has the same inputs and outputs but also has other intermediate variables that provide more information about the twin system.
The data used for these results are generated using a Matlab/Simulink model of a solar system with 150 kW of power as a real installation (
Figure 7) [
32]. In addition, the weather input data from PVGIS for two years are used to evaluate the first part as training and the second for evaluation [
33]. An installation with fixed panels at 30 degrees has been used to evaluate this research. The idea of this concept is to have a DT designed for a specific installation. If it changes—for example, the slope of the panels or the system tracks the sun—then the models have to be retrained with the data of this new configuration.
Solar energy plants consist of many complex parts, and they work intertwined with each other; with the proposed design in this study (
Figure 8), we aim to reduce the complexity by examining the whole system in three parts separately from each other. However, on the other hand, the whole system is like a chain reaction, and it is desired to emphasize that there is a natural bond between them.
This study utilized various variables for training and estimating algorithms based on examining solar panel systems and identifying key system characteristics. The relationships and connections between these variables were thoroughly analyzed to optimize the data structure for optimal results (
Figure 8).
4.1. Machine Learning
This section explains, for each part of the system of a PV plant, the results obtained comparing the three chosen machine-learning methods, evaluating the performance of each in the three different situations. In this study, we collected data for a period of two years, with data points recorded every hour. This resulted in a dataset containing 17,544 rows for each parameter measured. The large amount of data collected allowed for a thorough analysis of trends and patterns in the measured parameters over time. In this research study, a data partitioning approach was employed, whereby 30% of the available data were reserved for testing purposes, while the remaining 70% were partitioned into training and validation sets. Specifically, 80% of the data were allocated to the training set, and the remaining 20% were designated as the validation set.
4.1.1. Pv Panel Part
The solar panel component is the central focus of this study. It is heavily dependent on various input variables and requires preprocessing for estimation, except for electricity generation. Furthermore, given that its performance is affected by weather conditions and environmental factors, it requires continuous monitoring and protection.
The first stage of this study focuses on the direct effect of various types of radiation and temperature on solar panel performance. If
Figure 9 is examined, there is a linear relationship between power generation and irradiation types and temperature. In contrast, this link is not as clear as the others in
, which specifies the height of the sun (degrees). While it is stated in the literature that wind speed
does not has a significant impact, this study found that these variables still have an effect. This effect comes indirectly because, with the wind, the panels can be covered with sand or vice versa.
Figure 10 compares real electrical energy output with projections given by several machine learning algorithms. To evaluate the performance of our model, we selected four random days for testing in each season. Specifically, the selected days were 1 January, 7 March, 3 June and 3 September. The data collection for these testing days began at 3 am and continued until 4 am the next day, providing a sufficient amount of data to test the model’s accuracy and generalizability. It is evident from the comparison that the current generation graph (represented in blue) and the prediction graph (represented in red) exhibit a high degree of similarity for all days analyzed using the different methods. This implies that the machine learning techniques’ predictions capture the trends in the data. Among the different methods used, it is noteworthy that the only method whose estimates are below the peak values of the actual data for each observed day is the Catboost method.
CatBoost does not perform as well as RF or DNN, but it is the fastest algorithm to train and predict, but the error rate cannot be considered acceptable. DNN and RF gave perfect results, but still it is hard to say which one performed better from these images. As a result, the prediction error must be calculated to measure the forecasts’ accuracy. The RMSE and MAE values for the comparative approaches are shown in
Table 1.
As depicted in the aforementioned
Table 1, the Root Mean Squared Error (RMSE) and Mean Absolute Error (MAE) values for each of the proposed machine learning techniques are minimal, indicating that the estimation for the test dataset is satisfactory. Both metrics indicate that the DNN model yields the least prediction error. Conversely, the Catboost method yields the highest RMSE and MAE values. Despite the slight variations in the errors obtained for all methods, they are of a similar magnitude. This can be attributed to the lack of significant variations in the meteorological conditions of the location under test.
4.1.2. DC–DC Converter Part
Unlike PV panels, DC–DC converters exhibit nonlinear behavior and are dynamic systems that may adapt quickly to changes in the system. The semiconductor devices utilized in the converter, as well as the nonlinear phenomena generated by parasitic capacitances and inductances in the system, are principally responsible for this non-linearity [
34]. As a result, DC–DC converters are intrinsically nonlinear, making accurate modeling by machine learning techniques difficult. Furthermore, because of the semiconductor architectures utilized, the quick reaction time of DC–DC converters offers extra challenges for machine learning algorithms with large gaps between sample periods [
35]. To address this challenge, the machine learning algorithm employs a MIMO (Multiple Input Multiple Output) architecture, which allows for a more detailed analysis of the system by estimating both current and voltage as output.
Figure 11 reveals that the predictive performance of all DC current models is excellent, and this is not an easy task, considering that the current values fluctuate sporadically from 0 to 250 amps. However, these good results make it difficult to decide on one of them, so it is necessary to check the voltage estimation results in
Figure 12.
In
Figure 12, the situation is completely different, because the voltage values only range from 499 V to 501 V, and most are clustered above 500 V. ML models have difficulty predicting these stable movements. As shown in the graph, DNN usually makes predictions above or below the true value, while CatBoost was unable to accurately predict the upper and lower values, and the average remained around 500 V, so the graph shows that RF predicted with the best accuracy.
It can be confirmed with the help of
Table 2 when comparing the model in terms of MAE and RMSE that RF had the best numbers, while DNN showed the worst. This may lead to the idea that tree-based models give better results because they cannot go beyond the maximum and minimum values shown in the training data.
4.1.3. Grid Part
In this section, variables from the DC part have been used as input (
Figure 8), and we have tried to determine whether there is any loss in the system due to extra resistors, some cables or malfunctions of some electronic devices, but it should be noted here that ML models cannot say anything about the type or cause of the loss but only show the damage these losses caused to the system.
When looking at
Figure 13, it is clear that all the models give very good results. To see which one performs better, a graph can usually give us an idea, but here again, all the points are on top of each other, which means that a perfect fit has been reached. This raises the question of whether there is an over-fitting or not, but to avoid this, the used data were shuffled to different months and different hours of the day and night. Moreover, as explained from the PV part, the environmental weather conditions of the test place are pretty stable for the whole year, so predicting the grid part is also quite a straightforward process.
Figure 13 shows that all the models predict the same values, so it is necessary to check
Table 3 to see the details and decide which one is the best algorithm. Thus, based on the RMSE and MSE, we can choose one model to use for the grid part.
It is evident that all models exhibit excellent performance, with Mean Squared Error (MSE) or Root Mean Squared Error (RMSE) values below one. Therefore, it is justifiable to select a model based on its categorization. Here, three criteria were used to make the decision: the size, speed and complexity of the model. Since Catboost was the winner of these criteria, it was preferred as the main model, while the RF model showed the best performance in terms of error rates.
4.2. Digital Twin
Integrating Digital Twin technology and Internet of Things (IoT) devices is a promising approach for the real-time monitoring and analysis of power grid systems. Utilizing a Digital Twin concept, a virtual replica of the physical system allows for a seamless connection between IoT devices and data analytics, enabling rapid assessment and real-time decision-making based on reliable data.
One example of such an application is the use of Digital Twin technology in predicting the electricity production of a solar power plant to the grid. These predictions from the system can be used to compare with the plant, the real twin, to increase any deviation from the expected results, increasing the system’s reliability.
Figure 14 depicts the Grafana charts that allow for real-time system examination and alerting. The top graph shows the prediction of machine learning applied to the electricity passing through the grid. The central dashboard displays the error rate between the estimated and actual power. This study uses a tolerance rate of 20% as an example, and if the error rate exceeds this threshold, the system sends an error to the user. The bottom portion of the figures also displays the types of irradiation, with data obtained and sent to the system via IoT devices. This aspect of the system can be further supported with additional sensors to facilitate monitoring.
The idea to split the model into small models starts with the thesis that if there is a deviation from the Digital Twin, something is not correct, but there is no more information. Dividing the system into small pieces allows us to see the location of the issue and facilitates decision making. However, this method needs more data not only for the training but also for monitoring. It needs the status of intermediate points of the system and forces digitalization, adding more sensors to the parts of the system that require more information.
5. Conclusions
The paper has presented an innovative concept of an AI-based Digital Twin as a “box of black boxes”. This innovative concept is different from the previous research in this field. Instead of focusing on one big model, the authors designed a concept like a puzzle, creating small parts and connecting them. This research culminated in the development of three unique AI models, each representing a different component of the overall solar PV power plant system with a global accuracy of 98.3%. These models allow for gathering complex and granular insights into the detection and evaluation of possible faults or performance deterioration inside specific components that are part of the overall system. The investigation included an in-depth assessment of each model using three machine-learning algorithms to discover the most appropriate approaches. Notably, the findings revealed that the performance of various strategies differed depending on the individual system components with the investigation’s distinct traits and qualities.
Further steps of this research are to test this development in a real field and apply this concept to another energy generation system where there is the need to create a DT, such as wind turbines and other renewable resource systems. Furthermore, in future work, there is the need to not only use the error and its variations to consider a good fit, but also, the amount of data and time for the training could be important depending on the application.