1. Introduction
The rapid development of intelligent power systems presents a unique opportunity to optimize energy management and enhance energy security, thereby promoting sustainable development [
1]. However, this growth also introduces challenges, particularly in maintaining the integrity of the ever-increasing load data, which is crucial for ensuring efficient and reliable power grid operations [
2]. Factors such as sensor malfunctions [
3], communication disruptions [
4], smart meter anomalies [
5], and transmission bottlenecks [
6] can lead to irregular and unpredictable data loss. Additionally, the intermittency of renewable energy generation and the complexity of user interactions and dynamic behaviors present unprecedented challenges to system stability [
7]. These issues pose significant risks not only to the stability of power systems but also to broader sustainability goals, including energy efficiency and the transition to green energy. Therefore, addressing the challenge of data loss is essential to maintaining the resilience and sustainability of power grids, which are foundational to supporting a green and secure energy future.
Generally, in power systems, the redundancy of measurement configurations provides a buffer against data loss. A traditional approach to handling missing data is to simply delete records containing missing values from the dataset. Although this method maintains the high quality of the dataset by avoiding the risks associated with inaccurate estimations, it has a significant drawback: it reduces the overall size of the dataset. The reduction in data volume limits the amount of data available for training models, which can impact the efficiency and effectiveness of the learning process [
8]. The reduction in data size leads to insufficient training data for gaining information through the learning process. When only a few data points are missing, we can use pseudo-measurements to substitute these missing points, a method that can maintain the accuracy of state estimation under the precondition of observable state estimation, and this method can even be used to fill in missing data [
9].
However, if the quantity of missing data increases to a certain proportion, relying solely on pseudo-measurements or simply deleting the missing data becomes ineffective. Deleting missing data results in significant information loss [
10], which not only reduces the effectiveness of the data but also may prevent models in training from being adequately driven, leading to biased outcomes in related models and consequently complicating subsequent data analysis and decision making [
11]. Therefore, in the face of large-scale data omissions, we need to employ more complex and refined data imputation techniques to restore lost information, thereby ensuring the accuracy and reliability of data analysis.
To address this issue, data imputation techniques have emerged. Data imputation involves estimating missing data based on observed data [
12], and the methods for filling missing values can primarily be categorized into three types: statistical learning-based methods, traditional machine learning-based methods, and deep learning-based methods.
Statistical learning-based imputation methods typically include forward fill, mean interpolation, polynomial interpolation, mode imputation, regression imputation, nearest neighbor algorithms, and hot-deck and cold-deck imputation to repair missing data [
13,
14]. However, these approaches conduct analyses only from the perspective of data distribution, overlooking the time series characteristics and correlations in power system measurements, resulting in significant imputation errors and a suboptimal reconstruction of missing data in power systems. Sim et al. [
8] proposed a missing data imputation method for transmission systems using the Principal Component Analysis Iterative Algorithm (PCA-IA), which, compared to traditional statistical methods, improves the accuracy of data imputation. Although it considers the impact of other multivariate factors on power loads, this method is based on the assumption of linear correlations and may be limited when dealing with nonlinear or complexly correlated datasets, and it is highly dependent on data quality. Furthermore, very large datasets or a high volume of missing data can affect the performance of the algorithm. Kamisan et al. [
15] introduced a new imputation technique based on seasonal patterns and missing data localization, which, by rearranging data and calculating the mean, the mean plus standard deviation, and the third quartile of the decomposed subsets, provides a more rational estimate for missing values. This method forms an effective data reconstruction strategy, which is particularly suitable for load data with distinct seasonal patterns. However, it still assumes random missingness and depends on specific imputation models.
Given the limitations of traditional statistical methods and some improved algorithms in dealing with complex datasets, modern machine learning offers a new perspective. To further enhance imputation precision, traditional machine learning-based methods, such as random forest, k-nearest neighbors (KNN), and Support Vector Machines (SVMs), are gaining attention. Farrugia et al. [
16] proposed a k-nearest neighbors (KNN) method that, by analyzing past consumption patterns of consumers, identifies the patterns most similar to those around the missing data and estimates the missing data by calculating their average, providing an effective filling strategy for missing load profile data in smart meters. Turrado et al. [
17] introduced a novel missing data imputation algorithm based on Multivariate Adaptive Regression Splines (MARSs), which, by building a model of basic functions dependent on existing data, predicts missing power data and, compared with the widely used Multivariate Imputation by Chained Equations (MICE) technique, has proven its efficiency and superiority in estimating missing main electrical variables in power networks. Smola et al. [
18] proposed a Support Vector Machine (SVM)-based method for missing data imputation, which incorporates temporal information during the imputation process. As power grid environments become increasingly complex, with load changes showing strong randomness [
19], nonlinearity [
20], and conditionality, relying solely on traditional machine learning methods may be insufficient for the high-precision imputation of missing data.
In recent years, deep learning methods have been widely applied in the field of missing data imputation. Although existing studies, such as those by Lotfipoor et al. [
21] with Transformer networks, Ryu et al. [
22] with denoising autoencoders (DAEs), and Liu et al. [
23] with Generative Adversarial Imputation Networks (GAINs), have made certain advances, they still face limitations in handling multivariate impacts and correlations among features. While these methods individually perform well in data imputation, they often do not take into account the complex relationships between power load values and other variables (such as meteorological conditions) as well as the intrinsic connections among features.
In response to these limitations, this study proposes an innovative composite machine learning method aimed at more comprehensively considering the multiple factors and time series characteristics that influence power load. This method considers both meteorological features and short-term and long-term factors of time series to enhance the accuracy and efficiency of imputation. This study then details three machine learning models: random forest (RF), k-nearest neighbors based on Spearman’s rank correlation coefficient (SW-KNN), and a back propagation neural network optimized by Levenberg–Marquardt (LM-BP). Further, we introduce a variance–covariance weighted mixing imputation model that combines the predictive results of the three machine learning methods and dynamically allocates weights based on the variance and covariance of each model, synthesizing more accurate predictive outcomes. Overall, the contributions of this study are as follows:
The meteorological features and the short-term and long-term factors of time series are fully considered, making imputation more accurate and efficient.
Multiple machine learning models are introduced, which can predict the power grid load from different perspectives, capturing the complex dependencies among features in the data.
By introducing a variance–covariance weighting method, the merged method becomes more stable and accurate, providing effective data support for optimizing the scheduling, safe operation, and reasonable pricing of power systems.
5. Conclusions and Outlook
In this study, we primarily assessed the performance of a hybrid model constructed from three machine learning methods designed for imputing consecutively missing feature values over extended periods, thereby enhancing the resilience and sustainability of smart grid operations. Additionally, we analyzed the performance differences across single and multiple models, various model variants, training set sizes, seasonal variations (summer and winter), and the inclusion of relevant factors for imputation, providing a comprehensive overview of how these variables impact model efficacy. Through tests on five distinct metrics, the hybrid model reduced the RMSE, MSE, MAE, and MAPE by approximately 12.3–23.5% compared to the best single model. Considering different factors, the reductions in RMSE, MSE, and MAE ranged from 8.3 to 38.1%, with improvements in slightly increasing and trending towards 1. Moreover, the machine learning hybrid model demonstrated superior imputation accuracy compared to different variants and mainstream imputation methods. This demonstrates the precision and superiority of considering multiple factors and applying the variance–covariance weighted method, highlighting the significant potential of our model in enhancing data accuracy and ensuring reliable energy management.
Looking ahead, future research should focus on refining this hybrid approach to better align with the goals of sustainable development and energy security. Our method comprehensively considers both meteorological features and the short-term and long-term factors of time series to enhance the accuracy and efficiency of data imputation. This study detailed the application of three machine learning models—random forest (RF), k-nearest neighbors based on Spearman’s rank correlation coefficient (SW-KNN), and back propagation neural network optimized by the Levenberg–Marquardt (LM-BP) method. Moreover, we introduced a variance–covariance weighted mixing imputation model that combines the predictive results of these models, dynamically allocating weights based on their variance and covariance to synthesize more accurate outcomes. These advancements support the transition to a green energy future by providing actionable insights and reliable data for optimizing power grid operations. By integrating these advanced techniques, this study offers a clear pathway to practical applications that enhance operational efficiency and promote environmental sustainability. Additionally, real-time data integration and dynamic prediction updates will be critical for maintaining grid stability and efficiency, ensuring that smart grids continue to meet the evolving demands of a sustainable energy landscape. This study lays a strong foundation for future advancements in sustainable energy management within intelligent power systems, offering a comprehensive approach that addresses complex dependencies among features and improves the stability and accuracy of power grid operations.