Multiple Types of Missing Precipitation Data Filling Based on Ensemble Artificial Intelligence Models

Qiu, He; Chen, Hao; Xu, Bingjiao; Liu, Gaozhan; Huang, Saihua; Nie, Hui; Xie, Huawei

doi:10.3390/w16223192

Open AccessArticle

Multiple Types of Missing Precipitation Data Filling Based on Ensemble Artificial Intelligence Models

by

He Qiu

^1,2,

Hao Chen

^1,2,3,*,

Bingjiao Xu

^1,2,

Gaozhan Liu

^1,2,

Saihua Huang

^1,2

,

Hui Nie

^1,2

and

Huawei Xie

^1,2

¹

School of Hydraulic Engineering, Zhejiang University of Water Resources and Electric Power, Hangzhou 310018, China

²

International Science and Technology Cooperation Base for Utilization and Sustainable Development of Water Resources, Zhejiang University of Water Resources and Electric Power, Hangzhou 310018, China

³

Nanxun Innovation Institute, Zhejiang University of Water Resources and Electric Power, Hangzhou 310018, China

^*

Author to whom correspondence should be addressed.

Water 2024, 16(22), 3192; https://doi.org/10.3390/w16223192

Submission received: 25 September 2024 / Revised: 5 November 2024 / Accepted: 6 November 2024 / Published: 7 November 2024

(This article belongs to the Section Water Resources Management, Policy and Governance)

Download

Browse Figures

Versions Notes

Abstract

:

The completeness of precipitation observation data is a crucial foundation for hydrological simulation, water resource analysis, and environmental assessment. Traditional data imputation methods suffer from poor adaptability, lack of precision, and limited model diversity. Rapid and accurate imputation using available data is a key challenge in precipitation monitoring. This study selected precipitation data from the Jiaojiang River basin in the southeastern Zhejiang Province of China from 1991 to 2020. The data were categorized based on various missing rates and scenarios, namely MCR (Missing Completely Random), MR (Missing Random), and MNR (Missing Not Random). Imputation of precipitation data was conducted using three types of Artificial Intelligence (AI) methods (Backpropagation Neural Network (BPNN), Random Forest (RF), and Support Vector Regression (SVR)), along with a novel Multiple Linear Regression (MLR) imputation method built upon these algorithms. The results indicate that the constructed MLR imputation method achieves an average Pearson’s correlation coefficient (PCC) of 0.9455, an average Nash–Sutcliffe Efficiency (NSE) of 0.8329, and an average Percent Bias (Pbias) of 10.5043% across different missing rates. MLR simulation results in higher NSE and lower Pbias than the other three single AI models, thus effectively improving the estimation performance. The proposed methods in this study can be applied to other river basins to improve the quality of precipitation data and support water resource management.

Keywords:

precipitation data missingness; artificial intelligence; data imputation; ensemble simulation; Jiaojiang River basin

1. Introduction

The completeness of precipitation observation data is indispensable for conducting relevant calculations in hydrological simulation, water resources management, and water environment protection. Filling in the missing data from precipitation observations is a core aspect of data integration and processing [1]. With the widespread application of various monitoring equipment in the water conservancy industry, established open-channel water diversion projects have formed a comprehensive monitoring system covering various hydrological elements such as precipitation, evaporation, and flow. However, due to various factors such as equipment malfunctions, transmission interruptions, human disturbances, and environmental changes, missing values can easily occur during collecting, storing, and organizing observation data [2]. From a statistical analysis perspective, missing data represent a measurement error that reduces the sample size, potentially leading to bias or significant distortion. This can cause deviations in the analysis results based on such data, further affecting the accuracy, effectiveness, and scientific nature of downstream tasks such as meteorological analysis and prediction [3].

Research addressing missing data in meteorological observations is extensive globally, with traditional imputation methods primarily categorized into mathematical and physical models. For mathematical models, methods such as linear interpolation suffer from low data accuracy, while approaches like Lagrange interpolation, despite their advantages, exhibit the “Runge phenomenon” near the ends of the interpolation interval, thereby limiting their application in data processing and fitting [4]. When simulating complex hydrological environments, physical models require more parameter support, involve significant computational effort, and pose greater challenges in the modeling process [5,6]. For instance, Angkool et al. [7] employed arithmetic averaging (AA), multiple linear regression (MLR), normal ratio (NR), and nonlinear iterative partial least squares (NIPALS) algorithms to impute missing daily precipitation data in northern Thailand, and the models demonstrated good performance. Similarly, Yagci et al. [8] used ArcGIS to construct a physical model for imputing missing data and achieved favorable results.

With the gradual formation of Big Earth Data and the rapid development of 3S technology (Remote Sensing, Geographical Information System, and Global Positioning System), machine learning models have been applied in multiple fields [9]. In the field of hydrology and meteorology, machine learning algorithms such as Back Propagation Neural Network (BPNN), Random Forest (RF), Support Vector Regression (SVR), and Extreme Gradient Boosting (XGBoost) have been used to train large amounts of historical data, learn complex relationships between data, and predict values for unknown data points based on these relationships. Compared to traditional methods, they offer higher accuracy, greater adaptability, and broader application prospects [10,11]. For example, Shortridge et al. [12] utilized AI methods such as Artificial Neural Networks (ANN), RF, and Support Vector Machines (SVM) to predict monthly runoff in Ethiopia and found that, compared to physical models, data-driven models could reduce errors. Lee et al. [13] simulated hydrological and meteorological elements using the Long Short-Term Memory (LSTM) model and discovered its potential in hydrological and meteorological simulations, with results showing that the LSTM model reproduced the variability, correlation structure of the larger timescale, and key statistics of the original time domain more effectively than traditional models. Ángel et al. [14] used the machine learning model to estimate missing ozone values through five other pollutant variables contained in air quality information and found that AI algorithms demonstrated high accuracy in data imputation. Muhammad et al. [15] used the Nearest Neighbor Method (NNM) and Expectation Maximization (EM) to find that machine learning algorithms have strong applicability in data imputation when imputing missing air quality data from five monitoring stations in Sabah, Malaysia.

However, most of the studies mentioned above are constructed based on a single data type, lacking an in-depth exploration of the integrated application of multiple models and overlooking valuable information present in other data types. Stacking, as an important method in ensemble learning, has undergone several stages of development since David H. Wolpert first conceptualized “stacked generalization” in 1992 [16,17]. Wolpert pointed out that combining the predictions of multiple base learners can enhance the accuracy of the final prediction, laying the foundation for the development of stacking models. Subsequently, Leo Breiman [18] elaborated on stacking models’ principles and training processes in 1996, emphasizing their potential to improve prediction accuracy. As machine learning technologies advanced in the 21st century, stacking models were widely applied and integrated with ensemble learning methods to form more complex and effective learning frameworks. These were applied in various domains, such as data mining, recommendation systems, financial risk assessment, and medical diagnosis. For example, Dai et al. [19] proposed a two-level stacking regression model based on abdominal CT, achieving more accurate predictions with lower parameter costs. Lin et al. [20] enhanced the prediction capabilities of brain activity by stacking different encoding models for brain mapping.

Understanding the causes of missing data is crucial for effective imputation, as it helps identify the primary reasons and relationships between measured variables and data incompleteness [21]. Data missing can be classified into three types: MCR (Missing Completely Random), MR (Missing Random), and MNR (Missing Not Random). Missing meteorological records are typically classified as MCR, where the probability of any specific missing data is unrelated to observed and unobserved data and any variables within the dataset [22,23]. Nevertheless, given the constraints and uncertainties associated with missing data that cannot be fully elucidated by observed variables alone, considering imputation methods for all types of missingness may provide more accurate precipitation estimates and forecasts.

To address these issues, this study selects precipitation observational data from the Jiaojiang River basin from 1991 to 2020 as a case study, focusing primarily on the MCR type while also considering the MR and MNR types. The objectives are as follows: (1) to construct three machine learning algorithms, namely BPNN, RF, and SVR, and reveal their respective advantages and limitations; (2) to provide a scientific basis for selecting the most appropriate imputation strategies for different application scenarios; and (3) to develop the MLR imputation method to improve the accuracy and reliability of imputing missing hydrological and meteorological data. This study aims to promote the intelligent development of hydrological and meteorological data processing and provide more reliable and high-quality data support.

2. Study Area and Data

The Jiaojiang River basin, located in the central coastal region of Zhejiang Province in southeast China, is one of the eight major river basins in Zhejiang Province. It lies between 120°17′6″ and 121°41′00″ east longitude and 28°32′2″ and 29°20′29″ north latitude, covering an area of 6603 km² (Figure 1). The terrain of the Jiaojiang River basin slopes from west to east, with continuous mountains in the central, western, and northern regions. The coastal plain is interspersed with low mountains and hills, and the river channels are densely distributed [24].

The core of this study is to develop a novel method for imputing missing data, with a primary focus on evaluating the performance of data imputation models within specific application contexts. It does not aim to thoroughly analyze the inter-annual climatic and hydrological variations in a particular watershed, so the requirement for data timeliness is relatively less stringent. Consequently, this study has selected precipitation data observed at 24 meteorological stations in the Jiaojiang River basin. The data, which cover all meteorological information from 1991 to 2020, are recorded on a daily scale, taking into careful consideration both the reliability and accessibility of the data. The 24 stations were determined as the precipitation data used in this study, including Caodian (CD), Hengliao (HL), Linshan (LS), Longtantou (LTT), Shangzhang (SZ), Xianju (XJ), Xianju Meiao (MA), Xiahuitou (XHT), Xishang (XS), Miaoliao (ML), Fengshugang (FSG), Baizhiai (BZA), Jietou (JT), Lishimen (LSM), Tiantai Yanxia (YX), Tianzhu (TZ), Baihedian (BHD), Feshu (FS), Hutanggang (HTG), Longhuangtang (LHT), Shanzhengtou (SZT), Tiantai (TT), Baibu (BB), and Shaduan (SD).

In discussing whether the four methods employed in this study significantly impact the water balance within the watershed, precipitation data were used to construct a Thiessen polygon for allocating weights to the annual precipitation across the watershed (Figure 2). Subsequently, the total precipitation within the watershed from 1990 to 2020 was calculated.

3. Methodology

The process of this research is illustrated in Figure 3. The overall research framework revolves around constructing and evaluating machine learning models, encompassing four core steps: data preparation, construction and training, establishment of the MLR model, and accuracy assessment.

The data missing rates at 1%, 5%, 10%, 20%, and 30% were established to simulate real-world data missingness. The precipitation data were designated from the Shaduan station as the target for imputation. Following these missing rates, we deleted the original data from the designated stations to mimic realistic data gaps (the deletion rates mentioned all refer to the ratios of data that are intentionally and randomly removed to simulate realistic missing scenarios). This study designed three missing data patterns. The first two patterns simulated MCR by randomly deleting N% of the data or randomly deleting N% from consecutive years. The third pattern simulated MR by deleting the largest N% of the data. Additionally, to simulate MNR, the study deleted the x-th data point when the (x + 1)-th data point is greater than A, ensuring the proportion of deleted data accounts for N%. Additionally, we set up training and validation datasets (Figure 4). Secondly, the three machine learning algorithms of BPNN, RF, and SVR were employed for training. Due to the inherent uncertainty in machine learning algorithms, to investigate this uncertainty and to enhance the accuracy of the newly developed MLR imputation method, each of the three machine learning algorithms was trained 1000 times. The training result with the highest Nash–Sutcliffe Efficiency (NSE) from each of the three machine learning models was selected as the final versions of these three models. Subsequently, the three machine learning models were used to construct the novel MLR filling method. This method was applied to fill meteorological data of three missing types and multiple missing rates, and the Pearson’s correlation coefficient (PCC), NSE, and Percent Bias (Pbias) of the filled results were calculated. Finally, a comparative analysis and evaluation of the four methods were conducted.

3.1. Model Validation Criteria

This study employs three quantitative indicators: the Pearson’s correlation coefficient (PCC), the Nash–Sutcliffe Efficiency (NSE), and the Percent Bias (Pbias) to quantify the accuracy of various methods in filling gaps in precipitation data [25]. The formulas are as follows:

P C C = \frac{\sum_{i = 1}^{n} (Q_{s i} - \bar{Q_{s i}}) (Q_{o i} - \bar{Q_{o i}})}{\sqrt{\sum_{i = 1}^{n} {(Q_{s i} - \bar{Q_{s i}})}^{2} \cdot \sum_{i = 1}^{n} {(Q_{o i} - \bar{Q_{o i}})}^{2}}}

(1)

where

Q_{s i}

represents the simulated precipitation for the i-th data point,

Q_{o i}

represents the observed precipitation for the i-th data point. The range of PCC is from −1 to 1, and a value closer to 1 indicates better simulation performance [26].

N S E = 1 - \frac{\sum_{i = 1}^{n} {(Q_{s i} - Q_{o i})}^{2}}{\sum_{i = 1}^{n} {(Q_{o i} - \bar{Q_{o i}})}^{2}}

(2)

where

Q_{s i}

represents the simulated precipitation for the i-th data point and

Q_{o i}

represents the observed precipitation for the i-th data point. The range of NSE is from negative infinity to 1, and a value closer to 1 indicates higher simulation quality [27].

P b i a s (%) = \sum_{i = 1}^{n} \frac{Q_{s i -} Q_{o i}}{Q_{o i}} \times 100

(3)

where

Q_{s i}

represents the simulated precipitation for the i-th data point and

Q_{o i}

represents the observed precipitation for the i-th data point. A Pbias value closer to 0 indicates better simulation performance [28].

3.2. Machine Learning Methods

3.2.1. Backpropagation Neural Network (BPNN)

An artificial neural network is a widely parallel-connected network composed of simple and adaptive units [29,30]. BPNN, an improvement based on artificial neural networks, is also the most widely used among all artificial neural networks [31]. When the activation function of all neurons adopts the Sigmoid function, a single hidden layer BPNN model can solve most classification problems [32,33,34,35]. Therefore, this study chooses a single hidden layer structure as the BPNN model structure (Figure 5).

Figure 5 shows n input units and q output units. To simulate the nonlinear characteristics of biological neurons, the output function of BPNN adopts the Sigmoid function:

f (x) = \frac{1}{1 + e^{x}}

(4)

where

x

represents the feature value of the input layer.

The output of each node in each layer is calculated as follows:

y = f (\sum_{j = 1}^{n} w_{i j} x_{i j} - θ)

(5)

where

x_{i j}

is the i-th unit receives the information in the calculation layer,

w_{i j}

is the connection weight between the i-th unit and the j-th unit in the previous layer, θ is the threshold, and

y

is the output. Learning BPNN is a cyclic iterative process that proceeds through four steps (Figure 6).

The global error or energy function of the network generally adopts a specific form:

E = \sum_{k = 1}^{m} E_{k} = \frac{1}{2} \sum_{k = 1}^{m} \sum_{t = 1}^{q} {(y_{t}^{k} - c_{t})}^{2}

(6)

where

E_{k}

represents the network learning error corresponding to the k-th pattern pair,

m

denotes the number of pattern pairs, which refers to the count of pairings between input data and their corresponding target output data in the training dataset. Each pattern pair consists of an input data sample and its desired output;

q

represents the number of output units in the network,

y_{t}^{k}

denotes the output value of the t-th output unit corresponding to the k-th pattern pair, and

c_{t}

represents the desired output value of the t-th output unit [36,37,38].

The correction formula for the connection weights is as follows:

w_{i j}^{(n + 1)} = w_{i j}^{(n)} + {α ∆ w}_{i j}^{(n)} = w_{i j}^{(n)} + {α (- \frac{{\partial E}_{k}}{{\partial w}_{i j}})}^{(n)}

(7)

where

w_{i j}^{(n + 1)}

denotes the corrected connection weight in the (n + 1)-th iteration,

w_{i j}^{(n)}

denotes the corrected connection weight in the n-th iteration,

\frac{\partial E_{k}}{\partial w_{i j}}

represents the gradient of the error function, and α represents the learning rate [39].

This study comprehensively considers the prediction results of the model and the computation time, refers to Ravindra’s parameter settings [40], and adopts a trial algorithm to determine the hyperparameters of the neural network model. The model is trained and optimized using the Adaptive Moment Estimation (Adam) algorithm, with the optimization process targeting the Root Mean Square Error (RMSE) as the objective function. After trial calculations, the learning rate is set to 0.0001, the maximum number of iterations is set to 1000, and the number of neurons in the hidden layer is set to 6.

3.2.2. Random Forest (RF)

The RF regression model is an ensemble learning algorithm with multiple decision trees (Figure 7). A decision tree is a tree-like structure where each internal node represents a test on an attribute, each branch represents a test output, and each leaf node represents a category [41,42]. The RF regression model builds multiple uncorrelated decision trees by randomly sampling data and features. Each decision tree can produce a prediction result based on the sampled data and features. By averaging the results from all trees, the regression prediction result of the entire forest is obtained, which effectively enhances the overall accuracy and robustness of the model [43].

3.2.3. Support Vector Regression (SVR)

SVR is a regression analysis method based on statistical learning theory, with its core idea originating from the extension of SVM applied in classification problems [44,45,46]. SVR maps input vectors to high-dimensional feature space and seeks the optimal decision function model that achieves the best fitting effect, possessing high accuracy and generalization ability [47,48,49]. SVR creates a “margin band” (Figure 8) on both sides of the linear function, with a width of ε (also known as tolerance deviation, an empirical value set by humans). No loss is calculated for samples within this margin band, meaning only support vectors impact the function model. Finally, an optimized model is derived by minimizing the total loss and maximizing the margin [50,51,52].

The objective function of SVR is as follows:

m i n \frac{1}{2} {‖ω‖}^{2} + C \sum_{i = 1}^{m} (ξ_{i} + \hat{ξ_{i}})

(8)

where

‖ω‖

represents the model complexity,

ξ_{i}

and

\hat{ξ_{i}}

are slack variables [53].

Given {x_i, y_j} as the feature vectors of the samples, where x_i and y_j respectively denote the i-th horizontal feature value and the j-th vertical feature value, by introducing Lagrange multipliers, Equation (8) can be transformed into the following:

\begin{array}{l} L (ω, b, α, \hat{α}, ξ, \hat{ξ}, μ, \hat{μ}) & = \frac{1}{2} {‖ω‖}^{2} + C \sum_{i = 1}^{m} (ξ_{i} + \hat{ξ_{i}}) + \sum_{i = 1}^{m} α_{i} (f (x_{i}) - y_{i} - ε - ξ_{i}) \\ + \sum_{i = 1}^{m} \hat{α_{i}} (y_{i} - f (x_{i}) - ε - \hat{ξ_{i}}) - \sum_{i = 1}^{m} ξ_{i} μ_{i} - \sum_{i = 1}^{m} \hat{ξ_{i}} \hat{μ_{i}} \end{array}

(9)

where

α_{i}

,

\hat{α_{i}}

,

μ_{i}

, and

\hat{μ_{i}}

are Lagrange coefficients,

α_{i}

represents the i-th Lagrange coefficient above the upper edge of the margin band, and

\hat{α_{i}}

represents the i-th Lagrange coefficient below the lower edge of the margin band [54].

Setting the partial derivatives of

L (ω, b, α, \hat{α}, ξ, \hat{ξ}, μ, \hat{μ})

concerning A, B, C, and D to zero, the solution of SVR can be expressed as follows:

f (x) = \sum_{i = 1}^{m} (\hat{α_{i}} - α_{i}) x_{i}^{T} x + b

(10)

where

x_{i}^{T}

represents the transpose of the feature matrix x, and

b

represents the model parameters to be determined.

The kernel function is an extension of the vector inner product space, which transforms nonlinear regression problems into approximately linear regression problems after being converted by the kernel function [55]. By incorporating SVR into the kernel function, the final functional model is obtained, as illustrated below:

f (x) = \sum_{j = 1}^{n} (\hat{a_{i}} - a_{i}) K (x_{i}, x_{j})

(11)

where

K (x_{i}, x_{j})

represents the kernel function.

Common kernel functions include the linear kernel, polynomial kernel, radial basis function kernel, laplace kernel, and sigmoid kernel [56]. Their forms are presented sequentially in Equations (12)–(16).

K (x_{i}, x_{j}) = 〈x_{i}, x_{j}〉

(12)

K (x_{i}, x_{j}) = {(γ 〈x_{i}, x_{j}〉 + c)}^{n}

(13)

where

γ

represents the coefficient of the kernel function, with a value greater than 0.

K (x_{i}, x_{j}) = e x p (- \frac{{‖x_{i} - x_{j}‖}^{2}}{2 σ^{2}})

(14)

where

σ

represents the bandwidth of the radial basis function kernel.

K (x_{i}, x_{j}) = e x p (- \frac{‖x_{i} - x_{j}‖}{σ})

(15)

K (x_{i}, x_{j}) = t a n h (γ 〈x_{i}, x_{j}〉 + c)

(16)

SVR models were constructed using these five kernel functions in sequence. Data with a missing rate of 1% were input into each model, and the average values of PCC, NSE, and Pbias were taken after 1000 runs (Figure 9). A comparative analysis of Figure 9 reveals that the linear kernel is simple and effective but performs poorly when dealing with MNR data. The polynomial kernel exhibits moderate performance across multiple tasks; however, it has a larger absolute percentage deviation when handling MNR data, indicating deficiencies in managing extreme missing values. The Laplace and sigmoid kernels demonstrate stable performance without significant advantages, making them suitable as balanced options. The Radial Basis Function (RBF) kernel shows notable advantages in imputing MCR, MR, and MNR data, consistently achieving higher PCC and NSE while maintaining lower absolute percentage deviations.

In summary, due to its powerful nonlinear mapping capability, the RBF kernel can effectively capture complex relationships between variables in data imputation tasks, enhancing prediction accuracy. Therefore, this study selects the radial basis function kernel as the kernel function for SVR. This study employs a trial-and-error method to determine parameter values. The regularization parameter c ranges from 0 to 1, and the kernel function parameter g ranges from 1 to 10.

3.3. Multiple Linear Regression (MLR)

This study constructs multiple artificial intelligence models, including BPNN, RF, and SVR, and combines them through MLR to maximize model accuracy [57,58,59,60]. MLR is a complex model in regression analysis that considers the impact of multiple input variables on the output variable [61,62,63,64]. Compared with Simple Linear Regression, MLR can more comprehensively and systematically elucidate the relationships between variables [65].

The expression for the MLR model is as follows:

f (x) = k^{T} x + b

(17)

where

f (x)

represents the output or response of the model; x is the input vector containing three features, namely the output results of BPNN, RF, and SVR;

k^{T}

denotes the feature weights; b is the intercept of the model [66].

To evaluate the performance of the model, the Mean Squared Error (

E (\hat{k})

) is introduced, and the model optimization objective is formulated as follows:

E (\hat{k}) = {a r g}_{(\hat{k})} m i n {(y - X \hat{k})}^{T} (y - X \hat{k})

(18)

where

\hat{k}

represents the linear influence of the independent variables on the dependent variable, and

X

denotes the augmented design matrix that includes all independent variables and a constant term of 1 [67].

By using the least squares method, taking the derivative of the mean square error function for

\hat{k}

and setting it to zero, and assuming that

X^{T} X

is a full-rank matrix, ultimately obtaining the multivariate linear regression model:

f (\hat{y_{i}}) = \hat{k_{i}} {(X^{T} X)}^{- 1} X^{T} y

(19)

where

\hat{y_{i}}

represents the estimated values of the dependent variables y₁, y₂, and y₃,

\hat{k_{i}}

denotes the least squares estimates of the parameters

k_{1}

,

k_{2}

, and

k_{3}

[68].

4. Results

4.1. Missing Completely Random (MCR) Precipitation Data

Figure 10 presents a comparative analysis and verification of the data obtained using various imputation methods compared to the real data under different missing data rates after randomly deleting N% of the data from the Shaduan meteorological station. Across different missing data rates, all four methods maintained a certain level of high accuracy. MLR demonstrated the most outstanding results for PCC and NSE, with PCC ranging from a minimum of 0.9630 to a maximum of 0.9975, and NSE from a minimum of 0.8240 to a maximum of 0.9265. The Pbias of the BPNN approached zero at higher missing rates, notably reaching 0.3460% at a 30% missing rate. Conversely, the Pbias of the SVR generally performed poorly, peaking at 13.8438% under a 10% missing rate. As the data missing rate increased, the performance of all models was somewhat compromised.

Analyzing Figure 10, it was evident that BPNN exhibited a relatively stable performance with a slight decrease in PCC and NSE as the missing rate increased while maintaining a low Pbias throughout. RF demonstrated stable PCC and NSE values, with Pbias being small at low missing rates and slightly larger at high missing rates. On the other hand, SVR showed significant fluctuations in PCC and NSE as the missing rate increased, and it had the highest Pbias, indicating its sensitivity to data missingness. Under single missing rate conditions, BPNN performed the most stably, with the smallest relative error between its maximum and minimum values. RF exhibited the worst stability in PCC and NSE, characterized by high upper limits and low lower limits, highlighting its extreme nature. The stability of SVR’s Pbias was significantly lower than that of the other methods. The stability of all three methods decreased notably with increasing missing rates. MLR performed excellently at low missing rates, and although its performance declined at higher missing rates, it remained relatively good.

4.2. Missing Completely Random (MCR) Precipitation Data Under the Absence of Concentrated Years

Table 1 shows the average PCC, NSE, and Pbias values for the Randomly deleted N% of consecutive whole years of data at the Shaduan meteorological station using four methods under different missing rates. Under varying levels of data missingness, all four methods demonstrated a certain level of accuracy. MLR demonstrated the most outstanding results for PCC and NSE, with PCC ranging from a minimum of 0.9326 to a maximum of 0.9865 and NSE from a minimum of 0.8459 to 0.9343. The absolute value of Pbias reached its maximum in the BPNN model at a 30% missing rate of 18.9039%. The minimum values of PCC and NSE both appeared in the SVR model under a 30% deletion rate of 0.9148 and 0.8193, respectively. As the missing rate increased, the performance of all models was affected to varying degrees.

The comparative analysis presented in Table 1 indicated that BPNN’s performance remained relatively stable across all missing rates, with PCC and NSE sustained at high levels. Its Pbias performed well under high missing rates, suggesting a certain level of robustness. The PCC and NSE of RF gradually decreased as the missing rate increased, and the performance of Pbias significantly worsened with rising missing rates, indicating higher sensitivity to missing data and poorer robustness. SVR exhibited relatively lower PCC and NSE at lower missing rates, and these metrics significantly declined as the missing rate increased. Its Pbias demonstrated severe fluctuations, indicating moderate imputation performance. MLR maintained the highest PCC and NSE across all missing rates, demonstrating strong stability and robustness. Although its Pbias also deteriorated with increasing missing rates, its performance surpassed other methods.

4.3. Missing Random (MR) Precipitation Data

Table 2 shows the average PCC, NSE, and Pbias values using four methods under different missing rate conditions after deleting the largest N% data from the Shaduan meteorological station. Under varying levels of data missingness, all four methods demonstrated a certain level of accuracy. MLR demonstrated the most outstanding results for PCC and NSE, with PCC ranging from a minimum of 0.0.8819 to a maximum of 0.9681 and NSE from a minimum of 0.7532 to a maximum of 0.9052. The absolute value of Pbias reached its maximum in the SVR model at a 30% missing rate of 20.9039%. The minimum values of PCC and NSE appeared in the SVR model under a 30% deletion rate of 0.8529 and 0.7333, respectively. As the missing rate increased, the performance of all models was affected to varying degrees.

A comparative analysis of Table 2 reveals that the PCC and NSE of BPNN exhibited slight fluctuations, indicating BPNN’s relative insensitivity to variations in missing data rates. However, Pbias performed poorly at higher missing rates, suggesting biases in handling data with high missingness. For RF, PCC, Pbias, and NSE significantly decreased as the missing rate increased; it showed moderate performance at low missing rates but struggled with high missing data. SVR’s PCC and NSE gradually declined with rising missing rates, demonstrating its high sensitivity to missing data. Notably, Pbias deviated significantly from 0 at a 30% missing rate, indicating poor imputation performance by SVR at high missing rates. Conversely, MLR maintained high levels of PCC and NSE across different missing rates, highlighting its robustness to missing data. Additionally, MLR’s Pbias remained low and relatively stable across various missing rates, suggesting that MLR could effectively maintain unbiasedness during the imputation process.

4.4. Missing Not Random (MNR) Precipitation Data

Table 3 shows the average PCC, NSE, and Pbias values for MNR data from Shaduan meteorological station using four methods under different missing rate conditions. Under varying levels of data missingness, all four methods displayed high accuracy. MLR demonstrated the most outstanding results for PCC and NSE, with PCC ranging from a minimum of 0.8475 to a maximum of 0.9372 and NSE from a minimum of 0.7284 to a maximum of 0.8404. SVR consistently showed poorer Pbias performance, reaching −26.1195% at a 30% missing rate. As the missing rate increased, the performance of all models was somewhat impacted.

A comparative analysis of Table 3 reveals that the PCC and NSE of BPNN exhibited greater fluctuations yet consistently outperformed the other two machine learning methods. The Pbias of BPNN, however, rose at higher missing rates, indicating an increased bias in imputing highly incomplete data. RF demonstrated moderate PCC, NSE, and Pbias performance across different missing rates. SVR showed significant variability in PCC, NSE, and Pbias at varying missing rates, with notably deviated Pbias values from 0 at high missing rates, suggesting its substantial sensitivity to missing data and poor robustness. Conversely, MLR maintained the highest PCC and NSE across all missing rates, demonstrating excellent imputation performance. Its Pbias values were also closest to 0 in most cases, indicating good adaptability and robustness of MLR to data with different missing rates.

5. Discussion

Theoretically, MLR can effectively combine the advantages of three machine learning methods to form a more powerful prediction tool. In practice, as shown in Figure 11, MLR achieves the highest PCC values across all scales, data types, and missing rates, demonstrating its superiority through significant experimental results. Therefore, it is reasonable for this study to construct an MLR model using BPNN, RF, and SVR. Overall, the points plotted by various methods on the scatter plots are roughly distributed around the 45° line, with the fitting line close to this line, indicating a satisfactory imputation performance. Among the three machine learning methods, SVR exhibits the poorest stability under various missing data rates, RF demonstrates moderate stability, and BPNN performs best.

The PCC and NSE of BPNN decrease as the missing rate increases, yet their overall performance remains relatively stable. The Pbias is low at lower missing rates, indicating small prediction biases and a certain level of robustness to missing data. Loh et al. [69] employed BPNN for sediment data imputation and concluded that excessive missing information could exceed the network’s compensation capability. Therefore, BPNN suits scenarios with moderate missing data rates and certain prediction accuracy and stability requirements. Its high adaptability to missing data allows it to perform well in various complex prediction tasks. The PCC and NSE of RF remain relatively stable across different missing rates, but the Pbias exhibits greater variability, particularly at higher missing rates, where performance becomes less stable and prediction biases are larger, indicating sensitivity to data missingness. Memon et al. [70] have used RF to impute clinical records, demonstrating high accuracy in imputation at lower proportions of missing data. Therefore, RF is suitable for scenarios with moderate data missing rates and less stringent requirements for prediction accuracy. The PCC and NSE of SVR vary significantly across different missing rates, and the Pbias consistently deviates notably from 0, indicating poor performance in terms of prediction bias. Jafary et al. [71] previously utilized SVR to impute suburban property prices and observed SVR’s sensitivity to noise and outliers in data. Consequently, SVR is more suited to scenarios with low missing data rates and a higher tolerance for prediction bias.

In MCR data imputation, MLR not only outperforms other methods in terms of PCC and NSE across all missing rates but also excels particularly at low missing rates, demonstrating its high prediction accuracy and low bias. Its robust imputation performance attests to its broad application potential in MCR data. In MR data imputation, although the accuracy of MLR decreases as missing conditions become more severe, its advantage over the other three machine learning methods becomes more pronounced, especially at high missing rates. It performs exceptionally well on all three evaluation metrics, effectively handling long time-series data. In MNR imputation, due to even more severe missing conditions, the performance of MLR declines, but its PCC and NSE are still much higher than those of other methods, and its Pbias is closest to zero. The imputation results are highly consistent with actual observations, showing strong imputation capability and prediction stability. Since each algorithm possesses unique modeling capabilities and adaptability, MLR, by integrating the strengths of different algorithms, can more comprehensively capture data features, reduce biases inherent in single algorithms, and mitigate overfitting and underfitting issues, thereby achieving significant improvements in prediction performance.

The MLR exhibits excellent and stable data imputation performance across different missing types and rates. Its high prediction accuracy, low bias, and wide applicability make it a preferred method in meteorological data imputation. Studies conducted by Mohammadinia et al. [72] using machine learning, SVR, and RF algorithms for shale volume estimation found that BPNN and RF models exhibited comparable prediction accuracy, both outperforming SVR, which aligns with the performance of these three machine learning methods in missing data imputation observed in our study. Additionally, Li et al.’s [73] simulation of non-Gaussian fluctuating wind pressure using BPNN, RF, and SVR revealed that BPNN and RF achieved better training and testing results, followed by SVR and LSTM neural network models, further validating the reliability of our study’s findings.

Based on the imputation results, we assessed whether the annual average precipitation of the watershed after imputation was approximately equal to the actual observed values (Figure 12). In MCR data, the differences in annual average precipitation among the four methods were extremely low under conditions of low missing rates. As the missing rate increased, errors gradually emerged, and the differences progressively enlarged, reaching a maximum of 102.55 mm, yet overall remaining at a relatively low level. In MR data, the discrepancies in annual average precipitation became evident due to the worsening missing conditions, peaking at 143.75 mm. In MNR data, as missing conditions continued to deteriorate, the difference in annual average precipitation reached a maximum of 168.04 mm at a 30% missing rate. Among the imputation results for these three types of data and five missing rates, the errors of MLR were the lowest or nearly the lowest, with a relatively slower growth in errors, indicating that MLR was more reliable for data imputation over long time scales. According to the Thiessen polygon calculations, the annual average precipitation in the study area was 1684 mm, and the error rates of the annual average precipitation for the four imputation methods ranged from 1.6030% to 9.9767%, which were still relatively low.

Although this study achieved remarkable results in filling gaps in meteorological data for the Jiaojiang River basin, there are still several limitations: (1) With the rapid advancement of machine learning techniques, more advanced algorithms may exhibit better performance in filling meteorological data gaps. (2) This study’s algorithm parameters optimization relied on empirical formulas and trial-and-error methods, lacking a systematic parameter tuning process. More advanced optimization algorithms can be adopted to enhance model performance further. (3) The four methods performed well in the subtropical monsoon climate region, but there is currently a lack of comprehensive research to verify their universal effectiveness in different climatic conditions, such as arid regions. (4) The current study only covered precipitation as the key data, and the hydrometeorological system is complex and multidimensional, with other meteorological factors such as temperature, wind speed, and humidity significantly influencing hydrological processes. Future research should integrate more dimensional meteorological data for a comprehensive analysis.

6. Conclusions

The present study utilized the MLR model integrated with three machine learning algorithms (BPNN, RF, and SVR) to impute precipitation data for the Jiaojiang River Basin in southeastern China, evaluating the imputation effectiveness across various data missing rates and temporal scales. The key conclusions drawn from this research are as follows:

(1): Across all missing types and rates, BPNN achieved an average PCC of 0.9316 and an average NSE of 0.8334; RF attained an average PCC of 0.9286 and an average NSE of 0.8320; and SVR recorded an average PCC of 0.9196 and an average NSE of 0.8183. While the Pbias values for these three methods were relatively similar, BPNN and RF demonstrated superior imputation performance compared to SVR. Among them, RF exhibited the lowest stability, whereas SVR showed the highest;
(2): BPNN is suitable for scenarios where the data missing rate is moderate and there are specific requirements for prediction accuracy and stability. On the other hand, RF is more appropriate for situations with moderate data missing rates and less stringent requirements for prediction accuracy. Conversely, SVR is best suited to contexts with low data missing rates and a high tolerance for prediction biases;
(3): Compared to the individual machine learning methods, the developed MLR imputation method achieved an average PCC of 0.9762 and an average NSE of 0.8483 across all missing types and rates, accompanied by an average Pbias of 0.9236%. This methodology offers enhanced PCC, NSE values, and lower Pbias, thereby improving the accuracy and reliability of hydrometeorological missing data imputation. Consequently, it fosters the intelligent development of precipitation data processing and provides more reliable, high-quality data support for research and decision-making in hydro-meteorology.

Author Contributions

Conceptualization, H.Q. and H.C.; methodology, B.X.; software, G.L.; validation, S.H.; formal analysis, H.N.; investigation, H.C.; resources, H.X.; data curation, H.Q.; writing—original draft preparation, H.Q.; writing—review and editing, H.C.; visualization, S.H.; supervision, H.C.; project administration, H.C.; funding acquisition, H.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Zhejiang Provincial Natural Science Foundation, grant numbers ZCLQ24E0901 and LZJWY22E090007; the Scientific Research Fund of Zhejiang Provincial Education Department, grant number Y202352492; the Huzhou Science and Technology Plan Project, grant number 2023GZ64; and the Nanxun Scholars Program for Young Scholars of ZJWEU, grant number RC2022021137.

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Acknowledgments

We are grateful to the Zhejiang Hydrological Management Center for providing hydrological and meteorological data.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Singh, S.K.; McMillan, H.; Bárdossy, A. Use of the Data Depth Function to Differentiate between Case of Interpolation and Extrapolation in Hydrological Model Prediction. J. Hydrol. 2013, 477, 213–228. [Google Scholar] [CrossRef]
Mendez, M.; Calvo-Valverde, L. Assessing the Performance of Several Rainfall Interpolation Methods as Evaluated by a Conceptual Hydrological Model. Procedia Eng. 2016, 154, 1050–1057. [Google Scholar] [CrossRef]
McLaughlin, D. An Integrated Approach to Hydrologic Data Assimilation: Interpolation, Smoothing, and Filtering. Adv. Water Res. 2002, 25, 1275–1286. [Google Scholar] [CrossRef]
de la Calle Ysern, B.; Galán del Sastre, P. A Lagrange Interpolation with Preprocessing to Nearly Eliminate Oscillations. Numerical Algorithms; Springer: Berlin/Heidelberg, Germany, 2024. [Google Scholar] [CrossRef]
Meng, Z.; Wang, Y.; Zheng, S.; Wang, X.; Liu, D.; Zhang, J.; Shao, Y. Abnormal Monitoring Data Detection Based on Matrix Manipulation and the Cuckoo Search Algorithm. Mathematics 2024, 12, 1345. [Google Scholar] [CrossRef]
Zhang, Y.; Vaze, J.; Chiew, F.H.S.; Teng, J.; Li, M. Predicting Hydrological Signatures in Ungauged Catchments Using Spatial Interpolation, Index Model, and Rainfall–Runoff Modelling. J. Hydrol. 2014, 517, 936–948. [Google Scholar] [CrossRef]
Wangwongchai, A.; Waqas, M.; Dechpichai, P.; Hlaing, P.T.; Ahmad, S.; Humphries, U.W. Imputation of Missing Daily Rainfall Data; A Comparison between Artificial Intelligence and Statistical Techniques. MethodsX 2023, 11, 102459. [Google Scholar] [CrossRef]
Yagci Sokat, K.; Dolinskaya, I.S.; Smilowitz, K.; Bank, R. Incomplete Information Imputation in Limited Data Environments with Application to Disaster Response. Eur. J. Oper. Res. 2018, 269, 466–485. [Google Scholar] [CrossRef]
Yaseen, Z.M.; Jaafar, O.; Deo, R.C.; Kisi, O.; Adamowski, J.; Quilty, J.; El-Shafie, A. Streamflow Forecasting Using Extreme Learning Machines: A Case Study in a Semi-Arid Region in Iraq. J. Hydrol. 2016, 542, 603–614. [Google Scholar] [CrossRef]
Naganna, S.R.; Marulasiddappa, S.B.; Balreddy, M.S.; Yaseen, Z.M. Daily Scale Streamflow Forecasting in Multiple Stream Orders of Cauvery River, India: Application of Advanced Ensemble and Deep Learning Models. J. Hydrol. 2023, 626, 130320. [Google Scholar] [CrossRef]
Zhu, C.; Li, G.; Luis, N.V.J.; Dong, W.; Wang, L. Optimization of RF to Alloy Elastic Modulus Prediction Based on Cuckoo Algorithm. Comp. Mater. Sci. 2024, 231, 112515. [Google Scholar] [CrossRef]
Shortridge, J.E.; Guikema, S.D.; Zaitchik, B.F. Machine Learning Methods for Empirical Streamflow Simulation: A Comparison of Model Accuracy, Interpretability, and Uncertainty in Seasonal Watersheds. Hydrol. Earth Syst. Sci. 2016, 20, 2611–2628. [Google Scholar] [CrossRef]
Lee, T.; Shin, J.-Y.; Kim, J.-S.; Singh, V.P. Stochastic Simulation on Reproducing Long-Term Memory of Hydroclimatological Variables Using Deep Learning Model. J. Hydrol. 2020, 582, 124540. [Google Scholar] [CrossRef]
Arroyo, Á.; Herrero, Á.; Tricio, V.; Corchado, E.; Woźniak, M. Neural models for imputation of missing ozone data in air-quality datasets. Complexity 2018, 1, 7238015. [Google Scholar] [CrossRef]
Rumaling, M.I.; Chee, F.P.; Dayou, J.; Chang, J.H.W.; Kong, S.S.K.; Sentian, J. Missing Value Imputation for PM10 Concentration in Sabah Using Nearest Neighbour Method (NNM) and Expectation-Maximization (EM) Algorithm. Asian J. Atmos. Env. 2020, 14, 62–72. [Google Scholar] [CrossRef]
Džeroski, S.; Ženko, B. Is Combining Classifiers with Stacking Better than Selecting the Best One? Mach. Learn. 2004, 54, 255–273. [Google Scholar] [CrossRef]
Wolpert, D.H. Stacked Generalization. Neural Networks 1992, 5, 241–259. [Google Scholar] [CrossRef]
Breiman, L. Stacked Regressions. Mach. Learn. 1996, 24, 49–64. [Google Scholar] [CrossRef]
Dai, H.; Wang, Y.; Fu, R.; Ye, S.; He, X.; Luo, S.; Jin, W. Radiomics and Stacking Regression Model for Measuring Bone Mineral Density Using Abdominal Computed Tomography. Acta. Radiol. 2023, 64, 228–236. [Google Scholar] [CrossRef]
Lin, R.; Naselaris, T.; Kay, K.; Wehbe, L. Stacked Regressions and Structured Variance Partitioning for Interpretable Brain Maps. NeuroImage 2024, 298, 120772. [Google Scholar] [CrossRef]
Chiu, P.C.; Selamat, A.; Krejcar, O. Infilling Missing Rainfall and Runoff Data for Sarawak, Malaysia Using Gaussian Mixture Model Based K-Nearest Neighbor Imputation. In Proceedings of the Advances and Trends in Artificial Intelligence. From Theory to Practice: 32nd International Conference on Industrial, Engineering and Other Applications of Applied Intelligent Systems, IEA/AIE 2019, Graz, Austria, 9–11 July 2019; Wotawa, F., Friedrich, G., Pill, I., Koitz-Hristov, R., Ali, M., Eds.; Springer International Publishing: Cham, Switzerland, 2019; pp. 27–38. [Google Scholar]
Hamzah, F.B.; Mohd Hamzah, F.; Mohd Razali, S.F.; Samad, H. A Comparison of Multiple Imputation Methods for Recovering Missing Data in Hydrological Studies. Civ. Eng. J. 2021, 7, 1608–1619. [Google Scholar] [CrossRef]
Nor, S.M.C.M.; Shaharudin, S.M.; Ismail, S.; Zainuddin, N.H.; Tan, M.L. A Comparative Study of Different Imputation Methods for Daily Rainfall Data in East-Coast Peninsular Malaysia. Bull. Electr. Eng. Inform. 2020, 9, 635–643. [Google Scholar] [CrossRef]
Chen, H.; Huang, S.; Xu, Y.-P.; Teegavarapu, R.S.V.; Guo, Y.; Nie, H.; Xie, H.; Zhang, L. River Ecological Flow Early Warning Forecasting Using Baseflow Separation and Machine Learning in the Jiaojiang River Basin, Southeast China. Sci. Total Environ. 2023, 882, 163571. [Google Scholar] [CrossRef] [PubMed]
Melgar-García, L.; Gutiérrez-Avilés, D.; Rubio-Escudero, C.; Troncoso, A. A Novel Distributed Forecasting Method Based on Information Fusion and Incremental Learning for Streaming Time Series. Inform. Fusion 2023, 95, 163–173. [Google Scholar] [CrossRef]
Maheswaran, R.; Khosa, R. Wavelet–Volterra Coupled Model for Monthly Stream Flow Forecasting. J. Hydrol. 2012, 450–451, 320–335. [Google Scholar] [CrossRef]
Lv, N.; Liang, X.; Chen, C.; Zhou, Y.; Li, J.; Wei, H.; Wang, H. A Long Short-Term Memory Cyclic Model with Mutual Information for Hydrology Forecasting: A Case Study in the Xixian Basin. Adv. Water Res. 2020, 141, 103622. [Google Scholar] [CrossRef]
Liu, C.; Xie, T.; Li, W.; Hu, C.; Jiang, Y.; Li, R.; Song, Q. Research on Machine Learning Hybrid Framework by Coupling Grid-Based Runoff Generation Model and Runoff Process Vectorization for Flood Forecasting. J. Environ. Manag. 2024, 364, 121466. [Google Scholar] [CrossRef]
Zhong, W.L.; Ding, H.; Zhao, X.; Fan, L.F. Mechanical Properties Prediction of Geopolymer Concrete Subjected to High Temperature by BP Neural Network. Constr. Build. Mater. 2023, 409, 133780. [Google Scholar] [CrossRef]
Yang, Z.; Mao, L.; Yan, B.; Wang, J.; Gao, W. Performance Analysis and Prediction of Asymmetric Two-Level Priority Polling System Based on BP Neural Network. Appl. Soft Comput. 2021, 99, 106880. [Google Scholar] [CrossRef]
Yang, J.; Meng, C.; Ling, L. Prediction and Simulation of Wearable Sensor Devices for Sports Injury Prevention Based on BP Neural Network. Meas. Sens. 2024, 33, 101104. [Google Scholar] [CrossRef]
Wu, Y.; Li, A.; Lei, S.; Zhang, T.; Deng, Q.; Tang, H.; Yao, H. Prediction of Pyrolysis Product Yield of Medical Waste Based on BP Neural Network. Process Saf. Environ. Prot. 2023, 176, 653–661. [Google Scholar] [CrossRef]
Wen, J.; Chen, X.; Li, X.; Li, Y. SOH Prediction of Lithium Battery Based on IC Curve Feature and BP Neural Network. Energy 2022, 261, 125234. [Google Scholar] [CrossRef]
Sahoo, G.B.; Ray, C. Flow Forecasting for a Hawaii Stream Using Rating Curves and Neural Networks. J. Hydrol. 2006, 317, 63–80. [Google Scholar] [CrossRef]
Pulido-Calvo, I.; Portela, M.M. Application of Neural Approaches to One-Step Daily Flow Forecasting in Portuguese Watersheds. J. Hydrol. 2007, 332, 1–15. [Google Scholar] [CrossRef]
Huang, X.; You, Y.; Zeng, X.; Liu, Q.; Dong, H.; Qian, M.; Xiao, S.; Yu, L.; Hu, X. Back Propagation Artificial Neural Network (BP-ANN) for Prediction of the Quality of Gamma-Irradiated Smoked Bacon. Food Chem. 2024, 437, 137806. [Google Scholar] [CrossRef]
Feng, L.; Hong, W. On Hydrologic Calculation Using Artificial Neural Networks. Appl. Math. Lett. 2008, 21, 453–458. [Google Scholar] [CrossRef]
Ding, C.; Feng, S.; Qiao, Z.; Zhu, H.; Zhou, Z.; Piao, Z. Experimental Prediction Model for the Running-in State of a Friction System Based on Chaotic Characteristics and BP Neural Network. Tribol. Int. 2023, 188, 108846. [Google Scholar] [CrossRef]
Ravindra, B.V.; Sriraam, N.; Geetha, M. Chronic Kidney Disease Detection Using Back Propagation Neural Network Classifier. In Proceedings of the 2018 International Conference on Communication, Computing and Internet of Things (IC3IoT), Chennai, India, 15–17 February 2018; pp. 65–68. [Google Scholar]
Xue, Z.; Yi, X.; Feng, W.; Kong, L.; Wu, M. Prediction and Mapping of Soil Thickness in Alpine Canyon Regions Based on Whale Optimization Algorithm Optimized Random Forest: A Case Study of Baihetan Reservoir Area in China. Comput. Geosci. 2024, 191, 105667. [Google Scholar] [CrossRef]
Wang, M.; Zhao, G.; Wang, S. Hybrid Random Forest Models Optimized by Sparrow Search Algorithm (SSA) and Harris Hawk Optimization Algorithm (HHO) for Slope Stability Prediction. Transp. Geotech. 2024, 48, 101305. [Google Scholar] [CrossRef]
Wang, F.; Liu, R.; Hao, Y.; Liu, D.; Han, L.; Yuan, S. Ground Visibility Prediction Using Tree-Based and Random-Forest Machine Learning Algorithm: Comparative Study Based on Atmospheric Pollution and Atmospheric Boundary Layer Data. Atmos. Pollut. Res. 2024, 15, 102270. [Google Scholar] [CrossRef]
Shen, Y.; Ruijsch, J.; Lu, M.; Sutanudjaja, E.H.; Karssenberg, D. Random Forests-Based Error-Correction of Streamflow from a Large-Scale Hydrological Model: Using Model State Variables to Estimate Error Terms. Comput. Geosci. 2022, 159, 105019. [Google Scholar] [CrossRef]
Li, J.; Zhu, D.; Li, C. Comparative Analysis of BPNN, SVR, LSTM, Random Forest, and LSTM-SVR for Conditional Simulation of Non-Gaussian Measured Fluctuating Wind Pressures. Mech. Syst. Sig. Process. 2022, 178, 109285. [Google Scholar] [CrossRef]
Desai, S.; Ouarda, T.B.M.J. Regional Hydrological Frequency Analysis at Ungauged Sites with Random Forest Regression. J. Hydrol. 2021, 594, 125861. [Google Scholar] [CrossRef]
Li, X.; Song, J.; Yang, L.; Li, H.; Fang, S. Source Term Inversion Coupling Kernel Principal Component Analysis, Whale Optimization Algorithm, and Backpropagation Neural Networks (KPCA-WOA-BPNN) for Complex Dispersion Scenarios. Prog. Nucl. Energy 2024, 171, 105171. [Google Scholar] [CrossRef]
Zhou, J.; Lu, Y.; Tian, Q.; Liu, H.; Hasanipanah, M.; Huang, J. Advanced Machine Learning Methods for Prediction of Blast-Induced Flyrock Using Hybrid SVR Methods. Comput. Model. Eng. Sci. 2024, 140, 1595–1617. [Google Scholar] [CrossRef]
Redekar, A.; Dhiman, H.S.; Deb, D.; Muyeen, S.M. On Reliability Enhancement of Solar PV Arrays Using Hybrid SVR for Soiling Forecasting Based on WT and EMD Decomposition Methods. Ain Shams Eng. J. 2024, 15, 102716. [Google Scholar] [CrossRef]
Iqbal, M.; Salami, B.A.; Khan, M.A.; Jalal, F.E.; Jamal, A.; Lekhraj; Bardhan, A. Computational Approach towards Shear Strength Prediction of Squat RC Walls Implementing Ensemble and Hybrid SVR Paradigms. Mater. Today Commun. 2024, 40, 109921. [Google Scholar] [CrossRef]
Gandhi, A.B.; Joshi, J.B.; Jayaraman, V.K.; Kulkarni, B.D. Development of Support Vector Regression (SVR)-Based Correlation for Prediction of Overall Gas Hold-up in Bubble Column Reactors for Various Gas–Liquid Systems. Chem. Eng. Sci. 2007, 62, 7078–7089. [Google Scholar] [CrossRef]
Fan, G.-F.; Peng, L.-L.; Hong, W.-C.; Sun, F. Electric Load Forecasting by the SVR Model with Differential Empirical Mode Decomposition and Auto Regression. Neurocomputing 2016, 173, 958–970. [Google Scholar] [CrossRef]
Chen, Y.; Xu, P.; Chu, Y.; Li, W.; Wu, Y.; Ni, L.; Bao, Y.; Wang, K. Short-Term Electrical Load Forecasting Using the Support Vector Regression (SVR) Model to Calculate the Demand Response Baseline for Office Buildings. Appl. Energy 2017, 195, 659–670. [Google Scholar] [CrossRef]
Castro-Neto, M.; Jeong, Y.-S.; Jeong, M.-K.; Han, L.D. Online-SVR for Short-Term Traffic Flow Prediction under Typical and Atypical Traffic Conditions. Expert Syst. Appl. 2009, 36, 6164–6173. [Google Scholar] [CrossRef]
Balogun, A.-L.; Rezaie, F.; Pham, Q.B.; Gigović, L.; Drobnjak, S.; Aina, Y.A.; Panahi, M.; Yekeen, S.T.; Lee, S. Spatial Prediction of Landslide Susceptibility in Western Serbia Using Hybrid Support Vector Regression (SVR) with GWO, BAT and COA Algorithms. Geosci. Front. 2021, 12, 101104. [Google Scholar] [CrossRef]
Ahmad, M.S.; Adnan, S.M.; Zaidi, S.; Bhargava, P. A Novel Support Vector Regression (SVR) Model for the Prediction of Splice Strength of the Unconfined Beam Specimens. Constr. Build. Mater. 2020, 248, 118475. [Google Scholar] [CrossRef]
Beniwal, M.; Singh, A.; Kumar, N. Forecasting Long-Term Stock Prices of Global Indices: A Forward-Validating Genetic Algorithm Optimization Approach for Support Vector Regression. Appl. Soft Comput. 2023, 145, 110566. [Google Scholar] [CrossRef]
Zhu, W.; Yu, W.; Dong, X.; Jin, Z.; Hu, S. Multiple Linear Regression Analysis of Vertical Distribution of Nearshore Suspended Sediment. Desalin. Water Treat. 2023, 314, 352–358. [Google Scholar] [CrossRef]
Zhao, C.; Li, N.; Jiang, Z.; Zhou, X.; Wu, Y. Parametric Optimization of Ambient and Cryogenic Loop Heat Pipes Using Multiple Linear Regression Method. Int. J. Refrig 2024, 161, 145–163. [Google Scholar] [CrossRef]
Zhang, T.; Wang, G.A.; He, Z.; Mukherjee, A. Service Failure Monitoring via Multivariate Multiple Linear Regression Profile Schemes with Dimensionality Reduction. Decis. Support Syst. 2024, 178, 114122. [Google Scholar] [CrossRef]
Shortridge, J. Prediction of Multi-Sectoral Longitudinal Water Withdrawals Using Hierarchical Machine Learning Models. J. Hydroinf. 2023, 25, 2389–2405. [Google Scholar] [CrossRef]
Ravichandran, C.; Gopalakrishnan, P. Estimating Cooling Loads of Indian Residences Using Building Geometry Data and Multiple Linear Regression. Energy Built Environ. 2024, 5, 741–771. [Google Scholar] [CrossRef]
Osmane, A.; Zidan, K.; Benaddi, R.; Sbahi, S.; Ouazzani, N.; Belmouden, M.; Mandi, L. Assessment of the Effectiveness of a Full-Scale Trickling Filter for the Treatment of Municipal Sewage in an Arid Environment: Multiple Linear Regression Model Prediction of Fecal Coliform Removal. J. Water Process Eng. 2024, 64, 105684. [Google Scholar] [CrossRef]
Li, B.; Lu, Y.; Sun, X.; Chen, X.; Gong, W.; Miao, F. Radial Artery Pulse Wave Age-Related Assessment for Diabetic Patients Based on Multiple Linear Regression Time Domain Analysis Method. Extrem. Mech. Lett. 2024, 70, 102185. [Google Scholar] [CrossRef]
Chen, H.; Huang, S.; Xu, Y.-P.; Teegavarapu, R.S.V.; Guo, Y.; Nie, H.; Xie, H. Using Baseflow Ensembles for Hydrologic Hysteresis Characterization in Humid Basins of Southeastern China. Water Resour. Res. 2024, 60, e2023WR036195. [Google Scholar] [CrossRef]
Flores-Sosa, M.; León-Castro, E.; Merigó, J.M.; Yager, R.R. Forecasting the Exchange Rate with Multiple Linear Regression and Heavy Ordered Weighted Average Operators. Knowl.-Based Syst. 2022, 248, 108863. [Google Scholar] [CrossRef]
Ferreira Schon, A.; Apoena Castro, N.; dos Santos Barros, A.; Eduardo Spinelli, J.; Garcia, A.; Cheung, N.; Luiz Silva, B. Multiple Linear Regression Approach to Predict Tensile Properties of Sn-Ag-Cu (SAC) Alloys. Mater. Lett. 2021, 304, 130587. [Google Scholar] [CrossRef]
Bhagawati, P.B.; Kumar, K.H.S.; Lokeshappa, B.; Malekdar, F.; Sapate, S.; Adeogun, A.I.; Chapi, S.; Goswami, L.; Mirkhalafi, S.; Sillanpää, M. Prediction of Electrocoagulation Treatment of Tannery Wastewater Using Multiple Linear Regression Based ANN: Comparative Study on Plane and Punched Electrodes. Desalin. Water Treat. 2024, 319, 100530. [Google Scholar] [CrossRef]
AlKheder, S. Experimental Road Safety Study of the Actual Driver Reaction to the Street Ads Using Eye Tracking, Multiple Linear Regression and Decision Trees Methods. Expert Syst. Appl. 2024, 252, 124222. [Google Scholar] [CrossRef]
Loh, W.S.; Ling, L.; Chin, R.J.; Lai, S.H.; Loo, K.K.; Seah, C.S. A Comparative Analysis of Missing Data Imputation Techniques on Sedimentation Data. Ain Shams Eng. J. 2024, 15, 102717. [Google Scholar] [CrossRef]
Memon, S.M.Z.; Wamala, R.; Kabano, I.H. A Comparison of Imputation Methods for Categorical Data. Inform. Med. Unlocked 2023, 42, 101382. [Google Scholar] [CrossRef]
Jafary, P.; Shojaei, D.; Rajabifard, A.; Ngo, T. Automating Property Valuation at the Macro Scale of Suburban Level: A Multi-Step Method Based on Spatial Imputation Techniques, Machine Learning and Deep Learning. Habitat Int. 2024, 148, 103075. [Google Scholar] [CrossRef]
Mohammadinia, F.; Ranjbar, A.; Kafi, M.; Shams, M.; Haghighat, F.; Maleki, M. Shale Volume Estimation Using ANN, SVR, and RF Algorithms Compared with Conventional Methods. J. Afr. Earth Sci. 2023, 205, 104991. [Google Scholar] [CrossRef]
Li, J.-Q.; Xia, X.-L.; Sun, C.; Chen, X. Estimation of Time-Dependent Laser Heat Flux Distribution Based on BPNN Improved by Multiple Population Genetic Algorithm. Int. J. Heat Mass Tran. 2024, 233, 125997. [Google Scholar] [CrossRef]

Figure 1. Geographical location map of the study area.

Figure 2. Weight allocation results of basin surface precipitation.

Figure 3. The flowchart of this study.

Figure 4. Methods for setting up training and validation sets.

Figure 5. Single hidden layer BPNN model structure.

Figure 6. The application process of the BPNN model.

Figure 7. The application process of the RF model.

Figure 8. The principle of the SVR model.

Figure 9. Comparison of the effectiveness of SVR models using various kernel functions for filling different missing data types. The (a–c) present the PCC, NSE, and Pbias evaluation results.

Figure 10. (a–t) represent the simulation performance of four models at the Shaduan meteorological station with the missing rate of precipitation data of 1%, 5%, 10%, 20%, and 30%, respectively.

Figure 11. Comparison of PCC test results for four imputation methods. RD, RDC, MR, and MNR represent interpolation of precipitation data under completely random missing conditions, interpolation of precipitation data under the absence of concentrated years condition, interpolation of precipitation data under random missing conditions, and interpolation of precipitation data under not random missing condition, respectively. The numbers indicate the missing rates. For example, RD10 represents randomly deleting 10% of data.

Figure 12. The absolute value of the difference between the annual average precipitation of the interpolated data in the research area and the actual observed annual average precipitation. RD, RDC, MR, and MNR represent interpolation of precipitation data under completely random missing conditions, interpolation of precipitation data under the absence of concentrated years condition, interpolation of precipitation data under random missing conditions, and interpolation of precipitation data under not random missing condition, respectively.

Table 1. Verification of interpolation results using four interpolation methods under different missing rate conditions after randomly deleting N% of continuous annual data at Shaduan meteorological station.

Deletion Rate	Evaluating Indicator	BPNN	RF	SVR	MLR
1%	PCC	0.9752	0.9793	0.9688	0.9865
	NSE	0.9280	0.9299	0.9107	0.9343
	Pbias (%)	4.5114	6.5158	7.0625	6.2700
5%	PCC	0.9627	0.9673	0.9525	0.9710
	NSE	0.8890	0.8989	0.8788	0.9104
	Pbias (%)	8.4705	8.8997	9.8305	9.4499
10%	PCC	0.9394	0.9604	0.9257	0.9627
	NSE	0.8553	0.8720	0.8193	0.8850
	Pbias (%)	8.4912	7.6533	8.4689	7.1187
20%	PCC	0.9378	0.9329	0.9214	0.9443
	NSE	0.8346	0.8261	0.8253	0.8615
	Pbias (%)	−7.3208	−12.5162	−13.3759	−12.4827
30%	PCC	0.9261	0.9230	0.9148	0.9326
	NSE	0.8270	0.8352	0.8193	0.8459
	Pbias (%)	−8.6826	−14.0482	−18.9039	−14.9972

Table 2. Verification of interpolation results using four interpolation methods under different missing rate conditions after deleting the largest N% of the data at Shaduan meteorological station.

Deletion Rate	Evaluating Indicator	BPNN	RF	SVR	MLR
1%	PCC	0.9503	0.9576	0.9482	0.9681
	NSE	0.8966	0.8975	0.8840	0.9052
	Pbias (%)	7.1615	7.1913	8.1636	6.1896
5%	PCC	0.9407	0.9431	0.9375	0.9542
	NSE	0.8597	0.8636	0.8333	0.8674
	Pbias (%)	9.1566	9.9846	10.4631	9.4968
10%	PCC	0.9179	0.9103	0.9006	0.9254
	NSE	0.8058	0.8041	0.7858	0.8175
	Pbias (%)	11.1956	10.1653	9.4689	9.1187
20%	PCC	0.9035	0.8901	0.8832	0.9117
	NSE	0.7672	0.7605	0.7569	0.7735
	Pbias (%)	−12.3208	−15.5162	−17.3759	−16.4827
30%	PCC	0.8761	0.8675	0.8529	0.8819
	NSE	0.7496	0.7341	0.7333	0.7532
	Pbias (%)	12.6826	18.0482	20.9039	17.9972

Table 3. Verification of interpolation results using four interpolation methods under different missing rate conditions after MNR of the data at Shaduan meteorological station.

Deletion Rate	Evaluating Indicator	BPNN	RF	SVR	MLR
1%	PCC	0.9297	0.9228	0.9195	0.9372
	NSE	0.8361	0.8232	0.8187	0.8404
	Pbias (%)	−8.0389	−8.2487	−12.4943	−9.8311
5%	PCC	0.8991	0.8916	0.8738	0.9068
	NSE	0.8168	0.8101	0.7927	0.8211
	Pbias (%)	12.2070	12.5836	14.3294	13.5068
10%	PCC	0.8811	0.8843	0.8639	0.8993
	NSE	0.7872	0.7951	0.7664	0.8098
	Pbias (%)	14.1420	15.4048	18.2364	15.1722
20%	PCC	0.8663	0.8685	0.8516	0.8720
	NSE	0.7482	0.7475	0.7310	0.7527
	Pbias (%)	−17.7356	−15.3534	−21.7856	−16.4038
30%	PCC	0.8269	0.8324	0.8158	0.8475
	NSE	0.7125	0.7163	0.7007	0.7284
	Pbias (%)	−19.2989	−23.0642	−26.1195	−21.0511

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Qiu, H.; Chen, H.; Xu, B.; Liu, G.; Huang, S.; Nie, H.; Xie, H. Multiple Types of Missing Precipitation Data Filling Based on Ensemble Artificial Intelligence Models. Water 2024, 16, 3192. https://doi.org/10.3390/w16223192

AMA Style

Qiu H, Chen H, Xu B, Liu G, Huang S, Nie H, Xie H. Multiple Types of Missing Precipitation Data Filling Based on Ensemble Artificial Intelligence Models. Water. 2024; 16(22):3192. https://doi.org/10.3390/w16223192

Chicago/Turabian Style

Qiu, He, Hao Chen, Bingjiao Xu, Gaozhan Liu, Saihua Huang, Hui Nie, and Huawei Xie. 2024. "Multiple Types of Missing Precipitation Data Filling Based on Ensemble Artificial Intelligence Models" Water 16, no. 22: 3192. https://doi.org/10.3390/w16223192

APA Style

Qiu, H., Chen, H., Xu, B., Liu, G., Huang, S., Nie, H., & Xie, H. (2024). Multiple Types of Missing Precipitation Data Filling Based on Ensemble Artificial Intelligence Models. Water, 16(22), 3192. https://doi.org/10.3390/w16223192

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Multiple Types of Missing Precipitation Data Filling Based on Ensemble Artificial Intelligence Models

Abstract

1. Introduction

2. Study Area and Data

3. Methodology

3.1. Model Validation Criteria

3.2. Machine Learning Methods

3.2.1. Backpropagation Neural Network (BPNN)

3.2.2. Random Forest (RF)

3.2.3. Support Vector Regression (SVR)

3.3. Multiple Linear Regression (MLR)

4. Results

4.1. Missing Completely Random (MCR) Precipitation Data

4.2. Missing Completely Random (MCR) Precipitation Data Under the Absence of Concentrated Years

4.3. Missing Random (MR) Precipitation Data

4.4. Missing Not Random (MNR) Precipitation Data

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI