1. Introduction
With the continuous progress of mesoscale regional numerical models, numerical weather prediction (NWP) models have gained significant prominence in the domain of weather forecasting. However, the utilization of NWP models to forecast at finer temporal and spatial scales is currently constrained by several factors, including the initial conditions, boundary conditions, physical parametric schemes, and the integration of multi-source data fusion technology. To enhance the performance of NWP models, research on correction methods based on NWP models cannot be overlooked. A bias correction method serves as a bridge by connecting the NWP models with the realization of higher-resolution predictions. By correcting biases, the models become a more reliable tool for generating accurate predictions and supporting decision-making processes.
Significant advancements have been achieved in the field of correction method research. Hamill et al. [
1] employed the technique of quantile mapping to align the precipitation frequency, resulting in enhanced forecast reliability, forecasting skills, and a reduction in the deterministic forecast bias. This approach also ensured the preservation of the precipitation distribution’s resolution and spatial details. Wu et al. [
2] observed that the application of classical statistical methods led to a notable enhancement of the forecast results. The frequency-matching method and scoring optimization correction method, as proposed by Wu et al. [
3], have gained significant popularity for the correction of cumulative precipitation forecasts. In recent years, the emergence of artificial intelligence has led to the successful utilization of machine-learning (ML) algorithms in various domains, such as data mining, image recognition, and medical care. These advancements have brought about significant transformations and disruptions in several industries. These advancements also serve as a point of reference and a source of inspiration for the advancement of weather-forecasting technology. For instance, Zaytar et al. [
4] employed a multi-stacked Long Short-Term Memory (LSTM) approach to effectively model time series data of equal length. This methodology facilitated improved predictions of various meteorological variables, such as the wind speed, in nine different cities located in Morocco. Herman et al. [
5] conducted a study in which they utilized three distinct statistical algorithms to forecast local extreme precipitation in the contiguous United States (CONUS). In their research, they employed a Random Forest (RF) training model for the purpose of precipitation prediction. Ahmed et al. [
6] employed a range of ML algorithms, such as artificial neural networks (ANNs), K-Nearest Neighbor (KNN), and Support Vector Machine (SVM), to conduct a comparative analysis with the simulated precipitation and temperature outcomes generated by the general circulation models (GCMs). The study revealed that the K-Nearest Neighbor and related vector machine multi-model ensemble exhibited superior skills, whereas the ANNs demonstrated greater performance fluctuations across spatial domains. Xu [
7] highlighted the increasing utilization of DL algorithms in weather forecasting and research in recent years. This has further emphasized their significant potential value and promising application prospects. Sun et al. [
8] made a significant discovery regarding the application of DL algorithms in improving the accuracy of 10 m wind speed forecast results generated by numerical models. Their findings revealed that over time, the performance of the corrected forecasts exhibited a consistent improvement, with the effect becoming increasingly optimal. Shi et al. [
9] utilized the convolutional Long Short-Term Memory (LSTM) model to forecast precipitation and observed that it exhibited superior performance compared to conventional optical flow extrapolation techniques. Guo et al. [
10] discovered that DL algorithms have the capability to acquire the spatiotemporal structure and intrinsic correlation of radar data. This ability leads to a significant enhancement in the prediction accuracy of strong convective weather echo intensity. Teng [
11] introduced a novel model known as RET-RNN, which was developed using LSTM and demonstrated promising results in the field of long-duration extrapolation.
However, unlike the consistent, continuous, and smooth temperature evolution, precipitation generally demonstrates a highly non-linear and random distribution in both space and time. Correcting precipitation data for bias is a challenge due to its complex characteristics. Various methods have been developed to address this issue, including traditional quantile mapping (QM)-based bias correction and downscaling techniques, as well as recent machine-learning-based approaches like Random Forest [
12,
13,
14,
15,
16] and artificial neural networks [
17]. In recent years, DL has made significant advancements across various fields and outperformed traditional ML methods due to its powerful ability to learn spatiotemporal feature representation in an end-to-end manner [
18,
19,
20]. Specifically, DL approaches utilizing convolutional neural networks (CNNs) have been applied to correct and downscale low-spatial-resolution data [
21,
22,
23], reanalysis products [
24,
25], and weather forecast model outputs [
26,
27]. While these studies have shown many promising strengths and advantages compared to traditional downscaling and correction methods, most of them struggle to capture local small-scale features, such as extreme events, in unseen datasets. For instance, Baño-Medina et al. [
24] designed different DL configurations with varying numbers of plain CNN layers to correct and downscale daily ERA5-Interim reanalysis data from a spatial resolution of 2° to 0.5°. However, the overall performance still fell short when compared to simple generalized linear regression models, resulting in significant underestimation of precipitation extremes. Harris et al. [
26] developed a generative adversarial network (GAN) architecture to correct and downscale weather forecast outputs and discovered that accounting for forecast errors (or biases) in a spatially coherent manner is more challenging than addressing pure downscaling problems. Additionally, previous studies on bias correction and downscaling have primarily focused on the daily time scale [
24,
25,
26,
27,
28,
29,
30]. It is worth noting that understanding the distribution of hourly precipitation data within a day is more crucial than daily or monthly aggregations when assessing the impacts and risks associated with precipitation changes induced by global warming [
31].
In this study, a combined model—PBT-GRU—based on the PBT (Population-Based Training) optimization algorithm and the GRU (Gate Recurrent Unit) model is constructed and trained, and a study on the correction method of the mesoscale numerical weather prediction (NWP) model’s WRF precipitation forecast is carried out by using the model. The objective of the proposed model is to offer significant guidance and technical assistance in enhancing the precision of refined precipitation forecasting and expanding its applicability in various business domains.
2. Data and Methodology
2.1. Scheme of Precipitation Correction
The correction process consists of five distinct steps. Firstly, the study area is determined and the necessary data are prepared. Secondly, the data are processed. Thirdly, a sample database is constructed and the data are standardized. Fourthly, the training, validation, and test datasets are divided. Finally, a DL model based on PBT and GRU is developed to correct the deviations in the precipitation product. Various ML algorithms are introduced and the resulting corrections are compared and evaluated. The specific implementation plan is depicted in
Figure 1.
2.2. Study Area
The target area for this study is the city of Zhengzhou (112.70°–114.23° E, 34.27°–34.98° N), which is located in northern Henan Province (
Figure 2). Zhengzhou, being situated in the middle latitudes, is prone to frequent incursions of cold air. Additionally, warm and humid air masses can also reach the region during the summer, which often leads to the convergence of warm and cold air masses and subsequently results in intense rainfall events. Additionally, there are multiple indications that China’s climate is undergoing a transitional phase, which may result in a change from low summer rainfall to increased precipitation in the northern regions. Therefore, the selection of Zhengzhou as the study area for this research is highly appropriate. Undoubtedly, this study will serve as a crucial stepping stone toward enhancing severe weather warnings and disaster prevention and mitigation capabilities in the region. It demonstrates a forward-thinking approach with a strategic perspective, setting the stage for future advancements in this field.
2.3. Construction of the Sample Database
The data used in this research consisted of numerical model NWP gridded forecast data and observation data. The observational data in this study were obtained from the Henan Meteorological Bureau. Specifically, we utilized the hourly ground observation data from 2014 to 2022 of Erqi Station (Station No. 57083), located in Zhengzhou, as the sample dataset. The data used in this study were obtained from the mesoscale weather model WRF4.0, a non-hydrostatic model jointly developed by numerous universities and research institutions in the United States [
32]. This model features a data assimilation system capable of incorporating meteorological data and executing parallel operations. Moreover, it integrates the latest research findings and advancements from experts and scholars across various fields, providing a solid foundation for both scientific research and practical applications [
33]. At the same time, we utilized WRF4.0 to conduct numerical simulations of the weather process. We then compared and analyzed the corrected results of other ML algorithms using the high-resolution forecast results generated by the model. The model configuration settings are listed in
Table 1. First, the temporal resolution was set at 6 h, and the horizontal resolution was set at 0.25° × 0.25°. Three nested domains were utilized, with the center of the simulation area located at (34.47° N, 114.21° E), as shown in
Figure 3. The grid sizes for the domains were set at 151 × 151, 202 × 151, and 220 × 151. These domains had corresponding grid spacings of 27 km, 9 km, and 3 km, respectively. The model employed 50 layers for the vertical dimension, with a top pressure level of 50 hPa. The physical parameterization schemes used in all the model domains and experiments were the WRF Single-Moment 6-class microphysics scheme [
34], the Mellor–Yamada–Janjic planetary boundary layer scheme [
35], the rapid radiative transfer scheme (RRTM) [
36] and the unified Noah land-surface model [
37,
38]. All the settings mentioned above are the optimal configurations for this simulation region, as summarized by Liu et al. [
39]. Next, all the experiments used NCEP FNL data as input conditions to help observe the changes in the model’s required spin-up times. The time step used for the lateral boundary condition file was one hour. The history output files of each domain were logged hourly.
Given the extensive parameters and high computational power requirements for training the DL model in this study, we select the radar reflectivity factor as the predictive factor. This feature serves as an indicator of the generation and development of convection, as it reflects the reflection of radar waves in various height layers. After the selection of NWP gridded forecast data and observation data, a sample database was generated for the purpose of model training. The database consisted of 54 meteorological variables that were updated on an hourly basis. These data can be classified into eight distinct categories: air pressure (P), visibility (VIS), wind direction (WD), wind speed (WS), air temperature (T), relative humidity (RH), precipitation (P), and NWP (see
Table 2 for detailed information).
Due to the specific focus of this paper on the correction of hourly precipitation output from the WRF model, we categorized the hourly rainfall into four levels based on the operational practices. By examining the sample distribution presented in
Table 3, it becomes apparent that there exists a significant disparity in the distribution of the hourly rainfall data. Precipitation events of weak intensity are infrequent, constituting a mere 4.74% of the overall distribution. This matter necessitates attention in subsequent iterations of the model training procedures.
2.4. Data Standardization
Due to the wide range of meteorological characteristics encompassed by the input features, each feature possesses distinct dimensions and units. Feeding these features directly into the model introduces complexity to the data processing and may potentially result in model crashes. To mitigate the occurrence of such issues, this study utilized the normalization calculation equation for the purpose of normalization processing, as suggested by Song et al. [
40]. This approach uniformly rescales different data values to fit within the standard interval of 0–1. The specific formula can be expressed as follows:
where
represents the standard data,
represents the original data,
and
represent the maximum and minimum values in the original meteorological dataset. By ensuring that the normalized meteorological sample data falls within the 0–1 standard interval, the training efficiency of the model can be effectively improved, and an efficient calculation process can be ensured when the data are input into the model. Further details can be found in
Table 4.
Considering the inclusion of meteorological features such as the temperature, relative humidity, pressure, wind speed, visibility, and precipitation in the sample database, it is important to note that these features, due to being observed simultaneously, cannot be directly utilized for prediction purposes. In order to ensure precise predictions, it is imperative to establish a correlation between the meteorological characteristics observed in the past and the predictive indicators for the future. Based on the aforementioned considerations, the normalized sample data are structured into time series data. Subsequently, the time series data are shifted backwards, where t represents the current time. The input value is taken as the sample observation data at time t − 1, while the output value is the precipitation data at time t.
Based on the data presented in
Table 4, all the sample data are normalized. Specifically, Var1(t − 1), Var2(t − 1), …, and Var54(t − 1) represent the observed values of 54 meteorological variables at time t − 1, while Var1(t), Var2(t), …, and Var54(t) represent the observed values of the same 54 meteorological variables at time t. This study aims to predict future precipitation based on observations at previous times. To achieve this, we retain Var1(t) as the predicted value for the precipitation at time t, and remove the remaining data of Var2(t), Var3(t), …, and Var54(t). The performance of the predictions is evaluated using seven statistical metrics: Probability of Detection (POD), Threat Score (TS), Equitable Threat Score (ETS), Bias Score (BIAS), accuracy, False Alarm Rate (FAR), and Missing Alarm Rate (MAR), which are defined as follows:
In the context of the statistical metrics used to evaluate the performance of the outcomes, the definitions of the contingency table statistics are as follows:
h: the number of forecasted events that match the actual events.
m: the number of actual events that were not forecasted.
f: the number of forecasted events that did not occur in reality.
c: the number of events that were neither forecasted nor occurred in reality.
These contingency table statistics are used to calculate the statistical metrics, which provide insights into the performance of the correction model.
2.5. Training and Test Dataset
The sample dataset exhibits a significant class imbalance due to the infrequent occurrence of convective weather. Specifically, the number of positive samples representing convective weather with an intensity of weak precipitation (greater than or equal to 0.1 mm/h) is considerably lower than the number of negative samples representing convective weather without precipitation (less than 0.1 mm/h). This particular instance serves as a quintessential illustration of the issue of sample imbalance, as discussed by Krawczyk et al. [
41]. To mitigate this concern, a down-sampling technique is utilized to randomly eliminate the surplus samples, taking into account the ratio of positive and negative samples in the dataset. This approach ensures a balanced distribution of positive and negative samples [
42].
During the experiment, six distinct ratios of positive and negative samples are chosen, namely 3620:72,645, 1:1, 1:2, 1:3, 2:1, and 3:1. Among the samples, the number 3620 represents the actual count of positive samples, whereas 72,645 represents the actual count of negative samples. When the PBT-GRU model is trained without any adjustments to the sample quantity, the Probability of Detection (POD), accuracy, and Threat Score (TS) scores are 0.6237, 0.6114, and 0.5982, respectively. Based on the distribution of positive samples in the original dataset, we employ a random selection process to obtain negative samples in order to maintain a balanced ratio of positive to negative samples at 1:2 and 1:3. Subsequently, we utilize the down-sampling technique to adjust the number of negative samples, resulting in a 2:1 and 3:1 ratio of positive to negative samples, respectively. The PBT-GRU model is subsequently trained, and the findings from the experiments are presented in
Table 5. When the ratio of positive and negative samples is balanced at 1:1, the accuracy and True Skill (TS) of hourly precipitation exhibit their highest values. However, the Probability of Detection (POD) score for the hourly precipitation is slightly lower. As the ratio of positive and negative samples increases to 2:1 and 3:1, there is an observed increase in the number of positive samples. This leads to higher Probability of Detection (POD) scores, while the accuracy and True Skill (TS) scores show a slight decrease. Conversely, when the ratio of positive and negative samples is 1:2 and 1:3, there is an increase in the number of negative samples, resulting in a notable decrease in the scores for the Probability of Detection (POD), accuracy, and TS (Threat Score). In conclusion, in order to enhance the prediction performance, we have determined that a 1:1 ratio of positive and negative samples is the optimal choice.
To obtain objective and fair experimental results, we employ random deletion to balance the number of positive and negative samples in the original dataset. This approach can ensure that the sample size is controlled and balanced. We then divide all the positive and negative samples into three subsets: the experimental training dataset, validation dataset, and test dataset. The training dataset constitutes 80% of the total samples, the validation dataset comprises 10%, and the remaining 10% serves as the test dataset.
3. Correction Model Construction Based on PBT and GRU
3.1. Dataset Dimensionality Reduction by RF
The high dimensionality and complexity of features in ML frequently result in reduced computational efficiency and heightened operating costs, which are detrimental to business-oriented applications. In the context of nonlinear complex feature spaces and vast high-dimensional data, the task of eliminating redundant and irrelevant feature values from input features has emerged as a critical concern in the field of ML. Feature filtering and dimensionality-reduction techniques are employed to identify and retain input features that possess high importance and contain rich information. This process ultimately improves the model’s ability to extract and refine relevant information. Random Forest (RF) is an ML algorithm that utilizes bootstrapping resampling to randomly select data for constructing resampled samples. The approach employs random splitting to construct multiple decision trees for each sample, and it subsequently aggregates the decision trees to derive the final prediction outcome via a voting mechanism. Random Forest (RF) is a commonly employed technique for feature selection. It operates by assessing the importance of each feature, ranking them based on their calculated importance, and subsequently filtering out the most significant ones. This is particularly valuable in scenarios where a substantial number of features are involved in classification or regression tasks. It is common for many features to exhibit high correlation and dimensionality issues. Incorporating these features into the model can have a significant impact on the accuracy of model training and prediction. By utilizing the Random Forest (RF) algorithm, an importance analysis can be conducted to determine the significance of each predictor and establish a prioritized ranking. The fundamental principle entails the quantification of the contribution made by each feature in every tree within the Random Forest. These values are then averaged and compared to determine the relative contributions among the features. Typically, the Gini index or Out-of-Bag (OOB) error rate can be employed as an evaluation metric. In this study, our primary focus is on the utilization of the Gini index as a means of assessment, as discussed by Breiman [
43], Robin et al. [
44], and McGovern et al. [
45]. Here, we denote the Variable Importance Measure (VIM) as the score reflecting the importance of the variables, while GI represents the Gini index. Assuming there are J features, I decision trees, and C categories, the Gini index of node q in the i-th tree is calculated as follows:
Among them, C represents the categories, and
denotes the proportion of category c at node q. The change in the Gini index for a feature is given by:
Suppose there are I trees in the Random Forest (RF), then:
Finally, normalization is performed:
The specific steps involved in this process are as follows. Firstly, the feature importance needs to be calculated for all the features. Subsequently, these features are ranked in descending order based on their importance. When provided with a predetermined threshold for the proportion of features to be rejected, this threshold can be utilized as a criterion to eliminate excessive features by considering their importance. Repeat steps 1 and 2 on the remaining feature dataset until the desired number of features has been selected. The feature dataset with the lowest Out-of-Bag error rate, which corresponds to the selected feature set, should be chosen as the input for the model [
43,
44].
Through the implementation of RF dimensionality reduction, we have identified the nine most significant features, which collectively account for an importance score of 0.853. The specific findings are displayed in
Figure 4 and
Table 6. The results show that for the feature of convective weather, the importance of the features ranked by the machine-learning method is largely consistent with the subjective understanding of forecasters, for example, the radar reflectivity factor is the most important predictive factor for judging short-term heavy precipitation. Through the analysis of the objective ranking of these features, some useful inspiration can also be obtained. For example, automatic observation of the minimum visibility is an important factor in predicting precipitation. As the intensity and duration of precipitation can significantly modulate visibility, consistent and stable rainfall can easily trigger a long low-visibility scenario, and the sudden heavy precipitation is an important factor inducing the sharp decrease in visibility. With the increase in rainfall, the visibility changes from a rapid decline to a slow decline and there exists an inflection point [
46]. Therefore, the automatic observation of the minimum visibility is important. which can be used more in business.
3.2. The PBT Optimization Algorithm
The training process for ML models encompasses a multitude of parameters and hyperparameters that exert a substantial influence on the ultimate efficacy of these models. Traditionally, the manual adjustment of these parameters and hyperparameters has been common practice. However, this approach is characterized by its time-consuming and labor-intensive nature, and it does not provide a guarantee of achieving an optimal solution. Consequently, automatic adjustment methods have emerged as the predominant approach. Parallel search and sequence optimization are two distinct approaches utilized in the field of automatic tuning, each comprising a variety of individual methods. In the context of optimization algorithms, parallel search refers to the simultaneous training of multiple sets of parameters. This approach utilizes various techniques, including random search and grid search, to efficiently explore the parameter space. One limitation of this approach is the inefficient utilization of optimization information across parameters. On the contrary, sequence optimization aims to optimize the parameters by employing a series of iterative attempts, without the inclusion of parallel operations. This methodology encompasses various strategies, such as Bayesian optimization and manual parameter tuning. Nevertheless, it is imperative to take into account that certain parameters, such as the degree of exploration and the learning rate, experience continuous fluctuations throughout the training process of the model. The conventional approach involves initially establishing predetermined values and subsequently modifying them in response to various scenarios. Unfortunately, this approach frequently does not result in the optimal parameter value. In conclusion, the careful selection and optimization of parameters and hyperparameters play a crucial role in determining the overall performance of ML and DL models. While conventional manual adjustment techniques are laborious and time-consuming, automatic tuning methods provide more efficient solutions, albeit with inherent limitations.
The Population-Based Training (PBT) method has been shown to be effective in automating and optimizing hyperparameters [
47].
Figure 5 presents a visual representation of the main differences among the PBT, sequence optimization, and parallel search methods. (A) Sequential optimization necessitates the completion of multiple training runs, which may include early stopping. Afterward, fresh hyperparameters are chosen, and the model is retrained from the beginning utilizing the newly selected hyperparameters. The aforementioned process is inherently sequential, leading to prolonged durations for hyperparameter optimization. However, it utilizes minimal computational resources. (B) In contrast, the parallel random/grid search of hyperparameters entails the simultaneous training of multiple models with varying weight initializations and hyperparameters. The objective is to identify the most optimized model among the available options. This approach entails the need for a solitary training session; however, it demands the utilization of additional computational resources to simultaneously train multiple models. The PBT algorithm integrates the advantages of sequence optimization and parallel search. Initially, the PBT algorithm employs a random initialization process to generate multiple models. During the training process, checkpoints are automatically generated at regular intervals. Each model autonomously adapts its behavior in response to the performance of other models. If a model exhibits encouraging outcomes, the training process persists. Conversely, in the event that a model’s performance is deemed unsatisfactory, its parameters are substituted with those derived from a model that exhibits superior performance. Additionally, in order to further explore the parameter space, random disturbances are introduced during the training process. Checkpoints are established through manual configuration, whereas disturbances are induced by introducing noise. In summary, the PBT method combines the advantages of the sequence optimization and parallel search methods, facilitating the efficient and effective adjustment and optimization of hyperparameters. The PBT algorithm demonstrates dynamic adaptation to enhance the overall training outcomes by employing checkpoint generation, model evaluation, and parameter replacement techniques [
47].
3.3. Construction of the Model
This paper introduces the development of a DL model called PBT-GRU. The PBT-GRU uses the sequence data from ground observations and numerical model grid points as input. Firstly, to optimize the efficiency, preprocessing techniques such as data normalization and cleaning are employed on the initial meteorological data. After completing the above operations, dimensionality reduction of the initial dataset is performed using the Random Forest algorithm. Secondly, two GRU layers are used to extract the time-varying features of the sequence data, in which the first GRU layer, containing 128 neurons, is set to return the complete sequence, and the second GRU layer, containing 64 neurons, is set not to return the complete sequence, and the activation function of the two GRU layers is ReLU. Finally, the predicted precipitation size is obtained from the output of the two dense layers. The PBT-GRU model contains two GRU layers and two fully connected layers, and through the stacking of these layers and the processing of the activation function, the model can learn the features of the input data and output the prediction results (as illustrated in
Figure 6). Interpolation is necessary to obtain the meteorological element values of the forecast station, as the forecast station and model grid points are not located at the same point. This is because the grid point element values near the forecast station need to be interpolated. Given the potential error introduced by interpolation, we employ bilinear interpolation as a means to mitigate the influence of this error. This approach allows us to generate a sample database for training the model. In order to ensure the comparability of different models, both the PBT-GRU model and other ML models undergo a reconstruction of the sample dataset. By taking into account the evolutionary patterns and characteristics of weather systems, the incorporation of this factor allows the model to capture the fundamental causal connections between precipitation and other forecast attributes in the long-term series. The proposed PBT-GRU model effectively combines the benefits of Population-Based Training and the GRU architecture. Through the implementation of efficient preprocessing techniques and the incorporation of NWP gridded forecast data and observation data, the model significantly improves its predictive capabilities by accurately capturing the complex interplay between precipitation and other features.
The model training process employs the early stop strategy, with the iteration period (Epoch) set to 300. If the loss does not decrease for more than 10 epochs, the operation is automatically terminated. A batch size of 16 is used. The loss function selected for minimization during training is the mean squared error (MSE). The formula is as follows:
where m represents the training sample size,
represents the actual value, and
represents the predicted value. Statistical measures, including the correlation coefficient (r), standard deviation (
), and root mean square error (RMSE), are used to assess the model performance.
3.4. Experimental Setup
To evaluate the efficacy of the different models, we utilized a range of methods, including Random Forest (RF), Support Vector Machine (SVM), K-Nearest Neighbor (KNN), Gradient-Boosting Decision Tree (GBDT), and PBT-GRU. The objective was to enhance the accuracy of the precipitation forecasts generated by the WRF model. These refined outputs were subsequently compared with the original predictions from the WRF model for the purpose of comparison. The input dataset utilized in our study encompassed historical data from the years 2014 to 2022. In our approach, we considered an optimal ratio of 1:1 for positive to negative samples. Given that we had 3620 positive samples, we set an equal number of negative samples, yielding a total of 7240 samples. We segmented this dataset as follows: 80% (that is, 5792 samples) served as our training dataset. The remaining dataset was equally split into two subsets: 10% (or 724 samples) formed our validation dataset, while another 10% were used for the test dataset. We selected nine features identified through the RF screening. It is important to note that the distribution for the training, validation, and test datasets remained consistent across the different models.
5. Summary
In this study, we constructed a DL model based on PBT and GRU (PBT-GRU) for correcting the precipitation deviation predicted by the WRF model. Subsequently, we employed ML algorithms such as RF, SVM, KNN, and GBDT to compare with the corrected results of the PBT-GRU model. The main conclusions drawn from this research are as follows:
- (1)
The sample balancing experiment results revealed that when the ratio of positive and negative samples was 1:1, both the accuracy and TS scores reached their highest values, while the POD score was slightly lower. As the number of positive samples increased, the POD score improved, yet the accuracy and TS scores slightly decreased. Conversely, when the number of negative samples increased, all three scores, namely the POD, accuracy, and TS, experienced a significant decline with the increase in negative samples.
- (2)
To optimize the model’s performance, we utilized RF to evaluate the significance of various forecast features. As a result, nine key features were identified and selected, including radar reflectivity factor, 3 h precipitation, automatic observation of minimum visibility, 6 h precipitation, artificial visibility, 12 h precipitation, automatic observation of 10 min average visibility, automatic observation of 1 min average visibility, and maximum wind speed. By incorporating these features, the model’s input size was significantly reduced, leading to improved computational efficiency.
- (3)
Combining the advantages of PBT and GRU, a DL model named PBT-GRU was constructed, which took the forecast features in the first 72 h as input features, fully considering the evolution law and characteristics of the weather system. The experimental results showed that the RMSE of the PBT-GRU was only 1.12 mm, which was reduced by 51.72%, 58.36%, 37.43% and 26.32% compared with SVM, KNN, GBDT and RF, respectively. The and r of the PBT-GRU, RF, SVM, GBDT and KNN were 1.02 and 0.99, 1.12 and 0.98, 1.24 and 0.95, 1.15 and 0.97, 1.26 and 0.93, respectively. According to the comprehensive analysis of the accuracy, TS, RMSE, and r, the PBT-GRU model performed the most ideally, and its correction effect was significantly better than that of the ML methods. This model can be applied to forecast applications in private industry, providing a platform and technical support for future weather forecasting and early warning services.
This study provides a hint of the possibility that the proposed PBT-GRU model can outperform model precipitation correction based on a small sample of one-station data. However, the memory overhead required in the training process has increased significantly with the improvement of the model resolution and the expansion of the sample data in other regions, and it has posed a new challenge to the validity of the algorithm and the generalization ability of the model. Therefore, there is an urgent need to develop new methods to address this set of issues. Much work remains to be performed to interpret the deep-learning models and forecast results, especially interpretive studies using visualization techniques for deep learning. Only then can we further improve the credibility of the deep-learning methods, increase forecasters’ trust in the product, and expand the scope of its applications.