*3.1. Overall Test Results*

Model evaluation is based on the datasets collected during real-time G2N applications along with the PRUFS forecast during November and December 2021. The root mean squared error (RMSE) of 2 m temperature, 2 m relative humidity, and 10 m wind speed at 311 stations of the PRUFS forecast and G2N correction for the 0–24 h forecasts were calculated and the results are presented in Table 1. The percentages of improvement (POI) of the corrected forecast accuracy are also given. It can be found that the convolutionalbased G2N model corrected the model forecast errors effectively. The RMSE of all three variables is reduced significantly. The POI of the temperature forecast accuracy increased by 19.4%, the relative humidity by 24.5%, and the wind speed forecast has the greatest enhancement rate of 42.8% by the G2N model.

**Table 1.** The RMSE and improvement percentages (IP) (see Equation (6)) of the 2 m temperature, 2 m relative humidity, and 10 m wind speed 0–24 h forecasts corrected by G2N, averaged at all stations.


The hour-by-hour errors of the PRUFS forecasts and G2N corrections at each station were computed, and the distribution characteristics of the errors were presented in Figure 4. The error is defined as

$$Error = forecast - observation \tag{5}$$

**Figure 4.** Normalized frequency distributions of the PRUFS forecast errors and the G2N corrected forecast errors. The (**a**) 2 m temperature, (**b**) 2 m relative humidity, and (**c**) 10 m wind speed, with error bins at 1 ◦C, 1%, and 0.5 m/s, respectively.

Figure 4 shows that the bias of the 2 m temperature, 2 m relative humidity, and 10 m wind speed of the PRUFS forecast are 0.41, −3.43, and 1.15, respectively. After the G2N correction, they are reduced to −0.15, 0.37, and 0.03, respectively. The distribution of the errors of the G2N-corrected 2 m temperature, 2 m relative humidity, and 10 m wind speed are approximately symmetric about and shrunk to the 0-error point, indicating that G2N is effective in eliminating both negative and positive systematic errors. Notably, the distribution of the 10 m wind speed forecast errors by PRUFS shows an overall apparent positive bias. Several previous studies reported similar results [21–23]. G2N effectively corrected such wind speed biases. The overall systematic errors of the 2 m temperature and 2 m humidity forecasts of PRUFS were not as substantial as the wind. Following the G2N correction, the number of samples with larger temperature and humidity errors also significantly decreased.

To further examine the details of the error properties, density scatter plots of the forecast–observation pairs of the PRUFS forecast and the G2N correction are plotted and the results are shown in Figure 5. The samples include all stations and forecast times during the test period.

**Figure 5.** Density scatter plots of the forecast–observation pairs of the PRUFS forecasts (**1st column**) and the G2N correction (**2nd column**). The (**a**,**b**) 2 m temperature, (**c**,**d**) 2 m relative humidity, and (**e**,**f**) 10 m wind speed.

The forecast–observation pairs of all three variables converge more compactly around the black diagonal lines after the correction, i.e., the corrected forecast is closer to the observed value. For example, the variance of the wind speed (Figure 5e,f) is reduced from 2.21 to 0.7. PRUFS underestimated RH (Figure 5c) with the samples skewed to the right of the diagonal and it is removed in the G2N corrected data (Figure 5d), resulting in more centralized and symmetrical error distributions around the diagonal. Similarly, the 10 m wind forecasts (Figure 5e,f) were overall largely overpredicted by PRUFS and G2N dramatically eliminated this bias and the overall errors too.

## *3.2. Forecast Lead Time and Daily Variation*

To analyze the performances of the G2N model for correcting the forecast at different lead times and different times of the day, the samples in the test dataset were grouped according to the forecast lead time and times of day, respectively. After grouping, the error statistics were analyzed for the times in each group. Figure 6a,c,e show the forecast scores of PRUFS and G2N for the 0–24 h lead times. As the forecast lead time increased from 1 to 24 h, the 2 m temperature RMSE increased from approximately 2 to 2.5 ◦C, the 2 m relative humidity RMSE increased from approximately 12 to 15%, and the 10 m wind speed RMSE increased from 1.8 to 2.0 m/s. After the G2N correction, the RMSE of these three variables is reduced by approximately 1.3 ◦C, 8%, and 0.8 m/s. Furthermore, G2N can correct the larger errors at longer lead times more effectively for the 24 h forecasts examined here, with the RMSE of the corrected 2 m temperature forecasts increasing only by 0.3 ◦C, and the corrected 2 m relative humidity and 10 m wind speed errors nearly unchanged with the lead time. This result shows that the G2N model can automatically adjust the magnitude of the error correction according to the error growth for the forecast lead times examined herein.

**Figure 6.** The variation of the RMSE of the PRUFS 0–24 h forecasts and the G2N correction for 2 m temperature (**a**,**b**), 2 m relative humidity (**c**,**d**), and 10 m wind speed (**e**,**f**) with forecast lead time (**left panels**) and diurnal variation (**right panels**); The horizontal coordinate of the left panels is the forecast lead time and that of the right panels is the local time.

The results of the RMSE of the three meteorological variables at different times of day for the PRUFS forecast and G2N correction are shown in Figure 6b,d,f. The errors of the PRUFS forecasts display significant diurnal variations. The RMSE of the 2 m temperature forecasts reached a peak at 15:00 LT, and a minimum at 7:00 LT. The evolution trend of the 2 m relative humidity errors is approximately opposite to the temperature errors, with an error peak at 8:00 LT, and a valley at around noon LT. The error of 10 m wind speed is less fluctuated. After the G2N corrections, the RMSE of the forecasts of all three variables were significantly reduced at all times of day, with a diurnal variation trend generally consistent with the PRUFS forecasts. This suggests that the diurnal variations of physical processes that caused the PRUFS model errors may also lead to some difficulties for the G2N model.

#### *3.3. Spatial Distribution of the G2N Performances*

To analyze the horizontal distribution of the G2N performances, the RMSE of 2 m temperature, 2 m relative humidity, and 10 m wind speed were calculated for each station for all samples of the test dataset. The PRUFS forecast RMSE for all three variables was significantly reduced (Figure 7) at all stations by G2N. The PRUFS temperature prediction errors at several stations were higher than 7.55 ◦C. After the G2N correction, they were reduced to less than 2.0 ◦C. In general, the central regions of the domain achieved the best correction results, where the overall error of the PRUFS relative humidity forecast is reduced from ~16% to less than 13% by the G2N model, and the wind speed error from ~1.3–1.5 m/s to below approximately 0.5 m/s. Furthermore, the G2N model is more effective for the stations where the PRUFS forecast errors are larger. At most stations, the G2N model gains IP values over 60% for wind speeds. For relative humidity, there are approximately half of the stations yield 60% IP.G2N performed slightly worse at the southern part of the domain and the northwest corner because the peripheral spatial information of the sites at and across the boundary is not included. The correction effect of wind speed is most effective throughout the domain.

**Figure 7.** The RMSE of the PRUFS forecasts (**left panels**) and the G2N correction (**middle panels**) and the corresponding G2N improvement percentages (**right panels**, %) of 2 m temperature (**a**–**c**), 2 m relative humidity (**d**–**f**), and 10 m wind speed (**g**–**i**).

To quantitatively compare the G2N effect among the stations, the improvement percentage (IP) of the RMSE of the G2N correction over the PRUFS forecasts was calculated for each site as follows

$$IP = \frac{PRLIFS\\_forecast\_{RMSE} - G2N\\_correction\_{RMSE}}{PRLIFS\\_forecast\_{RMSE}} \times 100\tag{6}$$

Figure 7 shows that more than half of the stations gain an IP over 30% for all three meteorological elements although some stations in the northwest marginal area and the southern boundary show a negative effect. A lack of spatial feature information at the edges may impose an unfavorable effect on these sites. Again, the G2N model is most effective for correcting the wind forecast errors, with IPs at most stations larger than 15% and more than a half gained over 50%.

Figures 8 and 9 show the bias and the Pearson correlation coefficients for the forecasts of the three meteorological variables, respectively. The bias of the PRUFS model forecast is significantly reduced by the G2N correction. The PRUFS model temperature forecasts have over 1.3 ◦C bias at several clustered surface weather stations. They are significantly reduced by the G2N corrections, to less than 0.5 ◦C. The PRUFS 2 m relative humidity forecast has an overall negative bias (approximately −6.92%) and its wind forecast has a positive bias (approximately 0.95 m/s), and they are reduced to −2.91% and −0.18 m/s, respectively, after the G2N correction.

**Figure 8.** Horizontal distribution of the bias of the PRUFS forecasts (**left panels**) and the G2N correction (**right panels**) of 2 m temperature (**a**,**b**), 2 m relative humidity (**c**,**d**), and 10 m wind speed (**e**,**f**).

**Figure 9.** Horizontal distribution of the Pearson correlation coefficients concerning the observations of the PRUFS forecasts (**left panels**) and the G2N correction (**right panels**) of 2 m temperature (**a**,**b**), 2 m relative humidity (**c**,**d**), and 10 m wind speed (**e**,**f**).

In comparison with the PRUFS forecast, the correlation between the G2N corrected forecasts and the observations is also significantly improved for all three meteorological variables (Figure 9). The correlation coefficient (*r*) can be assessed by the general guidelines proposed by Cohen et al. [24,25], |*r*| < 0.3 is defined as weakly correlated; 0.3 < |*r*| < 0.6 as moderately correlated; 0.6 < |*r*| < 0.8 as strongly correlated; and 0.8 < |*r*| < 1 as extremely strongly correlated.

All station average correlation coefficient for the PRUFS temperature forecast was approximately 0.946 and it reached 0.952 after the G2N correction. For relative humidity, the all-station average correlation coefficient was 0.793 for the PRUFS forecast and ~95% of the stations are strongly correlated. After the G2N correction, the all-station average correlation coefficient was improved to 0.852 and the stations with strong correlation increased to approximately 100%. For the wind, all station average correlation coefficient for the PRUFS forecast was 0.626, the proportion of strong correlation sites is 69%, and the proportion of strong correlation sites was 6.8%. After the G2N correction, all station average correlation coefficient of the corrected sites increased to 0.739, the percentage of strongly correlated sites rose to 92.6%, and the percentage of very strongly correlated sites rose to 33%.

#### **4. Sensitivity Analysis of G2N to the Inputs and Learning Areas**

G2N realized the forecast error correction by projecting the PRUFS model grid forecasts to the observation sites. Two natural questions are: what is the optimal size (area) of the gridded input data (G) and what is the proper number of stations for the objective function (loss function)? The size of the input data (G) determines the features of the multiscale characteristics of the PRUFS model forecast that are extracted to infer the information related to the target site. On the other hand, the number of sites (N) of the objective function is a multitask learning problem [26–29], that is, how many adjacent station sites are optimal for simultaneous learning. This section analyzes these two issues by conducting two groups of sensitivity experiments, briefly, G-exp and N-exp.

#### *4.1. Impact of the PRUFS Forecast Input (G-Exp)*

A group of G-exps was conducted to investigate the impact of the PRUFS forecast patch sizes, i.e., the areas of G, as the input of G2N, on the G2N correction. For simplicity, the central station "Jintan" (see Figure 1) was selected as a single site for the correction tests, i.e., G2N with N = 1, briefly, G2-One. The structure of G2-One is shown in Figure 2b.

To keep this paper concise, only the 10 m wind speed correction was presented because the results for the other two variables are similar. The experiments were designed by cropping the domain of the input fields with varying area ratios relative to the large default area chosen and discussed in the previous sections (with a dimension of 301 × 401 grid points). "Jintan" was kept approximately at the centers for all the cropping domains. The area ratio is defined as follows.

$$Area\text{Ratio} = \text{Cropped\\_dimension} / \text{Default\\_dimension} \tag{7}$$

The G2-One model was trained to correct the 10 m wind speed at the "Jintan" station with AreaRatio = 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, and 1.0, respectively. (Figure 2b shows the cases of AreaRatios = 0.3, 0.5, 0.7). The RMSE and improvement percentage of the 10 m wind speed at the station of these eight G2-One experiments corrected on the test dataset were computed and given in Table 2. The table also includes the results of the G2N331 model.

**Table 2.** RMSE of the 10 m wind speed of the PRUFS forecast and the correction by G2-One at Jintan and Nanjing for different input areas and the corresponding IP. The evaluation was done on all test datasets.


Table 2 shows again that incorporating information from other surrounding sites in the loss functions improves the error correction at Jintan (i.e., G2N outperforms G2-One). Nevertheless, for clarity and simplicity, G-exps for Jintan only is presented. It can be seen in Table 2 that the G2-One performance is improved as AreaRatios (the sizes of G) increase from 0.3 to 0.7, and thereafter, the performance degrades as the AreaRatios continue to increase. **This indicates that selecting the proper sizes of spatial structures/features is important for G2N.** If it is too small, the model will not be able to take in sufficient information on the spatial features of the PRUFS forecasts. On the other side, if the input domain size is too large, it may introduce unnecessary noises and/or information burdens that hinder the G2N training.

In addition to the Jintan station, we also computed the training at other stations located in the central regions of the domain. The RMSE of the wind speed at Nanjing (Table 3) is smaller than those at Jintan, but the trend of the sensitivity test results with different AreaRatios is consistent with that at Jintan. The results for other stations are similar, but not shown for brevity.


**Table 3.** Improvement percentages of RMSE for the N-Exps with the G2N model.

For a possible physical explanation of the optimum size for the G2N model training, we think the mesoscale circulation features are critical. For a given station, the model errors are affected by the mesoscale system over the station and the most important structural features of this mesoscale system should be included for the G2N model input. Thus, the "optimum size" should depend on the size of these most important structural features, which is a few hundred kilometers.

#### *4.2. Impact of the Sites for Multitask Learning(N-Exp)*

A set of N-exps is carried out to study the impact of assigning different numbers of surface stations for simultaneous learning, i.e., multitask learning. The stations are selected in regions with "Jintan" approximately at the center (see Figure 1), and the experiments take 51, 101, and 199 sites (Figure 10), respectively.

**Figure 10.** Sub-domains containing 51 (**a**), 101 (**b**), and 199 (**c**) station sites for N-exps. The red star is the station "Jintan".

For N-exp experiments, the G2N model was trained using the same labeled dataset and model forecast input as those discussed in the previous sections, but the loss functions were defined for a varying number of sites (i.e., target domain sizes), i.e., N = 51, 101, 199, and 331, respectively, namely, G2N51, G2N101, G2N199, and G2N311. The performances of these four configurations were assessed based on the outputs of these model runs over the test dataset. The improvement percentages (IP) of RMSE of the G2N outputs, with respect to the PRUFS forecast, were computed for the 51, 101, 199, and 331 site groups, respectively, and presented in Table 3.

The second column of Table 3 labeled as "51" compares the RMSE IP concerning the same 51 sites (shown in red in Figure 10a) corrected by the G2N over the PRUFS forecasts for the four N-exps. The third, fourth, and fifth columns are the same but for the IPs concerning 101, 199, and 311 sites, respectively. It can be seen from Table 3 that for the 51 evaluation sites, as N increases from 51 to 331, the IPs for 2 m T, 2 m RH, and 10 m wind speed gradually grow in general. Similar results can be found for verification statistics computed for 101 and 199 evaluation stations. The learning for all 331 stations achieved the best result. **These results indicate that multistation learning for G2N with more stations is beneficial, not only reducing computing costs dramatically but also increasing the learning skills of the G2N model.**

#### **5. Conclusions**

Model output post-processing is a crucial step for correcting the errors of numerical weather prediction. In this study, we established the "grid-to-multipoint" (G2N) convolutional neural network (CNN)-based deep-learning model for correcting the forecast error of an operational high-resolution numerical weather prediction system (PRUFS) running 24 cycles of 0–24 h forecasts, each day over eastern China. G2N corrects model forecast errors by projecting high-resolution weather model gridded forecasts to the surface weather observations. G2N was tested for correcting the forecast of 2 m temperature, 2 m relative humidity, and 10 m wind speed of a high-resolution PRUFS model output. The forecast area contains 311 standard surface weather stations. G2N was trained with one year of data (August 2020 to August 2021) and evaluated by an independent test dataset of the real-time operational PRUFS runs during November and December 2021. The training and testing datasets contain all 24 cycles of 0–24 h forecasts per day. The results show a good performance of G2N for all surface forecast variables corrected and computing efficiency. Furthermore, two groups of sensitivity experiments were conducted to evaluate the impact of changing the input gridded numerical model data sizes and varying the number of stations for multitasking training on the performance of G2N. The main results are as follows.

(1) The G2N model could effectively extract and use the meso- and micro-scale meteorological circulation features, simulated by the high-resolution NWP forecasts, to infer the weather forecast errors at the target stations. The verification of G2N on the test dataset of the 2-month operational runs shows very good improvement percentages of RMSE, 19.0%, 24.5%, and 42.4% for 2 m temperature,2mrelative humidity, and 10 m wind speed, respectively, in comparison to the PRUFS forecasts.

(2) Sensitivity experiments with selecting mesoscale model forecast (feature) domains show that the size of the input domain has an important impact on the performance of the G2N model. Inputting an excessively small domain will not feed G2N with sufficient spatial features in the PRUFS forecasts that are relevant to the forecast error at the target stations. On the other hand, an excessively large input domain may introduce unnecessary information that hinders the G2N performance.

(3) Sensitivity experiments with multitasking learning strategies (N-exps) show that, for a given input model grid domain, increasing the number of target correction stations within the domain for multitask learning is beneficial to improving the performances of G2N for correcting the errors of all three surface variables. When the three variables (*T*2, *RH*2, *W*10) are corrected for the 51 sites, the RMSE improvement percentages of 51 sites

with input threshold are 16.4%, 23.7%, and 44.5%. With the increase in the input threshold, the RMSE improvement percentage of the three variables in 51 sites increased to 20.8%, 27.8%, and 47.4%, with an average increase of approximately 3.8 percentage points.G2N gained the largest error correction when all 311 sites were included in simultaneous learning. This finding indicates that the loss function composed with more target stations could incorporate more relevant spatial loss information and thus increase the G2N model learning abilities.

(4) With its simplicity and high effectiveness, G2N can be readily generalized for post-processing a high-resolution numerical weather prediction system running over other regions. Based on our data and tests, we recommend specifying a patch size of the input model forecast domain with a side dimension of ~600–900 km (200–300 grids) for G2N and including all stations within the domain in the loss function for simultaneous forecast error correction.

The G2N model developed in this paper has been running operationally along with the PRUFS regional numerical weather system to support valuable applications by several customers. For the domain size and stations corrected in this paper, the training time for G2N with one-year samples takes approximately 6 h wall-clock time on a GPU server with Quadro RTX 8000. The G2N real-time run takes only 297 s. Therefore, G2N is a highly efficient and effective tool for post-processing high-resolution NWP forecasts.

Nevertheless, we note that it will be more informative to assess the G2N model performance for a complete year period. Unfortunately, we were not able to access the model data after December 2021. We plan to apply the G2N model for another NWP system in the future and put attention on evaluating the general applicability of the G2N model and its seasonal performance variation characteristics.

We also would like to note that the input for G2N described in this paper only uses a single-element forecast field, e.g., the PRUFS 2 m temperature forecasts for correcting the 2 m temperature at the surface stations. We tested inputting multiple variables, including surface pressure, humidity, and wind, but obtained a degraded performance. Additionally, the results of the present G2N training were obtained without separating the tags of the different forecast lengths, forecast sequences, or forecast cycles of the day. Activating these tags also degraded the G2N performance. Our future work will aim at understanding these limitations and explore more complicated deep-learning models, including refined self-attention algorithms that may amplify the contributions of the key feature in the input and thus gain further improvement on the forecast error correction.

**Author Contributions:** Conceptualization, Y.L.; methodology, Y.Q., X.J., Y.L. and L.Y.; validation, Y.Q., Y.S. and H.X.; formal analysis, Y.Q., X.J., Y.L. and L.Y.; investigation, Y.Q. and Y.L.; resources, Y.L., Y.S. and H.X.; writing—original draft preparation, Y.Q.; writing—review and editing, Y.L., H.X., Z.H. and L.Y.; visualization, Y.Q., X.J. and Z.H.; funding acquisition, Y.L., H.X. and Y.S.; All authors have read and agreed to the published version of the manuscript.

**Funding:** This work is supported by the Science and Technology Grant No.520120210003, Jibei Electric Power Company of the State Grid Corporation of China and partially by the National Key R&D Program of China (Grant No. 2018YFC1507901).

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** Not applicable.

**Acknowledgments:** The modeling simulations were carried out on the supercomputer provided by the Nanjing University of Information Science and Technology. The authors thank Peng Zhou for providing and processing the model data analyzed in this study and Xing Wang and Daili Qian for valuable discussions.

**Conflicts of Interest:** The authors declare no conflict of interest.
