1. Introduction
As the cornerstone of meteorological and hydrological research, precipitation data have become particularly critical in terms of their accuracy and reliability in the context of climate change and increasingly extreme weather events. Acquisition of precipitation data relies principally on technical means such as surface observation stations, radar, and satellite remote sensing, each of which has its own specific advantages and limitations [
1,
2]. Although radar and satellite remote sensing technologies have a wide coverage area, the accuracy and resolution of their retrieved data are affected by factors such as cloud cover and terrain [
3,
4,
5]. Consequently, surface precipitation observation remains the primary source of precipitation data used by local meteorological departments for their operations and precipitation research. In recent years, China has fully automated its acquisition of surface meteorological observations, and a large and dense network of approximately 65,000 automatic weather stations (AWSs) has been established nationwide. Previous studies have proven that the assimilation of observation data from AWSs can effectively improve the accuracy of high-resolution regional numerical weather forecasting [
6]. However, owing to the wide and dense distribution of AWSs, the substantial differences in terrain, and the characteristics of high-resolution real-time data, there has long been a challenge regarding the use of AWS precipitation data with unstable quality in practical forecasting operations and assimilation research [
7,
8].
The spatial distribution of precipitation has strong nonuniformity, which results in considerable uncertainty in estimating precipitation, even for reasonably small areas. The processing of surface observations of precipitation involves multiple aspects such as data quality control (QC), interpolation, and scale conversion. Any small errors in these links could be amplified in the final estimation of precipitation, leading to notable biases [
9], and the currently increasing occurrence of extreme precipitation events further increases the uncertainty of the quality of precipitation data [
10].
High-quality precipitation observations are the foundation of precipitation research, and strict QC is the main approach adopted to improve the quality of such observational data. The methods adopted for QC of precipitation data have also been the research focus of previous studies [
11]. For example, Boulanger et al. [
12] used a decision tree algorithm to perform QC on the daily observational data of the Argentine National Meteorological Agency for the period 1959–2005. They detected a large number of erroneous precipitation and temperature data and verified the applicability of their method in other countries. Hamada et al. [
13] developed an automatic QC system for detecting erroneous information in daily rainfall observations, which can automatically and objectively identify erroneous data. Dandrifosse et al. [
14] developed a rapid QC method for meteorological parameters, such as temperature, pressure, humidity, and wind observed, via meteorological stations deployed in farmland, which can perform real-time QC of the data and has a low misjudgment rate. In addition to multiple-station collaborative QC based on the spatial continuity of atmospheric variables, some studies also introduced statistical analysis methods such as spatial regression [
15], inverse distance weighting [
16], and interpolation [
17] into collaborative QC research. In recognition of the needs of data assimilation for research and commercial purposes, QC approaches based on model background fields have also been developed. For example, when the temperature difference between the observed data and the model background field exceeds a given threshold, the data are deemed unusable. This can effectively prevent the overall assimilation effect from being compromised by observational data that deviate too far from the background field [
18,
19].
Although research on the QC of precipitation data has developed to a certain extent, the prominent spatiotemporal characteristics of localized precipitation make it difficult to use traditional threshold determination methods such as boundary value checks and climate extreme value ratios. Methods such as internal consistency verification, temporal continuity assessment, and spatial consistency analysis all rely on the spatiotemporal continuity of variables [
20], and the randomness of precipitation limits the effectiveness of these methods in performing QC. Especially in summer, when severe convective weather occurs frequently, this limitation becomes even more prominent. In addition to statistical based QC methods, radar-based quantitative precipitation estimation has been introduced for QC discrimination of cumulative precipitation at AWSs [
21,
22]. However, such an approach cannot meet the real-time requirements of commercial activities; therefore, further exploration is needed regarding methods for performing QC of AWS precipitation data.
Although much progress has been achieved in research on QC methods adopted for meteorological data, current studies mainly focus on observational data of continuous atmospheric variables, and the variation patterns of the meteorological variables are generally applied based on the physical laws of large-scale weather systems. With the continued deployment of surface AWSs in China, the rapid increase in the density of their distribution has provided a valuable opportunity to capture small-scale surface weather information with high spatiotemporal resolution. Therefore, determining how best to utilize the high spatiotemporal resolution characteristics of surface AWS data, overcome the limitations of the prominent spatiotemporal characteristics of localized precipitation [
23], and avoid the influence of small-scale weather systems on the QC process are recognized as important issues in developing QC methods suitable for high spatiotemporal resolution surface AWS precipitation data.
Since the observational data contain a great deal of meso- and micro-scale weather information, these pieces of information often lead to observed values in the observational data that exceed the threshold range of conventional quality control methods, and thus misjudgments occur. The EOF method can decompose three-dimensional variables into a linear combination of different spatial modes and their corresponding time coefficients. Furthermore, based on the scale differences of the characteristics represented by different modes, the three-dimensional variables can be decomposed into a primary term composed of the first n modes that exhibit relatively large-scale structural features, and a residual term predominantly characterized by small-scale random variations. Based on this principle, a quality control method based on EOF decomposition was proposed [
24]. This method aims to reduce the adverse impact of the weather system on the quality of observed data, thus ensuring the accuracy and reliability of the quality control results [
25,
26].
Since the introduction of the EOF quality control method, numerous meteorologists applied and improved it. Shao et al. [
27] used ERA5 reanalysis data as the background field and established a surface temperature data quality control method suitable for high spatiotemporal density based on the EOF-based QC method. The method was tested in the central and eastern regions of China and yielded satisfactory results. Shen et al. [
28] leveraging the EOF method for identifying anomalous observational data, proposed a data repair technique based on iterative EOF analysis. This technique effectively restores erroneous surface automatic station temperature observations. Building upon an analysis of the spatial scale and error distribution characteristics of surface temperature, Shang et al. [
29] have established an EOF-based QC methodology for surface temperature observations that is entirely dependent on observational data. This approach is capable of accurately identifying erroneous observational data while effectively circumventing the impacts of background field errors, topographical influences, and local weather changes. Through the research and application of meteorologists, it has been proven that the EOF-based QC method can effectively avoid the influence of weather systems on precipitation data, providing new ideas for quality control of precipitation data.
Despite recent progress, further research is required to ascertain how best to apply EOF-based QC methods to precipitation observation data. The prominent spatiotemporal characteristics of localized precipitation have long represented a problem in controlling the quality of precipitation data. As a statistical method, EOF analysis focuses on variables with reasonable spatiotemporal continuity, but the suddenness of precipitation can seriously undermine the temporal continuity of precipitation as a variable. Gebremichael et al. [
30] found that precipitation data can be made to follow a normal distribution through temporal averaging. Therefore, precipitation accumulation conversion can be used to construct precipitation QC data with reasonable temporal continuity. High spatial resolution surface AWS precipitation data allow for the accurate identification of observed precipitation extremes caused by extreme weather conditions and complex terrain. However, EOF analysis is often used to identify the large-scale spatial structure of meteorological variables. Therefore, the problem of how best to use EOF analysis to extract small-scale precipitation information also urgently needs further research.
This study focused on the high spatiotemporal resolution characteristics of AWS data. Based on the cumulative conversion of hourly precipitation observation data, a method based on EOF analysis was developed to perform QC using only observational data. On the basis of in-depth analysis of the spatial correlation scale and probability distribution characteristics of precipitation data derived from AWSs, the regional scope and relevant threshold of EOF-based QC were determined objectively. Finally, a QC method suitable for high spatiotemporal resolution surface AWS precipitation observation data was successfully constructed. This research result can be integrated into both early warning systems to avoid false alarms caused by erroneous data and assimilation systems for real-time QC of precipitation data, offering strong commercial application prospects.
The remainder of the paper is structured as follows:
Section 2 introduces the data and the preprocessing method used in the study.
Section 3 describes the spatiotemporal and probability distribution characteristics of cumulative precipitation.
Section 4 discusses how to determine QC methods. The method to determine the occurrence time of incorrect hourly precipitation is described in
Section 5.
Section 6 proves the superiority of the proposed method. Finally, the conclusions are presented in
Section 7.
2. Data Sources and Preprocessing
This study used high-density AWS precipitation data provided by the Jiangxi Provincial Meteorological Bureau. Overall, the study considered data from 2530 AWSs distributed over the region 24.5–30°N, 113.5–118.5°E. The minimum distance between AWSs is approximately 1 km. The study period extended from 09:00 on 20 February 2023 to 08:00 on 31 May 2023 (unless specified otherwise, all times are UTC), i.e., a total of 2400 h. To verify the QC results, this study incorporated precipitation data from both the Tianqing Business System QC for the same period [
31] and the real-time China Meteorological Administration Multisource-merged Precipitation Analysis System (CMPAS) hourly dataset, developed using surface precipitation observational data, radar quantitative estimation precipitation data, and satellite inversion precipitation data and key technologies such as deviation correction and fusion analysis [
32]. Additionally, this study also used radar reflectance data.
The Tianqing System is a meteorological big data cloud platform developed by the China Meteorological Administration, which is the basic platform supporting meteorological operations. It has massive storage capacity and powerful data output capability, covering the lifecycle of data transmission and collection, processing and handling, storage and service, analysis and monitoring. The precipitation quality control methods in the Tianqing system mainly include boundary values, range values, spatial consistency, internal consistency, time-varying, continuous inspection, and comprehensive inspection.
CMPAS is a real-time precipitation fusion analysis product produced by the National Meteorological Information Center. It is divided into two source fusion products and three source fusion products. This article uses the three source fusion product, which uses probability density function matching method to correct the systematic bias of radar estimation and satellite inversion precipitation products. Then, the Bayesian model averaging method is used to combine the radar and satellite precipitation products to form a background field covering China, and finally the optimal interpolation method is used to integrate the ground observation data. The surface observation data used in CMPAS is provided by Tianqing.
In this study, we first pre-processed the precipitation data by calculating the cumulative precipitation. Specifically, for the hourly precipitation data, we carried out a moving average within a 120 h window (both before and after each hour), thus creating a cumulative precipitation sequence. The cumulative precipitation data for the entire Jiangxi region is a 2530 × 2160 matrix. We performed EOF decomposition on the cumulative precipitation data. For a 0.5 × 0.5-deg region, assuming there are 240 h observational data from M stations, the data matrix of the precipitation observation data X can be expressed as , and the covariance matrix can be expressed as .
3. Spatiotemporal and Probability Distribution Characteristics of Cumulative Precipitation
While hourly precipitation can exhibit prominent abruptness, cumulative precipitation data have better continuity [
33], especially as the time of accumulation increases, with the distribution of the precipitation data becoming closer to a normal distribution.
Figure 1 shows the cumulative precipitation change curve for different averaging periods. It is evident that with increase in the averaging period, data continuity also increases, verifying the findings of previous research [
34].
In order to better clarify the characteristics of precipitation observation data, the spatial distribution of the maximum, median, and variance of hourly precipitation observation data and hourly cumulative precipitation data used is presented here (
Figure 2). The northern part of Jiangxi is predominantly flat, while the central and southern regions are characterized by hilly terrain. Influenced by the topography, stations with hourly precipitation exceeding 50 mm are primarily concentrated in the central area. The regions with high standard deviation values are situated in the central-eastern and southern parts, where extreme precipitation is more likely to occur. The median values across all stations in the province are predominantly clustered between 0.5 and 0.6 mm.
The high-value regions for ten-day accumulated precipitation extremes are predominantly found in the central-eastern and southern areas. The median values in the northern and central parts are generally above 60 mm, whereas those in the southern region are clustered between 30 and 60 mm. The areas of high standard deviation largely overlap with the regions of high extremes. A comparison between hourly and accumulated precipitation shows that the ratio of the maximum to median values for hourly precipitation is considerably greater than that for accumulated precipitation. Additionally, the high values of accumulated precipitation are concentrated, while precipitation in other areas is more evenly distributed. This suggests that the spatiotemporal continuity of accumulated precipitation is superior to that of hourly precipitation.
Probability distribution characteristics can be used to quantitatively evaluate the continuity of data.
Figure S1 shows the precipitation probability distribution corresponding to each curve illustrated in
Figure 1. It is evident that hourly precipitation observational data do not follow a normal distribution. However, as the cumulative period increases, the precipitation probability distribution gradually approaches a normal distribution. When the averaging period reaches 10 days, the precipitation probability distribution broadly follows a normal distribution (dashed line in the figure). The skewness coefficient and the kurtosis coefficient of the cumulative precipitation probability distribution for different averaging periods were also calculated (
Figure 3). As the cumulative period increases, the skewness coefficient and the kurtosis coefficient both gradually diminish. At 10 days, the skewness coefficient and the kurtosis coefficient tend to stabilize, with a skewness coefficient of approximately 1 and a kurtosis coefficient of approximately 2. As the cumulative duration increases, the timeliness of the data gradually decreases; therefore, the cumulative period considered in this study was set to 10 days.
To accurately determine the optimal duration for calculating accumulated precipitation, the skewness and kurtosis coefficients of the data’s probability density function were calculated as they vary with the accumulation time. Precipitation data were processed with a rolling accumulation time length ranging from 1 to 11 days, and the curves illustrating the changes in the kurtosis and skewness coefficients of the probability distribution function were generated. It was observed that after the accumulation time reaches 9 days, both the kurtosis and skewness coefficients tend to stabilize. For the convenience of subsequent data processing, 10 days were empirically chosen for the statistical analysis.
4. Quality Control Method Based on Partition EOF
Precipitation has prominent spatiotemporal variations attributable to interactions between mesoscale disturbances and local conditions such as topography, which produce strong localized characteristics [
35]. In conventional QC methods, spatial consistency checks often determine the correctness of data based on the difference between precipitation observed at a specific station and that recorded at surrounding stations. However, AWS data with high-density characteristics can reflect the notable changes in precipitation in the horizontal plane associated with small-scale weather disturbances. This is especially the case at the edge of weather systems where adjacent stations are prone to large differences. Therefore, it is not possible to directly identify erroneous data based on the difference between observations at adjacent stations.
Accurate extraction of the small- and medium-scale variational characteristics of high-resolution precipitation data is fundamental to ensuring the accuracy of QC. Previous studies clearly indicated that the spatial scale distinguished via EOF analysis is positively correlated with the spatial range it covers [
36], which means that by narrowing the regional scope of EOF analysis, its ability to capture small- and medium-scale precipitation information can be effectively improved. To define the optimal spatial range for EOF analysis scientifically and reasonably, the primary task is to clarify the spatial scale characteristics of the weather systems that can be distinguished by AWS precipitation data. Given that the correlation of precipitation between stations is inevitably regulated by weather conditions, using the curve of the correlation coefficient with distance as an analysis tool can accurately reflect the spatial scale characteristics of the observational data. The specific quantification method used for this feature adopted the following steps.
- 1.
Calculate the temporal correlation coefficient between each station and the surrounding stations as follows:
- 2.
Calculate the distance between each station and the surrounding stations as follows:
where dis is the distance between two stations; r is the radius of the Earth;
and
represent the latitude (in radians) of the first and second station, respectively; and
and
represent the longitudes (in radians) of the first and second station, respectively.
- 3.
Calculate the maximum correlation coefficient between stations within different distances.
The correlation between precipitation amounts at two stations is often influenced by various factors such as terrain, vegetation, and water bodies. In most cases, these interfering factors tend to reduce the correlation coefficients of precipitation amounts. Therefore, in order to avoid the influence of the above factors on the statistical results, the maximum value of the correlation coefficient of all stations at each distance is selected to draw the curve of the correlation coefficient with distance.
Figure 4 shows the maximum correlation coefficient between the precipitation sequences at two stations corresponding to different station distances. It is evident that the trend of the maximum correlation coefficient between stations declines with increasing distance, which is attributable to the spatial continuity of precipitation. The correlation coefficient exhibits a reasonably stable characteristic at distances of 30–50 km, with magnitude of approximately 0.997. It means that relatively stable spatial features of the corresponding scale exist in the precipitation data. Therefore, the single analysis area adopted in this study was set at 0.5° × 0.5°.
Higher-order EOF modes represent the small-scale characteristics of the data. This is because the observed precipitation at a station generally consists of two components. First, there is the amount of precipitation similar to that of surrounding stations, which is often caused by a large-scale weather system. In this case, the precipitation in a region exhibits spatial structural characteristics. Second, there is the precipitation caused by local small-scale weather, which is independent of the surrounding stations. The precipitation component with regular spatial structural characteristics definitely does not conform to the characteristics of a random distribution. However, by using the first few modes of EOF, we can effectively extract the precipitation information with spatial structural characteristics. Therefore, the remaining precipitation information composed of higher-order EOF modes often shows characteristics of being uncorrelated among stations. This also means that the precipitation sequence composed of the remainders of all stations exhibits statistical characteristics of a random distribution. Therefore, in determining whether the EOF QC method can be applied to cumulative precipitation data, it must first be clarified whether the reconstruction results of its high-order modes satisfy a normal distribution. Consequently, before performing QC, it is necessary to define those high-order EOF modes that are suitable for EOF-based QC of precipitation data.
The skewness coefficient and the kurtosis coefficient are important metrics in statistics because they describe the symmetry and steepness or smoothness of a data distribution. As a special form of distribution, the normal distribution has specific values of skewness and kurtosis. Therefore, this study used these two statistical metrics to quantitatively analyze the frequency distribution of the high-order EOF modal reconstruction field. In recognition of the need to conduct multiple experiments to obtain the general pattern of the frequency distribution, in addition to the research period (10:00 on 9 March 2023 to 10:00 on 19 March 2023), two further periods were randomly selected for analysis: 10:00 on 4 March 2023 to 10:00 on 14 March 2023 and 10:00 on 14 March 2023 to 10:00 on 24 March 2023.
To determine those high-order EOF modes that are suitable for QC of precipitation data,
Figure S2 shows the frequency distribution of the reconstructed field of high-order EOF modes after gradual extraction of the first four modes, together with the skewness and kurtosis coefficients of the reconstructed field. The black dashed line shown in the figure represents the closest standard normal distribution function curve, which is defined as the standard normal distribution function for which the standard deviation is consistent with the observed data. It is evident from
Figure S2 that when the EOF mode increases to the first three modes, the residual field broadly follows a normal distribution. The residual field skewness coefficient of the first mode is 0.22, and the skewness coefficient of the second mode is increased to 0.66. When the mode is increased to the third and fourth modes, the skewness coefficient tends to become stable at −0.199 and −0.15. When the mode is increased to the third mode, the residual field has reasonable symmetry. The kurtosis coefficient of the first mode is 14.95, and the kurtosis coefficient of the second mode is increased to 45.87. When the mode is increased to the third and fourth modes, the kurtosis coefficient tends to stabilize at 16.62 and 18.06. When the mode increases to the third mode, the distribution of the residual field is more uniform. To further clarify the rationality of setting the modal threshold, the variation curves of the kurtosis and skewness of the residual fields after extracting the first six modes, respectively, are presented here (
Figure 5). It can be observed that the absolute value of skewness reaches its maximum after the first two modes and then gradually decreases as the number of modes increases. For kurtosis, it reaches its minimum after the first three modes and then shows an increasing trend as the number of modes continues to rise. This is mainly because as more modes are extracted, a large number of precipitation values close to 0 are likely to occur, which causes the kurtosis to increase with the increase in the number of modes. Therefore, choosing an appropriate modal threshold has a crucial impact on the quality control effect. By combining the skewness coefficient and the kurtosis coefficient, it can be established that when the mode is increased to the third mode, the residual field is closer to a normal distribution under the premise of reasonable symmetry. The results in
Figure 6 further demonstrate that choosing the first three modes as the threshold in this study is highly reasonable.
In order to clarify whether the use of the first three modes can meet the quality control requirements, the spatial distribution maps of the reconstructed field and residual field at 01:00 on 14 March 2023 are provided, and the explanatory variance and cumulative explanatory variance are calculated (
Figure 6). By comparing the reconstructed field, residual field, and observational data, it was found that EOF can effectively extract meso- and small-scale weather information from the observational data, with anomalies manifesting as large values in the residual field, thus allowing for the effective identification of erroneous stations. From the perspective of explained variance and cumulative explained variance, when the third mode is included, the cumulative explained variance reaches 99.24%. Therefore, extracting the first three modes can effectively capture the weather system information contained in the data.
As described above, the analysis area was defined by calculating the variation curve of the maximum correlation coefficient between stations with distance. The analysis area can be divided into subregions starting from the bottom left corner at 0.5° × 0.5° and then gradually moved incrementally to the right or upward by 0.25° to form new sub areas during QC. Rolling QC experiments can be conducted for each subregion separately.
The specific steps of the experimental plan were as follows.
Use the EOF analysis method to decompose the 3D data Rain into two parts in each subregion, i.e., the first n modal reconstruction parts and the remaining modal reconstruction results; then, the observation can be expressed as follows:
where
represents the observed precipitation at the observation station,
represents the reconstructed field of the first n modes extracted from the observational data following EOF analysis, and
represents the residual field of the first n modes extracted from the observational data following EOF analysis, where n is taken as 3.
After obtaining the reconstructed field and the residual field, calculate the standard deviation of for all stations at the same time each day, compare for each time with the corresponding , and define outliers within the subregion when .
After obtaining the distribution of outliers in each subregion at each time, for cases where overlap exists between subregions, if a station in the overlapping region is adjudged an outlier in the two subregions at a certain time, then that station is considered an outlier at that time. Finally, the distribution of outliers at each time in the studied area is obtained.
5. Determination of QC Methods
Owing to the lack of true values, to further clarify the accuracy of the new proposed QC method, an ideal experimental method was first used to evaluate the new QC method. This so-called ideal experiment refers to the addition of different levels of artificial error information to the observational data to test the capability of the new QC method in recognizing different levels of error. In conducting the ideal experiment, three stations were selected at random: station A (27.89°N, 116.26°E), station B (27.64°N, 114.01°E), and station C (27.35°N, 116.34°E). At 00:00 on 14 March 2023, artificial incorrect data were added to the hourly precipitation data of these three stations. Owing to the cumulative precipitation of 25–45 mm over the 10 days at that time, precipitation of 5–60 mm was added at equal intervals in the hourly precipitation data of these three stations and compared with the QC results of the original data. The precipitation trends and exclusion results of the three stations were similar; consequently, the results for station B were selected for display here. Comparison of
Figure 7a,b reveals that the additional artificial erroneous disturbances can affect the 10-day cumulative precipitation data for 120 h before and after each time. It is evident from
Figure 7a that at 00:00 on 14 March 2023 (coordinate 230 in the figure), when the disturbance precipitation increased to 15 mm, there was a continuous exclusion phenomenon during the period of coordinates 230–290 in the figure. However, at 25 mm, the 10-day cumulative precipitation data showed a continuous exclusion phenomenon during the period of coordinates 200–300 in the figure, indicating that the exclusion results tended to stabilize from 25 mm onward. However, in the original observation data (
Figure 7b), there was no phenomenon of exclusion around that time, indicating that the data began to exhibit exclusion when the disturbance precipitation approached the original 10-day cumulative precipitation data. When the disturbance reached or exceeded the original data time, the exclusion began to stabilize.
It should be clarified that the above analyses were based on cumulative precipitation data for QC. In practical application, it is necessary to clarify those observational data that are incorrect at specific times. From the actual QC results of accumulated precipitation, it is evident that there will be continuous exclusions in the QC results. Therefore, the stability of the QC results can be used as a basis for judging the correctness of precipitation data at a single moment.
To establish a suitable threshold, this study statistically analyzed the distribution of excluded precipitation data within the study area at different thresholds (
Figure 8a). It is evident that when the threshold is small, the excluded precipitation data exhibit a greater degree of extremism, and that the extreme value of precipitation gradually reduces as the threshold increases. The median value of the excluded precipitation data exhibits little change with the threshold, but from the distribution of the data, it is apparent that the excluded data tend to have similar characteristics as the threshold increases. Thus, it is evident that an ideal exclusion result can be obtained by setting a threshold, while also ensuring a stable exclusion rate. Therefore, the total frequency of data exclusion and the standard deviation of the rate of data exclusion over time for all stations within the study area with different thresholds set within 240 h were calculated (
Figure 8b). It is evident that the overall trends of data exclusion frequency and the standard deviation decrease with the increase in the threshold, indicating that the rate of data exclusion will gradually decrease and become more stable over time as the threshold increases. The standard deviation of the frequency and rejection rate over time decreases rapidly in the threshold range of 24–72, and it slows markedly in the threshold range of 72–192. Owing to the stable and high rejection rate of the standard deviation of the frequency and rejection rate over time in the three ranges of 48–72, 72–96, and 96–120, the median of the two ranges is taken to find the threshold in the threshold range of 60–108.
To visualize the data exclusion situation more clearly at different thresholds, the time variation curves of the rate of data exclusion in the study area at each hour under the set equidistant threshold are shown in
Figure 9. It is evident that as the threshold increases, the rate of exclusion gradually diminishes, and the fluctuation amplitude of the curve gradually decreases. When the threshold increases to 84, the rate of exclusion tends to stabilize and remain at approximately 1%. Heavy precipitation is usually generated by small- and medium-sized weather systems that have a life cycle of 1–3 days. Therefore, when the threshold continues to increase, the proportion of exclusion periods in the statistical period shows a trend of decline, leading to a trend of increase in the fluctuation of the exclusion rate. Consequently, the threshold was set to 84.
6. Determine the Occurrence Time of Incorrect Hourly Precipitation
Given that incorrect precipitation data at a single point in time can sometimes lead to notable deviations in the cumulative precipitation for up to 84 h or even 10 days, relying solely on threshold-based determination of erroneous data might inadvertently mistake correct data at adjacent time points as erroneous, thereby affecting the overall accuracy of the data. To overcome this problem, it is particularly important to introduce auxiliary criteria to accurately identify the specific moment at which erroneous precipitation occurs.
Using EOF analysis, it is possible to effectively extract the main features from complex weather information, while the residual field tends to follow a normal distribution pattern. In this context, any extreme values that deviate from the norm will appear as statistically significant extrema in the residual field. Based on this characteristic, extreme points in the residual field can be used as indicators to identify and locate the specific points in time of erroneous precipitation records.
To visually demonstrate the effectiveness of this method, the distribution of the residual field during the period of concentrated removal of erroneous data in the ideal experiment is shown in
Figure 10. It is evident that the residual field of the excluded data exhibits sudden changes in its temporal variation curve, such as at time 280 on the right-hand side of
Figure 10a and at time 180 on the left-hand side of
Figure 10c. Combined with the precipitation variation curve, it is apparent that these sudden changes occur when precipitation undergoes marked change in a short period. Moreover, as clearly shown in the figure, at the moment of introducing erroneous precipitation data, the residual field reached the extreme value level within the exclusion period. This reflects the strong basis of the proposed method: based on setting a threshold for judgment, combined with analysis of the time at which the residual field extremum appears, it is possible to more accurately identify the specific time when erroneous precipitation occurs.
After determining the entire set of quality control methods, a flow chart of the quality control methods is presented here (
Figure 11).
The detailed quality control process is shown in the following figure.
7. Comparative Analysis of EOF Quality Control Effect with Operational Data
To confirm the effectiveness of the threshold, the 10-day cumulative precipitation data of three consecutive time nodes (i.e., 23:00 on 13 March 2023, 00:00 on 14 March 2023, and 01:00 on 14 March 2023) in the same region were randomly selected for analysis.
Figure 12 visually displays the precipitation of each time period and their exclusion status before and after the application of the threshold. At 23:00 on March 13
th, there were significant differences in precipitation between the removed stations and the surrounding stations (such as an extreme value of 85.9 mm or an abnormal extreme value of 0 mm), and these data were continuously removed at 00:00 on March 14
th and 01:00 on March 14
th. By comparing the rejection results at 01:00 on March 14
th with
Figure 6b, it is found that the locations of the rejected stations exactly coincide with the regions of large absolute values In
Figure 6b. This further confirms that the erroneous precipitation information remains in the residual field. It can also be found that there may be a risk of misjudgment if the threshold is not added. For example, at 23:00 on March 13
th, a station with a precipitation of 35.0 mm was identified as an abnormal station despite no significant anomalies in precipitation compared to neighboring stations, and this misjudgment continued to be removed. After adding a threshold, this phenomenon did not occur. Observing the number of times each station was removed during the quality control cycle, it was found that sites with significant differences from surrounding stations had significantly more removal times. Therefore, it can be concluded that setting a threshold based on this Is reasonable and effective.
After determining the threshold and auxiliary judgment criteria, a new QC algorithm is immediately added to the QC program to control the hourly precipitation observation data. To analyze whether the addition of the new algorithm can improve the QC effect, the spatial correlation coefficient and the root mean square error change curve of AWS data and CMPAS data before and after QC are presented in
Figure 13. It is evident that the spatial correlation coefficient generally increases after QC, with a maximum increase of approximately 0.01. Additionally, the root mean square error between the QC data and the CMPAS data generally decreases, with a maximum reduction of approximately 1.
After clarifying that the addition of the new algorithm can effectively improve the QC effect on the data, we further explored the specific impact of adding the new algorithm on the QC effect. Here, the data before and after EOF-based QC were compared with the data after Tianqing QC and the CMPAS precipitation data to evaluate the advantages and disadvantages of EOF-based QC compared with traditional QC methods (
Figure 14). At time 338, it is evident that a single point of heavy precipitation appeared in the left-hand part of the figure in the original data, but not in the CMPAS precipitation data. The EOF-based QC method accurately identified the station with abnormal heavy precipitation, whereas the Tianqing QC method failed to identify that station. At time 404t, no precipitation occurred in the CMPAS precipitation data, but false weak precipitation appeared in the original data. The EOF-based QC method successfully identified and controlled the precipitation, while the Tianqing QC method failed to recognize these two stations. At time 506, in the original data, there were two stations with zero precipitation in the middle and on the right-hand side of the figure that were surrounded by stations with light to moderate rain. However, in the CMPAS precipitation data, both stations had light rain. The EOF-based QC method accurately identified these two stations with abnormal zero values, while they were not identified via the Tianqing QC method. Comparison shows that the EOF-based QC method has better capability for recognizing stations with abnormal heavy precipitation, false weak precipitation, and erroneous zero precipitation compared with traditional QC methods.
To verify whether this method can effectively retain accurate local heavy precipitation data, the hourly precipitation data at 18:00 on 21 May 2023 and the radar echoes covering short-term heavy precipitation generated during this period are shown in
Figure 15. It is evident that strong echoes of >55 dBZ existed in the area 26.3–26.5°N, 116.3–116.5°E, and that these echoes were stable and nearly stationary, resulting in short-term heavy precipitation with localized rainfall intensity of over 80 mm/h. When the Tianqing QC system detected this short-term heavy rainfall, it was adjudged incorrect data, and then manually verified and reassessed as correct. However, the EOF-based QC method effectively extracts information from small- and medium-sized weather systems, resulting in the retention of such data.
To further verify the effectiveness of this QC method in different regions and seasons, this study also conducted QC experiments on hourly precipitation observation data from 1–25 August 2024 in Hunan Province. The research found that under the existing thresholds, the data can still be effectively quality controlled, and this method can also yield good results in other provinces and seasons. The following is the quality control result (
Figure 16) at 10:00 am on 11 August 2024 (Beijing time). From the radar reflectivity, it can be seen that from 9:00 to 10:00, the radar reflectivity above the station was between 45–50 dBz. Through communication with the local meteorological department, we learned that there was an overestimation in the precipitation observation data at the station.
Figure 16c shows that the operational Tianqin quality control system failed to identify the erroneous strong precipitation data at that moment, but the EOF-based quality control method established in this study effectively identified the overestimation of precipitation at the station (
Figure 16b). The incorrectly observed precipitation at the station, marked in red in the figure, reached 85.9 mm, which is far beyond the range of heavy rain and does not match the radar echo results. Therefore, there is reason to believe in the correctness of the results from the new quality control method.
8. Conclusions and Discussion
China has achieved comprehensive automation of its surface weather observations, and the derived AWS data with high spatiotemporal resolution can better display the multiscale variational characteristics of surface meteorological parameters. However, the application of surface AWS data has long been constrained by the instability of the quality of the automatic observational data. To address this issue, this paper has developed an independent quality control method based on EOF, specifically targeting the strong local characteristics of precipitation. The new QC method has effectively solved the impact of the strong local characteristics of precipitation data on QC, by calculating the cumulative precipitation method. On the basis of obtaining continuous precipitation data, the QC method based on EOF (empirical orthogonal function) analysis has been further introduced. On the basis of extracting the relatively large-scale continuous spatial structure of precipitation data, QC is carried out on the residual field, thus effectively avoiding the impact of weather processes on the QC results.
This method can effectively control hourly precipitation observation data. If this method is integrated into the data assimilation system, it can effectively improve the utilization of precipitation observation data, thereby improving the accuracy of forecasting and obtaining more accurate precipitation reanalysis data. Meanwhile, the precipitation observation data obtained through quality control using this method can provide effective support for the generation of multi-source fusion precipitation observation data. However, it should be pointed out that in the process of returning the quality control results of cumulative precipitation to the hourly precipitation data, the new method requires the observational data of 5 days before and after each piece of data. Therefore, it is still very difficult for the new method to achieve real-time quality control of hourly precipitation observational data, which is also the direction of efforts for follow-up research.