Enhancing Extreme Precipitation Forecasts through Machine Learning Quality Control of Precipitable Water Data from Satellite FengYun-2E: A Comparative Study of Minimum Covariance Determinant and Isolation Forest Methods

Shen, Wenqi; Chen, Siqi; Xu, Jianjun; Zhang, Yu; Liang, Xudong; Zhang, Yong

doi:10.3390/rs16163104

Open AccessArticle

Enhancing Extreme Precipitation Forecasts through Machine Learning Quality Control of Precipitable Water Data from Satellite FengYun-2E: A Comparative Study of Minimum Covariance Determinant and Isolation Forest Methods

by

Wenqi Shen

^1,2

,

Siqi Chen

²,

Jianjun Xu

^2,3,

Yu Zhang

^1,*,

Xudong Liang

⁴ and

Yong Zhang

⁵

¹

College of Ocean and Meteorology, Guangdong Ocean University, Zhanjiang 524088, China

²

CMA-GDOU Joint Laboratory, Guangdong Ocean University, Zhanjiang 524088, China

³

Shenzhen Institute, Guangdong Ocean University, Shenzhen 518120, China

⁴

State Key Laboratory of Severe Weather, Chinese Academy of Meteorological Sciences, Beijing 100081, China

⁵

Meteorological Observation Center, China Meteorological Administration, Beijing 100081, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2024, 16(16), 3104; https://doi.org/10.3390/rs16163104

Submission received: 15 July 2024 / Revised: 17 August 2024 / Accepted: 19 August 2024 / Published: 22 August 2024

(This article belongs to the Special Issue Remote Sensing Data Application, Data Reanalysis and Advances for Mesoscale Numerical Weather Models)

Download

Browse Figures

Versions Notes

Abstract

Variational data assimilation theoretically assumes Gaussian-distributed observational errors, yet actual data often deviate from this assumption. Traditional quality control methods have limitations when dealing with nonlinear and non-Gaussian-distributed data. To address this issue, our study innovatively applies two advanced machine learning (ML)-based quality control (QC) methods, Minimum Covariance Determinant (MCD) and Isolation Forest, to process precipitable water (PW) data derived from satellite FengYun-2E (FY2E). We assimilated the ML QC-processed TPW data using the Gridpoint Statistical Interpolation (GSI) system and evaluated its impact on heavy precipitation forecasts with the Weather Research and Forecasting (WRF) v4.2 model. Both methods notably enhanced data quality, leading to more Gaussian-like distributions and marked improvements in the model’s simulation of precipitation intensity, spatial distribution, and large-scale circulation structures. During key precipitation phases, the Fraction Skill Score (FSS) for moderate to heavy rainfall generally increased to above 0.4. Quantitative analysis showed that both methods substantially reduced Root Mean Square Error (RMSE) and bias in precipitation forecasting, with the MCD method achieving RMSE reductions of up to 58% in early forecast hours. Notably, the MCD method improved forecasts of heavy and extremely heavy rainfall, whereas the Isolation Forest method demonstrated a superior performance in predicting moderate to heavy rainfall intensities. This research not only provides a basis for method selection in forecasting various precipitation intensities but also offers an innovative solution for enhancing the accuracy of extreme weather event predictions.

Keywords:

machine learning quality control; satellite data assimilation; numerical weather prediction; heavy precipitation forecasting; precipitable water

1. Introduction

Precipitable water (PW) is a crucial indicator of atmospheric water vapor content and is defined as the total amount of water vapor per unit area from the Earth’s surface to the top of the atmosphere [1]. PW is a vital parameter in weather and climate research, and it directly reflects the amount of water vapor available for precipitation [2,3]. Consequently, this determines the probability and intensity of extreme weather events and plays a key role in predicting severe weather conditions such as heavy rainfall [4,5]. Xu et al. [6] reported a significant correlation between the PW and surface precipitation, which indicated that high PW values were closely associated with heavy rainfall events. Multi-source satellite and radiosonde observational data combined with ERA5 reanalysis data revealed a strong spatiotemporal relationship between PW and heavy rainfall, thereby providing crucial evidence for the use of PW to predict precipitation events.

The assimilation of PW data into Numerical Weather Prediction (NWP) models is important to accurately forecast intense rainfall events [7,8]. Wang et al. [9] used the three-dimensional variational (3DVAR) assimilation method to assimilate high-spatial-and-temporal-resolution three-layer precipitable water (LPW) from the Himawari-8/-9 satellites into their model. The results showed that assimilating this high-resolution humidity information significantly enhanced the accuracy of intense rainfall forecasts, with improved timing and intensity predictions when compared with assimilating conventional data alone. Risanto et al. [10] assimilated Global Positioning System-derived precipitable water vapor (GPS-PWV) data into the Weather Research and Forecasting–Advanced Research WRF (WRF-ARW) model, which led to improvements in the most unstable convective available potential energy and surface dew point temperature, creating more favorable conditions for convective organization and overall enhancing the accuracy of heavy precipitation forecasts in terms of both timing and intensity.

However, observational data, including PW, are often affected by instrumental errors, retrieval uncertainties, and environmental conditions, resulting in noise and outliers within the data [11,12]. An effective quality control (QC) process is crucial for ensuring the reliability of the assimilated data. QC methods aim to identify and mitigate the impacts of inaccurate or anomalous observations, thereby improving the quality of data assimilation (DA) and subsequent weather forecasts [13,14]. Traditional QC methods such as gross error detection [15] and buddy checks [16] rely on predefined thresholds and statistical assumptions, typically assuming a Gaussian distribution of observation errors. However, in practical applications, satellite observation errors often exhibit non-Gaussian distributions, and traditional QC methods may fail to effectively identify complex outliers [17,18]. Nakabayashi and Ueno [19] pointed out that traditional filtering methods assuming a Gaussian error distribution become unstable when observational data contain outliers because extreme values significantly impact the filtering accuracy and often lead to erroneous predictions. Similarly, Fowler and Van Leeuwen [20] indicated that Gaussian assumptions may overlook the complex error structures present in actual data, thereby leading to distorted influences of observational data on assimilation results and ultimately reducing forecast accuracy.

Recently, with the advancement of machine learning (ML) techniques, researchers have begun to explore the application of ML methods in the QC domain. ML-based QC methods have shown promising results in improving data quality across various fields, including meteorology. These methods typically involve data cleaning, feature extraction, model training, and anomaly detection, followed by continuous optimization through expert feedback [21,22]. Zhou et al. [23] utilized a Bidirectional Long Short-Term Memory (Bi-LSTM) neural network model to effectively enhance radar data quality and simplify the QC process. Polz et al. [24] investigated the application of supervised and unsupervised ML methods to the QC of environmental sensor data. By addressing missing values in the input data and extracting spatiotemporal features, this study employed supervised learning models such as Deep Neural Networks (DNNs) for training, while also incorporating unsupervised learning methods combining dimensionality reduction and clustering algorithms to identify outliers. The results showed that supervised methods provided optimal QC performance in stable experimental systems, whereas unsupervised methods achieved reasonable accuracy without the need for manual labeling and were more adaptable to system changes.

ML techniques have also demonstrated significant potential for enhancing the quality of precipitable water vapor (PWV) data. In 2019, Just et al. [25] investigated the use of the extreme Gradient Boosting (XGBoost) algorithm to improve column water vapor (CWV) estimates in MAIAC atmospheric products derived from the MODIS instrument. By utilizing multiple feature variables, the XGBoost model significantly reduced errors relative to AERONET (AErosol RObotic NETwork) measurements, with Root Mean Square Error (RMSE) reductions of 26.9% and 16.5% for Terra and Aqua datasets, respectively. This approach demonstrated the effectiveness of ML in enhancing the PWV data quality. In terms of data synthesis, Zhang and Yao [26] developed a method based on a General Regression Neural Network (GRNN) to synthesize PWV data from GNSS, MODIS, and ERA5. High-precision PWV maps were generated by calibrating and optimizing low-quality MODIS and ERA5 PWV data, while using high-quality GNSS PWV data. The results showed that this method significantly improves data consistency and accuracy, and offered new insights into multi-source data synthesis. Xia et al. [27] developed a new algorithm based on automated machine learning (AutoML) to retrieve land surface PW from passive microwave (PMW) remote sensing data. The method uses data from the Advanced Microwave Scanning Radiometer 2 (AMSR-2) as the primary predictor variable and utilizes over 50 million GPS data points from more than 12,000 global stations for training. The validation results showed high consistency with ground observations, with an RMSE of 3.1 mm, thereby demonstrating their significant potential for improving PWV data accuracy.

Among the various ML methods, the Minimum Covariance Determinant (MCD) and Isolation Forest methods stand out due to their effectiveness in specific applications. Both methods utilize unsupervised learning techniques, making them particularly suitable for scenarios where labeled data are scarce or unavailable. MCD precisely identifies and eliminates outliers by seeking the subset with the Minimum Covariance Determinant within the dataset. Its advantages include efficient computational performance and robustness in outlier detection, which make it suitable for high-dimensional datasets [28]. The Isolation Forest detects anomalies based on the degree of isolation between outliers and normal values in the feature space by constructing multiple random trees. Owing to its independence from data distribution assumptions and capability to handle high-dimensional and non-Gaussian data, this method is particularly effective for detecting anomalies within complex datasets [29]. The MCD and Isolation Forest methods, while not extensively applied to PWV quality control, have shown promise in related fields. Li et al. [30] compared MCD with One-Class Support Vector Machine methods for quality control of Aircraft Meteorological Data Relay in typhoon forecasting, finding that MCD performed better in improving prediction accuracy. This suggests MCD’s potential in handling complex meteorological data, including PWV. Zhang et al. [31] successfully applied the Isolation Forest method for anomaly detection in hyperspectral images, demonstrating its effectiveness in identifying unusual patterns in remote sensing data. This capability could be particularly valuable for detecting anomalies in PWV data derived from satellite observations.

These advanced ML methods offer unique advantages in handling PWV data. As unsupervised techniques, they can identify patterns and anomalies without the need for pre-labeled training data, which is particularly advantageous in meteorological applications where such labels are often unavailable or difficult to obtain. The MCD method’s robustness to outliers makes it well-suited for preserving genuine extreme PWV values while filtering out erroneous data. The Isolation Forest’s efficiency in detecting anomalies in high-dimensional spaces aligns well with the complex nature of atmospheric data, potentially offering improved accuracy in identifying unusual PWV patterns.

In this study, we explore the application of MCD and Isolation Forest methods to enhance the quality control of FY2E TPW data, aiming to leverage their unique strengths in improving the accuracy of PWV data for weather forecasting.

In forecasting extreme precipitation events, the FengYun series satellites, as China’s important meteorological satellites, have irreplaceable value for research in East Asia and the Western Pacific region [32,33]. Among them, the high-spatiotemporal-resolution total precipitation water (TPW) data provided by the FengYun-2E (FY2E) meteorological satellite significantly contribute to the monitoring and forecasting of extreme weather events [34]. However, the quality of these data directly affects the prediction accuracy of NWP models for extreme events such as heavy rainfall. Although ML has shown great potential in the field of QC, research on ML-based QC for the FY2E satellite TPW products remains relatively scarce. Existing ML-based QC methods still face challenges when dealing with complex weather conditions (such as heavy rainfall), and their effectiveness in practical assimilation applications requires further verification [35].

To address these research gaps, this study applied two advanced ML-based QC methods, MCD and Isolation Forest, to enhance the quality of TPW data from FY2E. To evaluate the effectiveness of these two methods, we selected a typical heavy rainfall event that occurred in Sichuan Province in 2013 as the research case by using the grid point statistical interpolation (GSI) v3.7 assimilation system [36] and WRF-ARW v4.2 model [37] for experiments. This rainfall event, characterized by its long duration and high cumulative precipitation, caused severe flooding in the region, thereby making it an ideal case for testing the performance of QC methods. This study provides new solutions for the assimilation QC practice of FY2E TPW data and demonstrates the enormous potential of ML-based methods for improving the forecast accuracy of heavy precipitation events.

The remainder of this paper is organized as follows: Section 2 provides a detailed introduction to the MCD and Isolation Forest methods and their application process in the QC of FY2E TPW data. Section 3 presents the research results for the 2013 Sichuan heavy rainfall case, while focusing on the performance analysis of the two QC methods and their impacts on the forecast results of the WRF model. Section 4 discusses the significance of the research findings, analyzes the performance of MCD and Isolation Forest methods, and explores their impact on precipitation forecasts. It also addresses study limitations and suggests future research directions. Section 5 concludes this study by summarizing the main contributions, highlighting the effectiveness of both QC methods, and outlining implications for operational forecasting and future research.

2. Materials and Methods

2.1. Case Review

In July 2013, the Sichuan Basin experienced extreme rainfall that triggered severe flooding and mudslides, resulting in substantial property damage. This case study focuses on the period from 06:00 UTC on 8 July 2013 to 06:00 UTC on 10 July 2013, during which the most intense precipitation occurred.

We utilized ERA5 reanalysis data to generate 500 hPa geopotential height fields and 850 hPa water vapor flux and wind field distributions for the simulation period, to conduct an in-depth analysis of the circulation field distribution during this precipitation event (Figure 1). The primary weather systems influencing this heavy rainfall event included an East Asian trough, a Western Pacific subtropical high, tropical cyclone (TC) in the Indian Ocean, and an 850 hPa low-level jet.

During the key stage of precipitation (06:00 UTC on 7 July 2013 to 06:00 UTC on 8 July 2013), the western boundary of the Western Pacific subtropical high remained stable over China’s southeastern coastal regions before retreating eastward to the ocean at 06:00 UTC on 9 July. Concurrently, a TC in the Indian Ocean gradually moved northeastward, approaching East Asia, while intensifying. It reached its peak intensity at 06:00 UTC on 8 July 2013, before making landfall, and subsequently weakened.

Moisture transport plays a decisive role in heavy rainfall events. Research indicates that the primary moisture source affecting this precipitation event originated from the lower troposphere, south of the precipitation area. Moisture is transported from the Indian subcontinent, Bay of Bengal, and Indochina Peninsula to the Sichuan Basin, thereby providing abundant moisture conditions for this extreme precipitation event and triggering heavy rainfall [38].

From a moisture transport perspective, the Western Pacific subtropical high maintained its position over China’s southeastern coastal areas (e.g., Jiangsu and Fujian provinces), coupled with the eastward movement and intensification of the TC in the Indian Ocean, which led to continuous strengthening of the pressure gradient between these two systems. This triggered an anomalously strong low-level southwesterly jet stream. The large-scale topography of the Tibetan Plateau acted as a barrier to the west of the precipitation center in Sichuan, blocking moisture diffusion and inducing orographic lifting. The combined effect of these factors resulted in the transport and stagnation of substantial moisture from the South China Sea and Indian Ocean towards the precipitation center in Sichuan.

Simultaneously, in the 500 hPa upper-level region from Xinjiang to Sichuan, the stable East Asian trough structure continuously supplied positive vorticity advection to the area ahead of the trough (i.e., the precipitation center), thereby providing favorable dynamic conditions for precipitation.

Atmospheric circulation patterns and local topographic conditions significantly influenced the development of heavy rainfall events. The merger of the Tibetan Plateau Vortex (TPV) and Southwest China Vortex (SWV) in the Sichuan Basin formed a powerful precipitation system. This interaction enhanced the horizontal vorticity and secondary circulation to provide favorable conditions for the formation of deep convection and extreme rainfall [39]. The synergistic interactions of these multiscale systems ultimately led to this intense precipitation event.

Recent studies have emphasized the importance of moisture transport in precipitation processes, especially in regions with complex terrain. Yuan et al. [40] found that the Yarlung Zangbo Grand Canyon serves as a major channel for monsoon moisture transport to the Tibetan Plateau, significantly affecting precipitation in the eastern interior of the plateau. Li et al. [41] demonstrated that accurately simulating moisture transport in regions with complex terrain can improve precipitation simulation accuracy, which has significant implications for studying extreme precipitation events in areas like the Sichuan Basin.

2.2. NWP Model and Assimilation System

This study employed version 4.2 of the ARW-WRF model, developed by the National Center for Atmospheric Research (NCAR), to conduct numerical simulations of the extreme precipitation event that occurred in the Sichuan Basin in July 2013. The simulation period spanned from 06:00 UTC on 7 July 2013 to 06:00 UTC on 11 July 2013. However, the first 24 h was excluded from the analysis due to model spin-up, and the last 24 h, while generated to ensure complete event coverage, was not included as this fell outside the main period of interest for the heavy rainfall events under investigation. The model utilized a two-way nested grid configuration. The coarse grid(d01) had a horizontal resolution of 9 km, with sufficient coverage to encompass the weather systems influencing this event. The fine grid(d02) with a horizontal resolution of 3 km better captured the local influence of the complex terrain of the Sichuan Basin on precipitation (Figure 2).

We adopted the following main physical parameterization schemes in the ARW-WRF model: the NSSL 2-moment scheme for microphysics [42], RRTMG for radiation processes [43], the Kain–Fritsch–Cumulus potential scheme for cumulus convection [44] (applied only to d01), and the UW (Bretherton and Park) scheme for planetary boundary layer (PBL) processes [45]. The initial and boundary conditions of the model were provided by the European Center for Medium-Range Weather Forecasts (ECMWF) fifth-generation reanalysis data (ERA5) [46]. This dataset has a temporal resolution of 1 h and a horizontal resolution of 0.25° × 0.25°. It includes the full model level, which consists of 137 levels from the surface to 0.01 hPa. The detailed model configurations are presented in Table 1.

This study utilized the GSI-V3.7 data assimilation system developed by the National Oceanic and Atmospheric Administration (NOAA) to conduct cyclic assimilation experiments. Following a 6 h spin-up period, cyclic assimilation was performed every 6 h. The assimilation method employed was three-dimensional variational (3 DVar), with the objective function formulated as follows:

J = \frac{1}{2} {(x_{a} - x_{b})}^{T} B^{- 1} (x_{a} - x_{b}) + \frac{1}{2} {(H (x_{a}) - O)}^{T} R^{- 1} (H (x_{a}) - O) + J_{c}

(1)

where

x_{a}

is the analysis field,

x_{b}

is the background field,

B

is the background error covariance matrix,

H

is the observation operator,

O

represents the observations,

R

is the observation error covariance matrix, and

J_{c}

denotes the constraint terms (such as dynamic and moisture constraints). In this study, the background error covariance matrix utilized the global background error provided by the GSI, whereas the horizontal and vertical influence scales employed default configuration of the GSI.

The observation operator for satellite-derived Integrated Precipitable Water (IPW) is formulated as follows:

P_{w} = \frac{p_{s}}{g} \sum_{σ} q_{σ} ∆ σ

(2)

where

q_{σ}

is the specific humidity at the

σ

-th vertical level. The surface pressure

p_{s}

was obtained from the forecasted first guess [47]. Through the observation operator, the model’s forecast first guess is mapped to the observation space, and the difference from the observed value yields the observation-minus-background (OMB) value, which is then returned to the model space via the background error covariance.

2.3. Data Description

The observational data used for assimilation in this study consisted of three components, each representing a different source and serving a distinct purpose. Previous studies have clearly indicated that the use of non-QC data in the assimilation process can lead to significant degradation in system performance [48]. Therefore, in the experimental design, we focused on evaluating the effectiveness of the two ML-based QC methods for quality control of FY2E TPW data, without establishing an experimental group that directly assimilated unprocessed TPW data. The experimental setup is listed in Table 2. The observational data are described as follows:

Conventional surface and upper-air observational data PREPBUFR provided by the U.S. National Centers for Environmental Prediction (NCEP) include multiple subsets of data, such as upper-air observation reports (ADPUPA), satellite-derived wind reports (SATWND), sea surface observation reports (SFCSHP), land surface observation reports (ADPSFC), vertical azimuth display wind observations (VADWND), and ASCAT scatterometer data (ASCATW). These data were subjected to NCEP preprocessing, including QC, format unification, and bias correction, to ensure data quality and consistency. These preprocessed data are widely applied in the data assimilation processes for global and regional NWP models to enhance their forecasting capabilities. In this study, these data served as baseline assimilation data to provide stable and reliable observational inputs for the model.
Quality-controlled IPW data (hereafter referred to as CMA IPW) from conventional observation stations in China, which were provided by the Atmospheric Sounding Center of the China Meteorological Administration (CMA). After undergoing rigorous quality control procedures, the accuracy and reliability of the data were fully guaranteed. In this study, these data served as a reference dataset to help us understand normal meteorological patterns and to assist in fine-tuning our unsupervised ML models for quality control of FY2E TPW data. This approach allowed us to leverage both ML techniques and domain knowledge in the two processes, aiming to improve the accuracy of precipitation event predictions.
The TPW data observed by FY2E covering China and its surrounding areas were provided by the National Satellite Meteorological Center of China. Note that IPW and TPW both refer to the total amount of water vapor in a vertical column of the atmosphere. Although these terms are often used to describe the same physical quantity, the choice of terminology may vary depending on the specific research context, measurement technique, and instrumentation. In this study, we used the IPW when referring to ground-based measurements and the TPW for satellite observations, which is consistent with the conventions in our data sources. The FY2E TPW offers more comprehensive water vapor distribution information than ground observations. In this study, the FY2E TPW data were used to train the unsupervised ML models. Figure 3 provides an overview of the spatial distributions of the FY2E TPW and CMA IPW data at 12:00 UTC on 8 July 2013, illustrating the typical patterns observed during the study period.

Table 2. Experimental design.

ID	Assimilation Configuration
CTRL	No DA
EXPR1	Assimilating conventional data only
EXPR2	Assimilating conventional data + PW with MCD-QC
EXPR3	Assimilating conventional data + PW with Isolation Forest-QC

Note: An initial No-QC experiment was conducted but excluded from the main analysis due to its substantially poorer performance across all metrics compared to the QC experiments, to maintain a focus on comparing effective ML-based QC techniques.

To verify the precipitation simulation performance of the model after assimilation, two types of precipitation verification data were used: gridded and station data. The gridded data were provided by the Global Precipitation Measurement (GPM) mission [49], a collaboration between NASA and JAXA, which included half-hourly rain and snow retrieval data products based on microwave and infrared technologies. These data have a temporal resolution of 0.5 h and a spatial resolution of 0.1° × 0.1°, providing high-spatiotemporal-resolution precipitation information suitable for detailed verification of model precipitation. In addition to the IPW data, we also utilized precipitation data from CMA ground stations. The station data came from dense automatic weather observation stations in the CMA, including 3 h, 6 h, 12 h, and 24 h cumulative precipitation amounts. These station data have high reliability and accuracy and were used in this study as ground truth checks for the precipitation simulation of the model.

2.4. QC Process

2.4.1. Introduction to ML-Based QC Methods

MCD

In outlier detection, a common approach is to assume that regular data originate from a known distribution (such as a Gaussian distribution). Based on this assumption, traditional methods (such as Mahalanobis distance [50]) identify outliers by defining the “shape” of the data and recognizing observations that significantly deviate from this shape.

Consider a dataset consisting of

n

observations (

X_{n} = {x_{1}, \dots, x_{n}}

), each with

p

dimensions. The conventional Mahalanobis distance (MD) for an observation

x

relative to the center is defined as follows:

M h d (x) = \sqrt{(x - \bar{x}) S^{- 1} (x - \bar{x})}

(3)

where

\bar{x}

is the sample mean vector, and

S

is the sample covariance matrix. The traditional MD has limitations in identifying outliers, particularly when a dataset contains many anomalies. The outliers themselves can shift

\bar{x}

and

S

, thereby causing these anomalous points to appear less “abnormal” relative to the distorted center and dispersion. In other words, when the number and influence of outliers are sufficient to dominate the statistical characteristics of the entire dataset, the MD may fail to effectively identify these anomalies, thereby reducing its reliability as an outlier detection method. To address this issue, we must introduce more robust estimation methods to ensure that the detection results truly reflect the inherent structure of the data without being misled by anomalies.

The MCD method is a robust estimator that minimizes the influence of outliers on the mean and covariance estimates of a dataset and is particularly effective in high-dimensional spaces.

The MCD method defines a robust distance that can be represented in

p

-dimensional space as given below:

R D (x_{i}) = \sqrt{{(x_{i} - {\hat{μ}}_{M C D})}^{T} {\hat{Σ}}_{M C D}^{- 1} (x_{i} - {\hat{μ}}_{M C D})}

(4)

where

{\hat{μ}}_{M C D}

is the location estimate by MCD, and

{\hat{Σ}}_{M C D}^{- 1}

is the covariance matrix estimate by MCD.

The computation steps for the MCD estimator are as follows: First, a subset of

h

observations was selected from the dataset, where

[(n + p + 1) / 2] \leq h \leq n]

was chosen to minimize the determinant of the covariance matrix of this subset. In this step, the initial location estimate

{\hat{μ}}_{0}

was calculated, which was the mean of these

h

observations that minimized the determinant of their covariance matrix. Simultaneously, the initial scatter estimate

{\hat{Σ}}_{0}

was defined as the covariance matrix of these points multiplied by an adjustment coefficient

c

. Next, for each point (

x_{i}

) in the dataset, its distance to

{\hat{μ}}_{0}

was calculated as follows:

d_{i} = \sqrt{{(x_{i} - {\hat{μ}}_{0})}^{t} {\sum^{^}}_{0}^{- 1} (x_{i} - {\hat{μ}}_{0})}

(5)

The location and scatter estimates were then updated using the weight function

W (d_{i}^{2})

. The estimated location

{\hat{μ}}_{M C D}

was updated as follows:

{\hat{μ}}_{M C D} = \frac{\sum_{i = 1}^{n} W (d_{i}^{2}) x_{i}}{\sum_{i = 1}^{n} W (d_{i}^{2})}

(6)

The scatter estimate was updated to the following:

{\hat{Σ}}_{M C D} = c \frac{1}{n} \sum_{i}^{n} W (d_{i}^{2}) (x_{i} - {\hat{μ}}_{M C D}) {(x_{i} + {\hat{μ}}_{M C D})}^{t}

(7)

This process was iteratively optimized until convergence, thereby providing robust estimates of the central location and scatter of the data, greatly reducing the influence of outliers, and ensuring the robustness of the estimation.

2.: Isolation Forest

Isolation Forest is an unsupervised anomaly detection algorithm based on the principle that anomalous points are easier to isolate because of their sparse numerical values and distances from high-density regions. This method assumes that anomalous data have two significant characteristics: they constitute a small proportion of the data, and their feature values differ significantly from those of normal data. In a random tree structure, recursively partitioning the dataset until all sample points are isolated typically results in shorter paths for the anomalous points [29].

The Isolation Forest algorithm isolates anomalous points by constructing isolated trees. An Isolation Tree is a binary tree structure, in which each node can be either an external node (with no child nodes) or an internal node (with two child nodes and a split test). At internal nodes, data are partitioned by selecting attribute

q

and split point

p

, with points satisfying

q < p

belonging to

T_{l}

and the rest to

T_{r}

. The path length

h (x)

of sample point

x

in an Isolation Tree is the number of edges from the root node to the leaf node containing that point.

The construction of an Isolation Forest consists of two phases. The first phase builds Isolation Trees to form the Isolation Forest, and the second phase calculates the anomaly score for each sample point. The steps of the first phase are as follows:

ψ sample points were randomly selected from the given dataset to form a subset X′ of $X = {x_{1}, \dots, x_{n}}$ , which was placed in the root node.
A dimension $q$ was randomly designated from $d$ dimensions, and a split point $p$ was randomly generated in the current data, satisfying $\min (x_{i j}, j = q, x_{i j} \in X ’) < p < \max (x_{i j}, j = q, x_{i j} \in X ’)$ .
The split point $p$ generated a hyperplane that divided the current data space into two subspaces: sample points with a specified dimension less than $p$ were placed in the left child node $T_{l}$ , whereas those greater than or equal to $p$ were placed in the right child node $T_{r}$ .
Steps b and c were recursively executed until all leaf nodes contained only one sample point or the Isolation Tree reached the specified height.
Steps a to d were repeated until $t$ Isolation Trees were generated.

In the second phase, for each data point

x_{i}

, the path length

h (x)

in a particular Isolation Tree is defined as the number of edges traversed from the root node to the leaf node containing the sample point. The average height

E (h (x_{i}))

of sample point

x_{i}

across the entire Isolation Forest was calculated, which was the average of its path lengths in all Isolation Trees. The anomaly score for the sample point

x_{i}

was calculated as follows:

s (x_{i}, ψ) = 2^{\frac{E (h (x_{i}))}{c (ψ)}}

(8)

where

c (ψ) = \{\begin{array}{l} 2 H (ψ - 1) - \frac{2 (ψ - 1)}{ψ}, ψ > 2 \\ 1, ψ = 2 \\ 0, o t h e r w i s e \end{array}

(9)

Here,

H (k) = \ln (k) + ζ

, where

ζ

is the Euler–Mascheroni constant and

ζ = 0.5772156649

. The anomaly score

s (x_{i}, ψ)

ranges as

[0,1]

. The closer it is to 1, the higher the possibility that sample point

x_{i}

is anomalous; the closer it is to 0, the higher the possibility that sample point

x_{i}

is normal. If the anomaly scores of most sample points were close to 0.5, it might indicate that the model had difficulty distinguishing between normal and anomalous points in the dataset.

2.4.2. Data Preprocessing and QC Experiments’ Design

In this study, we employed the MCD and Isolation Forest methods from the Scikit-learn [51] toolkit to process FY2E TPW satellite observation data, with the aim to enhance their accuracy and practicality in precipitation simulation. Given that the ground station IPW observational data provided by the CMA had already undergone rigorous QC, we used these data as a reference dataset to understand normal meteorological patterns. The FY2E TPW data were then processed using our unsupervised ML models. The preprocessing workflow included the following steps.

First, we converted the CMA IPW observational data into a BUFR format readable by the GSI assimilation system, denoted CMA-BUFR. Similarly, the TPW observational data from satellite FY2E were converted into the GSI-readable BUFR format, referred to as FY2E-BUFR.

During the simulation period from 06:00 on 8 July 2013 to 06:00 on 10 July 2013 (UTC), we assimilated the CMA-BUFR data every 6 h to obtain the corresponding feature vectors including innovation (Observation Minus Model, OMB), latitude, longitude, usage flag (iuse), and observed value (obs). In data assimilation terminology, the term “innovation” refers to the difference between an observation and the corresponding model background (also known as the first guess) value. This difference is a crucial metric in assessing the quality of both the observations and the model forecast. In this study, the model refers to the WRF model, which provides the background field for the assimilation process. A large innovation may indicate either an erroneous observation or a significant model error, while small innovations suggest good agreement between observations and the model state. Over the same period, we assimilated the FY2E-BUFR data every 6 h to generate the corresponding feature vectors.

Next, we utilized the MCD and Isolation Forest methods to train models using the CMA-BUFR data and then applied these trained models to perform QC on the FY2E-BUFR data. Finally, we integrated the quality-controlled FY2E TPW and CMA IPW data into a conventional observation dataset for subsequent experiments and analysis. Through this systematic processing, we were able to comprehensively compare and analyze the effectiveness of the two ML-based QC methods and their impact on rainfall simulation.

It is worth noting that the two methods do not involve a specific threshold-based removal of outliers. This approach is chosen to avoid potentially eliminating valuable information about extreme values, which are crucial in precipitation forecasting. Instead, the ML models are trained to learn complex spatial and physical relationships in the data, using features such as latitude, longitude, and observed precipitable water values. This multi-dimensional approach enables the models to effectively differentiate between genuine extreme events and outliers that are likely errors.

In our experiments, we applied the unsupervised ML models (MCD and Isolation Forest) to the entire FY2E TPW dataset without splitting it into training and test sets, which is consistent with the nature of unsupervised anomaly detection. Both methods were implemented with a contamination factor of 0.15 and a random state of 42.

For model evaluation, we used method-specific metrics also due to the unsupervised nature of our task. The MCD method was evaluated using the scatter estimate, where a lower value indicates better performance in isolating the main cluster of data points from potential outliers. For the Isolation Forest method, we used the anomaly score, ranging from 0 to 1, with values closer to 1 indicating a higher likelihood of the data point being an anomaly.

3. Results

3.1. QC Results

The Gaussian distribution, also known as the normal distribution, is of significant importance in fields such as statistics and atmospheric sciences owing to its ability to approximate many natural phenomena and experimental data. By analyzing the Gaussian distribution plots, we can intuitively understand the central tendency, symmetry, and dispersion of the data.

Figure 4 presents the Gaussian distribution of FY2E TPW data innovation at three key time points (06:00 UTC on 8 July, 06:00 UTC on 9 July, and 06:00 UTC on 10 July 2013), representing the beginning, middle, and end of our study period, respectively. The figure shows the original data distribution alongside the results after applying the MCD and Isolation Forest QC methods.

Across all three time points, both QC methods demonstrate substantial improvements in data quality compared to the original distributions. The histograms for both methods show increased concentration around the center and reduced spread, indicating a more Gaussian-like distribution. This improvement is particularly evident in the reduction in extreme values and the smoothing of irregular peaks present in the original data.

However, the two methods exhibit distinct characteristics in their approach to data quality improvement. The MCD method (middle column) consistently produces smoother, more symmetrical distributions that closely approximate Gaussian curves. This is evident in the more uniform shape of the histograms and the closer fit of the red dashed Gaussian curve to the data. The MCD method’s effectiveness in optimizing the data distribution towards a Gaussian form is particularly noticeable in panels (e) and (h). This characteristic of the MCD method excels at optimizing the data distribution towards a Gaussian distribution, which may be more beneficial for variational assimilation systems based on Gaussian assumptions.

In contrast, the Isolation Forest method (right column) tends to preserve more of the original distribution’s features while still improving overall normality. This is visible in the slightly sharper peaks and marginally longer tails of the distributions, particularly in panels (c) and (f). This characteristic suggests that the Isolation Forest method may be more sensitive in identifying and processing anomalies, potentially preserving some physically significant extreme values. This unique advantage of preserving potentially important extreme information may be more applicable in certain extreme weather scenarios.

To quantitatively assess the effectiveness of the two QC methods, we calculated the skewness and kurtosis of the FY2E TPW data innovation after applying the MCD (EXPR2) and Isolation Forest (EXPR3) methods.

Table 3 presents these statistics for nine time points during the study period. The MCD method generally produced more consistent skewness values, with an average absolute skewness of 0.28 compared to 0.15 for the Isolation Forest method. This suggests that the MCD method was more effective in reducing the asymmetry of the data distribution. However, the Isolation Forest method maintained skewness values closer to zero in most cases, indicating a slightly better symmetry in the resulting distributions.

Regarding kurtosis, both methods showed varied results across different time points. The MCD method had an average kurtosis of −0.43, while the Isolation Forest method had an average of −0.20. The negative kurtosis values for both methods suggest that the resulting distributions are generally more platykurtic (flatter) than a normal distribution, with the MCD method producing slightly flatter distributions on average.

Note that at certain time points (e.g., 00:00 UTC on 9 July), the two methods produced quite different results, with the Isolation Forest method showing a higher positive kurtosis. This aligns with our previous observation that the Isolation Forest method may preserve more extreme values, resulting in heavier tails in some cases.

Overall, these metrics provide statistical evidence that both QC methods improved the normality of the data distribution, with each method showing distinct characteristics. The MCD method appears to be more consistent in reducing skewness, while the Isolation Forest method may be more effective in preserving the original distribution’s characteristics while still improving overall normality.

Figure 5 compares the boxplots of FY2E TPW data innovation before and after QC. In these boxplots, the central line represents the median, the lower and upper edges of the box indicate the first (Q1) and third (Q3) quartiles, respectively, and the whiskers extend to 1.5 times the interquartile range (IQR) beyond the box edges. Any points beyond the whiskers are considered potential outliers. Uncontrolled data (Figure 5a) show evident distribution anomalies, with large upper and lower boundary ranges up to ±20 mm. Notably, at 12, 24, 36, 42, and 48 h lead times, numerous anomalies (filtered points) below the lower boundary appeared, which indicated a clear negative bias in innovation distribution at these times, thereby implying a model overestimation of PW values. Ideally, in a large sample with a standard normal distribution, the median should be centered between the upper and lower quartiles, with the boxplot symmetrical about the median line. However, in Figure 5a, the median at almost every lead time is biased towards the upper quartile, thus revealing strong right-skewed distribution characteristics, which are consistent with the observations in Figure 5.

After performing QC using the MCD method (Figure 5b), the data distribution notably improved. The upper and lower limits of boxplots narrowed to ±10 mm. Except for 30 h and 42 h, the medians at each lead time were generally centered between the upper and lower quartiles, with boxplots showing near-symmetric shapes and exhibiting characteristics close to the standard normal distribution. This strongly demonstrates the excellence of the MCD method in eliminating noise and anomalies and effectively enhancing data concentration and symmetry.

The results after the Isolation Forest QC (Figure 5c) showed similar improvements. Although the upper and lower limits of the boxplots were slightly larger than those of the MCD QC group, they were still markedly smaller than those of the uncontrolled group, thus successfully filtering out the anomalies. Positive innovation values after QC indicate that the observed precipitation exceeds the model forecast values, and vice versa. The MCD QC group shows a median average of about 2.5 mm, with a maximum of 4 mm, while negative value distributions are sparse, reaching a minimum of −7.5 mm. In comparison, the median of the isolated forest QC group was closer to zero, with similarly sparse negative value distributions.

To further quantify the improvement in data quality, we calculated the standard deviation of the innovation values at each lead time for the original data and after applying each QC method, as shown in Table 4. The average standard deviation decreased from 4.38 mm in the original data to 2.86 mm after MCD QC and 2.55 mm after Isolation Forest QC. This reduction in standard deviation indicates a significant decrease in data spread and confirms the effectiveness of both QC methods in reducing data variability.

Notably, the Isolation Forest method showed a slightly lower average standard deviation compared to the MCD method, suggesting it may be more aggressive in reducing extreme values. This aligns with our previous observations of the Isolation Forest method’s characteristics in preserving certain data features while improving overall normality.

The standard deviation values also reveal that both QC methods are particularly effective at certain lead times. For instance, at the 42 h lead time, the standard deviation decreased from 4.40 mm in the original data to 3.35 mm with MCD and 1.90 mm with Isolation Forest, showing substantial improvements across all lead times.

These quantitative improvements in data distribution and variability, as evidenced by both the boxplots in Figure 5 and the standard deviation values in Table 4, underscore the effectiveness of both QC methods in enhancing data quality. Each method shows distinct characteristics that may be advantageous in different forecasting scenarios, with the MCD method generally producing more consistent distributions and the Isolation Forest method often achieving greater reductions in data variability.

Figure 6 and Figure 7 illustrate the spatial distributions of the pass (Figure 6A and Figure 7A) and reject points (Figure 6B and Figure 7B) before and after applying the two QC methods, respectively. When analyzing the range of innovations, it is evident that the absolute values of innovations at rejection points generally exceed 15 mm. This indicates significant discrepancies between observations and model forecasts at these locations. In contrast, the innovations of pass points after QC are within the ±10 mm range, thereby suggesting that the data quality at these observation points is relatively high and meets the expected standards.

From the perspective of spatial distribution, most observation points in the western plateau region of Sichuan Province failed to pass the QC. This phenomenon may be attributed to the lack of observational data from the plateau region in the training model and inadequate utilization of observational data from these regions by the GSI system. This finding highlights the need to improve data QC methods in areas with complex terrain.

When comparing the performances of the two QC methods, we found that the Isolation Forest method eliminated more observation points located in the southeastern part of Sichuan Province than the MCD method. This may reflect the fact that the Isolation Forest method potentially adopts more stringent criteria when dealing with outliers, imposing higher quality requirements on the observational data in that region. Overall, the Isolation Forest method exhibited a higher sensitivity in identifying and filtering anomalous observation points, whereas the MCD method demonstrated advantages in maintaining overall data consistency and reliability.

3.2. Analysis of Simulated Circulation Fields

Before examining the precipitation forecasts, we first analyze the simulated circulation fields, as the large-scale circulation patterns play a crucial role in driving precipitation. The following analysis compares the circulation fields simulated by the four experimental groups, focusing on the 500 hPa geopotential height, vertically integrated water vapor flux, and 850 hPa wind vectors.

Circulation fields are key factors that determine moisture transport pathways, velocities, and total amounts, and they directly influence the spatiotemporal distribution and intensity of precipitation. Therefore, an in-depth analysis of circulation field configurations and characteristics is crucial for understanding and predicting precipitation activities. To further investigate the impact of assimilating data processed by different QC methods on rainfall forecasts, this study generated circulation field distributions for the four experimental groups (Figure 8A–D).

The CTRL simulation results (Figure 8A) show that the high-value area of the water vapor flux coincides precisely with the core position of the low-level southwest jet stream, with substantial numerical values capable of transporting sufficient moisture to the heavy rainfall center. As the simulation progresses, it is evident that as the subtropical high retreats eastward, this causes the jet stream to weaken and shift southeastward. This change in the large-scale circulation pattern is likely to influence the distribution and intensity of precipitation. Compared to the background circulation field based on ERA5 data, the high-value area of water vapor flux simulated by the CTRL experiment was biased towards the southeast, which can be attributed to the southward bias in the simulated position of the subtropical high.

In EXPR1 (Figure 8B), compared to the CTRL, both the intensity and range of the moisture transport belt were weakened. In the 20°N–40°N and 105°E–115°E regions, the water vapor flux characterization between the 750 and 925 hPa layers showed a decrease in intensity, transitioning from high values of 4.0–4.7

g {cm}^{- 1} h P a^{- 1} s^{- 1}

to moderate values of 3.3–4.0

g {cm}^{- 1} h P a^{- 1} s^{- 1}

, and even to lower values of 2.7–3.4

g {cm}^{- 1} h P a^{- 1} s^{- 1},

thereby indicating reduced moisture transport intensity in this area. This change suggests that after assimilating the conventional data, the model simulation of moisture transport in this region underwent significant adjustments. The 850 hPa wind field showed changes in the low-level convergence areas, especially in the pattern at the edges of the original strong moisture transport belt. The southwestern airflow pattern across the entire region also exhibited subtle adjustments, resulting in a more complex structure. These changes indicate that the model’s description of atmospheric circulation was substantially adjusted after assimilating conventional data, and the weakening of moisture transport may have led to a reduced forecast precipitation intensity in certain areas.

The EXPR2 experiment (Figure 8C) demonstrates significant changes brought about by assimilating the TPW data after QC processing using the MCD method. The intensity of the moisture transport belt was further enhanced, showing a denser and more widespread area of high water vapor flux values (4.7–5.4

g c m^{- 1} h P a^{- 1} s^{- 1}

) in the 20°N–40°N, 105°E–115°E region, which indicated high levels of water vapor flux in this area. Simultaneously, the range of the moisture transport belt expanded notably, extending further northeast. The 500 hPa geopotential height field showed further deepening of the upper-level trough in the northwest, thereby potentially leading to stronger cold air advection southward and increasing atmospheric instability. The 850 hPa wind field revealed significant low-level convergence at the leading edge and central areas of the strong moisture transport belt. Moreover, the southwest airflow within the region had become more vigorous and persistent. These features suggest that after assimilating the QC data processed using the MCD method, the WRF model is likely to forecast more extensive and intense precipitation processes.

The circulation field from the EXPR3 experiment (Figure 8D) illustrates that assimilating the TPW data QC processed using the Isolation Forest method also brings about notable improvements. Compared to EXPR2, the intensity and range of the moisture transport belt were adjusted, showing a more concentrated high-intensity water vapor flux area in the 20°N–40°N and 105°E–115°E regions. This concentration may imply a more precise spatial distribution of precipitation. The 500 hPa geopotential height field shows that the depth of the upper-level trough in the northwest was moderate between CTRL and EXPR2, which may have brought about a more balanced atmospheric instability. The 850 hPa wind field shows that the low-level convergence area at the leading edge of the strong moisture transport belt was more distinct. The southwest airflow pattern was clearer and displayed a more organized moisture transport pathway. These features indicate that an assimilation scheme using the QC Isolation Forest method for QC may produce more balanced and refined precipitation forecasts.

3.3. Analysis of Precipitation Forecasts

3.3.1. Simulated Precipitation Distribution

Precipitation forecasting serves as a comprehensive indicator of the effectiveness of data assimilation. Figure 9 presents the observed precipitation provided by the CMA from 06:00 UTC on 8 July 2013 to 06:00 UTC on 10 July 2013. The observed precipitation distribution (Figure 9) shows that throughout the heavy rainfall event, the rain band was oriented northeast–southwest, primarily positioned above Mianyang–Chengdu–Ya’an, with persistent precipitation peaks particularly over the Chengdu area. At 06:00 UTC on 8 July 2013 (Figure 9a), precipitation was mainly concentrated in the northeastern Sichuan Province, with the maximum rainfall reaching approximately 110 mm. As time progressed (from 12:00 UTC on 8 July to 06:00 UTC on 9 July), the precipitation area gradually expanded, with the center shifting towards central Sichuan Province, maintaining maximum rainfall amounts above 110 mm, which is classified as heavy rainfall. Subsequently (from 12:00 UTC on 9 July to 06:00 UTC on 10 July), the precipitation intensity gradually weakened to moderate-to-heavy rain levels of 40–68 mm, thereby highlighting the significant persistence of this precipitation event.

Figure 10 shows the 6-hourly accumulated precipitation distributions simulated by the four experimental groups from 06:00 UTC on 8 July 2013 to 06:00 UTC on 10 July 2013. The CTRL experiment (Figure 10A) successfully reproduced the observed precipitation characteristics during the early stages of a heavy rainfall event. The simulation results showed high spatial consistency with the CMA observations and accurately captured the northeast–southwest orientation of the heavy rainband and its gradual shift from northeastern to central Sichuan. This indicates that the CTRL experiment was strongly formulated to accurately simulate the formation mechanism and evolution process of a heavy rainfall event during its initial stages.

However, in the later stages of the heavy rainfall event, particularly after 06:00 UTC on 9 July, the CTRL experiment substantially underestimated precipitation intensity. The simulated rainband was primarily concentrated in the central–eastern regions of Sichuan Province and the border areas of the Tibetan Plateau, which was slightly north of the observed rainband. Moreover, the simulated rainband appeared shorter and more dispersed, failing to adequately reproduce the observed characteristics of a continuous and extensive heavy rainband.

The EXPR1 group, which assimilated conventional observations (Figure 10B), showed a notable weakening of the simulated precipitation intensity, especially during the peak of the precipitation event (00:00 UTC, 9 July) and its preceding period. The EXPR1 group only exhibited higher precipitation intensities in the stage leading up to the peak, which rapidly weakened afterward, thereby indicating a poor simulation performance during the sustained precipitation process. Compared to the CTRL experiment, the assimilation of conventional observational data in EXPR1 only improved the simulation of precipitation areas in the early stages of heavy rainfall, thus making some adjustments to the spatial distribution and local characteristics of precipitation during this phase. However, EXPR1 performed poorly in simulating the later stages after the precipitation peak, which were primarily characterized by an excessively rapid decrease in precipitation intensity.

The EXPR2 group, which underwent MCD QC processing (Figure 10C), showed notable improvements over EXPR1 in simulating precipitation intensity. During the key phase from 06:00 UTC on July 8 to 06:00 UTC on 9 July, the EXPR2 maximum precipitation intensity reached heavy rainfall levels, with the rainband position and distribution closely matching the observations. This demonstrated that assimilating data processed using the MCD QC method can notably enhance the ability of the model to capture the spatial structure and intensity distribution of precipitation systems. This further validated the importance of introducing advanced QC and error-correction strategies prior to data assimilation.

The EXPR3 experiment, processed using the Isolation Forest method (Figure 10D), showed an overall similarity to EXPR2 in the precipitation simulation, thereby demonstrating remarkable skill in reproducing the precipitation intensity and spatial distribution. However, after 06:00 UTC on 9 July, EXPR3′s simulated precipitation intensity weakened slightly, thus showing some degree of underestimation when compared with the observations. Additionally, the simulation of the spatial distribution details for the rainband in central–eastern Sichuan Province was inadequate, with a reduced continuity and extent of the rainband, thereby failing to capture the widespread and continuous heavy rainfall characteristics observed. This may be attributed to the imprecise capture of local data features by the IF method when performing QC on the FY2E TPW data. This leads to biases in the model’s simulation of moisture conditions and convective activities in these areas after assimilation.

3.3.2. Quantitative Precipitation Verification

To comprehensively quantify the precipitation simulation performance of the four experimental groups, this study employed the Fraction Skill Score (FSS) to evaluate the 6 h cumulative precipitation (Figure 11). Evaluation thresholds were set according to the 6 h cumulative precipitation standards provided by the National Meteorological Center of China (0.1 mm, 4, 13, 25, 60, and 100 mm), corresponding to light rain, moderate rain, heavy rain, rainstorms, and severe rainstorms, respectively.

The four experimental groups exhibited distinct characteristics at various lead times during precipitation. The CTRL group generally performed well in simulating light to moderate rain but underperformed in simulating rainstorms and above, with FSSs consistently below 0.4. The EXPR1 group exhibited optimal performance in simulating all precipitation levels during the initial stage; however, its scores displayed a marked declining trend as the precipitation events progressed. Although EXPR1 outperformed the CTRL group in moderate rain and above, it still fell short of the EXPR2 and EXPR3 groups. After QC processing using the MCD and Isolation Forest methods, the EXPR2 and EXPR3 groups showed significant improvements in the precipitation forecast scores, particularly for moderate to heavy rain and severe rainstorm levels. During the critical precipitation stage, the FSSs exceeded 0.4, thereby demonstrating a relatively superior performance. This result indicates that appropriate QC of the FY2E TPW data before assimilation can effectively enhance the capability of the WRF model to simulate intense precipitation.

To evaluate the simulation performance of the entire precipitation event process comprehensively, we calculated the average FSSs across nine assimilation times to produce a mean FSS chart (Figure 12). The results showed that the EXPR2 and EXPR3 groups, which underwent QC processing, considerably outperformed the CTRL and EXPR1 groups in simulating heavy rain and higher precipitation levels. Further comparison of these two QC groups revealed that EXPR3, which employed the Isolation Forest method, surpassed EXPR2, which used the MCD method, at moderate to heavy rain levels. However, EXPR2 outperformed EXPR3 during both rainstorms and severe rainstorms.

These results may reflect the complex interactions between data quality control, assimilation system characteristics, and extreme weather forecasting. Although the MCD method tends to optimize the data distribution towards a Gaussian distribution, in practical applications, it may indirectly enhance the model’s grasp of large-scale circulation features by smoothing extreme values, thus leading it to excel in extreme precipitation forecasts. In contrast, although the Isolation Forest method demonstrates advantages in identifying and retaining key extreme data points, the retained extreme values may introduce local instabilities during the assimilation process, particularly when the forecast model is highly sensitive to these extreme conditions. This could lead to a slightly lower forecast accuracy in certain extreme precipitation scenarios when compared with the MCD method.

To further quantify the performance of different experiments, we analyzed additional metrics including the Root Mean Square Error (RMSE), correlation coefficient (CC), and bias for each experiment at different assimilation times (Table 5, Table 6 and Table 7). These metrics provided complementary insights to our FSS analysis. Additionally, we include results from experiments without quality control (No-QC) in these tables to highlight the importance of the quality control process in data assimilation. The No-QC results consistently showed higher RMSE, lower correlation coefficients, and larger biases compared to all other QC experiments, underscoring the critical role of proper quality control in improving forecast accuracy.

The analysis of these metrics across all experimental groups revealed several key findings:

Throughout the forecast period, all experimental groups demonstrated improvements over CTRL, which consistently showed the highest RMSE and bias values. However, the degree and nature of these improvements varied among EXPR1, EXPR2, and EXPR3.

EXPR1 showed a marked reduction in RMSE compared to CTRL, with values ranging from 9.39 to 24.62 mm. However, its performance was surpassed by both EXPR2 and EXPR3 at most time points. Interestingly, EXPR1 exhibited the highest initial CC (0.54), suggesting its potential strength in capturing immediate post-assimilation patterns. Yet, this advantage diminished over time, with CC values dropping as low as 0.06 in later stages. EXPR2, employing the MCD method, demonstrated notable RMSE reductions, particularly excelling in the early (8.43 mm at 06:00 UTC on 8 July) and late stages of the forecast. Its CC values, while variable, showed improvements over CTRL at several critical points. The MCD method’s performance in bias reduction was particularly impressive, maintaining generally low values throughout the forecast period. EXPR3, utilizing the Isolation Forest method, exhibited a more stable performance across all metrics. Its RMSE values, ranging from 7.57 to 15.90 mm, exhibited relatively less fluctuation compared to the other groups. While EXPR3’s CC values were not always the highest, they maintained a consistent level throughout the forecast, suggesting a steady reliability in pattern prediction. In terms of bias, EXPR3 achieved a balance between reduction and stability, with values consistently lower than CTRL and EXPR1.

The contrast between EXPR2 and EXPR3 is particularly intriguing. EXPR2 showed more pronounced improvements at specific time points, especially in RMSE reduction, aligning with its strong performance in extreme event prediction noted in our FSS analysis. Conversely, EXPR3’s strength lay in its consistency across all metrics and time points, resonating with its robust performance across various rainfall intensities observed earlier.

These findings not only corroborate our FSS analysis but also offer deeper insights into the characteristics of each QC method. The MCD method (EXPR2) appears to enhance the model’s capability in capturing critical features at specific stages, particularly beneficial for extreme event forecasting. In contrast, the Isolation Forest method (EXPR3) provides a more balanced improvement across the entire forecast range, potentially offering a more reliable overall performance.

4. Discussion

This study explored the effectiveness of two advanced ML-based QC methods applied to FY2E TPW data to enhance heavy precipitation forecasts using the GSI assimilation system and the WRF model. Our findings demonstrate that ML-based QC methods notably enhance the quality of the FY2E TPW data, consequently improving the forecasting capabilities of the WRF model for intense precipitation events. Both the MCD and Isolation Forest QC methods successfully identified and eliminated outliers, which resulted in data distributions that more closely approximated Gaussian distributions, thus providing more reliable observational information for subsequent data assimilation. This outcome aligns with the findings of Liang et al. [52], who observed that ML-based methods effectively identified and processed anomalies in complex satellite observation data when handling MODIS aerosol optical depth data.

As observed in the QC results, both the MCD and Isolation Forest methods markedly improved the data quality, but with distinct characteristics. The MCD method showed excellence in optimizing the data distribution towards a Gaussian distribution, which is particularly beneficial for variational assimilation systems based on Gaussian assumptions. In contrast, the Isolation Forest method demonstrated a unique advantage in preserving potentially important extreme information, which may be more applicable in certain weather scenarios. These characteristics have important implications for their application in different forecasting contexts.

The assimilation of the FY2E TPW data processed by these QC methods markedly enhanced the forecasting capabilities of the model for heavy precipitation events. The forecast skill improved substantially across almost all precipitation intensity levels. This result is consistent with the findings of Risanto et al. [10] who found that assimilating high-quality water vapor observation data substantially improved convection-scale precipitation forecasts. Moreover, the assimilation experiments revealed improvements in the model’s simulation of large-scale circulation fields, particularly in the intensity and extent of moisture transport belts, the position and depth of upper-level troughs, and the structure of low-level wind fields. These enhancements directly influence the WRF model’s representation of moisture transport and atmospheric instability, thereby affecting precipitation forecasts. In line with Do et al. [53], these results emphasize the critical importance of accurate water vapor field analysis for improving mesoscale convective system forecasts.

However, the MCD and Isolation Forest methods exhibited notable differences in performance across various precipitation intensities and levels. For moderate-to-heavy rainfall simulations, the Isolation Forest method demonstrated a clear advantage, possibly because of its superior ability to preserve local features and extreme values in the data. This enabled the WRF model to capture the spatial distribution characteristics of moderate-intensity precipitation more accurately. In contrast, the MCD method showed a better performance in simulating rainstorms above those intensity levels. This may be attributed to the excellence of the MCD method in optimizing the data distribution towards a Gaussian distribution, thereby facilitating better handling of large-scale circulation features by variational assimilation systems. These differences reveal the unique strengths of both methods in processing different precipitation intensity levels, and provide a crucial basis for selecting appropriate QC methods for various forecasting scenarios.

Although this study yielded valuable insights, certain limitations warrant further investigation in future research. Our analysis focused on a single intense precipitation event, which, while representative, does not comprehensively reflect the applicability of ML-based QC methods across diverse weather systems and geographical regions. This limitation also affected our ability to fully assess the robustness of the observed improvements across various meteorological contexts.

It is crucial to acknowledge the potential influence of error propagation within numerical weather prediction models on precipitation forecast accuracy. Errors can accumulate and propagate through various model components and time steps, potentially affecting the final forecast. While the two ML-based QC methods primarily focus on improving initial conditions and demonstrated significant enhancements in short-term forecasts, their impact on longer lead times may be influenced by the model’s internal dynamics and error growth. The propagation of these improvements through model integration is complex and nonlinear, adding another layer of complexity to the evaluation of QC method effectiveness.

To address these limitations, future studies should expand the range of cases to include diverse precipitation systems and geographical areas. This expansion would allow for a more thorough assessment of the effectiveness and generalizability of ML-based QC methods. Additionally, exploring error propagation in more detail, perhaps through ensemble approaches or sensitivity studies across multiple cases and longer forecast periods, would provide valuable insights into the long-term impacts of initial condition improvements and potential limitations of QC methods in different meteorological scenarios.

Furthermore, as computational capabilities advance and assimilation theories evolve, future research could explore more sophisticated assimilation techniques such as four-dimensional variational or ensemble Kalman filtering. These advanced methods could not only enhance assimilation effects but also better utilize the uncertainty information obtained during the QC process to optimize assimilation strategies. Such advancements would contribute to a more comprehensive understanding of how improved initial conditions through QC methods translate into enhanced forecast accuracy across various temporal and spatial scales.

5. Conclusions

This study investigated the effectiveness of two advanced machine learning-based quality control methods, MCD and Isolation Forest, in processing FY2E TPW data to improve heavy precipitation forecasts. By applying these methods to a case study of an extreme rainfall event in the Sichuan Basin in July 2013, we demonstrated their significant impact on data quality and subsequent precipitation forecasts.

Both the MCD and Isolation Forest methods substantially enhanced the quality of FY2E TPW data, bringing the data distribution closer to Gaussian. This improvement in data quality translated into marked enhancements in the Weather Research and Forecasting (WRF) model’s forecasting capabilities for heavy precipitation events. During key precipitation phases, the Fraction Skill Score (FSS) for moderate to heavy rainfall generally increased to above 0.4, indicating a significant improvement in forecast accuracy.

The two methods showed distinct characteristics in their performance. The MCD method excelled at optimizing the data distribution towards a Gaussian distribution, which proved particularly beneficial for variational assimilation systems based on Gaussian assumptions. This characteristic led to a superior performance in predicting extreme precipitation events. On the other hand, the Isolation Forest method demonstrated a unique advantage in preserving potentially important extreme information, resulting in a better performance in forecasting moderate to heavy rainfall intensities.

Further quantitative analysis using the Root Mean Square Error (RMSE), correlation coefficient (CC), and bias reinforced these findings. Both MCD and Isolation Forest methods markedly reduced RMSE and bias compared to the CTRL experiment. The MCD method showed notable improvements in specific forecast stages, with RMSE reductions of up to 58% in early forecast hours. The Isolation Forest method demonstrated a more consistent performance, maintaining stable RMSE values between 7.57 and 15.90 mm across various forecast lead times. Meanwhile, the MCD method excelled in capturing critical features for extreme event forecasting, while the Isolation Forest method provided more balanced improvements across different rainfall intensities. These results highlight the complementary strengths of the two QC methods, suggesting potential benefits in combining their approaches for comprehensive quality control in diverse forecasting scenarios.

These findings have important implications for operational forecasting. The MCD method appears more suitable for operational forecasting of extreme precipitation events due to its ability to better capture large-scale circulation features. However, for areas where moderate rainfall is of primary concern and preserving local extreme values is crucial, the Isolation Forest method could offer a valuable alternative. The choice between these methods should be based on the specific forecasting needs and the characteristics of the region of interest.

Our research not only provides a basis for method selection in forecasting various precipitation intensities but also offers an innovative solution for enhancing the accuracy of extreme weather event predictions. Future research should focus on expanding the range of case studies to include diverse precipitation systems and geographical areas, as well as exploring more advanced data assimilation techniques to further optimize the use of quality-controlled satellite data in numerical weather prediction models.

Author Contributions

Conceptualization, J.X. and Y.Z. (Yu Zhang); methodology, W.S., S.C., J.X. and X.L; software, W.S. and S.C.; validation, W.S., S.C. and Y.Z. (Yu Zhang); formal analysis, W.S.; investigation, W.S. and S.C.; resources, J.X. and Y.Z. (Yu Zhang); data curation, S.C., X.L. and Y.Z. (Yong Zhang); writing—original draft preparation, W.S.; writing—review and editing, W.S., S.C. and J.X.; visualization, W.S. and S.C.; supervision, J.X., Y.Z. (Yu Zhang), X.L. and Y.Z. (Yong Zhang); project administration, J.X.; funding acquisition, J.X. All authors have read and agreed to the published version of the manuscript.

Funding

This work was jointly supported by the National Natural Science Foundation of China, Grant No. 42130605; the Major Program of the National Natural Science Foundation of China, Grant No. 72293604; and the National Natural Science Foundation of China, Grant No. 42375159.

Data Availability Statement

The data that support the findings of this study are available from multiple sources. ERA5 reanalysis data are available from the Copernicus Climate Change Service (C3S) Climate Data Store (https://cds.climate.copernicus.eu/ accessed on 25 March 2022). The Global Precipitation Measurement (GPM) data are accessible through NASA’s Goddard Earth Sciences Data and Information Services Center (GES DISC) (https://disc.gsfc.nasa.gov/ accessed on 2 April 2022). FY-2E Total Precipitable Water (TPW) data can be obtained from the National Satellite Meteorological Center of China (http://satellite.nsmc.org.cn/PortalSite/Default.aspx accessed on 3 March 2022). The Integrated Precipitable Water (IPW) data were provided by the China Meteorological Administration and are available upon reasonable request from the corresponding author, subject to approval from the data provider.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Wang, J.; Zhang, L.; Dai, A.; Van Hove, T.; Van Baelen, J. A near-global, 2-hourly data set of atmospheric precipitable water from ground-based GPS measurements. J. Geophys. Res. 2007, 112, D11107. [Google Scholar] [CrossRef]
Kursinski, E.R.; Hajj, G.A.; Schofield, J.T.; Linfield, R.P.; Hardy, K.R. Observing Earth’s atmosphere with radio occultation measurements using the Global Positioning System. J. Geophys. Res. 1997, 102, 23429–23465. [Google Scholar] [CrossRef]
Trenberth, K.E.; Dai, A.; Rasmussen, R.M.; Parsons, D.B. The changing character of precipitation. Bull. Am. Meteorol. Soc. 2003, 84, 1205–1218. [Google Scholar] [CrossRef]
Zhu, Y.; Newell, R.E. A proposed algorithm for moisture fluxes from atmospheric rivers. Mon. Weather Rev. 1998, 126, 725–735. [Google Scholar] [CrossRef]
Ralph, F.M.; Neiman, P.J.; Wick, G.A. Satellite and CALJET aircraft observations of atmospheric rivers over the eastern North Pacific Ocean during the winter of 1997/98. Mon. Weather Rev. 2004, 132, 1721–1745. [Google Scholar] [CrossRef]
Xu, Y.; Chen, X.; Liu, M.; Wang, J.; Zhang, F.; Cui, J.; Zhou, H. Spatial–temporal relationship study between NWP PWV and precipitation: A case study of “July 20” heavy rainstorm in Zhengzhou. Remote Sens. 2022, 14, 3636. [Google Scholar] [CrossRef]
Kwon, E.H.; Sohn, B.J.; Chang, D.E.; Ahn, M.H.; Yang, S. Use of numerical forecasts for improving TMI rain retrievals over the mountainous area in Korea. J. Appl. Meteorol. Climatol. 2008, 47, 1995–2007. [Google Scholar] [CrossRef]
Rakesh, V.; Singh, R.; Pal, P.K.; Joshi, P.C. Impacts of satellite-observed winds and total precipitable water on WRF short-range forecasts over the Indian region during the 2006 summer monsoon. Weather Forecast. 2009, 24, 1706–1731. [Google Scholar] [CrossRef]
Wang, P.; Li, J.; Lu, B.; Schmit, T.J.; Lu, J.; Lee, Y.-K.; Li, J.; Liu, Z. Impact of moisture information from advanced Himawari imager measurements on heavy precipitation forecasts in a regional NWP model. J. Geophys. Res. Atmos. 2018, 123, 6022–6038. [Google Scholar] [CrossRef]
Risanto, C.B.; Castro, C.L.; Arellano, A.F., Jr.; Moker, J.M., Jr.; Adams, D.K. The impact of assimilating GPS precipitable water vapor in convective-permitting WRF-ARW on North American monsoon precipitation forecasts over Northwest Mexico. Mon. Weather Rev. 2021, 149, 3013–3035. [Google Scholar] [CrossRef]
Bennitt, G.V.; Jupp, A. Operational assimilation of GPS zenith total delay observations into the Met Office numerical weather prediction models. Mon. Weather Rev. 2012, 140, 2706–2719. [Google Scholar] [CrossRef]
Zhang, S.Q.; Zupanski, M.; Hou, A.Y.; Lin, X.; Cheung, S.H. Assimilation of precipitation-affected radiances in a cloud-resolving WRF ensemble data assimilation system. Mon. Weather Rev. 2013, 141, 754–772. [Google Scholar] [CrossRef]
Cucurull, L.; Derber, J.C. Operational implementation of COSMIC observations into NCEP’s global data assimilation system. Weather Forecast. 2008, 23, 702–711. [Google Scholar] [CrossRef]
Poli, P.; Healy, S.B.; Dee, D.P. Assimilation of Global Positioning System radio occultation data in the ECMWF ERA-Interim reanalysis. Q. J. R. Meteorol. Soc. 2010, 136, 1972–1990. [Google Scholar] [CrossRef]
Gandin, L.S. Complex quality control of meteorological observations. Mon. Weather Rev. 1988, 116, 1137–1156. [Google Scholar] [CrossRef]
Lorenc, A.C.; Hammon, O. Objective quality control of observations using Bayesian methods. Theory, and a practical implementation. Q. J. R. Meteorol. Soc. 1988, 114, 515–543. [Google Scholar] [CrossRef]
Lussana, C.; Uboldi, F.; Salvati, M.R. A spatial consistency test for surface observations from mesoscale meteorological networks. Q. J. R. Meteorol. Soc. 2010, 136, 1075–1088. [Google Scholar] [CrossRef]
Hastuti, M.I.; Min, K.-H. Impact of assimilating GK-2A all-sky radiance with a new observation error for summer precipitation forecasting. Remote Sens. 2023, 15, 3113. [Google Scholar] [CrossRef]
Nakabayashi, A.; Ueno, G. Nonlinear filtering method using a switching error model for outlier contaminated observations. IEEE Trans. Autom. Control 2019, 65, 3150–3156. [Google Scholar] [CrossRef]
Fowler, A.; Van Leeuwen, P.J. Observation impact in data assimilation: The effect of non-Gaussian observation error. Tellus A 2013, 65, 20035. [Google Scholar] [CrossRef]
Ye, X.; Zhou, J.; Xiong, X. A GEP-based method for quality control of surface temperature observations. J. Trop. Meteorol. 2014, 06, 1196–1200. (In Chinese) [Google Scholar]
Han, W.; Jochum, M. A Machine Learning Approach for Data Quality Control of Earth Observation Data Management System. In Proceedings of the IGARSS 2020–2020 IEEE International Geoscience and Remote Sensing Symposium, Waikoloa, HI, USA, 26 September–2 October 2020; pp. 3101–3103. [Google Scholar] [CrossRef]
Zhou, C.; Wei, C.; Yang, F.; Wei, J. A quality control method for high frequency radar data based on machine learning neural networks. Appl. Sci. 2023, 13, 11826. [Google Scholar] [CrossRef]
Polz, J.; Schmidt, L.; Glawion, L.; Graf, M.; Werner, C.; Chwala, C.; Mollenhauer, H.; Rebmann, C.; Kunstmann, H.; Bumberger, J. Supervised and unsupervised machine-learning for automated quality control of environmental sensor data. In Proceedings of the EGU General Assembly 2021, Online, 19–30 April 2021. EGU21-14485. [Google Scholar] [CrossRef]
Just, A.; Schlüter, S.; Graf, M.; Schmidt, L.; Polz, J.; Glawion, L.; Werner, C.; Chwala, C.; Mollenhauer, H.; Kunstmann, H.; et al. Gradient boosting machine learning to improve satellite-derived column water vapor measurement error. Atmos. Meas. Tech. 2019, 13, 4669–4681. [Google Scholar] [CrossRef] [PubMed]
Zhang, B.; Yao, Y. Precipitable water vapor fusion based on a generalized regression neural network. J. Geod. 2021, 95, 47. [Google Scholar] [CrossRef]
Xia, X.; Fu, D.; Shao, W.; Jiang, R.; Wu, S.; Zhang, P.; Yang, D.; Xia, X. Retrieving precipitable water vapor over land from satellite passive microwave radiometer measurements using automated machine learning. Geophys. Res. Lett. 2023, 50, e2023GL105197. [Google Scholar] [CrossRef]
Rousseeuw, P.J.; Driessen, K.V. A fast algorithm for the minimum covariance determinant estimator. Technometrics 1999, 41, 212–223. [Google Scholar] [CrossRef]
Liu, F.T.; Ting, K.M.; Zhou, Z.H. Isolation forest. In Proceedings of the 2008 Eighth IEEE International Conference on Data Mining, Pisa, Italy, 15–19 December 2008; pp. 413–422. [Google Scholar] [CrossRef]
Li, J.; Zhang, Y.; Chen, S.; Shao, D.; Hu, J.; Feng, J.; Tan, Q.; Wu, D.; Kang, J. Comparing Quality Control Procedures Based on Minimum Covariance Determinant and One-Class Support Vector Machine Methods of Aircraft Meteorological Data Relay Data Assimilation in a Binary Typhoon Forecasting Case. Atmosphere 2023, 14, 1341. [Google Scholar] [CrossRef]
Zhang, K.; Kang, X.; Li, S. Isolation Forest for Anomaly Detection in Hyperspectral Images. In Proceedings of the IGARSS 2019—2019 IEEE International Geoscience and Remote Sensing Symposium, Yokohama, Japan, 28 July–2 August 2019; pp. 437–440. [Google Scholar] [CrossRef]
Niu, Z.; Zhang, L.; Dong, P.; Weng, F.; Huang, W.; Zhu, J. Effects of direct assimilation of FY-4A AGRI water vapor channels on the Meiyu heavy-rainfall quantitative precipitation forecasts. Remote Sens. 2022, 14, 3484. [Google Scholar] [CrossRef]
Lu, H.; Ding, L.; Ma, Z.; Li, H.; Lu, T.; Su, M.; Xu, J. Spatiotemporal assessments on the satellite-based precipitation products from Fengyun and GPM over the Yunnan-Kweichow Plateau, China. Earth Space Sci. 2020, 7, e2019EA000857. [Google Scholar] [CrossRef]
Min, W.B.; Li, B.; Peng, J. Evaluation of total precipitable water derived from FY-2E satellite data over the southeast of Tibetan Plateau and its adjacent areas. Resour. Environ. Yangtze Basin 2015, 24, 625–631. (In Chinese) [Google Scholar]
Sha, Y.; Gagne, D.J.; West, G.; Stull, R. Deep-learning-based precipitation observation quality control. J. Atmos. Ocean. Technol. 2021, 38, 1075–1091. [Google Scholar] [CrossRef]
Kleist, D.T.; Parrish, D.F.; Derber, J.C.; Treadon, R.; Wu, W.S.; Lord, S. Introduction of the GSI into the NCEP global data assimilation system. Weather Forecast. 2009, 24, 1691–1705. [Google Scholar] [CrossRef]
Skamarock, C.; Klemp, J.B.; Dudhia, J.; Gill, D.O.; Barker, D.M.; Duda, M.G.; Huang, X.-Y.; Wang, W.; Powers, J.G. A Description of the Advanced Research WRF Model Version 4; NCAR Technical Note; National Center for Atmospheric Research: Bolder, CO, USA, 2019. [Google Scholar]
Huang, Y.; Cui, X. Moisture sources of an extreme precipitation event in Sichuan, China, based on the Lagrangian method. Atmos. Sci. Lett. 2015, 16, 177–183. [Google Scholar] [CrossRef]
Cheng, X.; Li, Y.; Xu, L. An analysis of an extreme rainstorm caused by the interaction of the Tibetan Plateau vortex and the Southwest China vortex from an intensive observation. Meteorol. Atmos. Phys. 2016, 128, 373–399. [Google Scholar] [CrossRef]
Yuan, X.; Yang, K.; Lu, H.; Wang, Y.; Ma, X. Impacts of moisture transport through and over the Yarlung Tsangpo Grand Canyon on precipitation in the eastern Tibetan Plateau. Atmos. Res. 2023, 282, 106533. [Google Scholar] [CrossRef]
Li, J.; Lu, C.; Chen, J.; Zhou, X.; Yang, K.; Li, J.; Wu, X.; Xu, X.; Wu, S.; Hu, R.; et al. The influence of complex terrain on cloud and precipitation on the foot and slope of the southeastern Tibetan Plateau. Clim. Dyn. 2024, 62, 3143–3163. [Google Scholar] [CrossRef]
Ziegler, C.L. Retrieval of thermal and microphysical variables in observed convective storms. Part 1: Model development and preliminary testing. J. Atmos. Sci. 1985, 42, 1487–1509. [Google Scholar] [CrossRef]
Iacono, M.J.; Delamere, J.S.; Mlawer, E.J.; Shephard, M.W.; Clough, S.A.; Collins, W.D. Radiative forcing by long-lived greenhouse gases: Calculations with the AER radiative transfer models. J. Geophys. Res. 2008, 113, D13103. [Google Scholar] [CrossRef]
Berg, L.K.; Gustafson, W.I.; Kassianov, E.I.; Deng, L. Evaluation of a modified scheme for shallow convection: Implementation of CuP and case studies. Mon. Weather Rev. 2013, 141, 134–147. [Google Scholar] [CrossRef]
Park, S.; Bretherton, C.S. The University of Washington shallow convection and moist turbulence schemes and their impact on climate simulations with the community atmosphere model. J. Clim. 2009, 22, 3449–3469. [Google Scholar] [CrossRef]
Hersbach, H.; Bell, B.; Berrisford, P.; Hirahara, S.; Horányi, A.; Muñoz-Sabater, J.; Nicolas, J.; Peubey, C.; Radu, R.; Schepers, D.; et al. The ERA5 global reanalysis. Q. J. R. Meteorol. Soc. 2020, 146, 1999–2049. [Google Scholar] [CrossRef]
Dutta, S.; Prasad, V.S.; Rajan, D. Impact study of integrated precipitable water estimated from Indian GPS measurements. Mausam 2014, 65, 461–480. [Google Scholar] [CrossRef]
Gao, J.; Liu, Y. Determination of land degradation causes in Tongyu County, Northeast China via land cover change detection. Int. J. Appl. Earth Obs. Geoinf. 2010, 12, 9–16. [Google Scholar] [CrossRef]
Huffman, G.J.; Stocker, E.F.; Bolvin, D.T.; Nelkin, E.J.; Tan, J. GPM IMERG Final Precipitation L3 Half Hourly 0.1 Degree × 0.1 Degree V06; Goddard Earth Sciences Data and Information Services Center (GES DISC): Greenbelt, MD, USA, 2019. [Google Scholar] [CrossRef]
De Maesschalck, R.; Jouan-Rimbaud, D.; Massart, D.L. The Mahalanobis distance. Chemom. Intell. Lab. Syst. 2000, 50, 1–18. [Google Scholar] [CrossRef]
Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-learn: Machine learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
Liang, T.; Sun, L.; Li, H. MODIS aerosol optical depth retrieval based on random forest approach. Remote Sens. Lett. 2021, 12, 179–189. [Google Scholar] [CrossRef]
Do, P.N.; Chung, K.S.; Lin, P.L.; Ke, C.Y.; Ellis, S.M. Assimilating retrieved water vapor and radar data from NCAR S-PolKa: Performance and validation using real cases. Mon. Weather Rev. 2022, 150, 1177–1199. [Google Scholar] [CrossRef]

Figure 1. The 500 hPa geopotential height (blue contours; gpm), 850 hPa water vapor flux (shading;

kg m^{- 1} s^{- 1}

), and 850 hPa wind vectors (arrows;

m s^{- 1}

) from 06:00 UTC 7 July 2013 to 06:00 UTC 9 July 2013, based on ERA5 reanalysis data (outer domain, d01); the purple line indicates the boundary of the Tibetan Plateau, and the black bold line indicates the boundary of Sichuan Province. The study domain and WRF model domains are shown in Figure 2.

Figure 1. The 500 hPa geopotential height (blue contours; gpm), 850 hPa water vapor flux (shading;

kg m^{- 1} s^{- 1}

), and 850 hPa wind vectors (arrows;

m s^{- 1}

) from 06:00 UTC 7 July 2013 to 06:00 UTC 9 July 2013, based on ERA5 reanalysis data (outer domain, d01); the purple line indicates the boundary of the Tibetan Plateau, and the black bold line indicates the boundary of Sichuan Province. The study domain and WRF model domains are shown in Figure 2.

Figure 2. Simulation domains and model nesting. The purple line indicates the boundary of the Tibetan Plateau, and the black bold line indicates the boundary of Sichuan Province. Red dots represent major cities (Chengdu, Mianyang, Ya’an, and Chongqing). The blue and red rectangles denote the outer (d01) and inner (d02) model domains, respectively. Background shading shows terrain elevation.

Figure 3. Distribution of PW data from FY2E satellite (left) and CMA ground stations (right) at 12:00 UTC on 8 July 2013. The different colors represent different PW value ranges.

Figure 4. Distribution of FY2E TPW data innovation at three representative time points: 06:00 UTC 8 July (top row), 06:00 UTC 9 July (middle row), and 06:00 UTC 10 July (bottom row). Each row shows the data distribution before QC (left column), after applying the MCD method (middle column), and after applying the Isolation Forest method (right column). Green histograms represent the data distribution, red dashed lines indicate fitted Gaussian distribution curves, and blue dashed lines mark zero innovation.

Figure 5. Box plots of FY-2E TPW data innovation (in mm) at 9 assimilation times before and after QC. The different colors represent different time points.

Figure 6. (A) Spatial distribution of reject points of FY-2E TPW data at 9 assimilation times after QC using the MCD method. (B) Spatial distribution of pass points of FY-2E TPW data at 9 assimilation times after QC using the MCD method.

Figure 7. (A) Spatial distribution of reject points of FY-2E TPW data at 9 assimilation times after QC using the Isolation Forest method. (B) Spatial distribution of pass points of FY-2E TPW data at 9 assimilation times after QC using the Isolation Forest method.

Figure 8. (A) A 500 hPa geopotential height (blue contours; gpm), vertically integrated water vapor flux (shading;

10^{- 2} g {cm}^{- 1} h P a^{- 1} s^{- 1}

) from 925 to 750 hPa, and 850 hPa wind vectors (arrows;

m s^{- 1}

) from 06:00 UTC 8 July 2013 to 06:00 UTC 10 July 2013, based on the CTRL experiment simulation (outer domain, d01); the purple line indicates the boundary of the Tibetan Plateau, and the black bold line indicates the boundary of Sichuan Province. (B) Similar to (A) but for the EXPR1 experimental simulation. (C) Similar to (A) but for the EXPR2(MCD) experimental simulation. (D) Similar to (A) but for the EXPR3 (Isolation Forest) experimental simulation.

Figure 8. (A) A 500 hPa geopotential height (blue contours; gpm), vertically integrated water vapor flux (shading;

10^{- 2} g {cm}^{- 1} h P a^{- 1} s^{- 1}

) from 925 to 750 hPa, and 850 hPa wind vectors (arrows;

m s^{- 1}

) from 06:00 UTC 8 July 2013 to 06:00 UTC 10 July 2013, based on the CTRL experiment simulation (outer domain, d01); the purple line indicates the boundary of the Tibetan Plateau, and the black bold line indicates the boundary of Sichuan Province. (B) Similar to (A) but for the EXPR1 experimental simulation. (C) Similar to (A) but for the EXPR2(MCD) experimental simulation. (D) Similar to (A) but for the EXPR3 (Isolation Forest) experimental simulation.

Figure 9. Distribution of observed 6 h accumulated precipitation (inner domain, d02) from 06:00 UTC 8 July 2013 to 06:00 UTC 10 July 2013, provided by the CMA; unit: mm. Dot size is proportional to precipitation intensity, with larger dots indicating higher precipitation amounts.

Figure 10. (A) Distribution of 6 h accumulated precipitation (inner domain, d02) from 06:00 UTC 8 July 2013 to 06:00 UTC 10 July 2013 in the CTRL experiment; unit: mm. (B) Distribution of 6 h accumulated precipitation (inner domain, d02) from 06:00 UTC 8 July 2013 to 06:00 UTC 10 July 2013 in the EXPR1 experiment; unit: mm. (C) Distribution of 6 h accumulated precipitation (inner domain, d02) from 06:00 UTC 8 July 2013 to 06:00 UTC 10 July 2013 in the EXPR2 experiment QC by MCD method; unit: mm. (D) Distribution of 6 h accumulated precipitation (inner domain, d02) from 06:00 UTC 8 July 2013 to 06:00 UTC 10 July 2013 in the EXPR3 experiment QC by Isolation Forest method; unit: mm.

Figure 11. Bar graph of FSSs for the four experimental groups at nine assimilation times.

Figure 12. Bar graph of the mean FSSs for the four groups of experiments.

Table 1. WRF V4.2 configuration and parameterization schemes.

	Description
Dynamics	Primitive equation, non-hydrostatic
Vertical layers	72 levels
Grid spacing	9 km; 3 km
Pressure at top level	10 hPa
Model domain	d01: 381 × 369 d02: 421 × 412
Radiation	RRTMG for shortwave and RRTMG scheme for longwave
Cumulus convection	Kain–Fritsch–Cumulus Potential scheme
Microphysics	NSSL 2-moment scheme
PBL	UW (Bretherton and Park) scheme

Table 3. Skewness and kurtosis values of FY2E TPW data innovation at 9 assimilation times after QC processing.

Lead Time	EXPR2-Skewness	EXPR3-Skewness	EXPR2-Kurtosis	EXPR3-Kurtosis
2013070806	0.13	−0.08	−0.83	0.19
2013070812	0.04	−0.18	−0.57	0.01
2013070818	−0.08	−0.01	−0.72	−0.39
2013070900	−0.48	−0.09	0.11	0.72
2013070906	−0.10	−0.26	−0.50	0.23
2013070912	−0.48	−0.28	−0.32	−0.63
2013070918	−0.59	−0.30	0.20	−0.56
2013071000	−0.53	−0.06	−0.53	−0.84
2013071006	−0.07	−0.10	−0.68	−0.54

Table 4. Standard deviation of FY2E TPW data innovation (in mm) at 9 assimilation times before and after QC.

Lead Time	Before QC	EXPR2-MCD	EXPR3-Isolation Forest
0	4.13	2.58	2.28
6	4.29	2.74	2.69
12	4.45	2.60	2.73
18	4.98	3.33	3.00
24	4.40	2.38	2.44
30	4.63	3.58	3.30
36	4.29	2.69	2.45
42	4.40	3.35	1.90
78	3.83	2.50	2.13

Table 5. Mean Root Mean Square Error (RMSE, mm) of 6 h accumulated precipitation for different experiments at 9 assimilation times, verified against GPM data.

Lead Time	CTRL	EXPR1	EXPR2	EXPR3	No-QC
2013070806	20.22	15.21	8.43	8.43	33.53
2013070812	27.02	20.01	10.99	11.53	33.58
2013070818	35.53	24.62	15.51	15.90	43.16
2013070900	41.57	20.12	14.09	12.26	21.47
2013070906	49.77	10.58	10.00	8.82	33.95
2013070912	55.24	14.16	10.38	8.63	23.89
2013070918	59.91	13.02	13.26	14.56	31.63
2013071000	64.03	22.98	10.97	12.72	16.58
2013071006	69.31	9.39	6.45	7.57	26.70

Table 6. Mean Correlation Coefficient (CC) of 6 h accumulated precipitation for different experiments at 9 assimilation times, verified against GPM data.

Lead Time	CTRL	EXPR1	EXPR2	EXPR3	No-QC
2013070806	0.38	0.54	0.44	0.44	0.23
2013070812	0.31	0.31	0.23	0.19	0.13
2013070818	0.28	0.20	0.33	0.28	0.11
2013070900	0.52	0.48	0.48	0.47	0.10
2013070906	0.32	0.29	0.46	0.36	−0.01
2013070912	0.06	0.06	0.00	0.04	0.13
2013070918	0.10	0.12	0.13	0.13	0.10
2013071000	0.23	0.09	0.11	0.10	0.08
2013071006	0.29	0.17	0.06	0.13	0.01

Table 7. Mean Bias (mm) of 6 h accumulated precipitation for different experiments at 9 assimilation times, verified against GPM data.

Lead Time	CTRL	EXPR1	EXPR2	EXPR3	No-QC
2013070806	8.94	2.80	1.17	1.17	8.92
2013070812	12.83	5.23	3.25	3.85	6.98
2013070818	17.07	4.81	3.81	3.51	8.10
2013070900	20.69	2.92	4.20	3.25	1.46
2013070906	25.16	2.18	3.08	3.22	9.78
2013070912	28.71	3.96	2.93	2.50	4.62
2013070918	32.25	−0.75	1.91	2.28	4.49
2013071000	35.08	−1.34	0.72	1.05	0.12
2013071006	38.67	0.05	1.09	1.64	6.55

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Shen, W.; Chen, S.; Xu, J.; Zhang, Y.; Liang, X.; Zhang, Y. Enhancing Extreme Precipitation Forecasts through Machine Learning Quality Control of Precipitable Water Data from Satellite FengYun-2E: A Comparative Study of Minimum Covariance Determinant and Isolation Forest Methods. Remote Sens. 2024, 16, 3104. https://doi.org/10.3390/rs16163104

AMA Style

Shen W, Chen S, Xu J, Zhang Y, Liang X, Zhang Y. Enhancing Extreme Precipitation Forecasts through Machine Learning Quality Control of Precipitable Water Data from Satellite FengYun-2E: A Comparative Study of Minimum Covariance Determinant and Isolation Forest Methods. Remote Sensing. 2024; 16(16):3104. https://doi.org/10.3390/rs16163104

Chicago/Turabian Style

Shen, Wenqi, Siqi Chen, Jianjun Xu, Yu Zhang, Xudong Liang, and Yong Zhang. 2024. "Enhancing Extreme Precipitation Forecasts through Machine Learning Quality Control of Precipitable Water Data from Satellite FengYun-2E: A Comparative Study of Minimum Covariance Determinant and Isolation Forest Methods" Remote Sensing 16, no. 16: 3104. https://doi.org/10.3390/rs16163104

APA Style

Shen, W., Chen, S., Xu, J., Zhang, Y., Liang, X., & Zhang, Y. (2024). Enhancing Extreme Precipitation Forecasts through Machine Learning Quality Control of Precipitable Water Data from Satellite FengYun-2E: A Comparative Study of Minimum Covariance Determinant and Isolation Forest Methods. Remote Sensing, 16(16), 3104. https://doi.org/10.3390/rs16163104

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Enhancing Extreme Precipitation Forecasts through Machine Learning Quality Control of Precipitable Water Data from Satellite FengYun-2E: A Comparative Study of Minimum Covariance Determinant and Isolation Forest Methods

Abstract

1. Introduction

2. Materials and Methods

2.1. Case Review

2.2. NWP Model and Assimilation System

2.3. Data Description

2.4. QC Process

2.4.1. Introduction to ML-Based QC Methods

2.4.2. Data Preprocessing and QC Experiments’ Design

3. Results

3.1. QC Results

3.2. Analysis of Simulated Circulation Fields

3.3. Analysis of Precipitation Forecasts

3.3.1. Simulated Precipitation Distribution

3.3.2. Quantitative Precipitation Verification

4. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI