1. Introduction
The China–France oceanography satellite (CFOSAT) was successfully launched on 29 October 2018, through a collaborative effort between the China National Space Administration (CNSA) and the National Center for Space Studies of France (CNES) at the Jiuquan Satellite Launch Center in China. The satellite is equipped with two microwave sensors: the CFOSAT SCATterometer (CSCAT) of China and the Surface Wave Investigation and Monitoring (SWIM) of France. The CSCAT employs a rotating fan beam, which is different from the fixed fan beam and rotating scanning pencil beam of the traditional scatterometer. The innovative rotating fan beam facilitates the simultaneous acquisition of multiple azimuthal measurements of the sea surface, resulting in a large number of independent samples of backscatter coefficients obtained concurrently. The low-speed scanning of antennas enhances the redundancy and reliability of azimuth backscatter measurements. The CSCAT is currently the highest original spatial resolution scatterometer in the world, which provides the possibility to develop high-quality sea surface wind products [
1]. However, studies have found that rain has a serious impact on the echo signals received by scatterometers [
2,
3,
4,
5,
6,
7,
8]. The CSCAT, as a Ku-band scatterometer, will be significantly influenced by rain, thereby limiting the accuracy of its wind products [
9]. Therefore, it is necessary to identify the rain-contaminated data in order to improve the quality of CSCAT L2B wind products.
In the past few decades, researchers have developed several quality control (QC) methods for scatterometers. The brightness temperature (TB) provided by the radiometer exhibits high sensitivity to rain, thus making an important contribution to rain identification. The Multidimensional Histogram (MUDH) is the earliest technique for identifying rain using brightness temperature (TB) and other rain characteristics, which was employed to generate a rain flag for SeaWinds on QuikSCAT [
10]. The combination of TB and wind speed standard deviation has been applied to OSCAT to develop a rain flag [
11]. However, the use of the radiometer TB to identify rain is significantly limited due to the lack of radiometers synchronized with scatterometers on many satellites. Subsequently, maximum likelihood estimation (MLE), which identifies low-quality wind by quantifying the deviation between the measured normalized radar cross-section (NRCS) and the NRCS calculated using the geophysical model function (GMF), has been demonstrated to significantly flag rain-contaminated data [
12,
13]. The empirically normalized objective function (ENOF) follows a similar principle to MLE, but a weighted approach is used instead of the NRCS measurement error variance of MLE for error quantification [
14]. Notably, the ENOF is primarily applicable for rain detection at low wind speeds, while exhibiting an underreporting rate ranging from 65% to 75% during high wind speeds and tropical cyclones.
During the early years, MLE was widely used in wind QC and could reject most of the rain-contaminated data. However, MLE rejects some good wind to ensure the effectiveness of its QC. Subsequently, some rain-flag technologies complementary to MLE are developed to improve QC. Singularity Exponent (SE) is based on the spatial derivatives between wind vector cells (WVCs) [
15,
16], identifying poor quality wind data from spatial heterogeneity caused by rain [
17], and can therefore be complementary to MLE to flag more low quality data. J
OSS reduces the false alarm rate (FAR) of rain-contaminated data by using the background wind speed provided in the 2-D variational ambiguity removal (2-DVAR) process [
18]. The Bayesian algorithm (P rain flag) provides a posterior rain probability for each measurement in WVC with a low false alarm rate (FAR) and a low missing report rate (MRR) [
19].
In recent years, with the rise of artificial intelligence technology, machine learning models have been found to have great potential in the rain. Two neural network (NN) models are used for rain detection and rain rate inversion on five sets of data samples from different regions observed by OCSAT [
20]. However, the model cannot independently utilize scatterometer parameters for rain identification and relies on external collocated data sources, including parameters derived from numerical weather prediction (NWP) models: total precipable water (TPW), ground relative humidity (RH), wind speed (WS), and wind direction (WD). The HY2RRM model for rain identification of HY-2A data was developed based on the K-nearest Neighbor (KNN) [
21], while employing the same set of rain-sensitive parameters as the MUDH rain flag. Meanwhile, MUDH has also been transplanted to HY-2A data. The experimental results demonstrate that the effect of the KNN model surpasses that of MUDH technology.
At present, there are few studies on the rain flag of CSCAT [
22,
23]. The J
OSS rain flag has been proven to reduce the FAR of rain identification on CSCAT. However, the MLE/Joss-based rain flagging technique has a large MRR [
12,
18], and the quality control effect of wind products needs to be improved. The current CSCAT rain flag lacks specifications regarding further processing methods for MRR. CFOSAT is not equipped with a radiometer and cannot provide TB synchronized with CSCAT. Consequently, the MUDH-like method cannot be used for the CSCAT rain flag. The direct collocation of rain data is considered the optimal quality control method, but obtaining reliable rain data in the same spatiotemporal domain as the scatterometer often encounters significant challenges.
CSCAT requires the development of more effective quality control methods in order to improve the quality of the CSCAT L2B wind products. The rich observation information in CSCAT provides sufficient data for machine learning to identify rain. EXtreme Gradient Boosting (XGBoost) has high-dimensional data processing and feature selection capabilities to make full use of the rich information of CSCAT. In this paper, a rain identification model based on the Dung Beetle Optimizer algorithm [
24] optimized for XGBoost (DBO-XGBoost) is constructed. The model independently realizes rain identification and rain intensity classification using only the own information of CSCAT, without relying on external wind data or radiometer TB data. This approach enables more timely and efficient rejection of rain data, thereby enhancing the product quality of CSCAT. This paper is organized as follows:
Section 2 describes the collocated dataset and the methods of constructing the model.
Section 3 analyzes the rain identification and rain intensity classification performance of the model.
Section 4 discusses the effect of the model under different sea conditions and the influence of different input information on rain identification. The conclusions are presented in
Section 5.
3. Results
3.1. Evaluation of the DBO-XGBoost Model in Rain Identification
In this study, we assessed the performance of DBO-XGBoost and conducted a comparative analysis with K-Nearest Neighborhood (KNN), XGBoost, and CSCAT L2B products in terms of their effectiveness in rain flagging. The XGBoost is the model using the default internal parameters, with the number of estimators set to 100, max_depth set to 6, and learning_rate set to 0.3. KNN is a classical classification algorithm [
29]. KNN is a classical classification algorithm, where the underlying principle is that if a majority of the K samples in the feature space surrounding the target point belong to a specific category, then it can be inferred that the sample also belongs to this category. The value of K serves as a hyperparameter in the KNN algorithm, determining the number of nearest neighbors considered for accurate classification. A smaller value of K increases model complexity, resulting in reduced training error but weakened generalization ability. Conversely, a larger value of K reduces model complexity, leading to increased training error but improved generalization ability. Therefore, selecting an appropriate value for K plays a pivotal role in the model. The KNN model has been demonstrated to exhibit favorable performance in the rain identification of HY-2A [
21]. The performance of KNN with K = 3 (KNN3) and KNN with K = 5 (KNN5) is compared to that of DBO-XGBoost models. The input features and targets of all comparative models are consistent with DBO-XGBoost.
In order to evaluate the performance of the model more comprehensively, the classification evaluation metric is used to systematically score the model. The selected model evaluation indicators are as follows: (1) Accuracy: the proportion of correctly classified data in the total dataset; (2) Precision: the proportion of accurately predicted rain-contaminated data among all predicted rain-contaminated data; (3) False alarm rate (FAR): the proportion of data predicted as rain-contaminated but actually rain-free that accounted for the entirety of the rain-free dataset; (4) Missing report rate (MRR): rain data is predicted as the proportion of no rain to all rain data; (5) Rejection rate: the proportion of rain-contaminated data identified by the model to the total data; (6) Actual rain: the proportion of actual rain-contaminated data in the total data. The formulas for the evaluation metric are as follows:
where TP is a positive sample predicted by the model as a positive class; TN is a negative sample predicted by the model as a negative class; FP is a negative sample predicted by the model as a positive class; and FN is a positive sample predicted by the model as a negative class.
Accuracy is the most commonly used evaluation metric. However, given the infrequent occurrence of rain events, even if a substantial amount of rain-contaminated data is misclassified as rain-free data, accurate partitioning of the proportion of rain-free data will still yield a model with high accuracy. Nevertheless, our primary objective remains to maximize the identification of rain-contaminated data. Therefore, a superior rain identification model should have the ability to classify rain and no rain with high precision, low FAR, and low MRR. In practical applications, these requirements are often not fully realized. Increasing the precision of reporting each rain event may lead to a higher likelihood of MRR, whereas aiming for comprehensive reporting may result in increased FAR and reduced precision.
Table 2 shows the evaluation of DBO-XGBoost, XGBoost, KNN5, KNN3 models, and the CSCAT rain flag. Among all the models, DBO-XGBoost exhibits the highest accuracy and precision. Compared with KNN5 and KNN3, XGBoost reduces the FAR while obviously increasing the MRR. Although the precision of rain identification has improved, the accuracy of the model has been reduced, and the overall performance has deteriorated. DBO-XGBoost found a balance between the FAR and the MRR of rain identification. Compared with XGBoost, the DBO-XGBoost model exhibits a slight increase in the FAR while significantly reducing the MRR, thereby enhancing its overall performance. Compared with KNN3 and KNN5, the DBO-XGBoost model has a lower FAR. The ROC curve and AUC of DBO-XGBoost, XGBoost, KNN5, KNN3, and CSCAT rain flags are shown in
Figure 5. All machine learning models performed better than the CSCAT rain flag. Among all the curves, the ROC curve of the DBO-XGBoost classifier is the closest to the coordinate point (0.0, 1.0), and the area under the curve is the largest. The AUC of the DBO-XGBoost model is 3.93% higher than XGBoost, 3.08% higher than KNN5, 4.84% higher than KNN3, and 35.63% higher than the CSCAT rain flag. This shows that the performance of the DBO-XGBoost is better compared to the XGBoost, KNN5, KNN3, and CSCAT rain flags. Overall, DBO-XGBoost has the best comprehensive performance and excellent rain identification ability.
Figure 6 shows the retrieved wind speed scatter point density of rain-free data and rain-contaminated data flagged by DBO-XGBoost, XGBoost, KNN5, KNN3, and the CSCAT rain flag, which indirectly reflects the accuracy of the model in identifying rain. The root mean square error (RMSE) between CSCAT L2B wind speed and ERA5 wind speed without rain-contaminated data is calculated, as is the RMSE between CSCAT L2B wind speed and ERA5 wind speed flagged as rain-contaminated data by the models. The rain-free data (
Figure 6a–f) and rain-contaminated data (
Figure 6g–l) were flagged in the following four machine learning models: the CSCAT rain flag and the GPM KuPR collocation. The corr of all data, including rain-free data and rain-contaminated data, is 0.90752; the bias is 0.13155 m/s; and the RMSE is 1.5793 m/s. Due to the limited panel, this data is not shown in the figure. It can be seen that the RMSE of the wind field is reduced by all methods, and the effect of the machine learning model is better than that of the CSCAT rain flag. The effect of DBO-XGBoost is significantly improved compared to XGBoost and is comparable to the results of KNN5 and KNN3. The RMSE of the retrieved wind speed for data flagged as rain-contaminated is considerably high, surpassing 2 m/s in all cases. This indicates that filtered rain data significantly contributes to deviations in the retrieved wind speed. From the overall trend, the data affected by rain is overestimated in the low wind speed region and underestimated in the high wind speed region when the wind speed is retrieved.
3.2. Evaluation of the DBO-XGBoost Model in Rain Intensity Classification
In the experiment on rain identification, DBO-XGBoost, KNN5, and KNN3 all had good performance. In contrast, the XGBoost model performs worse but is still better than the CSCAT rain flag. The multi-classification ability of the machine learning model can further classify the rain intensity of the rain-contaminated data, which is an ability that the CSCAT rain flag does not have. Such classification can augment our capability to evaluate the degree of rain contamination in CSCAT data and facilitate subsequent product processing. According to the standard of the China Meteorological Administration, we classify rain intensity into four levels: Light rain ranges from 0.004 to 0.41 mm/h, heavy rain ranges from 0.41 to 2.08 mm/h, torrential rain ranges from 2.08 to 4.16 mm/h, and the rain rate of heavy downpour is above 4.16 mm/h. The experimental procedure for this subsection is as follows: Firstly, the training set used for constructing the rain identification model is also utilized in building the rain intensity classification model with the same hyperparameter. Then, the real labeled rain-contaminated data derived from GPM collocation is used to train the model to ensure that it is realistic. However, as the rain intensity classification model was developed after the rain identification experiment, its objective is to further categorize the identified rain into distinct intensities. Therefore, during the model testing phase, we specifically selected the dataset that was previously identified as rain contamination to evaluate the performance of the rain intensity classification model.
Figure 7 shows the process of rain intensity classification.
The evaluation of the DBO-XGBoost, XGBoost, KNN5, and KNN3 in the rain intensity classification is shown in
Table 3. The rain intensity classification problem is addressed in the presence of rain, thereby obviating the need to consider FAR and MRR. Furthermore, accuracy is defined as the proportion of accurately predicted rain-contaminated data among all actual rain-level data. The evaluation metrics include accuracy, precision, and the comparison between rejection rate and proportion of actual rain.
The accuracy and precision of the XGBoost, KNN5, and KNN3 rain intensity classifications are only a few, reaching more than 80%. The performance of these models in accurately categorizing rain intensity is suboptimal, with more than half of the classification accuracy and rate falling below 70%. Rain intensity classification requires high classification ability of the model because there exists a certain correlation between the accuracy of classification across each level. For instance, the classification accuracy of KNN5 for rain is merely 32.01%, indicating that approximately 70% of rain instances are misclassified into other three levels instead of being correctly classified, thereby resulting in a low precision for those levels. The KNN5 model has demonstrated commendable performance in rain identification. However, it falls short of meeting the requirements for rain intensity classification. Obviously, the DBO-XGBoost model performs best among the four machine learning models, and its accuracy and precision for light rain, torrential rain, and heavy downpour levels have reached more than 80%, or even more than 90%. For the four machine learning models, the classification precision of heavy rain levels is lower than that of other levels. The precision of KNN3 is only 19.72%, but DBO-XGBoost in this term can still be close to 80%, which proves that DBO-XGBoost has an excellent classification ability across all rain intensities.
During the process of rain identification, some rain-free data are mistakenly classified as rain-contaminated, which is expected to be classified as light rain rather than higher rain levels in the experiment of rain-level classification. This approach aims to minimize misclassification errors. In the results of the DBO-XGBoost model, the proportion of rain-free data being classified in the light rain category is 78%, and the proportion of no rain data being classified in the moderate rain category is 20.39%. Additionally, a mere 0.36% and 0.27% of no rain data are respectively misclassified as stormy rain and heavy stormy rain. This shows that DBO-XGBoost not only exhibits high accuracy in classifying rain intensity but also effectively reduces the misclassification observed in previous classification methods.
Figure 8,
Figure 9,
Figure 10 and
Figure 11 show the retrieved wind speed scatter point density of different rain intensities classified by DBO-XGBoost, XGBoost, KNN5, and KNN3, respectively, which indirectly reflects the performance of the model in identifying rain. It can be seen that higher rain intensity is associated with worse wind quality. Comparing the wind speed RMSE of DBO-XGBoost and XGBoost in light rain, it is found that the XGBoost model exhibits a larger RMSE in comparison to the DBO-XGBoost model when discerning the impact of light rain on wind speed. This discrepancy arises due to the enhanced capability of DBO-XGBoost in identifying a greater volume of light rain data. To be specific, due to the low accuracy and precision of light rain detected by XGBoost, a portion of higher-intensity rain is classified into the category of light rain, leading to an increased RMSE. Therefore, the effect of the wind speed on the same rain intensity between models is not suitable for direct comparison, necessitating a comprehensive analysis in conjunction with the evaluation metrics of the models.
5. Conclusions
The XGBoost optimized by the DBO algorithm is used to construct DBO-XGBoost to realize rain identification and rain intensity classification, so as to realize quality control of CSCAT wind products. A dataset generated by the collocation of CSCAT and GPM is used for constructing DBO-XGBoost. We evaluated the performance of DBO-XGBoost and conducted a comparative analysis with KNN, XGBoost, and CSCAT rain flag. In terms of rain identification, the QCs of all machine learning models are better than those of the CSCAT rain flag. DBO-XGBoost shows better performance than XGBoost and is comparable to KNN5 and KNN3. In the experiment of classifying light rain, heavy rain, torrential rain, and heavy downpour, the DBO-XGBoost demonstrates its excellent performance compared with XGBoost, KNN3, and KNN5.
Furthermore, we evaluate the performance of DBO-XGBoost in rain identification and rain intensity classification under two different sea conditions. The accuracy and precision of DBO-XGBoost in rain identification at low wind speeds are higher than those at high wind speeds, and the FAR and MRR are lower than those at high wind speeds. This is probably due to the diminishing impact of rain on NRCS with increasing wind speed. DBO-XGBoost misclassifies only a part of light rain at low wind speed while misclassifying light rain and heavy rain at high wind speed, indicating that the classification of rain intensity at low wind speed is more accurate. An orbit of CSCAT data is used to evaluate the performance of DBO-XGBoost. The results show that continuous data in time and space is more conducive to rain identification but did not significantly improve the classification of rain intensity. Therefore, the classification of heavy rain and torrential rain still needs further research.
The DBO-XGBoost model developed for rain identification and rain intensity classification has several advantages: (1) The rain-contaminated data can be directly flagged without collocating with other external data, which improves the timeliness and utilization rate of CSCAT data. (2) Compared with the CSCAT rain flag, the DBO-XGBoost model exhibits superior rain identification ability and possesses the capacity to classify rain intensity to evaluate the severity of rain events that the CSCAT rain flag lacks. (3) The machine learning model can simplify the data processing flow, which is more efficient than the traditional rain flag method. In the future, we will consider the correction of rain-contaminated data and make full use of CSCAT-measured information to play to its advantages.