**4. Discussion**

By analyzing the performance of all seven models, it can be concluded that LGBM performed best in the low wind speed interval, while ET performed best in the high wind speed interval. However, the above experimental results do not prove that all the variables in Table 1 can be used to optimize the performance of the model. On the contrary, some variables may reduce the accuracy of the model. Therefore, it is very important to analyze the effects of different variables. It should be noted that in the high wind speed interval, the data of spaceborne GNSS-R also present different data distributions and characteristics

from those in the low wind speed interval, and the roles of the variables were not always consistent. Here, we use the characteristics of XGBoost as the basis for evaluating the effect of each variable. XGBoost uses the average gain (AG) of data splits across all trees to measure the effects of variables [51]. After model training, by analyzing the XGBoost model structure, the AG related to each variable is defined as:

$$AG\_{v\_i} = \frac{\sum Gain\_{v\_i}}{S\_{v\_i}} \tag{23}$$

where *vi* is a variable used in the XGBoost model, *Svi* is the number of times that *vi* is used to split the data across all trees and *Gainvi* is the gain value of each tree after splitting with *vi*. Table 9 shows the AG of each variable in the low and high wind speed intervals, respectively.


**Table 9.** Rankings of the effects of variables.

Although AG helps to verify the effectiveness of feature selection, it cannot be used as a direct basis thereof. As such, the rationale of Table 9 needs to be demonstrated through experimental results. In order to analyze the influences of different variables more intuitively, this study constructed 60 models based on ET, XGB and LGBM with different variables. Line charts were used to help in analyzing the influence of these variables. The *x*-axis in Figure 12 indicates the number of variables, which is consistent with the ranking of the effects of variables in Table 9. For example, in the low wind speed interval, if the number of variables was set at 4, NBRCS, LES, SNR and SWH\_swell were used in the modeling; in the high wind speed interval, if the number of variables was set at 3, SWH\_swell, NoiseFloor and NBRCS were used in the modeling.

In Figure 12, the relationship between variables and models can be analyzed clearly. It is obvious that Figure 12 and Table 9 are highly consistent. In the low wind speed interval, the AG of NBRCS is much larger than that of other variables, which means that NBRCS is the most important variable in the low wind speed models. In the two subgraphs of the first column of Figure 12, it is obvious that LES, SNR and SWH\_swell improved the performance of the model greatly, as also confirmed in Table 9. In Table 9, the AGs of LES, SNR and SWH\_swell are significantly greater than those of the other variables. These variables effectively reduced the RMSE of the model and increased the correlation coefficient between the wind speed estimates and the true values of wind speed. In the high wind speed interval, the models were mostly affected by SWH\_swell; this may have been due to the degradation of the performance of spaceborne GNSS-R technology in a high wind speed. This result also indicates that, especially in the high wind speed interval, spaceborne GNSS-R technology needs to fuse more reliable auxiliary information to achieve better retrieval results. The contributions of other variables to the model are basically similar. Different from the results of the low wind speed interval, the effects of NoiseFloor and ScatterArea were significantly greater, while the effects of SNR and LES

were lower. In the high wind speed interval, the quality of DDM became lower, decreasing the correlation coefficients between sea surface MSS and the variables SNR and LES.

**Figure 12.** RMSEs and Rs of the wind speed retrieval models using different numbers of variables.

In general, from the above analysis, it is obvious that the results of the models with all variables are the best in both high and low wind speed intervals. In most cases, the accuracy of the model is directly proportional to the number of variables. Additionally, for different modeling methods, the influence of the number of variables was different; for different wind speed intervals, the rankings of the effects of variables were different. The above conclusions may be helpful for the future research of spaceborne GNSS-R sea surface wind speed retrieval.
