*3.1. Data Processing Flow*

This study makes use of the CYGNSS Level 1B (L1B) product, which contains Delay Doppler Maps (DDM), together with other engineering and science measurement parameters. CYGNSS data are in the range of 40◦S to 40◦N and work with a spatial resolution of ~25 km. The sampling rate of the data used in this study is 2 Hz. Different from most previous studies on wind speed estimation, this study adopts the latest CYGNSS v3.1 data instead of CYGNSS v2.1 data. Several data fields have been empirically corrected in the v2.1 L1 calibration algorithm. Therefore, they need to be carefully examined before

modeling. Additionally, time-dependent variations have been observed in v2.1 data due to the variability of the transmitter and receiver. All these problems have been addressed in v3.1 data. The data are encapsulated by NASA in the netCDF file format and can be downloaded from https://podaac.jpl.nasa.gov/dataset/CYGNSS\_L1\_V3.1 (accessed on 26 March 2022) [29,30].

ECMWF reanalysis data (i.e. ERA-5) were used as the ground-truth data. ECMWF obtains hourly ERA-5 reanalysis datasets by assimilating meteorological data from different sources. The current sea surface wind speed product of ECMWF can be used as the groundtruth data in CYGNSS sea surface wind speed retrieval [25]. In this study, we use two ERA-5 parameters: the 10 m (above sea surface) u-component of neutral wind speed *WSu*<sup>10</sup> and the 10 m v-component of wind speed *WSv*10, i.e., the eastward component and the northward component of the 10 m wind speed. The horizontal wind speed of 10 m above sea surface *WS*<sup>10</sup> can be readily obtained as the root square of the sum of the squares of these two parameters. However, CYGNSS data are sampled at an interval of half second and therefore need to be matched temporally with ERA-5 data. The spatial resolution of ERA-5 is 0.5◦ × 0.5◦, which is rather different from that of CYGNSS, so spatial matching is also required.

In order to analyze the performance of the machine learning methods in different wind speed intervals, two datasets are constructed according to the wind speed distribution. They are a low wind speed dataset with wind speeds within 0–15 m/s and a high wind speed dataset with wind speeds within 15–30 m/s. To ensure the data is representative and generalizable, and to improve the generalization ability of the models, this study mainly uses randomly selected data from 2019 to 2021. Figure 3 shows the spatial distribution of all data used in this paper. Red points represent low wind speed data and green points represent high wind speed data. Most high wind speed data generally appear in high latitudes, while low wind speed data appear in all latitudes. It should be noted that the sea surface roughness near the coast may be affected by land [6], which leads to performance degradation of GNSS-R technology in terms of retrieving sea surface wind speeds and other parameters [26].

**Figure 3.** The spatial distribution of all data used in this paper.

The process of wind speed retrieval can be briefly summarized as containing four steps:


Figure 4 shows a flow chart of the proposed model construction and evaluation methods. Figure 5 shows the histogram of wind speed distribution. High wind speed data are more difficult to obtain than low wind speed data, and a great deal of the former are concentrated in the range of 15–20 m/s. Next, in order to evaluate the performance of the models and the effect of variables, three metrics are chosen, i.e., the root mean square error (RMSE), the correlation coefficient (R) and mean difference (MD), defined as:

$$RMSE = \sqrt{\frac{1}{n} \sum\_{i=1}^{n} (X\_i - Y\_i)^2} \tag{17}$$

$$R = \frac{\sum\_{i=1}^{n} \left(X\_i - \overline{X}\right) \left(Y\_i - \overline{Y}\right)}{\sqrt{\sum\_{i=1}^{n} \left(X\_i - \overline{X}\right)^2 \sum\_{i=1}^{n} \left(Y\_i - \overline{Y}\right)^2}}\tag{18}$$

$$MD = \frac{1}{n} \sum\_{i=1}^{n} (Y\_i - X\_i) \tag{19}$$

where *n* is the number of total data samples, {*Xi*} are the wind speed estimates, {*Yi*} are the wind speed data of ERA5, *X* is the mean of {*Xi*} and *Y* is the mean of {*Yi*}.

**Figure 4.** Model construction process and evaluation methods.

**Figure 5.** Wind speed distribution histogram. The red dotted line divides the dataset into the low wind speed dataset and the high wind speed dataset.
