**Statistical and Machine Learning Models for Remote Sensing Data Mining - Recent Advancements**

Editors

**Monidipa Das Soumya K. Ghosh V. M. Chowdary Pabitra Mitra Santosh Rijal**

MDPI Basel Beijing Wuhan Barcelona Belgrade Manchester Tokyo Cluj Tianjin

*Editors* Monidipa Das Machine Intelligence Unit Indian Statistical Institute Kolkata India Soumya K. Ghosh Department of Computer Science and Engineering Indian Institute of Technology Kharagpur Kharagpur India Pabitra Mitra Department of Computer Science and Engineering Indian Institute of Technology Kharagpur Kharagpur Santosh Rijal Department of Geography Virginia Tech Blacksburg United States

V. M. Chowdary Department of Agriculture, Cooperation Farmers Welfare Mahalanobis National Crop Forecast Centre New Delhi India

*Editorial Office* MDPI St. Alban-Anlage 66 4052 Basel, Switzerland

India

This is a reprint of articles from the Special Issue published online in the open access journal *Remote Sensing* (ISSN 2072-4292) (available at: www.mdpi.com/journal/remotesensing/special issues/rs data mining).

For citation purposes, cite each article independently as indicated on the article page online and as indicated below:

LastName, A.A.; LastName, B.B.; LastName, C.C. Article Title. *Journal Name* **Year**, *Volume Number*, Page Range.

**ISBN 978-3-0365-4592-9 (Hbk) ISBN 978-3-0365-4591-2 (PDF)**

© 2022 by the authors. Articles in this book are Open Access and distributed under the Creative Commons Attribution (CC BY) license, which allows users to download, copy and build upon published articles, as long as the author and publisher are properly credited, which ensures maximum dissemination and a wider impact of our publications.

The book as a whole is distributed by MDPI under the terms and conditions of the Creative Commons license CC BY-NC-ND.

## **Contents**


## **About the Editors**

#### **Monidipa Das**

Monidipa Das is currently a DST-INSPIRE Faculty at the Machine Intelligence Unit (MIU), in the Indian Statistical Institute (ISI) Kolkata, India. Previously she was a postdoctoral research fellow in the School of Computer Science and Engineering (SCSE), Nanyang Technological University (NTU), Singapore. She received her Ph.D. degree in computer science and engineering from the Indian Institute of Technology (IIT) Kharagpur, in 2018, and her M.E. degree in computer science and engineering from the Indian Institute of Engineering Science and Technology (IIEST), Shibpur, in 2013. Her research interests include spatial informatics, spatio-temporal data mining, soft computing, and machine learning. Dr. Das has research publications in a number of revered international journals and international conferences. Dr. Das is member of the ACM, the IEEE Computational Intelligence Society, and the IEEE Geoscience and Remote Sensing Society.

#### **Soumya K. Ghosh**

Soumya K. Ghosh received his Ph.D. degree in Computer Science and Engineering from the Indian Institute of Technology (IIT) Kharagpur, India. Presently, he is a Professor with the Department of Computer Science and Engineering, IIT Kharagpur. Before joining IIT Kharagpur, he worked for the Indian Space Research Organization in the area of satellite remote sensing and geographic information systems. He has been awarded the National Geospatial Chair Professorship by the Department of Science and Technology, Government of India, in 2017. He has more than 300 research papers in reputed journals and conference proceedings. His research interests include spatial data science, spatial web services, and cloud computing. Dr. Ghosh is a senior member of IEEE.

#### **V. M. Chowdary**

V. M. Chowdary is currently the Director of the Mahalanobis National Crop Forecast Centre (MNCFC), Department of Agriculture, Cooperation & Farmers Welfare. Previously, he was the Deputy General Manager, Regional Remote Sensing Center –North (Delhi), NRSC, ISRO. Earlier, he also worked as a scientist and head (applications) at the Regional Remote Sensing Centre-East, Kolkata, National Remote Sensing Centre (NRSC), ISRO, India. His areas of interest include: the application of geospatial technologies, multi-criteria analysis and soft computing tools for agricultural water management, integrated watershed management, hydrological modeling, and land use/cover changes. He has published widely in peer-reviewed journals. He is the recipient of the Eminent scientist award from NESA and Team excellence award from ISRO. He has also been selected as the 'Fellow of A. P. Akademi of Sciences-FAPAS'for the year 2019.

#### **Pabitra Mitra**

Pabitra Mitra received the B.Tech. degree in electrical engineering from IIT Kharagpur, Kharagpur, India, in 1996, and the Ph.D. degree from the Department of Computer Science and Engineering, Indian Statistical Institute, Kolkata, India, in 2005. He is currently working as a Professor with the Department of Computer Science and Engineering, IIT Kharagpur. His research focuses on machine learning, pattern recognition, data mining, information retrieval, and computational biology. He is a recipient of prestigious Indian National Academy of Engineering Young Engineer Award and Royal Society UK Indian Science Network Award. He is a member of the Indian Unit for Pattern Recognition and Artificial Intelligence. He is a senior member of IEEE.

#### **Santosh Rijal**

Dr. Santosh Ris an assistant professor in the department of geography at Virginia Polytechnic Institute and State University (Virginia Tech). He completed his Masters degree from University of North Dakota and his Doctoral degree from Southern Illinois University Carbondale. Dr. R's research is focused on the application of geospatial technology in land condition disturbance, land use/land cover change, and natural resource management. He has been involved in research related to assessing and monitoring military installations land condition in the United States. He currently teaches geospatial courses such as Principles of GIS, Modeling with GIS, Introduction to Remote Sensing, etc., in the geography department at Virginia Tech.

## *Editorial* **Statistical and Machine Learning Models for Remote Sensing Data Mining—Recent Advancements**

**Monidipa Das 1,\* , Soumya K. Ghosh <sup>2</sup> , Vemuri M. Chowdary <sup>3</sup> , Pabitra Mitra <sup>2</sup> and Santosh Rijal <sup>4</sup>**

<sup>1</sup> Faculty in the Machine Intelligence, Indian Statistical Institute, Kolkata 700108, India


During the last few decades, the remarkable progress in the field of satellite remote sensing (RS) technology has enabled us to capture coarse, moderate to high-resolution earth imagery on weekly, daily, and even hourly intervals. This wealth of earth surface data, if analyzed effectively, can provide significant insights into various geo-spatial processes, and eventually, can help us in making crucial decisions in a timely manner. Nevertheless, these RS data, as continuously captured at varying spatial, spectral, and temporal resolutions, are not only voluminous but also acquired heterogeneous data, where diverse categories of sensors, i.e., optical/microwave were used. Consequently, mining useful patterns/information from these enormous volumes of heterogeneous unstructured data requires enhanced data mining techniques exploiting the power of advanced computational intelligence and high-performance computing paradigms. Moreover, in the context of resolving urgent issues, such as in environmental nowcasting, a timely analysis of the RS data requires resource-efficient computation models with real-time processing and online analysis capabilities [1,2].

With this background in mind, in this Special Issue, we called for high-quality papers focusing on recent advancements in conventional statistical as well as machine learning techniques and modern AI (artificial intelligence)-driven technologies for efficient mining of remote sensing data. This Special Issue also aimed to provide a common platform for professionals, researchers, and practitioners from heterogeneous communities, including artificial intelligence, machine learning, geographical information systems, and spatial data science, to share their views, innovations, research achievements, and solutions to foster the advancement of intelligent analytics and efficient management of remote sensing data. Papers were invited to cover the following broad topics:


After the rigorous review process, a total of five papers have been accepted for publication in this issue. The selected papers either deal with the core challenges, such as missing data handling, noisy label distillation, feature-level fusion, etc., in remote sensing data analyses [3–5], or these highlight on various critical real-world problems, including oil spill detection [6], and high wind speed inversion [7].

As just mentioned above, missing data is a common problem in the field of remote sensing data analytics. This primarily occurs due to internal malfunctioning of the satellite remote sensing devices/sensors or due to the poor atmospheric condition, such as the presence of thick cloud cover. Remote sensing images with missing information not only reduce the usability of the data but also may negatively affect the performance of the

**Citation:** Das, M.; Ghosh, S.K.; Chowdary, V.M.; Mitra, P.; Rijal, S. Statistical and Machine Learning Models for Remote Sensing Data Mining—Recent Advancements. *Remote Sens.* **2022**, *14*, 1906. https:// doi.org/10.3390/rs14081906

Received: 24 March 2022 Accepted: 7 April 2022 Published: 15 April 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

analytical models. The problem appears to be more prominent at the time of analyzing aerosol optical depth (AOD) from remotely sensed data. AOD is a key parameter reflecting the characteristics of aerosols, and plays a significant role in predicting the concentration of pollutants in the atmosphere. However, as highlighted in the work of Chi et al. [3], the AOD data obtained by satellites are often found to be missing, and thereby, impose serious research challenges. The existing methods of AOD recovery primarily focus on to the accuracy of AOD restoration while neglecting the AOD recovery ratio. In order to solve the issue, Chi et al. [3] have proposed a light gradient boosting-based two-step model, termed as TWS, that fills the missing AOD data by combining data from multiple sources and at the same time learning spatio-temporal relationships of AOD. Experimental evaluation of TWS with respect to recovering AOD from Terra Satellite's 2018 AOD product has demonstrated the reliability of TWS method in producing competitive and consistent performance in AOD restoration. Overall, the work of Chi et al. [3], as included in our Special Issue, is of great significance in the context of studying the spatial distribution of atmospheric pollutants and handling missing data in this context.

In spite of the huge availability of remotely sensed data in recent years, the data are often found to be annotated with noisy labels. Label noise occurs whenever there is a mismatch between the ground truth label and the observed label. This happens due to several reasons, including manual labeling error, wrong or misinterpretation of the data, and so on. Noisy label can lead to serious network over-fitting problem and may negatively impact on the model performance. Therefore, noisy label distillation plays an important role in remote sensing image scene classification or segmentation tasks. Traditional models are typically based on direct fine-tuning and pseudo-labeling approaches, which are not only inefficient but also, may badly influence the model in other ways. In order to address such problem, in this Special Issue, Zhang et al. [4] have proposed a novel noisy label distillation approach grounded on an end-to-end teacher-student framework, which does not require pre-training on clean or noisy data. Evaluation on benchmark remote sensing image datasets with injected noise has demonstrated the superiority of the proposed approach [4] over the state-of-the-art techniques.

Apart from dealing with the core challenges in remote sensing data analytics, another way of improving the model performance is to fully exploit the increasingly sophisticated data from multiple sources. For example, the optical remote sensing data provides us with significantly larger amount of spectral information compared to the images captured using synthetic aperture radar (SAR), whereas the SAR technology has more penetration capability and has the advantage of generating images almost in all weather conditions. Remote sensing image fusion is, thus, important for enhancing the application ability of remote-sensing images, and accordingly, it has gained immense research attention in recent years. Incidentally, the remote-sensing image fusion can be performed both at the pixel-level and at the feature-level. However, in contrast with the pixel-level fusion, feature-level fusion considers more diverse factors, and thereby, helps to obtain more macro-level information than that obtained using pixel-level fusion. This Special Issue includes an interesting article by Kong et al. [5] on feature-level fusion-based classification of remote sensing images using features extracted from polarized SAR and optical images. The approach is based on a combination of Random Forest (RF) and Conditional Random Fields (CRFs). Typically, the model exploits the power of CRF in spatial context feature modeling and improves the RF-based classification. Experimental evaluation shows the efficacy of the proposed fusion-based classification approach.

In addition to discussion on the technical challenges being faced during remote sensing image data processing, this Special Issue also includes papers on some critical applications of remote sensing data analytics [6,7]. For example, in the fourth article of this Special Issue, Almulihi et al. [6] have presented the application of SAR Image analysis in oil spill detection. The proposed approach is based on online extended variational learning of dirichlet process mixtures of Gamma distributions. The technical novelty lies here in extending the finite Gamma mixture model that can handle infinite number of mixture components. The

online learning property of the proposed model makes it more advantageous over the batch learning-based models at the time of dealing with massive and streaming data. Empirical study with respect to real-world application of oil spill detection from SAR images demonstrates the effectiveness of the approach proposed by Almulihi et al. [6].

High wind speed inversion is another critical as well as challenging application of remote sensing data analytics, which has gained significant research interest in present days. Wind speed is one of the key sea surface parameters that prominently influence diverse oceanic applications. The traditional ways of detecting wind speed using remote sensing imaging technology are often found to be failed when the wind speed is high. The study made by Zhang et al. [7], as included in this Special Issue, reveals that machine learning techniques can be effectively employed as the complements of these conventional RS technology-based models. Experimentations on multi-sourced RS data show that machine learning schemes of Support Vector Regression (SVR), combined Principal Component Analysis (PCA) and SVR (PCA-SVR), and Convolutional Neural Network (CNN) can be certainly useful for improving the accuracy in high wind speed inversion on sea surface, where CNNs are promising models in this area.

We hope that the readers will become highly benefitted from the insightful discussions and presentations of our Special Issue papers, as concisely discussed above, and also will be encouraged to contribute to these rapidly progressing areas.

**Funding:** This research received no external funding.

**Acknowledgments:** We would like to thank all authors who have contributed to this volume by sharing their domain knowledge, research experiences and experimental results.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


## *Article* **Handling Missing Data in Large-Scale MODIS AOD Products Using a Two-Step Model**

**Yufeng Chi 1,2 , Zhifeng Wu 1,2, Kuo Liao <sup>3</sup> and Yin Ren 1,2,\***


Received: 23 October 2020; Accepted: 16 November 2020; Published: 18 November 2020

**Abstract:** Aerosol optical depth (AOD) is a key parameter that reflects the characteristics of aerosols, and is of great help in predicting the concentration of pollutants in the atmosphere. At present, remote sensing inversion has become an important method for obtaining the AOD on a large scale. However, AOD data acquired by satellites are often missing, and this has gradually become a popular topic. In recent years, a large number of AOD recovery algorithms have been proposed. Many AOD recovery methods are not application-oriented. These methods focus mainly on to the accuracy of AOD recovery and neglect the AOD recovery ratio. As a result, the AOD recovery accuracy and recovery ratio cannot be balanced. To solve these problems, a two-step model (TWS) that combines multisource AOD data and AOD spatiotemporal relationships is proposed. We used the light gradient boosting (LightGBM) model under the framework of the gradient boosting machine (GBM) to fit the multisource AOD data to fill in the missing AOD between data sources. Spatial interpolation and spatiotemporal interpolation methods are limited by buffer factors. We recovered the missing AOD in a moving window. We used TWS to recover AOD from Terra Satellite's 2018 AOD product (MOD AOD). The results show that the MOD AOD, after a 3 × 3 moving window TWS recovery, was closely related to the AOD of the Aerosol Robotic Network (AERONET) (R = 0.87, RMSE = 0.23). In addition, the MOD AOD missing rate after a 3 × 3 window TWS recovery was greatly reduced (from 0.88 to 0.1). In addition, the spatial distribution characteristics of the monthly and annual averages of the recovered MOD AOD were consistent with the original MOD AOD. The results show that TWS is reliable. This study provides a new method for the restoration of MOD AOD, and is of great significance for studying the spatial distribution of atmospheric pollutants.

**Keywords:** LightGBM; spatiotemporal weight interpolation; AOD recovery; East Asia

#### **1. Introduction**

Atmospheric aerosols are a dispersion system of suspended colloids formed by solid or small particles [1]. With the increase in the number of aerosols emitted by human activities, the scattering and absorption of solar radiation forms a brighter cloud layer and directly reduces the efficiency of precipitation [2]. Moreover, the increased number of aerosols changes the structure of the atmosphere, reduces solar radiation on the surface, increases the heating effect on the atmosphere, reduces precipitation, and inhibits the removal of pollutants [3]. Additionally, the weak water cycle brought about by aerosols directly affects the quality and quantity of fresh water [4]. Therefore, it is crucial to quantitatively measure the aerosol optical depth (AOD). Typically, the definition of AOD is

the vertical integral of the aerosol extinction coefficient in the atmosphere column, which is used to describe the aerosol optical properties [5,6].

There are two main methods for obtaining AOD data: ground acquisition and space acquisition. The Aerosol Robot Network (AERONET) represents the ground observation network, which relies mainly on a sun spectrophotometer to conduct fully automatic measurements of the AOD at the instrument deployment site [7,8]. Compared with space acquisition, the AOD obtained by the ground observation network has higher accuracy. Nevertheless, it is difficult to provide a wide range of viewing angles for the AOD of ground measurements due to limitations in equipment deployment and observation ranges [9,10]. Therefore, it is more efficient to use remote sensing for AOD measurement and inversion on a large scale. The following are some of the remote sensing inversion products that are commonly used in AOD observations: (1) MODIS sensors on the Terra and Aqua satellites in polar orbit are used to provide global AOD products (MOD AOD and MYD AOD) with resolutions of 10 km and 3 km every day through the Dark Target (DT) [11] and Dark Blue (DB) [12] algorithms [13,14]. (2) MODIS sensors combined with the Multi-Angle Implementation of Atmospheric Correction (MAIAC) algorithm [15] are used to provide AOD products with a fixed 1-km grid. MAIAC AOD uses time series to detect multiangle surface features to recover Bidirectional Reflectance Distribution Function (BRDF). Compared with the DT and DB algorithm, it can better identify AOD information in cloud and snow areas [16,17]. (3) The Advanced Himawari Imager (AHI) sensor on the Japan Himawari-8 geostationary satellite provides AOD products with a spatial resolution of 5 km at a spectral wavelength of 500 nm and continuously monitors East Asia at a maximum interval of 10 min [18,19].

At present, many studies use AOD as an important indicator or parameter for the mapping of air pollutants (e.g., PM2.5, PM10) [20–22]. Complete and high-precision AOD distribution data will greatly improve the quality of the mapping of air pollutants. However, uncertainties in cloud detection, limitations of the AOD inversion algorithm, and sensor degradation are the three main factors that cause a partial loss of the AOD local data retrieved by satellites [23–25]. For example, the shortcomings of the DT algorithm and DB algorithm for AOD detection in bright areas, the errors of cloud detection in some heavily polluted areas and the degradation of other sensors directly affect the detection of dark pixels in low angle areas, which leads to the loss of AOD data in some areas [26,27]. A study of the Yangtze River Delta in China found that the missing rate of MOD AOD reached 89.6% between 2014 and 2017 [28]. Because the results of AOD are affected by meteorological conditions, human activities and vegetation coverage, it is difficult to ensure the accuracy of the AOD restoration [29].

A large quantity of research has focused on how to recover missing information from AOD data. One approach is through the innovation of the inversion algorithm to reduce the missing AOD. For example, some researchers use low cloud detection standards or the Dense Dark Vegetation (DDV) algorithm to improve the AOD inversion accuracy of bright surfaces [30,31]. However, such methods still cannot overcome the missing AOD data caused by cloud shading [32]. Statistical regression models such as linear regression [33,34], spatial interpolation and spatiotemporal interpolation [35,36] are used to fill in the deficiency of the AOD statistical regression models, and it is difficult to analyze the internal relationships of the global heterogeneity of the AOD data, which results in poor recovery results. AOD information is filled in by using a machine learning methods such as random forest (RF) [20] or gradient boosting machine (GBM) [24] to process the multisource data. The strong data mining ability of the machine learning methods is good for fitting multisource data, and it can achieve higher precision at the same time [9,37].

In this paper, a two-step model (TWS) is proposed to recover the missing AOD caused by the presence of clouds of MOD AOD under the premise of ensuring recovery accuracy. Specifically, the first step of TWS uses a machine learning method to integrate multisource AOD data. The second step uses the spatio-temporal interpolation and spatial interpolation methods of moving windows to further fill in the missing MOD AOD. In addition, the second step of TWS uses a local buffer to reduce the heterogeneity of the AOD caused by time differences. Section 2 of this paper describes the research area and data set, Section 3 shows the methodology of the TWS, Section 4 shows the results of the model, Section 5 discusses the model and application, and Section 6 presents the conclusions.

#### **2. Materials**

#### *2.1. Study Areas*

Part of the East Asia region (18–50◦ N, 96–150◦ E) was selected as the study area (Figure 1). The research area mainly includes regions of China, Mongolia, Japan, the Korean Peninsula, and the Northeast Pacific. The study area includes countries that contain more than 75% of the population distribution in East Asia in total (central and eastern China, Korean Peninsula, Japan, Mongolia, northern Vietnam) and major urban agglomerations (Yangtze River Delta, Pearl River Delta, Seoul City Cluster, Tokyo City Cluster) [38]. The spatial and temporal distribution characteristics of AOD data are complicated by the increasing number of human activities [39]. Additionally, a large-scale research area can reduce the probability of all missing AOD data on a given day and provide enough data for research. Moreover, a larger study area has more complex land types and other factors, which can better test the universality of the model.

**Figure 1.** Distribution of the AERONET sites considered in this paper.

#### *2.2. Datasets*

We collected the data from 86 ground AERONET stations in the study area from 31 December 2017, to 1 January 2019 (Figure 1) and the satellite AOD dataset. The satellite data included Terra and Aqua satellite AOD products (MOD AOD/MYD AOD), MAIAC AOD, and AHI AOD products. In addition, we included part of the auxiliary data.

#### 2.2.1. AOD Products

We selected the following three AOD products: 1. The "MOD AOD" data were selected from MODIS Terra, and the "MYD AOD" data were from Aqua Aerosol Collection 6.1, which were downloaded through Earthdata (https://earthdata.nasa.gov). A total of 16,233 images of MOD AOD and MYD AOD were selected with a time resolution of one day and the spatial resolution of 3 km [40]. 2. More than 19508 MAIAC AOD data were downloaded from Earthdata. We selected the MAIAC AOD data at the spectral wavelength of 550 nm and then removed invalid AOD based on the guidance of the filter quality assurance in the user manual (reserve AOD when QA.CloudMask = Clear and QA.AdjacencyMask = Clear). 3. We selected the Advanced Himawari-8 AOD (AHI AOD), which is provided by the Japan Meteorological Agency (JMA). AHI AOD data were divided into two levels:

L2 and L3. The L3 product selected in this research underwent strict cloud screening. Therefore, the L3 product has higher accuracy and reliability than L2 [41]. L3 daily products (averaged from L3 hour products) have a spatial resolution of 5 km and contain a total of 367 images. AHI AOD date were obtained from the FTP provided by JMA (ftp.ptree.jaxa.jp).

#### 2.2.2. AERONET Data

AERONET (aeronet.gsfc.nasa.gov) has a time resolution of 15 min. AERONET AOD contains three quality levels: Level 1.0 (unscreened), Level 1.5 (cloud-screened and quality controlled), and Level 2.0 (quality-assured). Compared with Level 1.0, the uncertainty of Level 1.5 and Level 2.0 in version 3 is low [8]. In this paper, the Level 1.5 and Level 2.0 data of version 3 of the AERONET site in 86 research areas are used as ground truth values for comparison.

#### 2.2.3. Auxiliary Data

The auxiliary data were mainly divided into meteorological, terrain, land data and other types. The meteorological data were extracted from the Modern-Era Retrospective analysis for Research and Applications version 2 (MERRA2) dataset (https://earthdata.nasa.gov) [42]. The meteorological data included the temperature (TLML), wind speed (WS), surface roughness (ZM), surface specific humidity (QSH), and planetary boundary layer height (PBLH). The spatial resolution of the meteorological data was 0.625◦ × 0.5◦ , and the average value of the 9:00–12:00 local time (satellite transit time) data was calculated as the meteorological data of the day. The terrain data were extracted from Shuttle Radar Topography Mission (SRTM) data (https://earthdata.nasa.gov) with a spatial resolution of 90 m. The terrain data included the digital elevation model (DEM), slope, and aspect. The land data included population data, road density, and Normalized difference vegetation index (NDVI) composition. The population data were obtained by LandScan (landscan.ornl.gov), which is integrated by multisource data and released once per year. The spatial resolution of the population data was approximately 1 km [43]. The road data provided by OpenStreet (www.openstreetmap.org) were mainly composed of data shared by users, and were therefore free from copyright. The road data were the vector data format of ESRI (RL). NDVI data use MOD13 A2 16D 1 km spatial resolution (collection 6) data (https://earthdata.nasa.gov) [44]. Other types included the day of the year (DOY).

#### **3. Methods**

Due to aerosol diffusion, AOD inversion algorithm differences, remote sensing image detection time differences, and differences in multisource AOD data are mainly reflected in the different data sources, different data detection times, and various data detection positions [10,45,46]. Thus, the life cycle of aerosols in the troposphere varies from a few days to a few weeks [4,47]. Over a short time, there is a correlation between different AOD data sources; in addition, there is a correlation between different AOD data detection times. According to the 2018 statistics of the AOD data in the study area, the MOD AOD at the same location on the same day is directly related to MYD AOD, MAIAC AOD and AHI AOD data. The MOD AOD at the same position correlates with that of the adjacent time, and the specific data are shown in Table 1. The spatial correlation refers to the correlation coefficient (R) of the effective AOD values of two adjacent pixels. The time correlation refers to the R of the effective value of the target AOD pixel and the adjacent day AOD pixel.


**Table 1.** MOD AOD correlation (spatial correlation, temporal correlation, and correlation of different AOD data sources).

This paper proposes a two-step model (TWS) that combines the rich data volume of multisource data and the inherent spatial-temporal distribution relationships of aerosols to recover missing MOD AOD. First, we preprocess the multisource data and then use the TWS method to recover the MOD AOD. 1. For the multisource AOD data obtained at the same spatial location on the same day, some sources have pixel values, and some are missing. The existing data helps to recover some of the missing MOD AOD values from the other data sources, which is possible due to the complementarity of the multisource AOD data. The multisource AOD data is fitted and calculated using a machine learning method, and then the overlapping parts of the multisource AOD data are calculated by a weighted average to fill in some missing MOD AOD pixels. 2. In the moving window, the missing MOD AOD can be recovered through space or spatiotemporal relationships. First, we create a moving window. The corresponding calculation scenario is determined by the number and distribution of the AOD in the moving window and then combined with the buffer factor to perform the calculation. Finally, the recovered MOD AOD pixels are obtained by the priority settings of the overlapped pixels (priority stack). The steps of the specific method are shown in Figure 2.

**Figure 2.** Flowchart of the proposed TWS model. The recovered AOD represents the final result.

Note: *n* represents the number of observations.

#### *3.1. Data Preprocessing*

First, we create a 3-km spatial resolution grid in the UTM coordinate system. We rebuild the multisource data according to the grid position (including the AOD data set and auxiliary data). MAIAC AOD, AHI AOD, meteorological data, terrain data, and land data must be reconstructed because the spatial resolution is not 3 km. Specifically, MAIAC AOD, terrain data and NDVI must have their averages calculated in the 3-km grid. We summarize the population data within the 3-km grid (POP), and the RL data must have the total length of the roads in the grid calculated, which is assigned to the road length grid (RLG). All of the reproduced information must be resampled due to pixel position deviation.

#### *3.2. First Step of TWS*

GBM uses a gradient descent algorithm to adjust the regression tree of the weak learner's addiction model, thereby reducing the loss of the objective function. LightGBM was developed by Microsoft and uses the GBM framework. LightGBM adds Gradient-based One-Side Sampling (GOSS) and Exclusive Feature Bundling (EFB). Compared with GBM, LightGBM can accelerate the calculation speed under the premise of ensuring accuracy, and has a higher calculation speed for large sample data [48,49]. In this study, MOD AOD was used as the dependent variable; MYD AOD, MAIAC AOD, AHI AOD and the other auxiliary data were used as independent variables. Three LightGBM models, i.e., MOD-MYD, MOD-MAIAC and MOD-AHI, were established. Then, the accuracy of the prediction model was verified by a 10-fold cross-validation method. The data for constructing the LightGBM model were randomly divided into ten groups. Cyclic verification was performed ten times, and one group was used for prediction verification, while the remaining nine were used as training samples. The decision coefficient (R<sup>2</sup> ) was used as an index for model verification. Next, we used the trained model to predict the missing AOD of MOD AOD where MYD AOD, MAIAC AOD, and AHI AOD were not missing. After calculating the three LightGBM models, weighted average processing was performed on the overlapping pixels according to the LightGBM training result R<sup>2</sup> .

#### *3.3. Second Step of TWS*

AOD data has a strong spatial correlation (the R of adjacent MOD AOD is 0.9), but it also has a certain correlation in time (the R of adjacent time MOD AOD is 0.5). Therefore, when restoring MOD AOD information, we consider the spatial relationship of AOD and the spatiotemporal relationship. Moreover, the small moving window could reduce the uncertainty caused by AOD spatial heterogeneity.

#### 3.3.1. Design of Moving Window Size and Selection of Interpolation Mode

Moving windows of different sizes will affect the number of valid MOD AOD pixels. However, a large moving window will cause serious spatial heterogeneity of MOD AOD, and will also affect the computing performance of the MOD AOD recovery. In this study, we set the size of the moving window to 3 × 3 pixels, 7 × 7 pixels, and a self-adaption window (from 3 pixels to 7 pixels) [34]. The self-adaption window is determined by the ratio of the number of valid AOD pixels to the total number of pixels. The formula is as follows:

$$Sw = \text{Max}\left(\frac{PV\_{\chi}}{PA}\right) \quad \text{x} \in (\text{3, 4, 5, 6, 7}) \tag{1}$$

where *Sw* represents the size of the self-adaption window; *PV<sup>x</sup>* is the number of valid AOD pixels in the window; and *PA* is the total number of pixels in the window.

Spatial interpolation and spatiotemporal interpolation methods have good adaptability to recover the AOD data, which performs a strong correlation in local space and is spatiotemporal. Regardless of whether it is spatial interpolation or spatiotemporal interpolation, the recovery results of the AOD data are greatly affected by the distribution and the number of valid AOD data points and the spatiotemporal

distribution of the AOD data. Therefore, this study designed the following scenarios (taking a 3 × 3 window as an example), as shown in Figure 3: (1). Inverse Distance Weight interpolation (IDW) [50] is a spatial interpolation method. It was applied when the central MOD AOD was missing in the moving window. (2). We used region constraints kriging (RC kriging) which involves adding a constraints factor to the ordinary kriging method. It was applied when five or fewer pixels of MOD AOD were missing in the moving window. (3). We used spatiotemporal weight interpolation when the number of missing cells of Day 2 MOD AOD was greater than or equal to 5 and the number of valid AOD cells of Day 1 or Day 3 MOD AOD was greater than or equal to 5. (4). When there were too few MOD AOD pixels in the moving window for three consecutive days (Day 2 had no MOD AOD pixels and the number of valid MOD AOD cells for Days 1 and 3 were fewer than (5), we ignored this part of the calculation. The change in the window size changed the above rule (the ratio of the number of AOD pixels to the total number of moving window pixels). For example, when the window was 7 × 7, the five pixels in condition two increased to 27.

**Figure 3.** Three scenarios of the second step TWS. Here, n represents the number of missing AOD pixels in the moving window, and Days 1, 2, and 3 represent three consecutive days (where Days 1 and 3 are disordered). 1—Spatial represents spatial interpolation, including IDW and RC kriging. 2—Spatiotemporal-weight represents spatiotemporal weighted interpolation and lists two examples. 3—Pass indicates that this scenario ignores and does not calculate the AOD in the moving window.

#### 3.3.2. Buffer Factor

Because the moving window introduced only a small quantity of MOD AOD data, it caused the prediction value to deviate greatly between the spatial interpolation and spatiotemporal interpolation of MOD AOD. Therefore, a buffer factor was introduced to correct the deviation. Global Moran's I (MoranI) [51] is a statistic for spatial autocorrelation; the larger the MoranI of AOD, the higher the similarity of the AOD data, which can provide more information for the recovery of AOD gaps. This approach is applied to calculate the spatial autocorrelation of MOD AOD in the region; the larger the value of MoranI, the higher the correlation of the MOD AOD data in the region. This study calculated MoranI in different areas and determined the maximum amount of MoranI in a local area. The corresponding local area was called the scope window (Figure 4). The mathematical expectation of the MOD AOD of the scope window served as a buffer factor for the spatial interpolation of the MOD AOD. Uncertainty in the numeric values of the MOD AOD pixels in the scope window was prone to occur, and the MOD AOD pixel values were not in a Gaussian distribution. The Spearman correlation coefficient was introduced as the time buffer factor of the MOD AOD. The mathematical expectation of the Spearman correlation coefficient for three consecutive days and the MOD AOD of

the scope window were used as buffer factors for the spatiotemporal interpolation of the MOD AOD. The formula is as follows:

$$\text{MoranI} = \frac{\overset{n\sum\_{i=1}^{n}\sum\_{j=1}^{n}G\_{ij}(p\_i-\overline{p})\left(p\_j-\overline{p}\right)}{\sum\_{i=1}^{n}\sum\_{j=1}^{n}G\_{ij}\sum\_{i=1}^{n}\left(p\_i-\overline{p}\right)^2}}$$

$$G\_{ij} = 1/\sqrt{\left(i\_x-j\_x\right)^2+\left(i\_y-j\_y\right)^2}$$

*w* ← *Scope Window* ↔ *Max*(MoranI*w*−1, MoranI*w*, MoranI*w*+1)

(2)

$$E\_{\varpi} = \left(\sum\_{i=1}^{w\*w} S\_i\right) / w^2$$

$$P\_{(S\_{tk}, E\_{t2})} = \frac{\sum\_{j=1}^{n} (S\_{tk} - E\_{t2w})(S\_{tk} - \overline{\tau\_{t2}})}{\sqrt{\sum\_{j=1}^{n} (S\_{tk} - \overline{\tau\_{tk}})^2 \sum\_{j=1}^{n} (S\_{tk} - \overline{\tau\_{t2}})^2}} k \in (1, 3)$$

where MoranI represents the Global Moran's I. Here, n represents the number of valid pixel AODs; *p<sup>i</sup>* and *p<sup>j</sup>* represent the AOD values of the two pixels, I and J; *x* represents the average value of the AOD pixels; *dis*(*i*, *j*) represents the spatial distance between the two pixels, I and J; *Gi*,*<sup>j</sup>* represents the inverse distance weight; *Scope Window* represents the window that corresponds to the maximum local MoranI, *Scope Window* is a square; *w* represents the number of pixels on one side of the square a *Scope Window*; ↔ represents iterative search for the *Scope Window*; ← represents obtaining *w*; *S<sup>i</sup>* represents the AOD value in the *Scope Window*; *Stk* represents the AOD value in the *Scope Window* on day *tk*; *E<sup>w</sup>* represents the mathematical expectation of AOD in the *Scope Window* (buffer factor); and *P*(*Stk*,*Et*2) represents the Spearman correlation coefficient between day *tk* and day *t*2.

**Figure 4.** Buffer factor calculation flowchart.

#### 3.3.3. Spatial Interpolation Method (IDW and RC Kriging)

Compared with other complicated physical models of AOD recovery, the spatial interpolation of AOD can quantify the spatial information of the AOD with known spatial positions, which can easily and effectively predict the missing AOD data over a small range. Moreover, the AOD spatial interpolation method does not require an excessive number of parameters. Among them, IDW and the spatial interpolation method are commonly used to predict the missing AOD. Additionally, based on the best linear unbiased prediction of ordinary kriging interpolation [52], we introduced the buffer factor for spatial interpolation when predicting the MOD AOD in a moving window, and established RC kriging. The buffer factor helps the RC kriging method to better adapt to the stability of mod AOD in the moving window [53]. The formula is as follows:

$$\mathbf{Z}\_{1} = \left[\sum\_{i=1}^{N} \sum\_{j=1}^{N} \mathbf{G}\_{i,j} \{ \mathbf{S}\_{i,j} - \mathbf{E}\_{w} \} \right] + \mathbf{E}\_{w} \begin{cases} \sum\_{i=1}^{N} \sum\_{j=1}^{N} \mathbf{\mathcal{D}}\_{i,j} \times \text{Cov}\{ \mathbf{s}\_{i,j} \} - \mu = \text{Cov}\{ \mathbf{s}\_{j,i} \} \\ \sum\_{i=1}^{N} \sum\_{j=1}^{N} \mathbf{\mathcal{D}}\_{i,j} = 1 \\ \mathbf{Z}\_{2} = \left[ \sum\_{i=1}^{N} \sum\_{j=1}^{N} \mathbf{\mathcal{D}}\_{i,j} \{ \mathbf{S}\_{i,j} - \mathbf{E}\_{w} \} \right] + \mathbf{E}\_{w} \end{cases} \tag{3}$$

where *Z*<sup>1</sup> and *Z*<sup>2</sup> represent the AOD estimates produced by IDW interpolation and RC Kriging interpolation, *Gi*,*<sup>j</sup>* represents the inverse distance weight, *si*,*<sup>j</sup>* represents the MOD AOD value at points I and J, <sup>µ</sup> represents the Lagrange multiplier, ג2*i*,*<sup>j</sup>* represents the weight, *Cov si*,*j* and *Cov sj*,*i* represent the covariance of *si*,*<sup>j</sup>* and *sj*,*<sup>i</sup>* , and *E<sup>w</sup>* represents the mathematical expectation in the *Scope Window* (buffer factor).

#### 3.3.4. Spatiotemporal Weight Interpolation (STW)

Spatiotemporal interpolation can effectively consider both space and time MOD AOD relationships and overcome the shortcomings of MOD AOD space interpolation [54]. We quantify the time distance of one day of MOD AOD as 1 and combine the spatial distance between the MOD AOD pixels to determine the spatiotemporal distance. The spatiotemporal distance and the buffer factor are used to determine the spatiotemporal weight of MOD AOD spatiotemporal interpolation. We combine the spatiotemporal interpolation and spatiotemporal weights to generate spatiotemporal weight interpolation (STW). In this study, the time of STW used for MOD AOD was set to three days (including the predicted day, as well as the days before and after the predicted time), to avoid the excessive AOD data noise caused by a time span that is too long. The specific formula is as follows:

*ZST*<sup>0</sup> = P 3 *tk*=1 P *Nt j*=1 " <sup>P</sup> *Nt i*=1 ג*tki*,*<sup>j</sup> <sup>S</sup>ti*,*<sup>j</sup>* <sup>−</sup> *<sup>E</sup>tw*# <sup>+</sup> *<sup>E</sup>tw*! ג*tk* = ג)*tk*,*tk*) = P *N j*=1 r (1 − [(*P*(*St*<sup>1</sup> ,*Etk* ) + *P*(*Stk*,*Et*3) )/2]) 1/*dis*(*tk<sup>i</sup>* , *tkj*) P*<sup>N</sup> i*=1 (1/*dis*(*tk<sup>i</sup>* , *tkj*)) *k* = 2 ג*tk* = ג)*tk*,*t*2) = P *N j*=1 r *<sup>P</sup>*(*Stk*,*Et*<sup>2</sup> ) 2 2 + 1 *dis*(*tk<sup>i</sup>* ,*tkj*) / P*<sup>N</sup> i*=1 1/*dis tki* , *tk<sup>j</sup>* 2 *k* ∈ (1, 3) *dis*(*i*, *j*) = q (*i<sup>x</sup>* − *jx*) <sup>2</sup> + *i<sup>y</sup>* − *j<sup>y</sup>* 2 (4)

where *ZST*<sup>0</sup> represents the estimation of STW. T represents the time of day, t1 is the previous day, t2 is the day to be calculated, and t3 is the next day. *S<sup>t</sup>* represents the value of the valid AOD. *Etw* is the mathematical expectation in *Scope Window* within t days (buffer factor), *P*(τ*tk*,τ*t*2) represents the R between *t*2 and *tk*. ג*tk* represents the time weight of k days (*k* ∈ (1, 2, 3)). *N* is the number of pixels in the moving window, and *dis tki* , *tk<sup>j</sup>* represents the spatial distance between *tk<sup>i</sup>* and *tk<sup>j</sup>* .

#### 3.3.5. Priority Setting of Overlapping Pixels

Because the spatial interpolation of MOD AOD and STW belong to the second step in TWS, TWS will have overlapping results of MOD AOD recovery with the movement of the window. Therefore, TWS should set the priority of the MOD AOD recovery results. The priority of the MOD AOD recovery result was set to IDW > RC Kriging > STW. If the MOD AOD recovery resulted in

overlap, then the missing values of the MOD AOD were filled according to their priority. Furthermore, if the recovery results of the MOD AOD overlap in the same method, the average amount of the MOD AOD recovery results overlap should be calculated as the final result of the MOD AOD. For example, in the process of moving the window of TWS, RC kriging and STW were used in the two calculations before and after the predicted time, and the overlapping area of the MOD AOD recovery result should have used the RC kriging result. If RC kriging was used for both calculations during the window movement of the TWS, the overlapping area of the MOD AOD recovery results were calculated in the average value as the final MOD AOD recovery result.

#### 3.3.6. Validation Methodology

A comparison between the MOD AOD recovery results and AERONET data can be used as the basis for the MOD AOD recovery accuracy [55]. The time resolutions of MOD AOD and AERONET were different. This research calculated the transit time of the satellite (Terra) (30 min before and after) and compared the average value of the AERONET data with the MOD AOD data of the location pixels for the AERONET site [37]. In addition, AERONET collected AOD data of multiple wavelengths, many of which were slightly different from the MOD AOD wavelength (550nm). Therefore, the AERONET AOD at 550 nm was interpolated using the Ångström exponent [7]. In addition, both the 551 nm and 560 nm AOD data were used in the AERONET data to evaluate the MOD AOD. The specific calculation formula is as follows:

$$
\tau\_{\omega} = \beta a^{\delta}
$$

$$
\delta = -\frac{\ln(\tau\_1/\tau\_2)}{\ln(\omega\_1/\omega\_2)}\tag{5}
$$

$$
\beta = \tau\_1 (\omega\_1)^{\delta} = \tau\_2 (\omega\_2)^{\delta}
$$

where τ, τ1, and τ<sup>2</sup> represent the AOD at wavelengths ω, ω1, *and* ω2, respectively. Here, δ represents the Ångström exponent.

The accuracy evaluation indexes include R and RMSE, where RMSE is as shown in Equation (6).

$$RMSE = \sqrt{\frac{1}{N} \sum\_{i=1}^{N} \left( \tau(\text{MOD AOD})\_i - \tau(AERONET)\_i \right)^2} \tag{6}$$

where τ(*MOD AOD*) and τ(*AERONET*) represent the *AOD* from *MOD AOD* and *AERONET*, respectively.

#### **4. Results**

#### *4.1. LightGBM Training and Processing Results*

We constructed and trained the three LightGBM models separately and combined them with 10-fold cross-validation; the sample size, R2, and independent input variables are listed in Table 2. Then, each of the three LightGBM models was used to predict the missing MOD AOD, and we superimposed the prediction results (where the overlap of the pixels is weighted according to R2); the results for 1 January 2018 are listed in Figure 5.



**Figure 5.** MOD AOD is recovered from multisource AOD data and auxiliary data after fitting by LightGBM (2018.1.1). Here, (**a**) shows the original MOD AOD data (90% missing AOD); (**b**) shows the MOD AOD (56% missing AOD) after AHI AOD, and the auxiliary data were recovered by LightGBM; (**c**) shows the MOD AOD after combining MYD AOD and the auxiliary data after LightGBM recovery (84% missing AOD); (**d**) shows the MOD AOD (66% missing AOD) after combining MAIAC AOD and the auxiliary data after LightGBM recovery; (**e**) shows the result of calculating the weighted average of the overlapping parts of (**b**), (**c**) and (**d**) (47% missing AOD). The legend is the value range of the MOD AOD.

In Table 2, it can be seen that all of the auxiliary variables were involved in the training of the three groups of LightGBM models, and the R<sup>2</sup> of the 10-fold cross-validation fitting effect exceeded 0.95. Additionally, in 1 January 2018, the MOD AOD gap was filled by MYD AOD, MAIAC AOD, and AHI AOD. Among them, AHI AOD contributed the largest quantity of AOD data. The AOD missing rate predicted by AHI AOD decreased from 90% to 56%. After calculating the weighted average of the overlapping parts, the AOD missing rate dropped to 47%.

#### *4.2. Comparison between MOD AOD Recovered by Di*ff*erent Methods and AERONET*

We compared the AOD data recovered by different methods with AERONET: 1. The original MOD AOD data and AERONET. 2. The first step of the TWS (LightGBM) was used to calculate the recovered AOD and AERONET comparison. 3. Using spatiotemporal kriging interpolation to interpolate the MOD AOD, we then compared the AOD results with AERONET data. 4. The TWS calculation results were compared with AERONET. To evaluate the effect of the TWS model more carefully, the accuracy of the comparison was divided into all of the AOD data parts (including the recovered part of the AOD and the original MOD AOD part) and a separate AOD recovery part (excluding the original MOD AOD data), as shown in Figures 6 and 7.

**Figure 6.** Comparison of MOD AOD recovered by different methods (including the recovered MOD AOD and original MOD AOD) and AERONET. (**a**) Comparison of the original MOD AOD and AERONET. (**b**) Comparison of the MOD AOD recovered by LightGBM and AERONET. (**c**) Comparison of the MOD AOD recovered by spatiotemporal kriging interpolation and AERONET. (**d**) Comparison of the MOD AOD recovered by TWS with AERONET. The solid red line represents the regression line; the solid black line is the 1:1 line. The color bars represent the density of the points.

**Figure 7.** Comparison of MOD AOD and AERONET recovered by different methods. (**a**). Comparison of the recovered MOD AOD of LightGBM (excluding the original MOD AOD part) and AERONET. (**b**). Comparison of the MOD AOD recovered by TWS (excluding the original MOD AOD part) and AERONET. (**c**). Comparison of the MOD AOD recovered by TWS (excluding the original MOD AOD and LightGBM recovered MOD AOD) and AERONET. The solid red line represents the regression line; the solid black line is the 1:1 line. The colored bars represent the density of the points.

As shown in Figures 6 and 7, the number of matching points of MOD AOD and AERONET for reference are 263, the R is 0.83, and the Root Mean Square Error (RMSE) is 0.13. In the comparison of all of the AOD pixel values, LightGBM has the least number of matching points (n = 876), and although the number of matching points in the spatiotemporal kriging interpolation is the largest (1587), the quality according to the R and RMSE (0.59, 0.71) is not as good as that of LightGBM (0.85, 0.24), while TWS (R = 0.87, RMSE = 0.23) maintains value of the R with LightGBM and the reference and the quality of RMSE while also obtaining a larger number of matching points (1433). In the comparison of the AOD recovery part, we computed the results of the TWS recovery after removing MOD AOD (R = 0.87, RMSE = 0.26) and LightGBM (R = 0.88, RMSE = 0.25), and the R and the indicators of RMSE were removed from LightGBM MOD AOD (R = 0.86, RMSE = 0.26), which is consistent; the R is consistent with the reference (the difference in the RMSE index is related to the number and distribution of the reference samples). It can be seen from the results that TWS not only utilizes the information volume of the multisource AOD data, but also absorbs the advantages of AOD spatiotemporal information. In the case of increasing the number of matching points, the R can still maintain a high quality, which indicates that the TWS is reliable.

To further verify the effectiveness of TWS, we regridded the original MOD AOD by 5 × 5 AOD pixels size, and set the existing value in the grid center as a forced-missing AOD. Then, we used 3 × 3 grid TWS to regenerate the forced-missing MOD AOD. A validation between the regenerated MOD AOD and the original effective MOD AOD is shown in Figure 8.

**Figure 8.** Comparison of the regenerated MOD AOD by 3 × 3 TWS and the original MOD AOD. The solid red line represents the regression line; the solid black line is the 1:1 line. The colored bars represent the density of the points.

As shown in Figure 8, the number of regenerated MOD AOD is 2352752. After restoring the missing AOD by 3 × 3 grid TWS, the validation process results in R = 0.98 and RMSE = 0.05 between the regenerated MOD AOD and the original effective MOD AOD. These results show that the 3 × 3 grid TWS also maintains good stability and accuracy in recovering a large number of missing MOD AOD pixels. This verifies the reliability of the TWS.

#### *4.3. TWS Recovered the Performance with Di*ff*erent Moving Windows*

The missing rate for MOD AOD was calculated by the ratio of the MOD AOD pixels and the total number of pixels in the study area, as shown in Figure 9. The MOD AOD missing rate was set to between 0 and 1. The recovery of MOD AOD requires higher accuracy and a lower MOD AOD missing rate to achieve its goal. Although the MOD AOD after the spatiotemporal kriging interpolation processing had no AOD data missing, the accuracy could not reach the application level. Therefore, the comparison of the MOD AOD missing rate was conducted in different windows of the TWS (3 × 3 window, adaptive window and 7 × 7 window). According to the statistics of the

original MOD AOD data and the MOD AOD results recovered by TWS, the annual average missing rate of the original MOD AOD exceeded 0.8. After the first step of the TWS LightGBM calculation, the average annual missing rate of MOD AOD decreased from 0.8 to 0.4, and after 3 × 3 restoration of the window, the annual average missing rate of MOD AOD decreased from 0.4 to 0.1; additionally, the result calculated after the 7 × 7 window (0.06) showed the smallest annual average missing rate of MOD AOD.

**Figure 9.** Time series plot of daily AOD coverage over study areas in 2018 for MOD, LightGBM, 3 × 3, self-adaption and 7 × 7. MOD represents the original MOD AOD, LightGBM represents LightGBM recovered MOD AOD, 3 × 3 represents the 3 × 3 grid moving window TWS recovered MOD AOD, self-adaption represents the self-adaption moving window TWS recovered MOD AOD, and 7 × 7 represents the 7 × 7 moving window TWS recovered MOD AOD. The numbers in parentheses represent the average and standard deviation of the empty AOD coverage.

Furthermore, in 2018, the standard deviation of the missing rate of MOD AOD after LightGBM alone was 0.131. However, the standard deviation of the MOD AOD missing rate of the TWS treatment was smaller than 0.08, which shows that LightGBM alone relies on only multisource AOD data. After processing by LightGBM alone, there is still a large quantity of missing AOD data. In contrast, a complete TWS combined with spatial and spatiotemporal information can reduce the missing rate of MOD AOD.

According to Table 3 and Figure 10, the missing rate of MOD AOD, R, and the calculation efficiency all change with changes in the size of the moving window. Among them, the 7 × 7 grid has the lowest R and the largest RMSE, 0.78 and 0.32, respectively. The adaptive R and RMSE are 0.79 and 0.3, respectively. The 7 × 7 grid and adaptive R decrease compared to the 3 × 3 window, while the RMSE increases. The adaptive network's calculation time of the grid is the largest, i.e., 4.2 times that of the 3 × 3 grid, while the 7 × 7 grid is 2.7 times that of the 3 × 3 grid. The above data show that with the expansion in the window size, the result R from the recovery of the MOD AOD decreases, while the RMSE increases. A possible reason for this is that the spatial and temporal variability of the MOD AOD increases with the size of the moving window. Moreover, the change in the size of the moving window also significantly affects the amount of calculation.


**Table 3.** Performance comparison of 3 different moving windows.

**Figure 10.** Comparison of TWS recovered MOD AOD (including the recovered MOD AOD and original MOD AOD) and AERONET in different moving windows. **a**. Comparison of the 7 × 7 moving window TWS recovery MOD AOD and AERONET. **b** Comparison of the self-adaption moving window TWS recovery MOD AOD and AERONET. The solid red line represents the regression line; the solid black line is the 1:1 line. The colored bars represent the density of points.

#### *4.4. Analysis of the Spatiotemporal Characteristics of MOD AOD Recovered by TWS*

Combining the recovery results of the MOD AOD in the 3 × 3 window of the TWS and the spatiotemporal kriging interpolation results of the MOD AOD, the annual average results of the MOD AOD after recovery were calculated and compared with the annual average results of the original MOD AOD (Figure 11). The following can be found in Figure 11: (1). There were still some gaps in the annual average map of the original MOD AOD (the position of the red circle 1). Compared with Figure 1 (land use), the red circle is mainly brighter, bare land, which confirmed that the DT algorithm and the DB algorithm had poor AOD data inversion in relatively bright areas. The annual average result of the MOD AOD recovery in the 3 × 3 window of TWS and the annual average result of the MOD AOD spatiotemporal Kriging interpolation filled the gaps of the AOD data in the red circle 1. (2). The maximum value of the original annual average result of the MOD AOD is too large in Figure 11 (the maximum AOD value was 3). (3). The maximum value in the annual average result of MOD AOD in the 3 × 3 window of TWS decreased to 0.64 and the annual average result of the spatiotemporal kriging interpolation of MOD AOD decreased to 0.82. (4). The average value in the annual average results of the original MOD AOD, spatiotemporal kriging interpolation and TWS were 0.23, 0.34 and 0.27 respectively. The original MOD AOD data had a large number of missing AOD pixels (the missing rate in Figure 11a was 2%). There was a lack of sufficient AOD pixels to average the minimum and maximum values in the original MOD AOD, which ultimately led to the maximum value in the original MOD AOD annual average result being too large (the maximum AOD value was 3), and the average value in the original MOD AOD annual average result was low (the average AOD value was 0.23). (5). Comparing red circle 2, the annual average results of the original MOD AOD and the spatiotemporal Kriging interpolation of the MOD AOD are higher. The annual average results of the restoration of MOD AOD in the 3 × 3 window of TWS retained the original MOD AOD spatial characteristics of the annual average results and reduced the annual average of MOD AOD. Moreover, in the Pacific region, the annual average results of the restoration of the TWS 3 × 3 window MOD AOD were higher than the original annual average results of the original MOD AOD. In the original annual average results of the MOD AOD, the reason why the AOD data gap in red circle 1 was filled is that the 3 × 3 window of the TWS and the spatiotemporal kriging interpolation method filled the AOD

data gap to a large extent. The reason for this was that the MOD AOD gap was filled, and the MOD AOD annual average result was more fully calculated. The maximum value of the original MOD AOD annual average result was reduced. In addition, due to the lack of accurate prediction of local features by the spatiotemporal kriging interpolation algorithm, the annual average result of the MOD AOD spatiotemporal kriging interpolation was higher than the average annual result of the restoration of the TWS 3 × 3 window MOD AOD.

**Figure 11.** The average annual MOD AOD distribution in 2018. (**a**). Annual average of the original MOD AOD (2% missing). (**b**). The MOD AOD average of the spatiotemporal kriging interpolation recovery (missing 0). (**c**). The 3 × 3 moving window TWS recovered the MOD AOD annual average (missing 0). The red fonts Ave and Max represent the average and maximum values of the AOD annual average graph, respectively. The white part represents nodata. The color bar represents the MOD AOD value.

We compared the results of the TWS 3 × 3 window MOD AOD recovery with the original MOD AOD data by a monthly average (Figure 12). In Figure 12, we marked the missing rate, average and maximum of the monthly average of the original MOD AOD and the monthly average of TWS AOD for each month. The monthly average maximum value of TWS AOD was smaller than the original monthly average maximum value of MOD AOD. The average range of the monthly average results of TWS AOD (0.17–0.24) was also smaller than the average monthly average results of the original MOD AOD (0.18–0.36). In addition, the TWS AOD monthly average result also accurately retained the high value area of the original MOD AOD monthly average result (in the yellow box).

On this basis, in the yellow box area in Figure 12 (112.7◦ E–125.2◦ E, 32.5◦ N–42.1◦ N), we calculated the monthly average and maximum AOD values of the original MOD AOD and TWS AOD in this area, as well as the monthly average AERONET AOD at the same place (Figure 13). In the yellow box area, the maximum of the monthly average original MOD AOD result was greater than 2. The maximum of the monthly average TWS AOD result was lower than the maximum of the monthly average original MOD AOD result. Moreover, the largest average value of the monthly average TWS AOD results was in June. Specifically, there was an upward trend from January to June and a downward trend from June to December. In addition, in the yellow box, there are seven AERONET ground stations. We calculated the monthly average of these stations. The monthly average trend of MOD AOD after TWS recovery was also consistent with the monthly average AERONET AOD trend. A similar trend was shown by Song et al. [56] for the North China Plain in 2018.

**Figure 12.** Monthly average of the original MOD AOD and 3 × 3 moving window TWS recovered MOD AOD (each month includes the missing rate of MOD AOD). The red fonts Ave and Max represent the average and maximum values of the AOD monthly average graph, respectively. The white part represents no data. The yellow box area represents the sampling area in Figure 13. The color bar represents the MOD AOD value.

**Figure 13.** The broken line represents the monthly average original MOD AOD and the monthly average MOD AOD restored by TWS, respectively, and the dotted line represents the monthly average AERONET AOD, with the ordinate on the left. The histogram represents the maximum value (monthly) of the original MOD AOD and the MOD AOD recovered by TWS, respectively, with the ordinate on the right. The above data is from the yellow box area of Figure 12.

#### **5. Discussion**

#### *5.1. Comparison of TWS and Other MOD AOD Recovery Models*

The recovery of missing satellite AOD product data is of great significance to atmospheric pollution research. Recently, many methods have been used to study the recovery of missing data from satellite AOD products. This study selected the same approach to recover missing MOD AOD data and made a comparison in Table 4. The results of the various methods in Table 4 were compared with AERONET. Based on this comparison, the improvements in R compared to the MOD AOD and AERONET recovered by the proposed method and the R compared to the original MOD AOD and AERONET were not obvious (the R of the MOD AOD and AERONET recovered by the TWS recovery was the highest). In the comparison of the missing rate of MOD AOD, the missing rate of MOD AOD recovered by TWS was the lowest (0.1). Additionally, in the different methods in Table 4, the missing rate of the MOD AOD recovered by TWS had the largest decreased missing rate difference compared to the original MOD AOD (0.78). The improved difference (R) of the 3 × 3 window TWS method was not significantly different from other methods. However, the decreased missing rate difference (%) of the 3 × 3 window TWS method was significantly different from other methods. The main reasons are as follows: (1.) The 3 × 3 window TWS introduced multisource datasets (MYD AOD, MAIAC AOD, AHI AOD). With TWS, the first step is to use the spatial complement of AOD data sets with different algorithms and data collection times. The AOD missing rate dropped from 88% to 40%. In Figure 9, the decreased missing rate difference (%) is 48%. (2.) The second step of the 3 × 3 window TWS is to make reasonable use of the spatiotemporal relationship of AOD, under the optimization of moving window and buffer factor. The AOD missing rate decreased from 40% to 10% (the decreased missing rate difference (%) was 30%). Although the direct comparison of the decreased missing rate difference had certain limitations, it also showed stability and excellent performance for the TWS.


**Table 4.** Comparison of the MOD AOD data recovery methods.

Note: ~ indicates a lack of accurate data in the cited article. \* indicates a lack of method name in the cited article.

#### *5.2. TWS Recovery MOD AOD Performance Discussion*

The MOD AOD after TWS processing can obtain a higher improved R and lower AOD missing rate because it takes full advantage of the rich data volume of multisource data and the high local spatiotemporal autocorrelation of the AOD itself. A large amount of research has confirmed that multisource data can easily introduce data noise. However, based on the data statistics, we chose LightGBM to build a MOD AOD prediction model, which can make full use of the characteristics of different AOD data and reduce the data noise. From the comparison between LightGBM and AERONET, it can be seen that the LightGBM model does not introduce much data noise (all R = 0.85, R = 0.86 after removing MOD AOD).

Moreover, we developed MOD AOD recovery measures based on moving small windows by combining MOD AOD spatial data and spatiotemporal data when generating the statistics. The setting of the small window is used to ensure a high correlation of AOD in the small window. MOD AOD recovery measures set three MOD AOD recovery modes, and use the adaptive space and spatiotemporal buffering methods. Different calculation modes were set based on the temporal and spatial distribution of valid AOD information, to enable the calculation to be more reliable when recovering the AOD value. In this way, it can avoid the introduction of excessive data noise. The index was used to determine the local area of the autocorrelation, and the mathematical expectation and R were introduced to slow down the spatiotemporal difference; then, we determined the spatial and spatiotemporal buffer. Spatial and spatiotemporal buffering can more accurately improve the R of the moving small windows to recover the MOD AOD missing data. These settings all ensure the accuracy of the MOD AOD recovery and reduce the data loss rate of MOD AOD (R = 0.87 compared to MOD AOD and AERONET in the 3 × 3 window. The average daily loss rate of MOD AOD was 10%, whereas the adaptive window of the MOD AOD and AERONET comparison was R = 0.79, the average daily missing rate of MOD AOD was 8%, the window of the 7 × 7 window MOD AOD and AERONET comparison was R = 0.78, and the average daily missing rate of MOD AOD was 6%). In different applications, different window sizes can be chosen to meet different needs because the moving window size is variable. For example, to obtain a lower MOD AOD data loss rate, a larger moving window in TWS can be selected. The 7 × 7 window in the 2018 experiment can limit the average daily loss rate of MOD AOD to 6%. Therefore, moving the window size can adjust this advantage and make the TWS method more flexible. Moreover, if the missing MOD AOD data rate is not 0, the iterative approach to the TWS method can be used, which gradually reduces the missing MOD AOD data rate to 0. Of course, it is also possible to use spatial interpolation based on the results of MOD AOD processed by TWS to reduce the missing rate of MOD AOD to 0. Because TWS is based on sufficient data statistics on AOD data and uses AOD spatial autocorrelation, the TWS method can, in general, be applied to the missing data of AHI AOD, MAIAC AOD, MYD AOD and other remote sensing products with spatial correlation and time correlation. Finally, the MOD AOD recovered by TWS cannot be studied and used on a global scale because the AHI sensor is carried on a geosynchronous orbit satellite.

#### *5.3. TWS Recovery MOD AOD*

In the results of MOD AOD recovery in the study area in 2018 (Figure 11), we found that the areas with higher AOD were mainly concentrated in North China, the Central China Plain, the Yangtze River Delta and the Sichuan Basin; in northern Vietnam, the Japanese Islands and the Korean Peninsula, the AOD was lower (see Figure 12). Because of the dry weather conditions of the Red River Delta, in addition to local traffic and industrial pollution, there was a relatively obvious pollutant transmission process, and higher AOD distributions existed in southern China and northern Vietnam in March and April [57]. Moreover, there was also a higher monthly average value of AOD near the North China Plain and Shandong Peninsula in China which spread to the sea. Furthermore, in November, December and January, the pollutant diffusion capacity of North China, the Central China Plain and the Yangtze River Delta was more obvious during the influence of the winter monsoon [58,59]. Eventually, the mean monthly AOD of North China, the Central China Plain, the Yangtze River Delta and the East China Sea increased. Overall, the high AOD area did not cover the Korean Peninsula or the Japanese Islands. Although some of the pollutants might have reached the Korean Peninsula and the Japanese Archipelago region through the atmospheric transmission process, most of the pollutant transmission still stopped in the offshore area of China.

#### **6. Conclusions**

A high-precision, low AOD missing rate MOD AOD recovery result is of great help in measuring the spatial distribution of air pollutants, continuous monitoring, climate change and other related research. In this paper, the TWS model was constructed by multisource AOD data, LightGBM, spatial interpolation and STW, which were used for the large-scale recovery of data missing from MOD AOD. The results show that the TWS model can guarantee the accuracy of the recovered MOD AOD (R = 0.87). Moreover, compared with other methods, TWS greatly reduces the missing rate of the MOD AOD data (the missing rate of MOD AOD in the 3 × 3 window dropped from the original 88% to 10%). Moreover, after the missing information is added, the changes in the local AOD start to show more obvious high and low value details, for example, the AOD average, maximum and minimum of the original MOD AOD missing area in the AOD annual average map. TWS proves the spatial complementarity of multisource AOD data and the spatiotemporal relationship of the AOD data, which is very important when recovering the AOD data. In follow-up research, we will use other data sets to expand the applicability of the TWS method, for example, using GOES-16 ABI AOD data to restore AOD on the American continent. Moreover, we will use deep learning to recover areas in which the loss of AOD spatiotemporal information is severe, for example, in scenario 3 (Pass) in Figure 2, the moving window has missing AOD information for three consecutive days.

**Author Contributions:** Methodology, Y.C.; validation, Y.C.; formal analysis, Y.C.; writing—original draft preparation, Y.C.; writing—review and editing, Z.W., Y.R. and K.L. All authors have read and agreed to the published version of the manuscript.

**Funding:** This work was supported by the National Science Foundation of China (31670645, 31972951, 41801182, and 41807502), the National Social Science Fund (17ZDA058), the National Key Research Program of China (2016YFC0502704), Fujian Provincial Department of S&T Project (2018T3018), the Strategic Priority Research Program of the Chinese Academy of Sciences (XDA23020502), the Ningbo Municipal Department of Science and Technology (2019C10056), the Xiamen Municipal Department of Science and Technology (3502Z20130037 and 3502Z20142016), the Key Laboratory of Urban Environment and Health of CAS (KLUEH-C-201701), the Key Program of the Chinese Academy of Sciences (KFZDSW-324), and the Youth Innovation Promotion Association CAS (2014267). Open-end Fund (2020KX03).

**Acknowledgments:** We are grateful to the anonymous reviewers for their constructive suggestions. The authors would like to express our gratitude to the National Aeronautics and Space Administration (NASA) for providing the MODIS AOD, MAIAC, MERRA2, SRTM and NDVI products, gratitude to the Japan Meteorological Agency (JMA) and Center of Environmental Remote Sensing, Chiba University (CEReS) for providing the AHI AOD product, and the Principle Investigators for establishing and maintaining the AERONET sites.

**Conflicts of Interest:** The authors declare no conflict of interest

#### **References**

1. Hallquist, M.; Wenger, J.C.; Baltensperger, U.; Rudich, Y.; Simpson, D.; Claeys, M.; Dommen, J.; Donahue, N.M.; George, C.; Goldstein, A.H.; et al. The formation, properties and impact of secondary organic aerosol: Current and emerging issues. *Atmos. Chem. Phys.* **2009**, *9*, 5155–5236. [CrossRef]


**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## *Article* **Remote Sensing Image Scene Classification with Noisy Label Distillation**

**Rui Zhang 1,2 , Zhenghao Chen <sup>1</sup> , Sanxing Zhang 1,2, Fei Song 1,3, Gang Zhang <sup>1</sup> , Quancheng Zhou 1,2 and Tao Lei 1,\***


Received: 12 June 2020; Accepted: 19 July 2020; Published: 24 July 2020

**Abstract:** The widespread applications of remote sensing image scene classification-based Convolutional Neural Networks (CNNs) are severely affected by the lack of large-scale datasets with clean annotations. Data crawled from the Internet or other sources allows for the most rapid expansion of existing datasets at a low-cost. However, directly training on such an expanded dataset can lead to network overfitting to noisy labels. Traditional methods typically divide this noisy dataset into multiple parts. Each part fine-tunes the network separately to improve performance further. These approaches are inefficient and sometimes even hurt performance. To address these problems, this study proposes a novel noisy label distillation method (NLD) based on the end-to-end teacher-student framework. First, unlike general knowledge distillation methods, NLD does not require pre-training on clean or noisy data. Second, NLD effectively distills knowledge from labels across a full range of noise levels for better performance. In addition, NLD can benefit from a fully clean dataset as a model distillation method to improve the student classifier's performance. NLD is evaluated on three remote sensing image datasets, including UC Merced Land-use, NWPU-RESISC45, AID, in which a variety of noise patterns and noise amounts are injected. Experimental results show that NLD outperforms widely used directly fine-tuning methods and remote sensing pseudo-labeling methods.

**Keywords:** scene classification; teacher-student; noisy labels; knowledge distillation; remote sensing images

#### **1. Introduction**

The optical remote sensing image is a powerful source of geographical information since it contains complex geometrical structures and spatial patterns. In recent decades, the remote sensing community has tried to establish an accurate remote sensing image scene classifier. Recent advances in Convolutional Neural Networks (CNNs) make it possible to identify remote sensing scenes with better performance [1,2]. However, many real-world applications for earth observation require large-scale datasets with clean annotations such as ImageNet [3]. It is costly and time-consuming to collect a large-scale remote sensing dataset with high-quality manual annotations. Lack of annotated data has become a bottleneck for the development of deep learning methods in remote sensing and Earth observation. Moreover, the same bottleneck also exists in many other visual tasks.

To tackle the bottleneck, many studies [4] start with leveraging crowd-sourcing platforms, image search engines, or other automatic labeling methods to collect labeled data for natural

image scene classification. For example, the Open Images Dataset V4 [5] contains over 30.1 million image-level labels automatically produced by a classifier and a small percentage of labels are verified by crowd-sourcing platforms. These methods significantly reduce the cost of data labeling, which is valuable for applying deep learning in remote sensing image scene classification. The volume of unlabeled images collected by satellites or drones is growing by a few terabytes each day. Low-cost annotations could facilitate the use of abundant image resources. Hence, some methods [6] generate pseudo labels for unlabeled remote sensing images through semi-supervised learning. However, these labels struggle to provide the same asymptotic properties as supervised learning does in high-data regimes. The labels produced by these approaches contain varying degrees of error, i.e., noise, and the performance of classifiers is highly sensitive to massive label noise. Since most of the automatically generated labels are mismatched, it is challenging for traditional learning methods to work on such datasets.

Training on noisy labeled datasets become essential and has attracted much attention in recent years [7–9]. Furthermore, several approaches learning with noisy labels [10–12] have been explored for remote sensing image analysis tasks. Existing methods based on RGB images with noisy labels usually make a strong assumption that all labels are noisy. These studies mostly work on robust algorithms against noisy labels [13], label cleansing methods finding label errors [14] , or combining them together [15]. It was proven that these classifiers have achieved good accuracy on noisy CIFAR10/100 datasets. However, it is difficult and impractical to apply these complex methods to other areas. For remote sensing image scene classification, some of these methods sometimes do not perform as well as direct training. In real-world applications, datasets usually contain a small fraction of images with clean annotations and large amounts of images with noisy or missing labels. In this case, some approaches [16–18] have produced better performance and practicality on large-scale real-world noisy datasets, such as Clothing1M dataset [8] and Open Images V4 dataset [5]. To the best of our knowledge, there is no existing work for remote sensing image scene classification with minimal extra-human supervision.

This work focuses on augmenting existing human-verified labeled datasets with additional noisy labeled data to improve the performance of remote sensing scene classifiers. A more efficient way is explored to learn knowledge from massive noise, instead of simply mix all data or fine-tuning with labeled images. Inspired by Deep Mutual Learning (DML) [19], this paper proposes a novel noisy label distillation framework called NLD based on teacher-student methodology with a decision network, as given in Figure 1. First, the student and teacher jointly learn from each other. Pre-training is no longer a required process. Second, the teacher distills the knowledge learned from noisy data to facilitate the student to learn from full dataset. NLD can even be applied to completely noise-free datasets. This means that our method can be used in a wide range of remote sensing applications. Third, a decision network derived from [20] is introduced, which is easier to optimize in practice and replace the calculation of the mimicry loss. Considering the lack of public datasets with noisy annotations for remote sensing image scene classification, experiments are conducted to evaluate NLD by injecting a series of noises into well-annotated datasets(e.g., UC Merced Land-use [21], NWPU-RESISC45 [22] and AID [23]).

Our contributions are as follows:


**Figure 1.** A high-level illustration of NLD. The student and teacher mutually learn knowledge of clean and noisy labels.

This paper is organized as follows: Section 2 introduces the research status and the challenges. Section 3 describes the overall framework of NLD. Section 4 presents the implementation details of experiments and analyzes the result. Finally, Section 5 concludes our paper and gives an outlook.

#### **2. Related Works**

In this section, we will briefly review existing related works on remote sensing image scene classification and learning from noisy labels.

#### *2.1. Remote Sensing Image Scene Classification*

Remote sensing image scene classification aims to distinguish the semantic category of an image, which is a fundamental problem for understanding high-level geospatial information. With the development of deep learning methods, many CNN architectures (e.g., ResNet [24], VGG [25]) have achieved remarkable performance on many remote sensing public datasets. However, there are large intra-class variations and small inter-class dissimilarities between different remote sensing scenes. These problems will decrease the recognition abilities of models for some categories. To address these challenges, many studies focus on how to learn discriminative feature representations. Nogueira et al. [2] analyzed the use of different networks in the field of remote sensing. Chaib et al. [1] proposed an adequate method for feature fusion and introduced discriminant correlation analysis to represent the fused features. Zhang et al. [26] proposed a newly designed CapsNet to deal with the remote sensing image scene classification problem. Li et al. [27] proposed a unified feature fusion framework based on attention mechanism to improve the classification performance.

These algorithms are all data-driven algorithms, which means large-scale datasets are required in practice. To facilitate the application of these methods to more fields that have little data with clean annotations, NLD can be widely used with various models including the above research.

#### *2.2. Learning from Noisy Labels*

Most of methods learning from noisy datasets aim to directly learn without clean labels available. These approaches usually focus on noise-robust algorithms and label cleansing methods. Wang et al. [13] proposed symmetric cross entropy (SCE) loss to boost cross-entropy (CE) symmetrically with a noise-robust counterpart reverse CE. Northcutt et al. [14] proposed confident learning for characterizing, identifying, and learning with noisy labels. Kim et al. [15] proposed Selective Negative Learning and Positive Learning (SelNLPL) to filter and learn with noisy data. These methods face the problem of discriminating difficulty from mismatched labels.

Our approach belongs to a practical stream, assuming that both clean and noisy labels of the dataset are known [8,28]. This is a more practical scenario, allowing researchers to focus on leveraging noisy labeled data to enhance existing fully supervised algorithms. Veit et al. [16] proposed a learning approach for multi-label image classification using clean labeling combined with massive noise labeling. Hu et al. [18] proposed a method to automatically identify credible annotations in the massive noisy labels under weakly supervised learning. Many semi-supervised learning algorithms, especially pseudo-labeling algorithms, can also be categorized into such scenarios [29]. Han et al. [6] proposed a framework based on deep learning features, self-labeling techniques and decision evaluation methods under semi-supervision for remote sensing image scene classification and annotating datasets. The works closer to ours comes from Li et al. [17] and Li et al. [30].To achieve noisy label learning, they proposed a teacher-student framework, which comes from knowledge distillation [31]. To take full use of the whole data space, traditional knowledge distillation and many other similar noise-robust methods use the student model to mimic the large pre-trained teacher model by providing training experiences. These experiences are called "dark knowledge".

In practice, a smaller network with the same precision is needed because of the cost, i.e., a student network. However, due to the existence of noisy labels, even under the guidance or regularization of a powerful network pre-trained with clean data, small networks are still prone to overfit to noisy labels. This may even lose the knowledge of the original clean data.

#### **3. Method**

#### *3.1. Problem Formulation*

Our goal is to train a remote sensing scenes classifier using a dataset with automatically collected noise labels and a part of human-verified clean labels available. The source of noisy labels may come from collects from the web or predictions from models trained on clean data or other ways. Furthermore, the framework can be used for large-scale datasets with fully clean annotations to improve the performance of networks under traditional supervised learning.

Formally, we define the notations for our study. Let D = D*<sup>c</sup>* S D*<sup>n</sup>* donates the entire large training dataset, where D*<sup>c</sup>* is the clean subset and D*<sup>n</sup>* is the remaining noisy subset. In a single label classification problem, D*<sup>c</sup>* = {(~*x<sup>i</sup>* , *yi*)| *i* = 1, 2, · · ·, *Nc*} and D*<sup>n</sup>* = ~*x<sup>j</sup>* , *y<sup>j</sup> j* = 1, 2, · · ·, *N<sup>n</sup>* , which contains *N<sup>c</sup>* and *N<sup>n</sup>* samples from *M* classes, respectively; *y<sup>i</sup>* ∈ {1, 2, . . . , *M*} and *y<sup>j</sup>* ∈ {1, 2, . . . , *M*} donate the label corresponding to image ~*x<sup>i</sup>* and ~*x<sup>j</sup>* . In this work, the ratio of D*<sup>n</sup>* to D*<sup>c</sup>* is not limited, because NLD can improve the performance of classifiers in different practical applications.

As shown in Figure 2, NLD is formulated with a cohort of two classifiers *g* and *h*. The classifier *g* is the large teacher model that is used to distill and transfer the knowledge of noise. In addition, its backbone is a powerful network such as a ResNet-50 [24]. The student model *h* is designed to learn from the clean labels and guided the learning process by the knowledge of noise which is distilled from the teacher network *T*. The network *S* is a network that is same as or shallower than network *T* (e.g., ResNet-34 [24] and VGG-16 [25]). The logits ~*r*<sup>1</sup> for ~*x<sup>j</sup>* given by the teacher network *T* can be represented as

$$
\vec{r\_1} = \mathcal{F}\_n \left( \vec{x\_j} \right),
\tag{1}
$$

where the F*<sup>n</sup>* is a nonlinear transformation in teacher network *T*. Similarly, the logits ~*r*<sup>2</sup> and *c*~<sup>1</sup> can be represented as

$$
\vec{r\_2} = \mathcal{F}\_c \left( \vec{x\_j} \right),
\tag{2}
$$

$$
\vec{\alpha\_1} = \mathcal{F}\_{\mathcal{C}} \left( \vec{x\_l} \right) \,. \tag{3}
$$

where the F*<sup>c</sup>* is a nonlinear transformation in student network *S*.

**Figure 2.** The overview of the proposed framework to train a remote sensing scenes classifier from a large dataset D*<sup>n</sup>* with noisy labels and a small dataset D*<sup>c</sup>* with manually verified labels. The framework consists of teacher network *T*, student network *S*, decision network, fully connected layers, and predictor of softmax. In the training phase, two loss terms *L<sup>g</sup>* and *L<sup>h</sup>* (a CE loss with noisy labels and a CE loss with clean labels) are minimized jointly. The teacher model *T* transfers the "dark knowledge" distilled from noisy subset to the student model *S* through the decision network. In the inference phase, a classifier containing the student network *S*, fully connected layers and softmax can give the correct predictions.

For classifier *g* and *h*, the supervision depends on the source of the training sample. For image ~*x<sup>j</sup>* from the noisy dataset D*<sup>n</sup>* , the classifier *g* is supervised by the noisy label *y<sup>j</sup>* . For sample ~*x<sup>i</sup>* from the clean dataset D*c*, supervision comes directly from the verified label *y<sup>i</sup>* .

#### *3.2. Noisy Distillation*

In contrast to the previous work on teacher-student models including [17,30], we need to pre-train a teacher model with a small part of or the entire dataset: the teacher model and student model are trained together to learn latent noisy label distributions to improve the performance of student network supervised with the clean subset. NLD is motivated by DML which leverages a teacher-student framework to improve the representation of the network. The details will be analyzed in the later part of this section.

The student network learns the knowledge of clean data and acquires the distilled knowledge of the noisy dataset. The teacher takes advantage of powerful deep network architectures to learn features of noisy labels at various levels of abstraction rather than simply memorizing these. Besides, noise knowledge is distilled by comparing the outputs of the student and teacher simultaneously. To that end, the student and teacher model are trained by a mutual learning approach which originates from knowledge distillation. Noted that NLD is different from DML and other similar approaches. To match noisy label distributions, a metric between two branch's representation vectors ~*r*<sup>1</sup> and ~*r*<sup>2</sup> needs to be defined. As a loss function, Kullback Leibler (KL) Divergence is the most widely used. The KL distance from ~*r*<sup>1</sup> and ~*r*<sup>2</sup> is computed as

$$D\_{KL}\left(\vec{r\_2} \parallel \vec{r\_1}\right) = \sum\_{i=1}^{N\_{\rm tr}} \sum\_{m}^{M} r\_2^{m}\left(\vec{x\_j}\right) \log \frac{r\_2^{m}\left(\vec{x\_j}\right)}{r\_1^{m}\left(\vec{x\_j}\right)},\tag{4}$$

where the *r m* 1 is the score of class *m* in logits ~*r*<sup>1</sup> and the *r m* 2 is the score of class *m* in logits ~*r*2.

According to the formula, KL divergence is asymmetric. Hence, the KL distance between the two networks is different. One can instead use a symmetric KL-divergence such as

$$D\_{\rm SKL} = D\_{\rm KL} \left( \vec{r\_2} \parallel \vec{r\_1} \right) + D\_{\rm KL} \left( \vec{r\_1} \parallel \vec{r\_2} \right). \tag{5}$$

Compared to teacher network *T*, student network *S* has similar representation capacities, but it is harder to learn appropriate parameters. In DML and other similar knowledge distillation algorithms, both teacher network and student network are trained on clean datasets. These studies expect the student network to mimic the classification probabilities and feature representations of the teacher network. The objective functions of the two networks are the same. Therefore, a simple combination of CE loss and KL divergence can facilitate a better student network from the entire clean dataset. However, how to combine and optimize these two different kinds of losses will be a difficult problem in our tasks. Our teacher network *T* is supervised by noise labels and our student network *S* is supervised by clean labels. The student network *S* should not totally mimic outputs of the teacher network *T*. By imitating and comparing, the purpose is to distill the knowledge from the noisy dataset, which is the intersection of clean student's features and noisy teacher's features. In the meanwhile, as mentioned above, a simple combination of CE loss and KL divergence would work on two networks identical to each other. Although this can be changed by adding some weights before the combination, there are too many options for hyper-parameters.

To address these problems, NLD feeds outputs of the two networks simultaneously into a decision network derived from [20]. The decision network simply consists of fully connected layers with a single output. In [20], this network is used to measure the similarity between two different images with siamese network. As discussed above, NLD has different settings from images similarity measurement methods. Different logits of two same image patches are mapping from different networks. Furthermore, the similarity of two networks is measured through the decision network. In addition, the decision network has learnable parameters. Instead of relying on the combination of different loss functions with hyper-parameters, this can automatically learn weights that fit the noisy label knowledge distillation. Because the original logits are mapping from the same image, the output *r* of decision network is still the original image feature mapping. The probability of class *m* for sample ~*x<sup>j</sup>* given by decision network is computed as

$$p\_1^m\left(\vec{x}\_j\right) = \frac{\exp\left(r^m\right)}{\sum\_{m=1}^M \exp\left(r^m\right)}.\tag{6}$$

Subsequently, the classifier *g* is supervised by noisy labels and the classifier *h* is supervised by clean labels. In this way, the student network can learn clean knowledge and similar knowledge between clean labels and noise labels, i.e., noise distillation. At the same time, NLD does not need a mimicry loss, so training is faster and more flexible than traditional distillation methods. In the meanwhile, the decision network also increases inference time as it requires combinations of two vectors. However, our goal is to train a student network guided by the teacher network. Therefore, only the student network is used for testing, while the decision network is not used.

#### *3.3. Model Training*

In original knowledge distillation and DML, the whole objective function consists of a supervised loss (e.g., CE loss) and a mimicry loss(e.g., KL divergence). In contrast, CE loss is used as the supervised loss for classifier *g* and *h*, respectively. In addition, they can be rewritten as:

$$L\_{\mathcal{S}} = -\sum\_{j=1}^{N\_{\mathcal{U}}} y\_j \log \left( p\_1 \right), \tag{7}$$

$$L\_h = -\sum\_{i=1}^{N\_\mathbb{C}} y\_i \log \left( p\_2 \right) \,\tag{8}$$

where *L<sup>g</sup>* and *L<sup>h</sup>* are the losses for the corresponding classifier *g* and *h*, respectively. Given the above definitions, the overall loss for the proposed model is constructed by two losses as follows:

$$L\_{\text{total}} = \mathfrak{a}L\_{\hbar} + \mathfrak{P}L\_{\mathbb{S}'} \tag{9}$$

where *α* and *β* denote weight factors that need to be set based on student network, teacher network and noisy dataset.

Training a network with a noisy dataset can lead the network to memorize these noises. To avoid the teacher network overfitting on noisy data, which will deteriorate the performance of noise distillation and may even mislead the student to have exploding gradients, batch normalization(BN) [32] and dropout layer [33] with a constant probability of 0.6 are applied between the teacher network and the decision network.

#### *3.4. Extension to Pseudo-Labeling*

Semi-supervised learning requires a small amount of manually labeled clean data, which is consistent with NLD. However, semi-supervised learning datasets usually contain a small amount of labeled data and a large amount of unlabeled data. Because NLD does not use additional mimicry loss, unlabeled data cannot be used directly. Pseudo-labeling belongs to the self-learning scenario which is often used in semi-supervised learning. Under the self-training settings, pseudo-labels are obtained by predicting unlabeled data through the models trained on labeled data. Some of the pseudo-labels will be mislabeled. These data with the pseudo-labels can be treated as a large noisy dataset and NLD can extend to semi-supervised learning.

Following [6], the pseudo-labeling method used is illustrated in Figure 3, which is close to traditional co-training. Denote the labelled and unlabeled subsets as D*<sup>l</sup>* and D*u*, where the entire training dataset is D*<sup>s</sup>* = D*<sup>l</sup>* S D*u*. First, there are two different classifiers *f*<sup>1</sup> and *f*<sup>2</sup> trained on the small labeled dataset D*<sup>l</sup>* , respectively. Given a batch of unlabeled images ~*x* <sup>0</sup> ∈ D*u*, two predictions *<sup>y</sup>*e<sup>1</sup> and *<sup>y</sup>*e<sup>2</sup> are provided by the classifiers *<sup>f</sup>*<sup>1</sup> and *<sup>f</sup>*2. Then, *<sup>y</sup>*e<sup>1</sup> and *<sup>y</sup>*e<sup>2</sup> can be represented as

$$
\tilde{y}\_1 = f\_1 \left( \vec{x}' \right),
\tag{10}
$$

$$
\tilde{y}\_2 = f\_2 \left( \vec{x}' \right). \tag{11}
$$

Only when *<sup>y</sup>*e<sup>1</sup> <sup>=</sup> *<sup>y</sup>*e2, the predictions of the classifiers *<sup>f</sup>*<sup>1</sup> and *<sup>f</sup>*<sup>2</sup> will be regarded as the pseudo-label *y* 0 corresponding to ~*x* 0 , and other different results will be discarded. Apparently, this process will reduce the dataset size from D*u*, which typically affects the final performance. In fact, it removes low confidence predictions from pseudo-labels and reduces the noise level of the labels. High-quality pseudo-labels can improve performance and the robustness of the model. Furthermore, it does not need to choose a confidence threshold or manual selection. This is a more efficient pseudo-labeling method.

**Figure 3.** Illustration of the pseudo-labeling method, which includes two phases: training two classifiers and pseudo-labeling. (**a**) Two different classifiers *f*<sup>1</sup> and *f*<sup>2</sup> trained on the small manually labeled subset D*<sup>l</sup>* , respectively. They provide two views of the data. (**b**) The trained models can predict labels on a batch of unlabeled data. When the inferences are the same, the predicted labels will remain as pseudo-labels for the corresponding images, and the rest will be discarded. *<sup>L</sup>CE* donate CE loss. *<sup>y</sup>*e<sup>1</sup> and *<sup>y</sup>*e<sup>2</sup> represent the predictions of two classifiers, respectively. <sup>~</sup>*<sup>y</sup>* 0 indicates pseudo-labels of the batch of images ~*x* 0 .

#### **4. Experiments**

In this section, we explain how to construct the mimic noisy datasets and describe the experimental details of our comparison with other methods on these datasets and evaluate NLD.

#### *4.1. Datasets and Settings*

#### 4.1.1. Datasets

UC Merced Land-use dataset is a classical land-use dataset, which contains 21 different scenes and 2100 images. Each image has 256 × 256 pixels and high-resolution in RGB color space with a spatial resolution of 0.3 m. They were all manually extracted from the USGS National Map Urban Area Imagery Collection.

NWPU-RESISC45 dataset has a total number of 45 scene classes and 700 images with a size of 256 × 256 for each class. Most of the images are middle to high spatial resolution, which varies from 30 m to 0.2 m. They are all cropped from Google Earth. The dataset takes eight popular classes from UC Merced Land-use dataset and some widely used scene categories from other datasets and research.

AID is a large-scale aerial image dataset with 30 aerial scene types. The dataset is composed of 10,000 images which are multi-resolution and multi-source. The size of each image is fixed to be 600 × 600. The number of images in each class is imbalanced. This dataset is challenging because of the large intra-class diversities.

These datasets have many overlapped classes (e.g., sparse residential, medium residential and dense residential) that can easily confuse non-expert. It is particularly challenging for computer vision researchers with little geography knowledge to label such a dataset manually. As for crowd-sourcing or automatic labeling, it will be more prone to make errors. Actually, based on the existing public datasets, when we need to use them in real-world applications, additional data will be used. Only experts can avoid label noise, which is expensive.

Experiments are conducted on these three datasets. In addition, as shown in Table 1, each dataset is randomly split into 60% training subset, 20% validation subset and 20% test subset. Because the existing datasets lack noisy labels, simulated approaches are taken to evaluate NLD. Three different types of noise are injected into the split training set of all the three datasets separately.


**Table 1.** Sample sizes for different datasets.

**Symmetric noise**: The symmetric noise is a type of uniform noise, which is generated by a random label among the classes to replace the ground-truth label with equal probabilities. This type of noisy subset represents an almost zero-cost annotation method, which means there are many unlabeled images, and labels are labeled in a completely random way. Experiments on this noise can prove that, through NLD, this labeling method is also feasible in some extremely low-cost scenarios.

**Asymmetric noise**: This type of noise is class dependent noise and it mimics some of the real-world noise for visually similar and semantically similar categories.

For UC Merced Land-use, to the best of our knowledge, there is no related noise label mapping method before. After observing the features of images and division of scene classes, asymmetric noise was generated by mapping *chaparral* → *agricultural*, *runway* ↔ *airplane*, *tennis court* → *baseball diamond*, *river* → *beach*, *mobile home park* → *parking lot*, *f reeway* ↔ *overpass*, *sparse residential* → *buildings*, *harbor* → *mobile home park*, *medium residential* ↔ *dense residential* as shown in Figure 4.

**Figure 4.** Examples of asymmetric noise mapping scenes in the UC Merced Land-use dataset.

For NWPU-RESISC45, *baseball diamond* → *medium residential*, *beach* → *river*, *dense residential* ↔ *medium residential*, *intersection* → *f reeway*, *mobile home park* ↔ *dense residential*, *overpass* ↔ *intersection*, *tennis court* → *medium residential*, *runway* → *f reeway*, *thermal power station* → *cloud*, *wetland* → *lake*, *rectangular f arm land* → *meadow*, *church* → *palace*, *commercial area* → *dense residential* are mapped, following [12]. Figure 5 shows representative images in this dataset.

For AID, the classes are flipped by mapping *bareland* ↔ *desert*; *center* → *storage tank*; *church* → *center*,*storage tank*; *dense residential* ↔ *medium residential*; *industrial* → *medium residential*; *meadow* → *f arm land*; *play ground* → *meadow*,*school*; *resort* → *medium residential*; *school* → *medium residential*, *play ground*; *stadium* → *play ground*, following [12]. Figure 6 shows examples from this dataset.

**Figure 5.** Examples of asymmetric noise mapping scenes in the NWPU-RESISC45 dataset.

**Figure 6.** Examples of asymmetric noise mapping scenes in the AID dataset.

**Pseudo-Labeling noise**: Pseudo-labeling methods can assign labels to unlabeled images automatically, which can reduce costs. However, there are not completely correct pseudo-labels. To ensure a fair comparison, following the idea of SSGA-E [6], the full training set is randomly divided into six parts and randomly select one of them as a small clean subset. Then, two different classifiers are trained on the small clean subset and make pseudo labels for the rest of the train set. In SSGA-E [6], two networks are ResNet-50 and VGG-S [34], respectively. However, VGG-S is rarely used in practice, which can cause many problems in deployment. As a result, VGG-S is replaced with the VGG-19 [25], which has lower accuracy but is more widely used. These unlabeled subsets with automatically generated labels can be viewed as the noisy subset. In addition, since this method does not label all images, some of the uncertain images are removed from the subset and the noise subset will be smaller than the original subset. The number of annotations obtained for unlabeled images of different datasets is listed in Table 2.

**Table 2.** Number of samples contained in different subsets. The unlabeled subset is <sup>5</sup> 6 of the entire training set. Pseudo-labeled subset is generated in unlabeled subset by the automatic labeling method trained with the clean labeled subset (i.e., <sup>1</sup> 6 of the entire training set) as a clean subset.


#### 4.1.2. Baselines and Model Variants

To evaluate the performance improvement of NLD, our approach is compared with some pseudo-labeling methods [6]. Several related baselines are also provided for symmetric noise, asymmetric noise. In addition, NLD is used as the base model for some other variants to verify the effectiveness of NLD. The details of the baselines and variants are as follows.

**Baseline-Clean**: A backbone network of the student model is trained for remote sensing scenes classification using the clean subset. This can be regarded as the lower bound of NLD. Our method uses the noisy subset to improve performance on this baseline.

**Baseline-Noise**: A backbone network of the student model is trained solely on noisy labels from the training set. This baseline can be viewed as a measurement of the quality of noisy labels.

**Baseline-Mix**: A backbone network of the student model is trained using a mix of clean and noisy labels with standard CE loss. This baseline shows the damage caused by noisy subsets.

**SCE Loss**: Under the supervision of SCE loss, a model is trained on the entire dataset with both clean and noisy labels. Parameters for SCE are configured as *α* = 0.1 and *β* = 1.0.This is a baseline for a noise-robust method.

**Noise model fine-tune with clean labels (Clean-FT)**: It is a common approach, which uses the clean subset directly to fine-tune the whole network of Baseline-Noise. This method is prone to overfit if there are few clean samples.

**Noise model fine-tune with mix of clean and noisy labels (Mix-FT)**: To address the problem caused by limited clean labels, fine-tuning the Baseline-Noise with mixed data is also a common approach.

**NLD with CE loss (NLD)**: NLD is trained on both the original clean datasets and different noisy ratios of datasets. For a completely clean dataset, one image is used as input simultaneously for the teacher and student, which is close to DML.

#### 4.1.3. Experimental Settings

All experiments are implemented with PyTorch framework [35] and conducted on an NVIDIA GeForce Titan X GPU. The networks used in our experiments are shown in Table 3. These networks are all pre-trained on ImageNet. Although VGG architecture has a larger number of parameters and needs more floating point operations(FLOPs), ResNet architecture has stronger feature representation capabilities-based residual modules. Therefore, teacher networks in all experiments are ResNet

architecture. For UC Merced Land-use dataset, it is worth mentioning that SSGA-E [6] uses VGG-S and VGG-16, but after our experiments, the network with VGG architectures will be over-fitting because the size of this dataset is small. So the actual network used is modified VGG architectures with BN to learn this dataset. As a preprocessing step, random flip, random gaussian blur and resize images to <sup>224</sup> <sup>×</sup> <sup>224</sup> are used. For optimization, we use Adam with weight decay of <sup>10</sup>−<sup>2</sup> , batch size of 32 and initial learning rate of 10−<sup>4</sup> . The leaning rate will decrease according to the exponential decay with the multiplicative factor of 0.98 in each epoch. All networks mentioned in Section 4.1.2 are trained for 200 epochs. Besides, for NLD, a batch of images is half clean and half noise. In general, the weight factors are set to *α* = 10 and *β* = 2. For additional experiments, experiments are conducted with more different factors, losses and networks, which will be detailed in Section 4.6


**Table 3.** Comparison of various network architecture.

#### *4.2. Results on UC Merced Land-Use*

The results on the original UC Merced Land-use without any label noise and the UC Merced Land-use with symmetric label noise are reported in Table 4. Two confusion matrices for noise-free UC Merced Land-use are shown in Figures 7 and 8, respectively. It is noticeable that the student network (ResNet-34) can significantly benefit for NLD when learning from the original noise-free dataset. Therefore, NLD can also be regarded as a model distillation-like process, without additional data and pre-trained models. For symmetric noise, this type of noise label is completely random and there is little correct information for NLD distilling the knowledge in the noisy subset. Our method can still make better performance and robustness of the student network in most cases. As for *D<sup>c</sup>* : *D<sup>n</sup>* = 8 : 2 and *D<sup>c</sup>* : *D<sup>n</sup>* = 2 : 8 cases, it revealed that when the clean subset *D<sup>c</sup>* or noisy subset *D<sup>n</sup>* is too small (e.g., 252 samples), clean labels or randomly generated labels are too weak to bootstrap the performance. Instead of improving performance, other common approaches even hurt the performance. When the label quality of the noise subset is extremely low, a lot of error guidance will be provided. Specifically, different fine-tuning methods require a pre-trained model of the noise subset, which may get worse initialization values than the ImageNet [3] pre-trained model. If the two subsets are mixed, the noise labels will become adversarial examples, which confuse the network. SCE or other noise-robust methods can alleviate this problem, but the performance is still far from the method with a small number of clean labels available.

Table 5 shows the results for asymmetric label noise. This noise is closer to the real scene, similar to crowd-sourcing labeling or crawling data from Internet. According to the results of Baseline-Noise, such labels can provide a more valuable pre-trained model than labels with symmetric noise. Clean-FT and Mix-FT provide clear improvements compared to Baseline-Clean and Baseline-Mix, respectively. However, for mix-based methods, during training, the learning process of the model on the clean subset will be continuously misguided by the noise labels. As the noise ratio increases and clean ratio decreases, less clean data is difficult to fight against more noisy data, the performance of Mix-FT and SCE Loss is severely impaired. For NLD, the framework can maintain a better performance with fewer clean labels and more noisy labels. When *D<sup>c</sup>* : *D<sup>n</sup>* goes from 2 : 8 to 8 : 2, the performance of the model will only decrease by 1.62%. It is particularly noteworthy that when *D<sup>c</sup>* : *D<sup>n</sup>* = 2 : 8, NLD can exceed 6.67% of the Baseline-Clean.

**Figure 7.** The confusion matrix of Baseline-Clean with full UC Merced Land-use dataset.

**Figure 8.** The confusion matrix of NLD with full UC Merced Land-use dataset.


**Table 4.** Classification accuracy (%) on the UC Merced Land-use test set for different methods trained with the original noise-free dataset and symmetric label noise. We report the mean and standard error across 5 runs.

**Table 5.** Classification accuracy (%) on the UC Merced Land-use test set for different methods trained with asymmetric label noise. We report the mean and standard error across 5 runs.


#### *4.3. Results on NWPU-RESISC45*

In this experiment, NLD is tested on NWPU-RESISC45 with different noisy types. Table 6 summarizes the classification accuracy (%) of ResNet-34 trained with/without NLD. According to Baseline-Noise, asymmetric noise can provide more correct information due to the larger scale of NWPU-RESISC45 than UC Merced Land-use. Thus, Clean-FT can benefit from asymmetric noisy labels. However, the performance of other methods is still compromised by the noise. On the contrary, NLD has strong robustness and can benefit from different ratios and types of noisy labels. As for the test set accuracy, NLD has clearly improved the baseline and direct fine-tuning. Figures 9 and 10 show the confusion matrices of NLD and Baseline-Clean on original NWPU-RESISC45. It can be observed that NLD improves the performance of the student network as a method of model distillation.

**Table 6.** Classification accuracy (%) on the NWPU-RESISC45 test set for different methods.


**Figure 10.** The confusion matrix of NLD for the NWPU-RESISC45.

#### *4.4. Results on AID*

Next, the performance of NLD is evaluated on the AID dataset. Table 7 shows the results. As the classes of AID are imbalanced, it is more challenge using noise labels. It can be observed that all methods are significantly affected by symmetric noise, especially when the noise rate increases. In contrast, asymmetric noise can change the imbalance of the data distribution. As a result, NLD can benefit from asymmetric noisy labels and improve performance. The gap between NLD and Clean-Baseline became especially apparent when the noise rate increased to larger values. Our method can be applied to scenarios with more noisy labels. For example, when the asymmetric noise rate is 2 : 8, NLD obtains 2.3% higher accuracy than Baseline-Clean and 3.35% higher than Clean-FT. The confusion matrices for the AID dataset with asymmetric noise of *D<sup>c</sup>* : *D<sup>n</sup>* = 2 : 8 are shown in Figures 11 and 12. The results of NLD are significantly better than Baseline-Clean.

**Methods Network Types None Symm Asym** *Dc* **:** *Dn Dc* **:** *Dn Dc* **:** *Dn* **10 : 0 6 : 4 4 : 6 4 : 6 2 : 8** Baseline-Clean ResNet-34 96.30 95.70 **94.95** 95.10 92.95 Baseline-Noise ResNet-34 - 6.85 4.7 59.62 59.57 Mix-FT ResNet-34 - 20.99 11.49 77.71 68.77 Clean-FT ResNet-34 - 83.96 12.39 94.15 91.90 NLD ResNet-50+ResNet-34 **96.35 95.70** 93.60 **95.90 95.25**

**Table 7.** Classification accuracy (%) on the AID test set for different methods.

**Figure 11.** The confusion matrix of Baseline-Clean for the AID dataset with asymmetric noise of *D<sup>c</sup>* : *D<sup>n</sup>* = 2 : 8.

**Figure 12.** The confusion matrix of NLD for the AID dataset with asymmetric noise of *D<sup>c</sup>* : *D<sup>n</sup>* = 2 : 8.

#### *4.5. Comparison with Pseudo-Labeling*

We explore pseudo-labeling in UC Merced Land-use, NWPU-RESISC45 and AID. For all datasets, one-sixth of the training images per class is randomly selected as labeled, and the rest of images is treated as unlabeled. Experiments are compared with three pseudo-labeling strategies: (1) traditional self-training with single network; (2) traditional co-training with two networks respectively; (3) SSGA-E [6] with three networks.

Tables 8 and 9 shows the result from Han et al. [6], supplemented with our results. NLD achieves the best overall accuracy in all cases. For the UC Merced Land-use, Resnet34 is more effective as a student network when there is less unlabeled data. When leveraging entire unlabeled subset, VGG-16 shows better performance as a student network. With a larger scale of labeled data (e.g., NWPU-RESISC45), the improvement of our framework is higher. This confirms that NLD benefits pseudo-labeling scenarios.

**Table 8.** The effect of the unlabeled sample ratio on accuracy for the UC Merced Land-use test set reported by Han et al. [6], supplemented with our results.



**Table 9.** Comparison with results on the NWPU-RESISC45 and AID test set reported by Han et al. [6].

#### *4.6. Additional Experiments*

In this section, we study the importance of hyper-parameters and investigate the effect of changing components to provide additional insight into NLD.

Table 10 presents the following four experiments on UC Merced Land-use: (a) NLD with the weight factors *α* = 10 and *β* = 2. (b) NLD with the weight factors *α* = 2 and *β* = 10. (c) Using two same networks as student and teacher, respectively. (d) For the noisy teacher network, CE loss is replaced by SCE loss.

**Table 10.** Classification accuracy (%) on the UC Merced Land-use test set after changing each module from our model.


**Hyper-parameters**: From Table 10, hyper-parameters settings have a significant effect on the performance of NLD. As *α* decreases and *β* increases, the student network learns more information from noise distillation. Since the information in symmetric noise labels is limited, a larger *β* cannot make the teacher network to distill more knowledge. In such cases, the network performance can be degraded by incorrect guidance. Similarly, asymmetric noise labels have more correct information. So a larger *β* can enhance the teacher's ability to distill the right instruction to the student. In the absence of noisy labels, the effect of factors is not significant. This result thus suggests that appropriate factors are needed to select based on the quality of the noise labels in practice.

**Distillation with the same network**: As shown in Table 10, we perform experiments for ResNet-34 as a teacher and a student. In general, the first thing to notice is that the teacher network with a smaller capacity can also benefit the student network. However, for noise-free scenarios, it cannot take effect because the teacher and student have the same input and architecture and it is difficult to get extra knowledge. Moreover, a larger standard deviation for most results implies worse robustness. Therefore, a large teacher network is still a better option. In some low-cost scenarios, it is also possible to choose a small teacher network.

**Training teacher with different loss**: SCE can supervise the network to learn more information in the noisy labels (i.e., more errors in symmetric noise or more correctness in asymmetric noise). For fully clean data, there is little additional benefit from SCE. Such a property produces the results in Table 10. Therefore, for most real applications, SCE should be used instead of CE for the teacher to achieve a better performance of NLD.

#### **5. Conclusions**

This work proposes an efficient framework named NLD to address the noisy label problem for remote sensing image scene classification. NLD can distill the knowledge from different types of noise to improve performance of networks. Teacher networks can avoid overfitting into the noise through consistent decisions with student networks. The decision network is introduced to replace KL divergence. It is different from previous methods for distillation. The proposed NLD framework is end-to-end and does not require a pre-training process besides ImageNet. Thus, NLD is more practical and easier to deploy.

NLD can fully leverage the information contained in the noisy labels to improve the performance of network trained on the clean labels. Experiments are conducted on UC Merced Land-use, NWPU-RESISC45 and AID with different noise types. NLD improves over the baseline and direct fine-tuning. It can also be easily extended to pseudo-labeling. NLD performs significantly better than SSGA-E and other methods. For completely clean datasets, NLD can also improve accuracy as a model distillation-like process.

Future work will explore real-world noise datasets. More data with noisy labels can be collected from search engines and google earth, etc. Furthermore, mixing multiple existing public datasets as a clean dataset is also a worthwhile experiment. Our goal is to apply NLD to real scenarios.

**Author Contributions:** Conceptualization, R.Z., Z.C. and T.L. ; methodology, R.Z., Z.C. and T.L.; software, R.Z. and Z.C.; validation, R.Z., Z.C. and T.L.; formal analysis, R.Z. and F.S.; investigation, R.Z. and S.Z.; resources, T.L.; data curation, R.Z.; writing—original draft preparation, R.Z.; writing—review and editing, Z.C. and G.Z.; visualization, R.Z. and Q.Z.; supervision, T.L.; project administration, T.L.; funding acquisition, T.L. All authors have read and agreed to the published version of the manuscript.

**Funding:** This work was supported by Youth Innovation Promotion Association, Chinese Academy of Sciences (Grant No. 2016336).

**Acknowledgments:** The authors thank the editor and anonymous reviewers for their helpful comments and valuable suggestions.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **Abbreviations**

The following abbreviations are used in this manuscript:


#### **References**


© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## *Technical Note* **Feature-Level Fusion of Polarized SAR and Optical Images Based on Random Forest and Conditional Random Fields**

**Yingying Kong 1,\* , Biyuan Yan <sup>1</sup> , Yanjuan Liu <sup>1</sup> , Henry Leung <sup>2</sup> and Xiangyang Peng <sup>3</sup>**


**Abstract:** In terms of land cover classification, optical images have been proven to have good classification performance. Synthetic Aperture Radar (SAR) has the characteristics of working alltime and all-weather. It has more significant advantages over optical images for the recognition of some scenes, such as water bodies. One of the current challenges is how to fuse the benefits of both to obtain more powerful classification capabilities. This study proposes a classification model based on random forest with the conditional random fields (CRF) for feature-level fusion classification using features extracted from polarized SAR and optical images. In this paper, feature importance is introduced as a weight in the pairwise potential function of the CRF to improve the correction rate of misclassified points. The results show that the dataset combining the two provides significant improvements in feature identification when compared to the dataset using optical or polarized SAR image features alone. Among the four classification models used, the random forest-importance\_ conditional random fields (RF-Im\_CRF) model developed in this paper obtained the best overall accuracy (OA) and Kappa coefficient, validating the effectiveness of the method.

**Keywords:** polarized SAR; optical image; random forest; conditional random fields; feature-level fusion

#### **1. Introduction**

The impact of urban development on the Earth's environment is enormous, leaving an ever-changing imprint on its surface. This situation calls for a compulsory requirement to map the land cover and review land-use patterns of our dynamic eco-system time [1]. Polarized Synthetic Aperture Radar (SAR) and optical image have gained many applications in land cover classifications [2–5]. Since the two have entirely different physical properties, this makes them have distinct advantages in classification. For example, the optical images are susceptible to differences in the vegetation spectrum and are, therefore, often used to detect pest and disease problems [6]. SAR images offer high accuracy and purity in detecting water areas, but extracting sharp edges is still a challenge [7]. Therefore, how to fully utilize the advantages of both is one of the major topics currently faced.

Data fusion is a way to take full advantage of multiple sources of data. The data fusion stages (pixel-level, feature-level, and decision-level) determine the data fusion techniques [8]. Feature-level fusion consists of two critical processes: image feature extraction and feature merging. In this regard, Aswatha et al. [1] used multimodal information from multispectral images and polarized SAR data to classify land cover into seven classes in an unsupervised manner. Su [9] extracted the backward scattering features and greylevel co-occurrence matrix (GLCM) features obtained from the Pauli decomposition and H/A/alpha decomposition of polarized SAR images, the spectral features, and GLCM features of multispectral images, and used a support vector machine (SVM) for classification.

**Citation:** Kong, Y.; Yan, B.; Liu, Y.; Leung, H.; Peng, X. Feature-Level Fusion of Polarized SAR and Optical Images Based on Random Forest and Conditional Random Fields. *Remote Sens.* **2021**, *13*, 1323. https://doi.org/10.3390/rs13071323

Academic Editor: Monidipa Das

Received: 1 March 2021 Accepted: 24 March 2021 Published: 30 March 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

This fusion method effectively improves the classification accuracy and the pepper noise is reduced.

Land cover classification is one of the critical applications of remote sensing images. The traditional land cover classification method is divided into two steps: feature extraction and classifier training [10].

The feature extraction for optical images is based on spectral and textural features. A textural feature is a comprehensive reflection of the image greyscale statistical information, spatial distribution information, and structural information. Commonly used textural feature classification algorithms include a local binary pattern (LBP) [11], GLCM [12], etc. Polarized SAR feature extraction is based on polarized target decomposition, which aims to decode the scattering mechanism of the feature under a reasonable physical constraint model [13], such as Freeman-Durden decomposition [14], Yamaguchi decomposition [15], etc.

Machine learning has achieved considerable progress in classification and regression tasks. Commonly used machine learning is SVM, decision tree, random forest, etc. In the current research, SVM has been used extensively. For example, Attarchi [16] used SVM to classify polarized SAR data and its GLCM features for the detection of impervious surfaces. While SVM classifies samples by finding hyperplanes, decision trees classify samples by selecting the optimal components and dividing the subset into the corresponding leaf nodes based on the features. Phartiyal et al. [17] used an evolutionary genetic algorithm to optimize the empirical model to maximize the classification performance. They constructed a decision tree based on the best class boundary and obtained satisfactory classification results. Random forest is an ensemble learning model based on decision trees, which obtains the final results by combining and analysing multiple decision trees [18]. Du et al. [19] extracted the polarization and texture features of the fully polarized SAR images for random forest and rotation forest classifiers. The experiment finally verified that random forest is better than Wishart and SVM classifiers, and it is less accurate than rotation forest but faster.

In image processing, conditional random fields (CRF) have unique advantages in expressing the spatial context and the posterior probability modelling [20]. Zhong et al. [21] proposed the spatial-spectral-emissivity land-cover classification based on the conditional random fields (SSECRF) algorithm, which integrates the spatial-spectral feature set and emissivity by constructing the SSECRF energy function to obtain better classification results. CRF allows for the processing of target classes in conjunction with neighbourhood information, effectively improving the image purity of the classification results, which is missing from machine learning.

This article proposes an RF-Im\_CRF classification model to improve the accuracy of the random forest classifier in feature-level fusion classification. The model first extracts the spectral and GLCM features of optical images, the Freeman decomposition, and Polarization Signature Correlation Feature (PSCF) features of polarized SAR. Then, the model assembled them into a random forest training dataset. Afterward, the random forest classifier results are input into the Im\_CRF model, which uses the feature importance from the random forest as the weight information in the pairwise potential function to improve feature classification accuracy.

#### **2. Materials**

#### *2.1. Study Site*

The location selected for this study is in Nanjing and its surrounding area, which is located in Jiangsu Province in Eastern China. Figure 1 shows the optical and polarized SAR false-colour images of the study area. The false-colour image is generated based on the Pauli decomposition. The images are 1500 × 1500 pixels in size, which include river, buildings, vegetation, and roads. The image resolution is 8 metres, so the total size of the study area is about 169 km<sup>2</sup> . The architective area occupies the majority of the image, the vegetation area is relatively concentrated, and there is a small amount of vegetation within

the building space. The cultivated area is concentrated in the northern part of the river. A clear colour difference can be observed in the optical image between the dense vegetation area and the cultivated area. The colour of the river part is not sufficiently uniform, which is similar to the farmland in some areas. In contrast, the river area of the SAR false-colour image is different from other regions. Therefore, it can be seen that polarized SAR has apparent advantages in identifying river categories. the building space. The cultivated area is concentrated in the northern part of the river. A clear colour difference can be observed in the optical image between the dense vegetation area and the cultivated area. The colour of the river part is not sufficiently uniform, which is similar to the farmland in some areas. In contrast, the river area of the SAR false-colour image is different from other regions. Therefore, it can be seen that polarized SAR has apparent advantages in identifying river categories.

buildings, vegetation, and roads. The image resolution is 8 metres, so the total size of the study area is about 169 km2. The architective area occupies the majority of the image, the vegetation area is relatively concentrated, and there is a small amount of vegetation within

*Remote Sens.* **2021**, *13*, x FOR PEER REVIEW 3 of 17

**Figure 1.** Study area. (**a**) The optical image. (**b**) The polarized SAR false-colour image. **Figure 1.** Study area. (**a**) The optical image. (**b**) The polarized SAR false-colour image.

The dataset used for research is the polarized SAR data collected by the RADARSAT-2 satellite, which has four polarization states: HH, VV, HV, and VH. This data was acquired on 19 April 2011 at a resolution of 8 m. The optical image resolution is 5 m, and the acquisition time is April 2017. Due to the relatively low resolution and the fact that the acquisition time falls within the same month, the variation in ground objects is within manageable limits. In the ENVI software, the optical image was down-sampled to a resolution of 8 m, and the polarized SAR image has undergone preprocessing such as multilooking and noise reduction. The two images were calibrated in the same geographic coordinate system. The dataset used for research is the polarized SAR data collected by the RADARSAT-2 satellite, which has four polarization states: HH, VV, HV, and VH. This data was acquired on 19 April 2011 at a resolution of 8 m. The optical image resolution is 5 m, and the acquisition time is April 2017. Due to the relatively low resolution and the fact that the acquisition time falls within the same month, the variation in ground objects is within manageable limits. In the ENVI software, the optical image was down-sampled to a resolution of 8 m, and the polarized SAR image has undergone preprocessing such as multi-looking and noise reduction. The two images were calibrated in the same geographic coordinate system.

#### *2.2. Sampling Point Selection*

no duplicate points.

*2.2. Sampling Point Selection*  The sampling point coordinates in the experiment were taken with the optical image as a reference. Overall, five land cover categories were considered, namely Water, Building, High vegetation, Low vegetation, and Road. The high vegetation is dominated by tall forests and the low vegetation is dominated by agricultural land. Since the image resolution is 8 m, this prevents some narrow roads from being clearly represented, especially for SAR images. This paper, therefore, chose to sample roads with larger width, such as motorways and arterial roads. Because of the massive amount of source image data, it is not easy to classify the entire image finely. Therefore, the training samples chosen for this experiment are 100 per class, and the test samples are 150 for each category,as shown in Table 1. The totals of training samples and test samples are 500 and 750, respectively, with The sampling point coordinates in the experiment were taken with the optical image as a reference. Overall, five land cover categories were considered, namely Water, Building, High vegetation, Low vegetation, and Road. The high vegetation is dominated by tall forests and the low vegetation is dominated by agricultural land. Since the image resolution is 8 m, this prevents some narrow roads from being clearly represented, especially for SAR images. This paper, therefore, chose to sample roads with larger width, such as motorways and arterial roads. Because of the massive amount of source image data, it is not easy to classify the entire image finely. Therefore, the training samples chosen for this experiment are 100 per class, and the test samples are 150 for each category, as shown in Table 1. The totals of training samples and test samples are 500 and 750, respectively, with no duplicate points.


**Table 1.** Sample label type and quantity.

#### **3. Characteristic Data Acquisition**

#### *3.1. Polarization Feature Extraction*

For the extraction of polarized SAR image features, this experiment selected two polarization feature extraction methods known as the Freeman-Durden decomposition and the PSCF.

#### 3.1.1. Freeman-Durden Decomposition

The Freeman-Durden polarization decomposition method is based on the fundamental principle of radar scattering, which decomposes the SAR cross-covariance matrix into canopy scattering (or volume scattering), odd bounce scattering (or surface scattering), and double-bounce scattering (or dihedral scattering). The detailed description of the modelling process for the composite scattering model can be found in Reference [22]. This model can acquire the characteristic parameters related to the three scattering mechanisms and the corresponding weight coefficients.

The power corresponding to the three scattering mechanisms are Ps, Pd, and Pv, where Ps corresponds to the power of surface scattering, Pd represents the power of dihedral scattering, and Pv represents the power of volume scattering. Then, the Freeman feature vector of the target points can be established.

$$X\_{\text{Fremann}} = \begin{bmatrix} \mathbf{x}\_i^{Pd} \; \mathbf{x}\_i^{Ps} \; \mathbf{x}\_i^{P\upsilon} \end{bmatrix}^T \tag{1}$$

#### 3.1.2. Polarization Signature Correlation Feature (PSCF)

Radar polarization signatures (PSs) can effectively characterize the scattering behaviour of the research object, so it has the potential to distinguish the types of ground objects. This feature is usually a three-dimensional representation of the backscattering behaviour of a target or land cover. In the expression of PSs, the *x*-axis and *y*-axis represent the ellipse angle and azimuth angle, respectively, and the *z*-axis represents the received backscattering power coefficient. The value range of the azimuth angle (*ψ*) is −90 to 90 degrees, and the value range of the ellipse angle (*χ*) is −45 to 45 degrees. The following formula gives the PSs.

$$\sigma(\chi\_i \psi\_i \chi\_j \psi\_j) = \frac{4\pi}{k^2} \begin{pmatrix} 1 \\ \cos 2\chi\_i \cos 2\psi\_i \\ \cos 2\chi\_i \sin 2\psi\_i \\ \sin 2\chi\_i \end{pmatrix} (K) \begin{pmatrix} 1 \\ \cos 2\chi\_j \cos 2\psi\_j \\ \cos 2\chi\_j \sin 2\psi\_j \\ \sin 2\chi\_j \end{pmatrix} \tag{2}$$

Among them, *σ* represents the backscattering coefficient or received power, the subscripts *i* and *j* mean the transmitting and receiving units, respectively, and *K* is the Kennaugh matrix [23]. *k* is the wave number of the illuminating wave.

The co-polarized signatures are obtained by transmitting and receiving combination *ψ<sup>i</sup>* = *ψ<sup>j</sup>* , *χ<sup>i</sup>* = *χ<sup>j</sup>* , and the cross-polarized signatures are obtained by *ψ<sup>i</sup>* = 90 + *ψ<sup>j</sup>* , *χ<sup>i</sup>* = −*χ<sup>j</sup>* . The ellipse angle defines the polarization behaviour (linear polarization, circular polarization, or elliptical polarization), and the azimuth angle defines the polarization states, that is, horizontal or vertical polarization [24]. In the current research, the characteristics of co-polarized and cross-polarized signatures have been fully considered and utilized.

Since surface objects generally exhibit a complex scattering response, the polarization signatures of standard targets must be used as a reference for classification. Therefore, PSs have been calculated for flat plate (FP), horizontal dipole (HD), vertical dipole (VD), and a dihedral angle (Di) in the standard targets. The formulae for the generation of the standard target PSs are given in Reference [25].

Therefore, the PSCF uses the radar polarization signatures of the four standard scatterers (FP, HD, HD, and VD) as a reference to calculate the relevance between the polarization characteristics of the target points and the above four standard targets. This can be a reference to distinguish between different categories. The correlation coefficient formula is as follows.

$$\text{CC} = \frac{\text{S}\_{xy}}{\text{S}\_{x}\text{S}\_{y}} \tag{3}$$

where *x* and *y* are the polarized characteristics of the target points and the standard targets, respectively. *S<sup>x</sup>* is the standard deviation of x, *S<sup>y</sup>* is the standard deviation of y, and *Sxy* is the covariance between *x* and *y*. *CC* is the correlation coefficient between x and y.

This paper refers to Reference [17] to obtain the PSCF solution and establish the feature correlation coefficients between a single target and four standard targets, which are Corr\_co\_Di, Corr\_co\_FP, Corr\_co\_HD, Corr\_co\_VD, Corr\_cross\_Di, Corr\_ cross \_FP, Corr\_ cross\_HD, and Corr\_ cross \_VD. Among them, the co is for the co-polarization while the cross is for cross-polarization. Thus, the PSCF feature vector of the target point is established as:

$$\begin{aligned} \mathbf{x}\_{\text{PSCF}} &= [\mathbf{x}\_{i}^{\text{corr\\_co\\_Di}}, \mathbf{x}\_{i}^{\text{corr\\_co\\_FP}}, \mathbf{x}\_{i}^{\text{corr\\_co\\_HD}}, \mathbf{x}\_{i}^{\text{corr\\_co\\_VD}}, \mathbf{x}\_{i}^{\text{corr\\_co\\_VD}}] \\ \mathbf{x}\_{i}^{\text{corr\\_c\\_Tos\\_Di}}, \mathbf{x}\_{i}^{\text{corr\\_c\\_Fos\\_FP}}, \mathbf{x}\_{i}^{\text{corr\\_c\\_Tos\\_HD}}, \mathbf{x}\_{i}^{\text{corr\\_c\\_Tos\\_VD}}]^T \end{aligned} \tag{4}$$

#### *3.2. Optical Image Feature Extraction*

3.2.1. Spectral Information Extraction

Compared with multispectral images, the optical image does not have rich spectral information, but it is also sufficient to identify information with significant spectral differences. This optical image can be divided into three bands: red, green, and blue, so the spectral feature information is shown as follows.

$$X\_{Spectral} = \begin{bmatrix} \mathbf{x}\_{i\prime}^r \mathbf{x}\_{i\prime}^\mathcal{S} \mathbf{x}\_i^b \end{bmatrix}^T \tag{5}$$

#### 3.2.2. Grey-Level Co-Occurrence Matrix (GLCM)

The textural feature is a visual feature that does not depend on brightness and colour, reflecting similar information of adjacent pixels in the image. It reflects the internal characteristics shared by the surface of the object. It contains essential information about the surface structure of the object and the relationship to its neighbours.

GLCM is a commonly used method for extracting texture information with good discrimination ability. Its principle is to convert the specified spatial relationship in the image into texture information based on the greyscale value. The texture features obtained by GLCM are helpful to distinguish objects with similar spectral characteristics.

In this paper, three features are chosen to describe the spatial relationships of images: contrast, dissimilarity, and energy. Contrast and dissimilarity can measure the local variation and reflect the sharpness of the image and the depth of the texture. The energy is the sum of the squares of element values of the GLCM, demonstrating the uniformity of the image greyscale distribution and the texture thickness. The GLCM feature information is expressed as follows.

$$X\_{GLCM} = \left[ \mathfrak{x}\_i^{contrast}, \mathfrak{x}\_i^{dissimilitary}, \mathfrak{x}\_i^{energy} \right]^T \tag{6}$$

#### **4. Random Forest-Importance\_Conditional Random Forest (RF-Im\_CRF) Model 4. Random Forest- Importance\_Conditional Random Forest (RF-Im\_CRF) Model**

ௗ௦௦௧௬,

௬]

் (6)

௧௦௧,

*Remote Sens.* **2021**, *13*, x FOR PEER REVIEW 6 of 17

ீெ = [

Figure 2 is the flowchart of applying the RF-Im\_CRF model to the feature-level fusion of polarized SAR and optical images. After extracting the features of the two images, the random forest is first used for classification. Then, the classification results and feature importance of the random forest are combined with the CRF. The classification results are taken as the unary potential function and the feature importance is taken as the weight of the pairwise potential function to improve the classification accuracy. Figure 2 is the flowchart of applying the RF-Im\_CRF model to the feature-level fusion of polarized SAR and optical images. After extracting the features of the two images, the random forest is first used for classification. Then, the classification results and feature importance of the random forest are combined with the CRF. The classification results are taken as the unary potential function and the feature importance is taken as the weight of the pairwise potential function to improve the classification accuracy.

**Figure 2.** RF-Im\_CRF model flowchart.

**Figure 2.** RF-Im\_CRF model flowchart.

#### *4.1. Random Forest*

*4.1. Random Forest*  Random forests construct mutually independent decision trees in which each generates a training set by bootstrap resampling. M rounds were randomly selected from the original training set with N samples to obtain M training sets. Some samples may be chosen multiple times under self-service resampling, while some samples may not be drawn. Then M decision trees are developed according to these training sets. In the decision-making stage, the classification results are obtained by taking the mode, or the regression results, by taking the average value. The random forest can process large data sets with high Random forests construct mutually independent decision trees in which each generates a training set by bootstrap resampling. M rounds were randomly selected from the original training set with N samples to obtain M training sets. Some samples may be chosen multiple times under self-service resampling, while some samples may not be drawn. Then M decision trees are developed according to these training sets. In the decision-making stage, the classification results are obtained by taking the mode, or the regression results, by taking the average value. The random forest can process large data sets with high efficiency and precision, filter explanatory variables by itself, and get the mutual influence and importance ranking of variables.

efficiency and precision, filter explanatory variables by itself, and get the mutual influence and importance ranking of variables. The Gini index, or Gini impurity, indicates the probability that a randomly selected sample in the sample set will be misclassified. At each node in the binary tree T of the random forest, the optimal segmentation is sought according to the Gini index *i*( ) τ , which divides the sub-node data set. Random forest follows the principle of Gini gain The Gini index, or Gini impurity, indicates the probability that a randomly selected sample in the sample set will be misclassified. At each node in the binary tree T of the random forest, the optimal segmentation is sought according to the Gini index *i*(*τ*), which divides the sub-node data set. Random forest follows the principle of Gini gain maximization when selecting features for nodes [26]. Let *p<sup>k</sup>* be the probability of node *τ* being divided into child nodes *τ<sup>k</sup>* , *k* = 1, 2. Then the Gini index is:

$$i(\tau) = \sum\_{k=1}^{2} p\_k(1 - p\_k) = 1 - \sum\_{k=1}^{2} p\_k^2 \tag{7}$$

The Gini gain ∆*i* generated by splitting the sample through a certain threshold and sending it to two child nodes *τ*<sup>1</sup> and *τ*2, which is defined as:

$$
\Delta i(\tau) = i(\tau) - p\_1 i(\tau\_1) - p\_2 i(\tau\_2) \tag{8}
$$

Since the decision tree selects features that can maximize the Gini gain of the node when generating nodes, the feature importance can be reflected by the sample division of the nodes. However, random forest introduces the double randomness of data samples and input features during a training process, which may cause important features with high discrimination being used to divide nodes less frequently than features with low discrimination. Therefore, the importance of features cannot be measured simply by the number of times used as segmentation attributes [27,28].

#### *4.2. Conditional Random Fields*

The CRF model simulates the local neighbourhood interaction between random variables in the unified probability framework. Given the observed image data, the model directly models the posterior probability of the label as a Gibbs distribution.

The general form of the CRF model is:

$$P(Y|X) = \frac{1}{Z(X)} \exp\left\{-\left[\sum\_{i \in V} \Phi^o\_{\forall i}(y\_i, \mathbf{x}\_i, w) + \beta \sum\_{(i,j) \in E} \Phi^o\_{\forall i}(y\_i, y\_j, \mathbf{x}\_i, \mathbf{x}\_j, v)\right]\right\} \tag{9}$$

Among them, *V* is for the set of data points and *E* is for the set of point neighbours.

*xi* , *y<sup>i</sup>* represents the observation variable of the *i*-th point in the data and its class label variable, respectively. *X* is the sequence of observations, *X* = [*x*1, . . . , *x<sup>i</sup>* , . . . , *xN*]. *Y* is the sequence of tags corresponding to *X*, *Y* = (*y*1, . . . , *y<sup>i</sup>* , . . . , *yC*), where *C* is the number of categories. *P*(*Y*|*X*) is the probability of the label sequence *Y* under the given observation sequence *X*. *Z*(*X*) is the normalization constant, *Z*(*X*) = ∑ *Y exp* − ∑ *c*∈*C Φ*%*c*(*yc*, *x*) ; *Φ*%*i*(·) is the unary potential function, which represents the probability of the observed variable *x<sup>i</sup>* taking the label *y<sup>i</sup>* . *Φij*(·) is the pairwise potential function, which means the correlation between the variable *x<sup>i</sup>* and its neighbouring variables *x<sup>j</sup>* and the correlation between the labels. *w*, *v*, respectively, represents the parameters of the correlation potential function and the interaction potential function. *β* is to adjust the weight of the two potential function terms, which determines the degree of influence of the pairwise function on the unray potential function. In this article, to simplify the implementation of CRF, *β* is set to a constant 1.

Then the corresponding Gibbs energy is defined as:

$$\begin{aligned} E(Y|X) &= -\log P(Y|X) - \log(Z(X)) = \sum\_{c \in \mathbb{C}} \Phi^o \!\!/ \!\!/ y\_c, \mathfrak{x} \\ &= \sum\_{i \in V} \Phi^o \!\!/ \!\!/ y\_i, \mathfrak{x}\_{i\prime} \mathfrak{w} \rangle + \beta \sum\_{\substack{(i,j) \in E}} \Phi^o \!\!/ \!\!/ y\_{i\prime} \mathfrak{y}\_{j\prime} \mathfrak{x}\_{i\prime} \mathfrak{x}\_{j\prime} \mathfrak{v} \end{aligned} \tag{10}$$

According to the Bayesian Maximum Posterior (MAP) rule, image classification aims to find the label Y that maximizes the posterior probability *P*(*Y*|*X*). Therefore, the CRF's MAP mark xMAP can be obtained by the following formula.

$$\mathcal{Y}\_{\text{MAP}} = \arg\max\_{\mathcal{Y}} P(\mathcal{Y}|\mathcal{X}) = \arg\min\_{\mathcal{Y}} E(\mathcal{Y}|\mathcal{X}) \tag{11}$$

It can be seen that finding the maximum value of the posterior probability *P*(*Y*|*X*) is equivalent to finding the minimum value of the energy function *E*(*Y*|*X*). Therefore, the optimization algorithm finds the most probable label by finding the minimum energy solution.

#### *4.3. RF-Im\_CRF Model*

4.3.1. Establishment of Potential Functions

In this paper, the unary potential function *Φ*%*<sup>i</sup>* is defined based on the classification results of the random forest classifier. For variables *x<sup>i</sup>* and its label *y<sup>i</sup>* , when *y<sup>i</sup>* = *k*, ∀*k* ∈ *K* (K is the label set), then Equation (12) is:

$$P(y\_i = k | \mathbf{x}\_i) = \frac{1}{M} \sum\_{m=1}^{M} \delta[T\_m(\mathbf{x}\_i, \theta\_m) = k] \tag{12}$$

M is the total number of decision trees. *θ<sup>m</sup>* is the independent and identically distributed parameter vector describing the *m*-th decision tree. Then, *P*( *y<sup>i</sup>* = *k*|*xi*) represents the probability that the target is of class *k*.

The CRF unary potential function is defined as:

$$\Phi^{\bullet}\!\!/\!\!/y\_i(y\_i.\chi\_i) = -\log P(y\_i|\chi\_i) \tag{13}$$

Pairwise potential function *Φ*%*ij yi* , *y<sup>j</sup>* , *x<sup>i</sup>* , *x<sup>j</sup>* , *v* , also called the smoothness term, encourages adjacent pixels of the image to use the same label. This article uses an improved contrast-sensitive Potts model that introduces the feature importance *η<sup>k</sup>* to define the pairwise potential function.

$$\Phi^{\bullet \diamondsuit\_{ij}}(y\_{i\prime}y\_{j\prime}x\_{i\prime}x\_{j}) = \begin{cases} 0 & \text{if } y\_{i} = y\_{j} \\ \mathcal{g}\_{ij}(\mathcal{S}) & \text{otherwise} \end{cases} \tag{14}$$

$$g\_{i\bar{j}}(\mathbf{S}) = \text{dist}(i, j)^{-1} \exp\left(-\sum\_{k=1}^{N} \eta\_k \gamma\_k \parallel \mathbf{X}\_i^k - \mathbf{X}\_{\bar{j}}^k \parallel^2 \right) \tag{15}$$

Among them, *gij* simulates the spatial interaction of adjacent pixels *x<sup>i</sup>* and *x<sup>j</sup>* , which is used to measure the feature difference between neighbours. *dist*(*i*, *j*) is the Euclidean distance between adjacent pixels, *X k i* and *X k j* represent the feature vector between points *i* and *j*. *k* represents the category of the feature vector, namely, *k* = 1, 2, 3, 4, which, respectively, represents the feature vector *XFreeman*, *XPSCF*, *XSpectral*, *XGLCM*. *γ<sup>k</sup>* is set to be the mean square error of feature vectors between adjacent pixels in the image, denoted as *γ<sup>k</sup>* = 2 D *X k <sup>i</sup>* − *X k j* 2 E−<sup>1</sup> , which h·i represents the mean value of the neighbourhood. The parameter *η<sup>k</sup>* is the feature importance in the classification process, obtained by random forest.

#### 4.3.2. Feature Importance

In this paper, the statistic Im*<sup>i</sup>* is used as a feature importance measurement based on the Gini index, representing the average change in the Gini index of the *i*-th feature in the node division of all decision trees. The importance of feature *x<sup>i</sup>* on node n is the change in the Gini index that the sample on the node *τ* is divided into child nodes *τ*<sup>1</sup> and *τ*<sup>2</sup> in which:

$$Im\_{i,m,n} = i(\pi) - i(\pi\_1) - i(\pi\_2) \tag{16}$$

where *n* = 1, . . . , *N*, which represents the node index in one decision tree, and *m* = 1, . . . , *M*, which represents the decision tree index in the random forest. Therefore, the feature *x<sup>i</sup>* has N nodes in the m-th decision tree as the attribute of node division. Then the feature importance *x<sup>i</sup>* on this decision tree can be expressed as:

$$Im\_{i,m} = \sum\_{n=1}^{N} Im\_{i,m,n} \tag{17}$$

The feature importance *x<sup>i</sup>* in the entire random forest is:

$$Im\_i = \frac{1}{M} \sum\_{m=1}^{M} Im\_{i,m} \tag{18}$$

The sum of the feature importance of each feature is 1.

For parameter *η<sup>k</sup>* , Freeman decomposition, PSCF features, spectral features, and GLCM features are regarded as four various feature components. Then, taking spectral features as an example, the feature importance of this characteristic component is:

$$
\eta\_{\text{Spectral}} = \text{Im}\_r + \text{Im}\_\mathcal{S} + \text{Im}\_b \tag{19}
$$

The four feature components extracted in this paper have different value ranges and number of elements. Since the normalization of features does not affect the random forest results, they are not normalized in feature extraction. However, in the CRF, this difference in the value range affects the pairwise potential function. Therefore, it needs to be divided into four parts to avoid the features with a small value range in which they do not work as well as they should. Since the importance of each feature is different, the higher the importance of the feature, the greater the influence on classification. Therefore, the parameters *η<sup>k</sup>* can further strengthen the feature difference between neighbours and improve classification accuracy.

#### **5. Experiment and Analysis**

*5.1. Multi-Source Data Comparative Classification Experiment*

First, to verify the advantages of image fusion in image classification, this paper used the random forest to perform classification experiments on optical image data and polarized SAR data. The optical image data contains a feature vector consisting of spectral and GLCM information, and the polarized SAR data includes a feature vector consisting of Freeman and PSCF information. The number of decision trees in the random forest was set to 100. This value ensures that the results of the random forest will be optimal and fluctuate within a range of values. The experimental results are shown as follows.

For classification tasks, the classification results can intuitively and clearly reflect the disparity between different features or different classification methods, especially when the distinction is significant. Figure 3 shows the classification results obtained by adopting different feature vectors. It can be seen that the characteristics of the optical image can better distinguish the difference between high and low vegetation due to the apparent differences in spectra. However, the reliance on spectral features also makes many errors in the identification of waters. Since the water surface tends to be specularly reflective, the backscatter from the water surface is almost zero, resulting in high accuracy of SAR image classification in waters. At the same time, the working frequency band of RADARSAT-2 is C-band, which has certain penetrability, making it difficult to distinguish the characteristic difference between high and low vegetation, thus, presenting a mixed phenomenon of dark green and light green. This penetrability is also reflected in the ability of the polarized SAR data to detect folds in the hills and present similar features to buildings, leading to misinterpretations. Optical image features have certain advantages in terms of buildings, and it is difficult for both sides to get ideal results on the road.

The visual effect of the classification that combines polarized SAR and optical image features is significantly improved. The water area as well as high and low vegetation are well inherited. Simultaneously, compared with the former two, the salt and pepper noise in the construction area has been significantly reduced. The large area of misjudgment is also hard to see, and the display effect of the road is improved. This indicates that the characteristics of polarized SAR and optical images both play a specific role in classification. Due to the similarity of the narrow river sections to the backscattering of the road, this caused the SAR data to misinterpret at the river in the southwest region of the image. This

*Remote Sens.* **2021**, *13*, x FOR PEER REVIEW 10 of 17

situation is also shown in Figure 3c. This indicates that the features of the optical images are still difficult to correct for the high misclassification of SAR images in this particular scene. larized SAR data to detect folds in the hills and present similar features to buildings, leading to misinterpretations. Optical image features have certain advantages in terms of buildings, and it is difficult for both sides to get ideal results on the road.

of dark green and light green. This penetrability is also reflected in the ability of the po-

**Figure 3.** Multi-source data classification results. (**a**) Optical image classification result. (**b**) Polarized SAR classification result. (**c**) Optical + polarized SAR image classification result. **Figure 3.** Multi-source data classification results. (**a**) Optical image classification result. (**b**) Polarized SAR classification result. (**c**) Optical + polarized SAR image classification result.

The visual effect of the classification that combines polarized SAR and optical image features is significantly improved. The water area as well as high and low vegetation are well inherited. Simultaneously, compared with the former two, the salt and pepper noise in the construction area has been significantly reduced. The large area of misjudgment is also hard to see, and the display effect of the road is improved. This indicates that the From the experimental results, it can be seen that the integrated polarized SAR and optical image fusion classification performance is significantly improved compared with the image classification performance of the single source. However, there are still many noise points, which affect the smoothness of the classification result. The RF-Im\_CRF model proposed in this paper will improve the classification results aiming at this phenomenon.

#### characteristics of polarized SAR and optical images both play a specific role in classification. Due to the similarity of the narrow river sections to the backscattering of the road, *5.2. Comparison of RF-Im\_CRF Model Experiment Results*

#### this caused the SAR data to misinterpret at the river in the southwest region of the image. 5.2.1. Analysis of Classified Image Results

This situation is also shown in Figure 3c. This indicates that the features of the optical images are still difficult to correct for the high misclassification of SAR images in this particular scene. From the experimental results, it can be seen that the integrated polarized SAR and optical image fusion classification performance is significantly improved compared with To verify the effectiveness of the algorithm in this paper, the experimental data were classified using SVM based on Poly kernel function, RF, RF-CRF without feature importance as weights [21], and the RF-Im\_CRF models, respectively. The experimental data is the feature vector composed of the four features in Chapter 3 of the article. The results are shown in Figure 4.

the image classification performance of the single source. However, there are still many noise points, which affect the smoothness of the classification result. The RF-Im\_CRF model proposed in this paper will improve the classification results aiming at this phenomenon. *5.2. Comparison of RF-Im\_CRF Model Experiment Results*  It can be seen that the SVM has the worst classification effect. SVM is an independent classifier, so it follows one rule when classifying. Random forests, on the other hand, rely on multiple mutually independent decision trees acting together, each with a different classification threshold. This means that the misclassification results of a single decision tree are corrected by the action of other decision trees. As a result, random forests give better results.

5.2.1. Analysis of Classified Image Results To verify the effectiveness of the algorithm in this paper, the experimental data were classified using SVM based on Poly kernel function, RF, RF-CRF without feature importance as weights [21], and the RF-Im\_CRF models, respectively. The experimental data is the feature vector composed of the four features in Chapter 3 of the article. The results Compared with the random forest classifier, the RF-CRF model significantly improves image smoothness, since the CRF eliminates most salt and pepper noise. The differences between the RF-CRF and RF-Im\_CRF models are difficult to see. Therefore, this paper extracted three scenes in the image for comparison to show the performance gap between the two models. The reference data are the optical image and the real classification results based on the optical image.

are shown in Figure 4. As shown in Figure 5, when compared with the RF-CRF model, the RF-Im\_CRF model can further reduce the salt and pepper noise in the image, and the smoothness can be further improved. Since parking lots are set up around some large buildings, the classifier will be difficult to balance between roads and buildings. Some open places such as sports fields and squares as well as roads have more white blocks in area 1, which represent the road. Area 2 has lower category complexity and better homogeneity of vegetation, so there is less variation in the effects of classification. There are narrow roads in area 3, which were

not sampled as samples during the sampling process, since it hardly distinguished with low contrast between neighbours in the SAR image. Therefore, it is misclassified as low vegetation in the classification result. The small white areas in the river are the ships sailing on the river in the SAR image. The RF-Im\_CRF model is better than the RF-CRF model in identifying the riverbank portion on the left side, showing a relatively complete low vegetation zone. *Remote Sens.* **2021**, *13*, x FOR PEER REVIEW 11 of 17

**Figure 4.** Classification results with different classifier (**a**) SVM classifier result, (**b**) RF classifier result, (**c**) RF-CRF model **Figure 4.** Classification results with different classifier (**a**) SVM classifier result, (**b**) RF classifier result, (**c**) RF-CRF model result, and (**d**) RF-Im\_CRF model result.

result, and (**d**) RF-Im\_CRF model result.

It can be seen that the SVM has the worst classification effect. SVM is an independent classifier, so it follows one rule when classifying. Random forests, on the other hand, rely on multiple mutually independent decision trees acting together, each with a different classification threshold. This means that the misclassification results of a single decision tree are corrected by the action of other decision trees. As a result, random forests give better results. Compared with the random forest classifier, the RF-CRF model significantly improves image smoothness, since the CRF eliminates most salt and pepper noise. The dif-The display of the classification results shows that, when compared with the RF-CRF model, the RF-Im\_CRF further improves the classification accuracy, resulting in less noisyimages and a further increase in purity. This is because the value range of various features is diverse. For example, the value range of the spectral feature is between 0–255, while the value range of PSCF is between -1 and 1. The feature difference is calculated in the unit of a feature component in CRF, which helps reduce the overall influence of features with a wide value range. Simultaneously, after adding feature importance as weights, the impact of features with high importance on feature differences between neighbours is enhanced. Therefore, the RF-Im\_CRF model can classify ground objects more accurately.

ferences between the RF-CRF and RF-Im\_CRF models are difficult to see. Therefore, this paper extracted three scenes in the image for comparison to show the performance gap

As shown in Figure 5, when compared with the RF-CRF model, the RF-Im\_CRF model can further reduce the salt and pepper noise in the image, and the smoothness can

tion results based on the optical image.

**Figure 5.** Sub-area classification results. The regions numbered 1, 2, and 3 are the corresponding subregions in Figure 1a. **Figure 5.** Sub-area classification results. The regions numbered 1, 2, and 3 are the corresponding subregions in Figure 1a.

#### The display of the classification results shows that, when compared with the RF-CRF 5.2.2. Classification Data Analysis

complete low vegetation zone.

model, the RF-Im\_CRF further improves the classification accuracy, resulting in less noisy images and a further increase in purity. This is because the value range of various features is diverse. For example, the value range of the spectral feature is between 0–255, while the This paper quantified the classification effectiveness of the classification model through Overall Accuracy (OA) and a Kappa coefficient, and analysed various classification cases using precision and recall.

be further improved. Since parking lots are set up around some large buildings, the classifier will be difficult to balance between roads and buildings. Some open places such as sports fields and squares as well as roads have more white blocks in area 1, which represent the road. Area 2 has lower category complexity and better homogeneity of vegetation, so there is less variation in the effects of classification. There are narrow roads in area 3, which were not sampled as samples during the sampling process, since it hardly distinguished with low contrast between neighbours in the SAR image. Therefore, it is misclassified as low vegetation in the classification result. The small white areas in the river are the ships sailing on the river in the SAR image. The RF-Im\_CRF model is better than the RF-CRF model in identifying the riverbank portion on the left side, showing a relatively

value range of PSCF is between -1 and 1. The feature difference is calculated in the unit of a feature component in CRF, which helps reduce the overall influence of features with a wide value range. Simultaneously, after adding feature importance as weights, the impact of features with high importance on feature differences between neighbours is enhanced. Therefore, the RF-Im\_CRF model can classify ground objects more accurately. 5.2.2. Classification Data Analysis This paper quantified the classification effectiveness of the classification model through Overall Accuracy (OA) and a Kappa coefficient, and analysed various classification cases using precision and recall. When the training set is the same, the SVM produce the same results in multiple experiments. In contrast, the random forest has a certain degree of randomness. Even though the training set is the same, the results obtained during each training set are different. Therefore, we used the same dataset for ten consecutive tests on the random forest model to get the average of the results. In each experiment, the RF, RF-CRF, and RF-Im\_CRF models use the same RF model results, which are only different in the subsequent processing. The RF model was built on Scikit-learn package using Python [29]. In each experiment, this paper extracted the feature importance and the probability of each class of all points. At the end, the evaluation index, such as OA and Kappa coefficients, were obtained for each model based on classification results.

When the training set is the same, the SVM produce the same results in multiple experiments. In contrast, the random forest has a certain degree of randomness. Even though The OA, Kappa values, and their 95% confidence interval are shown in Table 2.



With the same test data and constant parameters, the results of the SVM are always consistent and, therefore, there are no confidence intervals. In terms of a quantitative data comparison, the RF-Im\_CRF model proposed in this paper has the best classification accuracy with an average OA of 94.0%, and the 95% confidence interval is [93.52%,94.54%]. The Kappa coefficient is 0.91 with the 95% confidence interval of [0.902,0.918]. Compared

with SVM, RF, and RF-CRF, OA increased by 15%, 6%, and 2.4%, respectively, and classification reliability increased by 17%, 6%, and 2%, respectively. The reason is that SVM and RF classify single pixels, which are inevitably misclassified even with the inclusion of textural information. CRF can use neighbourhood information to correct misclassified pixels, thereby, improving the classification accuracy. The comparison of the above results shows that the RF-Im\_CRF model can further significantly reduce the noise generated in the random forest classification and improve the smoothness of images due to the correction capability of Im\_CRF.

In order to analyse the classification accuracy relationship between each category, we give the experimental result data obtained in a single experiment, as shown in Table 3. In the absence of CRF, the 95% confidence interval of each class of random forest is basically between [*A* + 2%, *A* − 2%]. Where *A* represents the classification accuracy of each category. The Bootstrap Resampling method of the random forest causes each decision tree to use a different training subset, which leads to differences in classification performance across the trees. With a large number of decision trees, the random forest itself is more accurate than the SVM method, but it inevitably generates randomness, which results in slightly different classification results for each category. The number of test sets for each category is 150, which means that there are three different classification results for this category in the two experiments, and there will be a 2% difference.The classification effect is further improved by the CRF, resulting in a 95% confidence interval between [*A* + 1%, *A* − 1%].


**Table 3.** Comparison of results of different classifiers.

It can be seen that the four models are more accurate in classifying water, high vegetation, and low vegetation than buildings and roads. The reason is that buildings have high complexity in both spectrum and structural characteristics, while roads are more challenging to identify due to low image resolution, a narrow area, and a susceptibity to factors, such as street trees. Among the two, roads are the most difficult to identify and the most error-prone category. This is because roads are mostly between buildings including the boundary between the road and the building that will blur the road with low image resolution. Moreover, the backscattering characteristics of buildings in SAR image can obscure the road to a certain extent, which has a negative impact on classification and makes roads more likely to be misclassified as buildings. At the same time, in the mixed area of multi-category features, low-resolution images significantly increase the complexity of categories, which makes the boundaries between categories difficult to distinguish. Therefore, how to effectively select feature quantities or improve image resolution to enhance the classification effect of buildings and roads, and make more precise distinctions to mixed regions will become the following research focus.

In terms of the model's operational efficiency, since the model proposed in this work needs to use neighbourhood information, this means that neighbourhood pixels must be classified as well. On the contrary, the original random forest classifier does not need to *Im*

classify neighbourhood pixels. Therefore, the computational amount in the calculation process for this model is significantly higher than the one required for simpler classifiers, such as SVM or random forest. The evaluation of computing efficiency and the possible improvements of the algorithm from the computational point-of-view are in progress and will be the subject of the follow-up work.

#### 5.2.3. Analysis of Feature Importance

This article also extracted the feature importance of each feature vector in the above ten experiments and took the average to get the results shown below.

As shown in Tables 4 and 5 and Figure 6, the feature importance of Freeman decomposition and spectral features are higher than others in the random forest classification. For the individual feature vectors, the volume scattering component in Freeman decomposition has the highest feature importance, which is followed by the blue component of spectral features. Nevertheless, the difference between the components of the spectral characteristics is not significant. This is because the volume scattering component is generally higher in the Freeman decomposition than the surface scattering and dihedral scattering for all targets except water. In water targets, these three components are small, and the scattering properties of road targets are similar to water under ideal conditions. Therefore, the volume scattering component has a good basis for judging the water area or road. Therefore, the body scattering has the highest feature importance. The recognition rate is not as ideal in water areas because of the complex and narrow environment in which roads are located. *Remote Sens.* **2021**, *13*, x FOR PEER REVIEW 15 of 17 **Table 5.** Feature importance of each feature. **Class Pd Ps Pv R G B G1 G2 G3 P1 P2 P3 P4 P5 P6 P7 P8** (%) 7.35 5.91 20.53 8.94 9.15 11.96 3.11 2.84 7.49 3.93 2.93 2.14 2.01 3.95 2.30 2.74 2.74 G1 = contrast, G2 = dissimilarity, G3 = energy, P1 = Corr\_co\_Di, P2 = Corr\_co\_FP, P3 = Corr\_co\_HD, P4 = Corr\_co\_VD, P5 = Corr\_cross\_Di, P6 = Corr\_cross\_ FP, P7 = Corr\_cross\_ HD, P8 = Corr\_cross\_VD.

**Figure 6.** Feature importance ring graph. **Figure 6.** Feature importance ring graph.

**6. Conclusions** 

Except for energy, the GLCM and PSCF have similar proportions, while PSCF com-**Table 4.** Feature importance of four characteristic components.


Relying on the unique advantages of CRF in spatial context feature modelling and classification, this paper established a pixel-based RF-Im\_CRF model for classification based on various feature information, such as spectrum, texture, and polarization. The experiments and analyses were carried out using polarized SAR and optical images of Nanjing area as data. The results show that the fusion of multi-source image data improves the classification accuracy. The RF-Im\_CRF model with multiple features proposed in this paper further improves the classification accuracy to more than 94%, which increases by 6% when compared with the random forest classifier. Therefore, the RF-

acteristic components is between [A−1%, A+1%]. Using such a contribution degree as the weight in the CRF pairwise potential function clarifies the spatial relationship between



G<sup>1</sup> = contrast, G-2 = dissimilarity, G<sup>3</sup> = energy, P<sup>1</sup> = Corr\_co\_Di, P<sup>2</sup> = Corr\_co\_FP, P<sup>3</sup> = Corr\_co\_HD, P<sup>4</sup> = Corr\_co\_VD, P<sup>5</sup> = Corr\_cross\_Di, P<sup>6</sup> = Corr\_cross\_ FP, P<sup>7</sup> = Corr\_cross\_ HD, P<sup>8</sup> = Corr\_cross\_VD.

> Except for energy, the GLCM and PSCF have similar proportions, while PSCF components are higher, so the *η* value is relatively high. The feature importance reflects the contribution degree of each feature in the classification. The randomness of random forest also impacts the feature importance. Therefore, the 95% confidence interval of four characteristic components is between [A−1%, A+1%]. Using such a contribution degree as the weight in the CRF pairwise potential function clarifies the spatial relationship between the target and the neighbourhood and improves classification accuracy.

#### **6. Conclusions**

Relying on the unique advantages of CRF in spatial context feature modelling and classification, this paper established a pixel-based RF-Im\_CRF model for classification based on various feature information, such as spectrum, texture, and polarization. The experiments and analyses were carried out using polarized SAR and optical images of Nanjing area as data. The results show that the fusion of multi-source image data improves the classification accuracy. The RF-Im\_CRF model with multiple features proposed in this paper further improves the classification accuracy to more than 94%, which increases by 6% when compared with the random forest classifier. Therefore, the RF-Im\_CRF model has good performance in the fusion classification of polarized SAR and optical images and can be used as a fusion classification method for heterogeneous images.

**Author Contributions:** All the authors made a significant contribution to the work. Conceptualization, Y.K. and B.Y. Methodology, Y.K. and B.Y. Software, B.Y. Validation, Y.K., B.Y., and Y.L. Formal analysis, B.Y. Writing—original draft preparation, B.Y. Writing—review and editing, Y.K. and Y.L. Supervision, H.L. Project administration, X.P. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was funded by the National Natural Science Foundation of China (No. 61501228), the Natural Science Foundation of Jiangsu (No. BK20140825), the Aeronautical Science Foundation of China (No.20152052029, No.20182052012), Basic Research (No. NS2015040), and the National Science and Technology Major Project (2017-II-0001-0017).

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


## *Article* **Oil Spill Detection in SAR Images Using Online Extended Variational Learning of Dirichlet Process Mixtures of Gamma Distributions**

**Ahmed Almulihi <sup>1</sup> , Fahd Alharithi <sup>1</sup> , Sami Bourouis 1,\*, Roobaea Alroobaea <sup>1</sup> , Yogesh Pawar <sup>2</sup> and Nizar Bouguila <sup>2</sup>**


**Abstract:** In this paper, we propose a Dirichlet process (DP) mixture model of Gamma distributions, which is an extension of the finite Gamma mixture model to the infinite case. In particular, we propose a novel online nonparametric Bayesian analysis method based on the infinite Gamma mixture model where the determination of the number of clusters is bypassed via an infinite number of mixture components. The proposed model is learned via an online extended variational Bayesian inference approach in a flexible way where the priors of model's parameters are selected appropriately and the posteriors are approximated effectively in a closed form. The online setting has the advantage to allow data instances to be treated in a sequential manner, which is more attractive than batch learning especially when dealing with massive and streaming data. We demonstrated the performance and merits of the proposed statistical framework with a challenging real-world application namely oil spill detection in synthetic aperture radar (SAR) images.

**Keywords:** Dirichlet process; infinite mixture models; Gamma distribution; variational inference; online setting; oil spill detection; synthetic aperture radar images

#### **1. Introduction**

The use of statistical machine learning has proliferated in many fields, especially to solve a broad range of problems ranging from signal processing, speech recognition, to geosciences and remote sensing where strong models are needed to apply statistical methodology. In the case of geosciences and remote sensing, for instance, statistical machine learning techniques have been deployed successfully in a variety of problems and applications in many parts of the earth system and beyond [1]. In particular, images modeling (e.g., SAR images) has received much attention due to its importance and applications in real world nature tasks related to land, climate, disturbance attribution, vegetation dynamics, urbanization, etc.

Among the probabilistic generative models, the so-named finite mixtures have been successfully applied due to their capability to represent large-scale complex probability densities and to offer a principled way for analyzing missing data [2,3]. Mixture models provide, in general, a formal approach to unsupervised learning and allow, in particular, to select the optimal number of clusters for a given dataset. This fact has been largely detailed in the literature (see, for example, [4,5]). This growing interest has led to developing several fascinating and flexible mixture models such as Gaussian-based mixture models (GMM) which have became popular even though they are not the most appropriate for fitting complex non-Gaussian shapes [6,7]. To deal with conventional GMM limitations, many other alternatives, such as Gamma (GaMM) mixtures [8–11], have shown to perform significantly better than GMM [12] thanks to its compact analytical form which is able to

**Citation:** Almulihi, A.; Alharithi, F.; Bourouis, S.; Alroobaea, R.; Pawar, Y.; Bouguila, N. Oil Spill Detection in SAR Images Using Online Extended Variational Learning of Dirichlet Process Mixtures of Gamma Distributions. *Remote Sens.* **2021**, *13*, 2991. https://doi.org/10.3390/ rs13152991

Academic Editors: Monidipa Das, Soumya K. Ghosh, Vemuri M. Chowdary, Pabitra Mitra and Santosh Rijal

Received: 8 June 2021 Accepted: 26 July 2021 Published: 29 July 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

cover long-tailed distributions and to approximate data with outliers. Thus, motivated by the flexibility and good performance obtained with Gamma distribution, we will focus here on investigating Gamma-based mixture model for SAR images classification. We are mainly motivated by the excellent results that Gamma mixture has provided, thanks to its flexibility, for SAR images analysis in many applications such as target detection and discrimination, target recognition and surface classification, oil spill detection, noise reduction, etc. [10]. In this paper, we will focus mainly on oil spill detection

The most challenging problem within finite mixture models is the estimation of the number of clusters that best describes the data without over- or under-fitting [13,14]. In the statistical learning context, this problem is solved using frequentist approach (i.e., maximum likelihood (ML)) within some criteria (ex. Akaike's Information Criterion, Minimum Description Length, Minimum Message Length, etc) [15,16]. It is noteworthy that the evaluation of these criteria for many clusters using ML method is very costly in terms of calculation. In addition, all parameters are supposed fixed and the inference process is based mainly on the likelihood of data which leads to convergence isssues. An alternative way to tackle the issue of selecting accurately the number of clusters is via nonparametric Bayesian inference using for instance Dirichlet process (DP) [17]. In this case, the number of clusters may increase as more data are observed. This property makes DP extremely useful in exploratory data analysis. Thus, the assumption of an infinite number of components allows to avoid the problems of over- and under-fitting. Dirichlet processes (DP) mixtures have become a popular choice for various machine learning applications thanks to effective sampling techniques such as Markov chain Monte Carlo (MCMC) [18,19]. Despite the fact that MCMC yields good performance, it is frequently limited to small-scale problems and computationally intensive [20].

An interesting alternative, to both frequentist and Bayesian methods, which has provided promising performance, is variational Bayes learning [15,21]. Variational inference has the advantage to find optimal approximate posterior distributions by minimizing Kullback–Leibler (KL) divergence, or as maximizing evidence lower bound. Recently, an extended variational inference (EVI) was proposed [8] and has shown to be efficient for minimizing the KL divergence and for tackling the estimation problem. In this work, we go a step further by developing an infinite mixture model based on Gamma distribution via Dirichlet process prior, and then we propose to exploit the merits found recently by the extended variational framework [8] to learn the developed mixture model (InGaMM-eV) in an online manner. Furthermore, it is possible to estimate all parameters in closed forms. Moreover, compared to batch algorithms, online learning is more effective and helpful especially when processing big and streaming data [22] which can be crucial in SAR images analysis to allow continuous monitoring of the earth's surface. It is noteworthy also that many SAR satellite missions have accumulated repeated observations over the last decades and processing these data in an online manner could offer ease of use and solutions to some challenging problems (e.g., change detection [23]). Thus, an effective online extended variational framework of Dirichlet process mixtures of Gamma distributions is developed using stick-breaking representation. As a result, the number of clusters is selected appropriately, the model's parameters are learned in a closed form, and the issue of under-fitting is solved by deriving a model with an unlimited complexity.

The rest of this manuscript is presented as follows. We review some relevant works related to oil spill detection in Section 2. The details of extending the finite Gamma mixture to infinite case are given in Section 3. The principles of our implemented nonparametric variational learning algorithm of infinite Gamma mixture are provided in Sections 4 and 5. Section 6 is devoted to discuss the results obtained from experiments. Finally, the paper is concluded with some future works.

#### **2. Related Research Work**

Oil pollution is a major ocean disaster and environmental threat to coastal ecosystems which has been recently highlighted by several tankers accidents around the world. Accidents on offshore oil platforms, refineries, and pipeline can cause serious oil spills. However, these accidents represent only 5% of the total oil pollution worldwide, and 95% are caused by illegal discharges by ships that prefer to dispose, cheaply, of oil residues in their tanks (according to many studies such as the European Space Agency) [24–26]. Oil pollution may result from several sources such as industrial discharges, oil production, natural oil seepage, and urban runoff. Natural slicks are of bacterial or biological decomposition or geological origin. Oil spills can devastate naval life as well as harm humans and animals by reducing dramatically air-sea exchanges processes, such as surface evaporation. Oil spills are then of great public, political and scientific concern. Therefore, there is an urgent need to monitor and detect oil spills on ocean so as to facilitate government decision making. The detection of these oil spills is considered an important and challenging problem to effectively conduct countermeasures. An effective approach is the use of satellites which provide radar images of the sea surface (500 <sup>×</sup> <sup>500</sup> km<sup>2</sup> in a single image). Satellites radar images supply an occasion to monitor coastal waters day and night, regardless of weather conditions allowing an early warning of oil spills. Moreover, satellite detection is well adapted to this kind of problems by producing images of difficult access areas [24]. Among different satellite imagery technologies, active microwave sensors such as synthetic aperture radar (SAR), has been frequently investigated for remote sensing of oil pollution [27]. The synthetic aperture radar emits and receives radio wave in order to acquire a representation of the target scene. Detecting oil spill in SAR images (as shown in Figure 1) is very complex procedure that involves many steps [26].

**Figure 1.** SAR image obtained by the European Remote Sensing satellite ERS-2 on April 1997 over the South China Sea (left image) and SAR image obtained by the ERS-1 satellite on May 1994 over Pacific Ocean east of Taiwan (right image). These images (area: 100 km × 100 km) showing an oil spill [28].

For several decades, extensive works have been provided [27,29,30] to distinguish oil slicks from natural biogenic slicks via analyzing satellite radar images. Most of conventional oil slick (or dark objects) detection procedures are carried out in three steps: (1) a presegmentation of dark spot, (2) the extraction of dark spot feature, and (3) a classification step of these dark spots. Some early and recent review articles summarize different oil slick detection methods [26,28,31]. These reviews state that most methods are based on using statistical patterns to discriminate between oil slicks and look-alikes under varying conditions. They conclude also that the automatic and accurate discrimination between oil spills and look-alikes is a challenging problem and need more investigations in the future. On the other side, a lot of efforts have been devoted to apply classic classifiers and descriptive statistical approaches learned from training data [25,30,32–34]. These works rely on highly trained human operators to asses and verify each region in a given SAR image. In [33], authors proposed a one-class based approach for image classification to detect oil-spill. First of all, a preprocessing step is used to identify related areas to oil spills. A feature selection step to select relevant features is also performed given that the contrast between spill's region and the surrounding regions depends on the type and amount of oil and other environmental factors (i.e., wave height, wind speed, and sea). Finally,

a one-class classifier is used to detect oil spills. A geometric level-set based segmentation method of oil spills and illegal oil discharges was developed in [35]. According to this work the regions in SAR images can be classified into pure oil spills or look-alikes on the basis of the following measurements: orientation, area, shape complexity, perimeter, eccentricity, and mean border gradient. In [36], a region-based method was also proposed. It involves both conventional detection theory and image segmentation techniques (such as N-nearest-neighbor) to have more accurate results. In [37], authors developed an adaptive thresholding-based algorithm to classify each slick as oil or look-alike. Here, involved features are derived from shape (slick complexity, width, area, moment), slick surroundings, contrast (slick local contrast, border gradient, smoothness contrast), and slick homogeneity. Their algorithms have been trained on two datasets, namely Radarsat and Envisat Advanced Synthetic Aperture Radar (ASAR) images. Fuzzy classifiers have been also used in [38] to identify all possible oil spills (dark patterns) in SAR images. A set of operations based on the fuzzy theory are used to establish the likeness of each candidate to be an oil spill or not. In the last few years, artificial neural network algorithms have been broadly applied in the context of remote sensing image segmentation and classification. Indeed, authors in [39–43] proposed different neural network-based methods (like CNN and Deep NN) in order to improve oil spill detection and classification. Some other notable interesting CNN-based oil spill detection and classification frameworks include the works in [44,45].

While considerable progress has been made in this field over the past few years, designing more robust tools still needs wide amounts of specialized knowledge and manual work. The goal here is to propose a method based on a nonparametric Bayesian model (infinite model) as well as to learn it using variational inference. Our main contributions are summarized as follow: First, we start by extending the finite Gamma mixture to the infinite case via a nonparametric Dirichlet process prior such that the problem of selecting the suitable number of clusters is solved fashionably. Then, we investigate the developed approach for remote sensing image classification. Indeed, after extracting effective features as in [46], we shall focus on modelling and classifying oil spills and other similar sea surface features using the infinite mixture model. The merits of our approach have been demonstrated using real datasets.

#### **3. Statistical Model Specification**

In this section, we present our developed variational learning approach based on the infinite Gamma mixture model.

#### *3.1. Finite Gamma Mixture Model*

Let's denote by <sup>Y</sup> our observed data such as <sup>Y</sup> <sup>=</sup> {~*Y*1, . . . , <sup>~</sup>*YN*}, where each *<sup>Y</sup>*<sup>~</sup> *<sup>i</sup>* = (*Yi*<sup>1</sup> ,*Yi*<sup>2</sup> , . . . ,*YiD*) is a *D* -dimensional positive vector. These feature vectors are supposed to be drawn from a mixture of Gamma distributions with parameter Θ. Let *M* denotes the number of mixture's components. ~*Y<sup>i</sup>* (*i* = 1, . . . , *N*) are independent and identically distributed (iid). The density function of multi-dimensional Gamma distribution is defined as follows:

$$p(\vec{Y}\_i \mid \theta) = \prod\_{d=1}^{D} \frac{(\beta\_d)^{a\_d} Y\_{id}^{a\_d - 1} e^{-\beta\_d Y\_{id}}}{\Gamma(a\_d)} \tag{1}$$

where *θ* = {*α<sup>d</sup>* , *βd*} is the set of parameters of the distribution such that *α<sup>d</sup>* denotes the shape and *β<sup>d</sup>* the location parameter. Here, Γ(.) is the Gamma function which is given as: Γ(*x*) = R <sup>∞</sup> 0 *s x*−1 *e* <sup>−</sup>*sds*.

Suppose that the *D*-dimensional random vector ~*Y<sup>i</sup>* (observed data) is drawn from a finite mixture of Gamma (GaMM) distributions and consisting of *M* components which is established to model the data with different shapes. The probability density function (pdf) of a GaMM is then given as:

$$p(\vec{Y} \mid \Theta) = p(\vec{Y} \mid \vec{a}, \vec{\beta}, \vec{\pi}) = \prod\_{i=1}^{N} \sum\_{j=1}^{M} \pi\_j p(\vec{Y}\_i \mid \theta\_j) \tag{2}$$

where Θ = {*θ*1, *θ*2, . . . , *θM*, *π*1, . . . , *πM*}. The parameters of the *j th* mixture component is represented by *θ<sup>j</sup>* = {*α<sup>j</sup>* , *βj*}. *π<sup>j</sup>* is the vector of the mixing weights subject to 0 6 *π<sup>j</sup>* 6 1, and ∑ *M <sup>j</sup>*=<sup>1</sup> *π<sup>j</sup>* = 1.

#### *3.2. Infinite Gamma Mixture Model*

The Dirichlet process (DP) is a stochastic process with a positive scaling factor and base distribution used in Bayesian nonparametric models of data, notably in infinite mixture models. The DP is an effective concept for various applications (for more details please refer to [47]). In this section we address the issue of assuming an infinite number of components. In order to solve properly this problem which is important for well describing the observed data without over- or under-fitting, we propose a Dirichlet process mixture of Gamma distributions. In other words, we construct our infinite model by following the principle of Dirichlet process (DP) through stick-breaking representation [48,49]. Thus, the number of components is intended to be infinite. In this case, let's denote *G* a Dirichlet process distributed with a base distribution *H* and a concentration parameter *ψ*. The construction of *G* ∼ *DP*(*ψ*, *H*) is defined as

$$\begin{aligned} \lambda &\sim \operatorname{Beta}(1, \psi) \\ \Omega\_j &\sim H \\ \pi\_j &= \lambda\_j \prod\_{s=1}^{j-1} (1 - \lambda\_s) \\ \mathcal{G} &= \sum\_{j=1}^{\infty} \pi\_j \delta \Omega\_j \end{aligned} \tag{3}$$

where *δ*Ω*<sup>j</sup>* represents the Dirac delta measure centred at Ω*<sup>j</sup>* . The proportions *π<sup>j</sup>* are determined by cutting a unit length stick, regularly, into an infinite number of pieces such that ∑ ∞ *<sup>j</sup>*=<sup>1</sup> *π<sup>j</sup>* = 1 and *ψ* is a real number. Consequently, the infinite mixture model of Gamma distributions Y is expressed as

$$p(\mathcal{Y} \mid \Theta) = p(\mathcal{Y} \mid \vec{a}, \vec{\beta}, \vec{\pi}) = \prod\_{i=1}^{N} \sum\_{j=1}^{\infty} \pi\_j p(\vec{Y}\_i \mid \theta\_j) \tag{4}$$

Subsequently, a latent variable *Z<sup>i</sup>* = (*Zi*<sup>1</sup> , *Zi*<sup>2</sup> , . . .) is introduced for observed data Y. These latent membership vectors are used to point out if the vector *Y*~ *<sup>i</sup>* belongs to component *j* (*Zij* = 1) or not (*Zij* = 0). Now, the complete-data likelihood is expressed as

$$p(\mathcal{Y}, Z \mid \vec{a}, \vec{\beta}, \vec{\pi}) = \prod\_{i=1}^{N} \prod\_{j=1}^{\infty} \pi\_j^{z\_{ij}} \left( p(\vec{Y}\_i \mid a\_j, \beta\_j) \right)^{z\_{ij}} \tag{5}$$

According to the stick-breaking construction of DP (see Equation (3)), *π<sup>j</sup>* can be expressed as a function of *λ<sup>j</sup>* and after replacement, we have the following:

$$p(\mathcal{Z} \mid \vec{\lambda}) = \prod\_{i=1}^{N} \prod\_{j=1}^{\infty} \left[ \lambda\_j \prod\_{s=1}^{j-1} (1 - \lambda\_s) \right]^{z\_{ij}} \tag{6}$$

The resulting complete-likelihood of the infinite Gamma mixture is finally expressed as (including latent variables):

$$p(\mathcal{Y}, Z \mid \vec{\alpha}, \vec{\beta}, \vec{\pi}) = \prod\_{i=1}^{N} \prod\_{j=1}^{\infty} \left[ \lambda\_j \prod\_{s=1}^{j-1} (1 - \lambda\_s) \right]^{z\_{ij}} \left( p(\vec{Y\_i} \mid \alpha\_{j\prime} \beta\_{\vec{\beta}}) \right)^{z\_{ij}} \tag{7}$$

#### **4. Batch Variational Bayesian Learning**

It is noteworthy that, when dealing with intractable models, variational inference is presented as a powerful deterministic alternative to approximate posteriors and likelihoods. In this section, we propose to develop a variational learning method to approximate inference for the DP, where the truncated stick-breaking construction [50] is applied to derive an approximate posterior and to estimate the model parameters. On the other side, we proceed by determining an approximation *Q*(Θ) for true posterior *p*(Θ | *Y*) such that Θ = {*Z*, *α*, *β*}. After that, we use the well-known KL divergence in order to reduce the difference between *Q*(Θ) and *p*(Θ | *Y*):

$$KL(Q \parallel P) = \int Q(\Theta) \ln \left( \frac{p(\Theta \mid \mathcal{Y})}{Q(\Theta)} \right) d\Theta \tag{8}$$

$$KL(Q \parallel P) = \ln(p(\mathcal{Y}) - \mathcal{L}(Q))\tag{9}$$

$$\mathcal{L}(Q) = \int Q(\Theta) \ln \left( \frac{p(\mathcal{Y}\_\prime \Theta)}{Q(\Theta)} \right) d\Theta \tag{10}$$

*KL* divergence attains value of zero if we have *Q*(Θ) = *p*(Θ | Y) (since As *KL*(*Q* || *P*) ≥ 0). From Equation (9), it is possible to deduce that L(*Q*) ≤ *lnp*(Y) and so L(*Q*) is a lower bound to *lnp*(Y). However, it is difficult to solve the true posterior which cannot be directly estimated because of the complexity of calculation. We get around this matter by taking into account a restricted family of *Q*(Θ) that can be calculated [21]. In particular, the mean field theory [51] is adopted to factorize *Q*(Θ) into different tractable distributions such that *Q*(Θ) = ∏*i*=<sup>1</sup> *Qi*(Θ*i*). To maximize L(*Q*), we apply variational methodology with respect to each *Qi*(Θ*i*). Then, the optimal form of *Qi*(Θ*i*) denoted by *Qs*(Θ*s*) is given as

$$\ln Q\_{\mathbb{S}}(\Theta\_{\mathbb{S}}) = \langle \ln(p(\mathcal{Y}, \Theta)) \rangle\_{\mathbb{H}^{\#}} + \text{const} \tag{11}$$

where h.i*j*6=*<sup>s</sup>* is the expectation value of *Q*, with respect to all *Qi*(Θ*i*) excluding that case of *j* = *s*. It is noted that we have to take into account the truncation of the stick-breaking representation [49] to take advantage of the bound. Therefore, we take *λ<sup>M</sup>* = 1 and *π<sup>j</sup>* = 0 when *j* > *M* which leads to ∑ *M <sup>j</sup>*=<sup>1</sup> *π<sup>j</sup>* = 1.

#### *4.1. Prior Distributions for Parameters*

To complete the probabilistic formulation, we have to place proper conjugate priors over the parameters *λ*, *α* and *β*. In particular, the Beta distribution is selected for the parameter *λ* (referring to Equation (3)) as follow

$$p(\lambda \mid \psi) = \prod\_{j=1}^{\infty} Beta(1, \psi\_j) = \prod\_{j=1}^{\infty} \psi\_j (1 - \lambda\_j)^{\psi\_j - 1} \tag{12}$$

Here, the hyperparameters of the Beta distribution is denoted by *ψ* = (*ψ*1, *ψ*1, . . .) [52]. Moreover, it is possible to assign a conjugate Gamma prior to *ψ*:

$$p(\boldsymbol{\psi}) = \mathcal{G}(\boldsymbol{\psi} \mid a, b) = \prod\_{j=1}^{\infty} \frac{b\_j^{a\_j}}{\Gamma(a\_j)} \boldsymbol{\psi}^{a\_j - 1} e^{-b\_j \boldsymbol{\psi}\_j} \tag{13}$$

For *α* and *β*, a prior Gamma distribution is imposed for them as suggested in [8] which is reasonable given that *α* and *β* are positives and also Gamma density is assumed to be too flexible and simple distribution to be selected as prior.

$$p(\vec{\alpha}) = \mathcal{G}(\vec{\alpha} \mid \vec{u}, \vec{v}) = \prod\_{j=1}^{\infty} \prod\_{d=1}^{D} \mathcal{G}(\alpha\_{\vec{j}d} \mid \mu\_{\vec{j}d}, v\_{\vec{j}d}) \tag{14}$$

$$p(\vec{\beta}) = \mathcal{G}(\vec{\beta} \mid \vec{s}, \vec{t}) = \prod\_{j=1}^{\infty} \prod\_{d=1}^{D} \mathcal{G}(\beta\_{jd} \mid s\_{jd}, t\_{jd}) \tag{15}$$

Following the graphical model in Figure 1, the resulting joint distribution is expressed as

$$\begin{split} p(\mathcal{Y}, \boldsymbol{\Theta}) &= p(\mathcal{Y}, \mathcal{Z} \mid \tilde{\mathbf{x}}, \tilde{\boldsymbol{\beta}}) p(\mathcal{Z} \mid \tilde{\mathbf{x}}) p(\tilde{\mathbf{x}} \mid \tilde{\mathbf{y}}) p(\tilde{\mathbf{y}}) p(\tilde{\mathbf{x}}) p(\tilde{\boldsymbol{\beta}}) \\ &= \prod\_{i=1}^{N} \prod\_{j=1}^{\infty} \Big[ \lambda\_{j} \prod\_{s=1}^{i-1} (1 - \lambda\_{s}) \Big]^{z\_{ij}} \Big( p(\tilde{Y}\_{i} \mid \boldsymbol{\alpha}\_{j}, \boldsymbol{\beta}\_{j}) \Big)^{z\_{ij}} \\ &\times \prod\_{j=1}^{\infty} \psi\_{j} (1 - \lambda\_{j})^{\psi\_{j} - 1} \\ &\times \prod\_{j=1}^{\infty} \frac{b\_{j}^{a\_{j}}}{\Gamma(a\_{j})} \varphi^{a\_{j} - 1} e^{-b\_{j} \psi\_{j}} \\ &\times \prod\_{j=1}^{\infty} \prod\_{d=1}^{D} \mathcal{G}(a\_{jd} \mid u\_{jd}, v\_{jd}) \\ &\times \prod\_{j=1}^{\infty} \prod\_{d=1}^{D} \mathcal{G}(\beta\_{jd} \mid s\_{jd}, t\_{jd}) \end{split} \tag{16}$$

#### *4.2. Learning Algorithm*

As explained at the beginning, the objective of this work is to approximate the true posterior *p*(Θ | *Y*) with a new tractable approximation denoted by *Q*(Θ). Furthermore, the optimal solution of variational learning is reached while maximizing the lower bound w.r.t Θ = {*Z*, *λ*, *α*, *β*}. The factorization of *Q*(Θ) (while taking into account the truncation *M*) leads to following parametric form which optimal solution is presented in Appendix A:

$$Q(\Theta) = \left[ \prod\_{i=1}^{N} \prod\_{j=1}^{M} Q(Z\_{ij}) \right] \left[ \prod\_{j=1}^{M} Q(\lambda\_j) Q(\psi\_j) \right] \left[ \prod\_{j=1}^{M} \prod\_{d=1}^{D} Q(a\_{jd}) Q(\beta\_{jd}) \right] \tag{17}$$

Once the optimal variational factors are in hand, the calculation of the lower bound L(*Q*) is then straightforward. Figure 2 presents a graphical model of the proposed infinite Gamma mixture model (inGaMM). Random variables are denoted by circles and hyperparameters are represented by rounded boxes. Then, the different steps of the implemented method are summarized in Algorithm 1.

**Figure 2.** Graphical model of the developed variational infinite inGaMM. Random variables are denoted by circles and hyperparameters are represented by rounded boxes . *Y* is observed variable, *Z* is latent variable, large boxes are used for repeated process and the arrows show the conditional dependence between variables.

**Algorithm 1:** Batch variational learning approach for the inGaMM


#### **5. Online Variational Bayesian Learning**

Early warning and immediate detection of oil spills has many advantages such as immediate response and reducing damage to the environment. The development of realtime monitoring and detection system is of great importance in order to minimize the volume of oil spilled. To address this problem, we propose to develop an online learning approach which is being commonly used in many other areas especially when data points are continuously arriving over time [53]. The online setting is particularly useful for incrementally training the system by feeding instances of data sequentially. It also has the benefit of making the learning process easier and faster than batch mode.

In what follows, we extend the batch variational method (presented in previous section) for unsupervised SAR images classification to an online setting. This process requires updating the model's parameters incrementally without degrading its efficiency and flexibility. To determine the lower bound, we suppose that we have at time *t* a fixed set of observed data. At time *t* + 1, a new SAR image *YN*+<sup>1</sup> comes out and is added to the dataset, hence, the mixtures' parameters have to be updated accordingly. Thus, in online setting, the lower bound at time *t* is expressed as in [54]:

$$\mathcal{L}^{t}(Q) = \frac{N}{t} \sum\_{i=1}^{t} \int \mathcal{Q}(\Omega) d\Omega \sum\_{\mathcal{Z}\_{i}} \ln \left[ \frac{p(\vec{Y}\_{i}, \vec{Z}\_{i} \mid \Omega)}{Q(\vec{Z}\_{i})} \right] + \int \mathcal{Q}(\Omega) \ln \left[ \frac{p(\Omega)}{Q(\Omega)} \right] d\Omega \tag{18}$$

where Ω = {*α*, *β*}.

Let's suppose that we already observed {*Y*<sup>~</sup> <sup>1</sup>, . . . , <sup>~</sup>*Y*(*t*−1)} and then a new data point *<sup>Y</sup>*~*<sup>t</sup>* is coming. Therefore, L *t* (*Q*) is maximized w.r.t *Q*(*Z*~*t*), such that *Q*(*α*), *Q*(*λ*) and *Q*(*β*) are set to *Qt*−<sup>1</sup> (*α*), *Qt*−<sup>1</sup> (*λ*) and *Qt*−<sup>1</sup> (*β*), respectively. We adopt a truncation technique with value *M* which gives [49]:

$$Q(\vec{Z}\_t) = \prod\_{j=1}^{M} r\_{tj}^{Z\_{tj}} \tag{19}$$

$$
\sigma\_{t\bar{j}} = \frac{\rho\_{t\bar{j}}}{\sum\_{j=1}^{M} \rho\_{t\bar{j}}} \tag{20}
$$

Then, L *t* (*Q*) is maximized w.r.t *Q*(*α*), *Q*(*λ*) and *Q*(*β*) while keeping *Q*(*Z*~*t*) fixed.

$$Q^{(t)}(\vec{a}) = \prod\_{j=1}^{M} \prod\_{d=1}^{D} \mathcal{G}(a\_{jd}^{(t)} \mid u\_{jd}^{\*(t)}, v\_{jd}^{\*(t)}) \tag{21}$$

$$\mathcal{Q}^{(t)}(\vec{\beta}) = \prod\_{j=1}^{M} \prod\_{d=1}^{D} \mathcal{G}(\beta\_{jd}^{(t)} \mid s\_{jd}^{\*(t)}, t\_{jd}^{\*(t)}) \tag{22}$$

$$\mathcal{Q}^{(t)}(\lambda) = \prod\_{j=1}^{M} Beta(\lambda\_j^{(t)} \mid c\_j^{(t)}, d\_j^{(t)}) \tag{23}$$

where

$$\begin{aligned} u\_{jd}^{\*(t)} &= u\_{jd}^{\*(t-1)} + \rho\_l \Delta u\_{jd}^{\*(t)} \\ v\_{jd}^{\*(t)} &= v\_{jd}^{\*(t-1)} + \rho\_l \Delta v\_{jd}^{\*(t)} \\ s\_{jd}^{\*(t)} &= s\_{jd}^{\*(t-1)} + \rho\_l \Delta s\_{jd}^{\*(t)} \\ t\_{jd}^{\*(t)} &= t\_{jd}^{\*(t-1)} + \rho\_l \Delta t\_{jd}^{\*(t)} \\ c\_{jd}^{\*(t)} &= c\_{jd}^{\*(t-1)} + \rho\_l \Delta c\_{jd}^{\*(t)} \\ d\_{jd}^{\*(t)} &= d\_{jd}^{\*(t-1)} + \rho\_l \Delta d\_{jd}^{\*(t)} \end{aligned} \tag{24}$$

∆ is the natural gradient of each hyperparameter in the previous equation. *ρ<sup>t</sup>* denotes the learning rate [55] expressed by following equation:

$$
\rho\_l = (\eta\_0 + t)^{-\varepsilon} \tag{25}
$$

where *e* ∈ [0.5, 1] and *η* ≥ 0. This helps to guarantee convergence [55]. Please note that the expectation in the above mentioned equations are obtained with same manner as for the case of batch setting in the previous section and as in [56]. Since the online learning framework can be considered as a stochastic approximation algorithm, the convergence is ensured as prove in [53]. The proposed and developed online variational algorithm is presented in Algorithm 2.

#### **Algorithm 2:** Proposed online algorithm for inGaMM


```
12 until for t = 1 to N
```
#### **6. Experimental Results**

#### *6.1. Data Sets*

The main objective of this section is to investigate our developed online extended variational learning framework of Dirichlet process mixture of Gamma distributions to detect oil spills in several SAR images. The second objective is to compare the performance of the proposed statistical framework with other methods from the state-of-art. First, it should be noted that one of the challenges is the lack of already common data sets for oil spill detection and this problem has been addressed by many relevant research communities such as [57,58]. Very limited data sets have been proposed in the literature, and therefore, it is too difficult to compare between published results since each method uses different data sets with different settings. In this work, we are essentially concerned with two challenging SAR databases. The first data set is the SAR images containing oil spills collected via the European Space Agency (ESA) database [40] which is composed of 1112 images with 5 different classes: Land, Look-alike, oil-spill, ships, and sea surface. The second one is a labelled SAR dataset taken from Sentinel-1 wave mode (TenGeoP-SARwv) [59] which includes 40,553 images with 10 different geophysical phenomena such as Pure Ocean Waves (F), Wind Streaks (G), Micro Convective Cells (H), Rain Cells (I), Biological Slicks (J), Sea Ice (K), Iceberg (L), Low Wind Area (M), Atmospheric Front (N), and Oceanic Front (O). Figures 3 and 4 show examples of images from these two datasets, respectively. For experiments, we randomly select half of the dataset as the training set and the rest for testing. In order to quantify how well SAR images are classified, we report the results in terms of average accuracy metric and false positive rate (FPR).

Modeling and classifying SAR requires powerful statistical models to represent their content (ex. color, texture). In this work we shall focus on the problem of SAR images modeling and classification via extracting local features that describe accurately input images. Indeed, feature extraction step is a part of the dimensionality reduction process that has been broadly studied in the past. It has an important role in many computer vision applications since it helps identifying the most discriminating characteristics, reducing ambiguity and enhancing the performance. However, the presence of speckle noise in synthetic aperture radar (SAR) images, as well as low-resolution between regions (surfaces) and poor contrast, make extracting relevant features too difficult. Thus, if the representative features are well extracted, then we can correctly interpret and classify images. Extracting local features from grey-scale images is a well-studied step in the fields of image processing and computer vision and various comparative measures have been studied for many years. The study of prior techniques is not within the scope of this paper. However, we suggest applying two successful methods of features extraction. The first one is based on imageNet pretrained deep learning model (resnet50) [60]. The flowchart diagram for extracting features using resnet50 is given in Figure 5. For each SAR image in the

flowchart, we first apply different image processing operations like adjusting contrast value, thresholding, object edge detection by blurring noise and small objects. After this step, based on the number of detected dark spots, we extract different features including geometrical characteristics and texture of the object. Finally, we store the extracted features for the model evaluation. In the second approach, we extract a number of features based on geometrical characteristics, physical behavior, and those related to oil spill context of the dark formations as described in [61]. After extracting features, we applied principal component analysis (PCA) to reduce dimensionality of extracted datasets features.

(**c**) Land (**d**) Ship

**Figure 3.** Dataset-1: Samples of SAR images from the European Space Agency (ESA) dataset [40]. (**a**) OilSpill, (**b**) Look-alike, (**c**) Land, (**d**) Ship.

#### *6.2. Results and Discussion*

Next, we apply our online extended variational algorithm (Section 5) over the extracted features. Thus, each image is represented by an infinite Gamma mixture model. We average the results over 30 runs to evaluate and compute the final performance. Tables 1 and 2 show the average classification accuracy and false positive rate (FPR) of our InGaMMeV model. They are obtained with different classes in both datasets and by using two features extraction methods. Indeed, we considered a first experiment where the goal was to distinguish between oil spills versus the rest and a second one where the goal is to categorize some classes from each data set (4 categories are taken from the first data set and 9 from the second one). The testing data is assumed to arrive sequentially in an online mode.

(**i**) Atmospheric Front (N) (**j**) Oceanic Front (O) **Figure 4.** Dataset-2: Samples of SAR images from Sentinel-1 wave mode (TenGeoP-SARwv)

dataset [59]. (**a**) Pure Ocean Waves, (**b**) Wind Streaks, (**c**) Micro-Convective Cells, (**d**) Rain Cells, (**e**) Biological Slicks, (**f**) Sea Ice, (**g**) Iceberg, (**h**) Low Wind Area, (**i**) Atmospheric Front, (**j**) Oceanic Front.

**Figure 5.** Flowchart diagram for extracting features using first feature extraction approach (ImageNet pretrained (resnet50) features).

Figures 6 and 7 present the confusion matrices for SAR images classification computed by the proposed InGaMM-eV using the two features extraction methods, respectively. It is noted that these matrices are used to describe the performance of the proposed model since they record true positives, false positives and false negatives. In fact, each matrix summarizes the prediction results on a classification problem and it offers a clear idea of what the proposed model is working correctly and what kinds of errors it commits. Each entry of index (*u*, *v*) represents the number of images in class *u* that are affected to class *v*. According to these results, the average classification accuracy is very promising and is equal to 90.57% (error rate of 9%) for the first dataset and 95.16% (error rate of 4%) for the second dataset.

**Table 1.** Results for both dataset with different number of classes using first feature extraction approach (ImageNet pretrained (resnet50) features).


**Table 2.** Results for both dataset with different number of classes using second feature extraction approach (Dark spots, geometrical, physical, and characteristics features).


Figures 6 and 7 present additional results obtained by changing the way visual features are extracted as well as the number of classes. Indeed, for the case of ESA-SAR dataset, InGaMM-eV provides high average accuracy of 97.96% using imageNet pretrained deep learning model (resnet50), and 89.94% using Dark spots, geometrical, physical char-

acteristics features. In both cases, the false positive rate is very low. For Sentinel-1 wave mode SAR dataset, the average accuracy to classify SAR images is 95.16% using resnet50, which is better than the second method for extracting features (only 88.68%). According to these results, we notice that the overall average classification accuracy is very encouraging, taking into account the complexity of treated images. It is noteworthy that, due to low resolution of images in the second dataset (Sentinel-1 wave mode), it was very difficult to extract features using the second feature extraction method (i.e., detecting dark objects). Thus, we have low accuracy than expected for this dataset.


**Figure 6.** Average rounded confusion matrix (in terms of percentage) for SAR classification using InGaMM-eV for ESA-SAR dataset.


**Figure 7.** Average rounded confusion matrix (in terms of percentage) for SAR classification using InGaMM-eV for Sentinel-1 wave mode SAR dataset.

In this experiment, our second goal is also to demonstrate the advantages of using extended variational framework over the maximum likelihood (via EM-algorithm), as well as the merits of infinite mixture model over its finite counterpart. Therefore, we compared the classification results using the following mixture models: InGaMM-eV (our infinite Gamma model using extended variational inference), GaMM-eV (finite Gamma model using extended variational learning), GaMM-EM (finite Gamma mixture model using expectation maximization learning), InGMM-eV (infinite Gaussian model using extended variational learning), and GMM-EM (finite Gaussian mixture model using expectation maximization learning). The average performances of all tested learning approaches, using the two features extraction methods, are depicted in Tables 3 and 4. We can see clearly that the extended variational approach provides better results than the EM. Furthermore, the merits of using a Dirichlet process mixtures of Gamma distributions (i.e. infinite mixture model ) over a finite mixture model is clear by noting that better result was found with the infinite mixtures. In particular, in Table 3, the InGaMM-eV (90.05%) outperforms GaMM-eV (88.33%) in terms of classification accuracy rate for both datasets. On the other side, it is worth mentioning that our approach provides better results than the implemented

frameworks based on Gaussian mixtures. We can then deduce that the infinite Gamma model has better modeling and classification capability than the Gaussian when dealing with SAR images analysis.

**Table 3.** Overall oil spill detection rate of different models for 2 datasets using the first feature extraction approach (ImageNet pretrained (resnet50) features) .


**Table 4.** Overall oil spill detection rate of different models for 2 datasets using the second feature extraction approach (Dark spots, geometrical, physical, and characteristics features).


Next, The proposed learning approach (InGaMM-eV) is compared with some methods from the literature and the comparative study is presented in Table 5. As we can see, the proposed online algorithm performs better than other algorithms. Accordingly, it is important to emphasize the advantage of our developed extended variational formalism for infinite Gamma mixture, which can provide interesting results. It is also important to underline the merit of the online learning process, which is able to maintain high performance of oil spill prediction as well as handling data faster as they arrived. Moreover, it has the capacity to update the model incrementally without the need for retraining. All these results confirm that the proposed infinite Gamma mixture using the extended variational learning mode is a better choice thanks to the flexibility of the infinite Gamma mixture over the finite models. All these benefits make it more appropriate especially for SAR images classification especially in the case or large scale data sets.

**Table 5.** Comparative study between different methods from the literature on two datasets.


#### **7. Conclusions**

In this paper an effective online nonparametric Bayesian analysis method based on Dirichlet process mixture of Gamma distributions (i.e., infinite Gamma mixture model) is developed to deal with the challenging problem of oil spill detection in SAR images. The Gamma distribution is considered because of its flexibility for semi-bounded data modelling. This framework is learned using an extended version of conventional variational inference in a flexible way which has certain advantages such as approximating the posteriors effectively in a closed form, easy assessment of convergence and easy optimization by offering a trade-off between frequentist techniques and MCMC-based ones. An important property of our approach is that it does not need the specification of the number of mixture components in advance. The proposed online algorithm has also the benefit to allow data instances to be treated in a sequential manner, which is more attractive than batch learning especially when dealing with massive and streaming data. Through the challenging application of oil spill detection in SAR images, we have demonstrated the performance of our statistical framework, which is able to provide very encouraging results in terms of SAR images modeling and classification capabilities. As future work, we plan to integrate a feature selection mechanism into the proposed framework in order to improve more the classification accuracy. It is our hope that many other real-world applications related to image processing and machine learning can be addressed via our developed framework.

**Author Contributions:** Conceptualization, A.A. and S.B.; methodology, N.B. and F.A.; software, Y.P.; validation, R.A., S.B. and A.A.; formal analysis, F.A.; investigation, A.A.; resources, Y.P.; data curation, S.B.; writing—original draft preparation, A.A. and S.B.; writing—review and editing, N.B. and F.A.; visualization, R.A.; supervision, N.B.; project administration, A.A.; funding acquisition, A.A. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was funded by Deanship of Scientific Research, Taif University, Kingdom of Saudi Arabia, grant number 1-441-137.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** Data available in a publicly accessible repository: The first data set: European Space Agency (ESA) database [40]. The second data set: Sentinel-1 wave mode (TenGeoP-SARwv) [59].

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **Appendix A**

**(1) Optimal solution to** *Q*(*Z*).

$$Q(Z) = \prod\_{i=1}^{N} \prod\_{j=1}^{M} r\_{ij}^{Z\_{ij}} \tag{A1}$$

where the responsibility *rij* can be calculated as:

$$r\_{\rm ij} = \frac{\rho\_{\rm ij}}{\sum\_{j=1}^{M} \rho\_{\rm ij}} \tag{A2}$$

such that:

$$\ln(\rho\_{ij}) = \ln(\pi\_j) + \sum\_{d=1}^{D} \left[ \mathcal{P}\_{\text{jd}} + (\langle \boldsymbol{a}\_{\text{jd}} \rangle - 1) \ln(y\_{\text{jd}}) - \langle \boldsymbol{a}\_{\text{jd}} \rangle \langle \boldsymbol{\beta}\_{\text{jd}} \rangle y\_{\text{jd}} \right] \tag{A3}$$

and

$$\mathcal{P}\_{\rm{jd}} = a\_{\rm{jd}}^{\*} \ln(a\_{\rm{jd}}^{\*}) - a\_{\rm{jd}}^{\*} - \ln(a\_{\rm{jd}}^{\*}) - \ln(\Gamma(a\_{\rm{jd}}^{\*})) + \langle \ln(a\_{\rm{jd}}) \rangle + \langle a\_{\rm{jd}} \rangle + \langle a\_{\rm{jd}} \rangle \langle \ln(\mathcal{S}\_{\rm{jd}}) \rangle \tag{A4}$$

where h.i refers to an expectation w.r.t. the corresponding factor and *α* ∗ *jd* is any feasible point.

The expectation of *Zij* is determined as:

$$
\langle Z\_{\vec{\text{ij}}} \rangle = r\_{\vec{\text{ij}}} \tag{A5}
$$

**(2) Optimal solution to** *Q*(*ψ*) and *Q*(*λ*).

$$\mathcal{Q}(\boldsymbol{\psi}) = \prod\_{j=1}^{M} \mathcal{G}(\psi\_j \mid a\_j, b\_j) \tag{A6}$$

$$Q(\lambda) = \prod\_{j=1}^{M} Beta(\lambda\_j \mid c\_{j'} d\_j) \tag{A7}$$

$$\begin{aligned} a\_j^\* &= a\_j + 1\\ b\_j^\* &= b\_j - \langle \ln(1 - \lambda\_j) \rangle\\ c\_j^\* &= 1 + \sum\_{i=1}^N \langle Z\_{ij} \rangle\\ d\_j^\* &= \langle \psi\_j \rangle + \sum\_{i=1}^N \sum\_{s=j+1}^M \langle Z\_{is} \rangle \end{aligned} \tag{A8}$$

From the previous equations, we obtain the following expectations:

$$\begin{aligned} \langle \ln(\lambda\_j) \rangle &= \Psi(c\_j^\*) - \Psi(c\_j^\* + d\_j^\*)\\ \langle \ln(1 - \lambda\_j) \rangle &= \Psi(d\_j^\*) - \Psi(c\_j^\* + d\_j^\*)\\ \langle \ln(\psi\_{\bar{j}}) \rangle &= \frac{a\_j^\*}{b\_j^\*}\\ \langle \lambda\_{\bar{j}} \rangle &= \frac{c\_{\bar{j}}}{c\_{\bar{j}} + d\_{\bar{j}}} \end{aligned} \tag{A9}$$

where Ψ is Digamma function.

**(3) Optimal solution to***Q*(~*α*).

*Q*(~*α*) = *M* ∏ *j*=1 *D* ∏ *d*=1 G(*αjd* | *u* ∗ *jd*, *v* ∗ *jd*) (A10)

where

$$\begin{aligned} u\_{jd}^{\*} &= u\_{jd} + \sum\_{i=1}^{N} \langle z\_{ij} \rangle \\ v\_{jd}^{\*} &= v\_{jd} - \sum\_{i=1}^{N} [S\_{jd} + \ln(y\_{id}) - \langle \beta\_{jd} \rangle y\_{id}] \langle z\_{ij} \rangle \\ S\_{jd} &= 1 + \ln(a\_{jd}^{\*}) - \frac{1}{a\_{jd}^{\*}} - \Psi(a\_{jd}^{\*}) + \langle \ln(\beta\_{jd}) \rangle \end{aligned} \tag{A11}$$

From the previous equations, we obtain the following expectations:

$$\begin{aligned} \langle \mathfrak{a}\_{\mathrm{jd}} \rangle &= \frac{\mathfrak{u}\_{\mathrm{jd}}^{\*}}{\mathfrak{v}\_{\mathrm{jd}}^{\*}}\\ \langle \ln(\mathfrak{a}\_{\mathrm{jd}}) \rangle &= \Psi(\mathfrak{u}\_{\mathrm{jd}}^{\*}) - \ln(\mathfrak{v}\_{\mathrm{jd}}^{\*}) \end{aligned} \tag{A12}$$

#### **(4) Optimal solution to** *Q*(~*β*).

$$Q(\vec{\beta}) = \prod\_{j=1}^{M} \prod\_{d=1}^{D} \mathcal{G}(\beta\_{jd} \mid s\_{jd}^{\*}, t\_{jd}^{\*}) \tag{A13}$$

where

$$\begin{aligned} s\_{jd}^{\*} &= s\_{jd} + \langle \mathfrak{a}\_{jd} \rangle \sum\_{i=1}^{N} \langle z\_{ij} \rangle \\ t\_{jd}^{\*} &= t\_{jd} + \langle \mathfrak{a}\_{jd} \rangle \sum\_{i=1}^{N} \langle z\_{ij} \rangle y\_{id} \end{aligned} \tag{A14}$$

From the previous equations, we obtain the following expectations:

$$\begin{aligned} \langle \mathcal{S}\_{\not{j}d} \rangle &= \frac{s\_{\not{j}d}^{\*}}{t\_{\not{j}d}^{\*}}\\ \langle \ln(\mathcal{S}\_{\not{j}d}) \rangle &= \Psi(s\_{\not{j}d}^{\*}) - \ln(t\_{\not{j}d}^{\*}) \end{aligned} \tag{A15}$$

#### **References**


## *Article* **High Wind Speed Inversion Model of CYGNSS Sea Surface Data Based on Machine Learning**

**Yun Zhang <sup>1</sup> , Jiwei Yin <sup>1</sup> , Shuhu Yang 1,\* , Wanting Meng <sup>2</sup> , Yanling Han <sup>1</sup> and Ziyu Yan <sup>1</sup>**


**Abstract:** In response to the deficiency of the detection capability of traditional remote sensing means (scatterometer, microwave radiometer, etc.) for high wind speed above 25 m/s, this paper proposes a GNSS-R technique combined with a machine learning method to invert high wind speed at sea surface. The L1-level satellite-based data from the Cyclone Global Navigation Satellite System (CYGNSS), together with the European Centre for Medium-Range Weather Forecasts (ECMWF) and the National Centers for Environmental Prediction (NCEP) data, constitute the original sample set, which is processed and trained with Support Vector Regression (SVR), the combination of Principal Component Analysis (PCA) and SVR (PCA-SVR), and Convolutional Neural Network (CNN) methods, respectively, to finally construct a sea surface high wind speed inversion model. The three models for high wind speed inversion are certified by the test data collected during Typhoon Bavi in 2020. The results show that all three machine learning models can be used for high wind speed inversion on sea surface, among which the CNN method has the highest inversion accuracy with a mean absolute error of 2.71 m/s and a root mean square error of 3.80 m/s. The experimental results largely meet the operational requirements for high wind speed inversion accuracy.

**Keywords:** GNSS-R; CYGNSS; high wind speed inversion; SVR; PCA-SVR; CNN

#### **1. Introduction**

As one of the most serious natural disasters in the world, typhoons are a top priority for scientific research because of their suddenness and destructive power, which bring huge economic losses to human society. Remote sensing technology provides a huge development space for typhoon monitoring and prediction. All microwave remote sensing instruments are struggling to provide reliable high wind speed measurements above 25 m/s. However, few studies have been obtained up to now [1–5]. The Global Navigation Satellite System reflection (GNSS-R) technology uses satellite signals reflected from the Earth's surface to obtain information of surface characteristics such as sea surface wind speed, so it can be provided with all-weather detection capability [6–11]. The main purpose of the Cyclone Global Navigation Satellite System (CYGNSS), launched by the United States in 2016, is to monitor tropical cyclones. It measures sea surface winds in and near the eyewalls of tropical cyclones, typhoons, and hurricanes frequently throughout their life cycle and the data collected can be used to invert wind speeds [12].

Many methods can be used to inverse wind speed. For example, a GNSS-R wind speed inversion method is to extract DDM observables reflecting the wind speed from the delay-Doppler map (DDM) and then build the Geophysical Model Function (GMF) model for wind speed inversion. Some other studies use the matched filter method between simulated DDMs and measured DDMs to inverse wind speed. In addition, the machine learning method is also suitable for wind speed inversion.

**Citation:** Zhang, Y.; Yin, J.; Yang, S.; Meng, W.; Han, Y.; Yan, Z. High Wind Speed Inversion Model of CYGNSS Sea Surface Data Based on Machine Learning. *Remote Sens.* **2021**, *13*, 3324. https://doi.org/10.3390/rs13163324

Academic Editors: Monidipa Das, Soumya K. Ghosh, V. M. Chowdary, Pabitra Mitra and Santosh Rijal

Received: 29 June 2021 Accepted: 19 August 2021 Published: 23 August 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

In 2014, Clarizia et al. [6] extracted five DDM observables from the United Kingdom disaster monitor constellation (UK-DMC) satellite, namely Delay-Doppler Map Average (DDMA), Delay-Doppler Map Variance (DDMV), Allan DDM Variance (ADDMV), Leading Edge Slope (LES), and Trailing Edge Slope (TES). Different GMF models were established by comparing the five observables with the buoy wind speed provided by the National Data Buoy Center (NDBC). Then, the reverse wind speeds, using each of these GMFs, were combined into a minimum variance estimator. The root mean square error (RMSE) obtained for wind speeds less than 10 m/s was 1.65 m/s. In 2016, Rodriguez et al. [13] used a generalized observable to determine the coefficients of linear combination by the maximum signal-to-noise ratio (MSNR), the minimum variance of the wind speed (MVU), and principal component analysis (PCA). Then, the three wind speed values were compared with the CYGNSS baseline L2 observables. The results show that PCA performs best, but the overall RMS was greater than 4 m/s when the wind speed was greater than 20 m/s. In 2018, Christopher S. Ruf et al. [9] used the observable DDMA and LES from CYGNSS L1 level data to establish the Fully Developed Seas (FDS) GMF and Young Seas/Limited Fetch (YSLF) GMF for different incidence angles at different seas. The knowledge of wind speed inversion algorithm required to establish FDS GMF model comes from [7]. The reference wind speed used to train the FDS GMF were the 10 m-referenced ocean surface wind speeds provided by the ECMWF and the GDAS. The YSLF GMF model was established using the wind speed collected by the stepped frequency microwave radiometer (SFMR) on NOAA P-3 hurricane hunter aircraft. The FDS GMF model was suitable for low-to-moderate wind speeds. On the contrary, the YSLF GMF model was more sensitive to hurricanes. By using FDS GMF to invert wind speed below 20 m/s and comparing with the European Centre for Medium-Range Weather Forecasts (ECMWF), the overall RMSE was about 2 m/s. In addition, compared with SFMR aircraft data, when the wind speed was greater than 20 m/s, the RMSE of the YSLF GMF model inversion wind speed was about 6.5 m/s. In addition, the samples for wind speeds greater than 20 m/s tested numbered only 674.

In 2017, F.Said [14] et al. proposed a method to inverse the maximum hurricane wind speed using the simulated power-versus-delay waveform data of CYGNSS. The CYGNSS end-to-end simulator (E2ES) [7] was used to produce the reference simulated waveforms. The specific process was to compare the simulated waveform with the reference waveform generated over a set of synthetic Willoughby storms with known maximum wind speed (Vmax) through the matched filter, and output the Vmax corresponding to the reference waveform. The Vmax was the retrieved wind speed. Comparing the retrieved Vmax values of 552 hurricane events with the hurricane weather research and forecasting model (HWRF) Vmax and the Best Track for Vmax, the overall bias of wind speed less than 40 m/s was greater than 11 m/s, and the overall bias of wind speed greater than 40 m/s was less than 3 m/s. However, the samples of hurricane wind speed studied were not enough. In 2019, Al-Khaldi, M [15] et al. extended the simulation study of [14] to the use of CYGNSS full DDM. A matched filter approach between normalized simulated DDMs and measured DDMs was applied to inverse storm parameters. The Vmax estimates were inversed by using the data during Hurricane Irma. Compared with the reported National Hurricane Center Best Track forecasts, the RMSE was 6.89 m/s. In 2021, the same team including Al-Khaldi, M [16] carried out a progress update and error analysis on the research performed by [15]. They continued to use the CYGNSS full DDM and proposed to use the synthetic storm model to retrieve wind speed on the basis of [15]. The synthetic storm model included the Willoughby model and Generalized Asymmetric Holland Model (GAHM). The success of inversion was due to the combination of the GAHM model suitable for storms with low levels of development and the Willoughby model suitable for storms with higher levels. The inversion Vmax was obtained by combining the results of the two models. Compared with the Best Track forecasts, the RMSE was 6.05 m/s. The RMSE was partially improved by comparison with the reference [15]. The effects of measurement delay extent on inverse error were also analyzed.

In 2019, Chong Wu et al. [17] used a back propagation (BP) neural network to invert the wind speed from 0 to 30 m/s, based mainly on the DDM data from CYGNSS. The DDM Observables included DDMA, LES, and Bistatic Radar Cross Section (BRCS). The paper used the CYGNSS L2 wind speed data as the reference wind speed. The Pearson correlation coefficient of the inverse wind speed and the CYGNSS wind speed data product was 0.958, the RMSE was 1.86 m/s, and the mean relative error was 2.66%. The feasibility and effectiveness of wind speed inversion using neural network based on DDM was demonstrated. However, the amount of data for wind speeds greater than 20 m/s in the paper was small and the applicability of the neural network for high wind speed data cannot be confirmed. In the same year, Han Gao et al. [18] used eight observables in CYGNSS L1 data (DDMA, LES, TES, specular reflection point position, satellite altitude angle, Scatter Area, delay-Doppler correlation power mean, and Effective Area) to train the model with a BP neural network, and then compared the reverse wind speed with the wind speed data provided by ECMWF. When the wind speed was less than 20 m/s, the RMSE was 1.21 m/s, and the RMSE in the wind speed range of 20~45 m/s was 2.54 m/s. However, this paper only had 4761 wind speed data above 20 m/s, which was not enough for high wind speed training.

In 2020, Jennifer et al. [10] proposed the Artificial Neural Network (ANN) inversion algorithm for wind speed inversion based on CYGNSS satellite data. In this paper, six characteristic parameters (DDMA, LES, Incidence Angle, Range Corrected Gain (RCG) [7], and Latitude and Longitude of the specular point acquisition.) were used to train ANN model, and CYGNSS L2 wind speed data was used as the reference wind speed. The RMSD of wind speed inversion error for the range of 0~32 m/s was 1.51 m/s. However, the wind speeds in the paper mainly focus on 0–20 m/s, and there was not enough research on wind speeds above 20 m/s, thus good inversion results cannot be obtained for tropical storms. In the same year, Sja Wang [19] performed a comparison between neural network and machine learning methods using Tech Demo Sat-1 (TDS-1) satellite DDM map data and ECMWF data for wind speeds in the 3–18 m/s interval. It was verified that the inversion effect of the neural network model had a significant advantage with a 20% performance improvement.

In 2020, Cardellach et al. [20] combined CYGNSS uncalibrated Level-1 bin original observation count with ECMWF/C3S ERA5 reanalysis dataset to obtain specular reflection point wind speed. The study covered hurricane season data for 2018 and 2019. The inversion was carried out by a variational technique based on physical forward model. The inverse wind speed was compared with the background model, other spaceborne sensors, such as NASA Soil Moisture Active Passive (SMAP), ESA Soil Moisture and Ocean Salinity (SMOS), EUMETSAT Advanced Scatterometer on board METOP (ASCAT) A/B, and other organizations' CYGNSS inverse wind speed. The research showed that this method had the ability to infer wind speed (including hurricane winds). The inverse wind speed was the most consistent with NOAA inversion [21], but the lowest correlation was found between inversion and the official products that were obtained with the YSLF GMF, and the dispersion reached 9.9 m/s. The author expected that this method will work at moderate wind speed, but this method had the possibility of underestimating wind speed.

According to the above research results, it can be found that machine learning has been widely used in the inversion of sea surface wind speed in the field of remote sensing at present; however, relevant studies for high wind speed greater than 20 m/s are relatively lacking [22].

In this paper, we put forward a high wind speed inversion model for CYGNSS data based on machine learning methods for inversion of typhoons. The datasets consist of the CYGNSS measured L1 data published by the National Aeronautics and Space Administration (NASA) and the reanalyzed wind speed datasets of the ECMWF and National Centers for Environmental Prediction (NCEP). Three methods, Support Vector Regression (SVR), the combination of PCA and SVR (PCA-SVR), and Convolutional Neural Networks (CNN), are used to train the wind speed data above 20 m/s. Due to the uneven distribution of samples, the under-sampling method is used to extract data for training. The three

models obtained after training are used to inverse the high wind speed during the typhoon Bavi life cycle typhoon in 2020. Compared with the wind speed from ECMWF/NCEP, the inversion results are used to study the performance of the three models.

#### **2. Materials and Methods**

*2.1. Data Source*

#### 2.1.1. CYGNSS

The CYGNSS satellites are a constellation of eight low Earth orbit (LEO) microsatellites launched in 2016. Each satellite is equipped with a right-hand-circular polarization (RHCP) antenna to receive direct signals from the transmitting satellite and two left-hand-circular polarization (LHCP) antennas to receive reflected signals from reflective surfaces such as the sea surface. The specular reflection points collected by the CYGNSS satellite cover approximately ±40◦ latitude zone in the global area, and the longitude zone is completely covered. CYGNSS seeks to improve weather prediction capabilities by studying the interaction between ocean surface properties, humid atmospheric thermodynamics, radiation, and convective dynamics associated with tropical cyclones [7,9,12,14–16,20]. CYGNSS data is encapsulated by NASA in netCDF file format, and this paper used version 2.1 of the CYGNSS Level 1 data (available online at https://podaac.jpl.nasa.gov/dataset/CYGNSS\_L1\_V2.1, accessed on 8 April 2021), which is the result of the power expression transformed by L0 level DDM [10].

#### 2.1.2. Mean Sea Level Pressure

Mean sea level (MSL) pressure is an important factor affecting typhoon status and its path [23]. This paper uses the MSL pressure reanalysis data product provided by ECMWF's official website. The MSL pressure reanalysis dataset calculates the atmospheric pressure on the Earth's surface, including all land, ocean, and inland water, and then adjusts the surface atmospheric pressure height to the height of mean sea level. The spatial resolution of MSL pressure dataset is 0.5◦ , and the temporal resolution is 1 h.

#### 2.1.3. Global Wind Speed Data

This paper used two different global reanalysis wind speed datasets: ECMWF reanalysis dataset and NCEP reanalysis dataset, mainly to study wind speed data at the 10 m-referenced ocean surface wind speed (u10), using UTC time. ECMWF regularly uses its forecasting models and data assimilation system to reanalyze archived observations and further create global reanalysis datasets describing the recent history of the atmosphere, land, and ocean. The datasets provide sea surface wind speed at a spatial and temporal resolution of 1 h, 0.5◦ . NCEP adopts a state-of-the-art global data assimilation system and a comprehensive database to quality control and assimilate observations from various sources (ground, ships, radio soundings, wind balloons, aircraft, satellites, etc.) to obtain reanalysis datasets. The datasets provide sea surface wind speed with a temporal and spatial resolution of 1 h, 0.2◦ . Further using the time, latitude, and longitude of the observed data provided by CYGNSS, the reanalysis datasets are passed through spatial linear interpolation with temporal linear interpolation to obtain the corresponding wind speed in time and space. This paper combined the wind speed reanalysis datasets of ECMWF and NCEP. The data from ECMWF alone were used when the wind speed is less than 20 m/s, and the data from NCEP are used when the wind speed was greater than 20 m/s [9,24,25]. Finally, the wind speed dataset was composed into new datasets according to this criterion, and the new datasets were used as the true wind speed for training and testing.

#### *2.2. Machine Learning Methods*

Three methods, SVR, PCA-SVR, and CNN, were used to train the data to obtain three models; the following sections briefly outline the principles of each method.

#### 2.2.1. SVR

SVR can improve the generalization ability of model by seeking structural risk minimization, so as to achieve the minimum empirical risk and confidence interval. Using fewer samples can also obtain good statistical rules. The input data is normalized before the SVR training to prevent training imbalance caused by feature anomalies. Additionally, normalization can also improve the computational speed. The SVR algorithm first symmetrically maps the input data X into a multidimensional space in a nonlinear way and then performs linear programming in that space. The selection of parameters of SVR generally includes three elements: The first is the selection of kernel function, here the radial basis function (RBF) with better smoothing performance is chosen; the second parameter is the selection of penalty factor C; the third parameter is the selection of kernel coefficient gamma value. In order to avoid overfitting and underfitting, this paper uses the grid search method to perform parameter search for C and gamma values when training the model [26,27]. In order to improve the rate of parameter search, the grid search method is adjusted as follows. Firstly, by finding the optimal parameters in a wide range roughly, and then by setting a smaller step size to search again according to this optimal parameter taking range.

The goal of SVR can be formalized as:

$$\begin{aligned} \text{Min} & \frac{1}{2} \left\| \boldsymbol{\omega} \right\|^2 + \mathbb{C} \sum\_{\mathbf{i}=1}^n (\xi\_{\mathbf{i}} + \xi\_{\mathbf{i}}^\*) \\\\ \text{s.t. } & \mathbf{y}\_{\mathbf{i}} - \boldsymbol{\omega} \boldsymbol{\phi}(\mathbf{x}) - \mathbf{b} \le \boldsymbol{\varepsilon} + \xi\_{\mathbf{i}} \ \xi\_{\mathbf{i}} \ge 0 \\\\ & \boldsymbol{\omega} \boldsymbol{\phi}(\mathbf{x}) + \mathbf{b} - \mathbf{y}\_{\mathbf{i}} \le \boldsymbol{\varepsilon} + \xi\_{\mathbf{i}}^\* \ \xi\_{\mathbf{i}}^\* \ge 0 \\\\ & \mathbf{i} = 1, 2, \dots, n \end{aligned} \tag{1}$$

where ω is the normal vector, which determines the direction of the hyperplane. n is the number of samples, C > 0 is the penalty parameter, ε is the error sensitivity index, and ξ<sup>i</sup> and ξ<sup>i</sup> ∗ are slack variables. By using the dual principle and introducing Lagrange multipliers, the above formula is solved:

$$\mathbf{f}(\mathbf{x}) = \sum\_{\mathbf{i}=1}^{n} (\beta\_{\mathbf{i}}{}^{\*} - \beta\_{\mathbf{i}}) \mathbf{K}(\mathbf{x}\_{\mathbf{i}}, \mathbf{x}\_{\mathbf{j}}) + \mathbf{b} \tag{2}$$

where β<sup>i</sup> <sup>∗</sup> and β<sup>i</sup> are the Lagrange multipliers, K xi , xj is the radial basis function, and b is the threshold. Equation (2) is a kernel function introduced by the nonlinear SVR to deal with dimensional catastrophes [28].

The preprocessed training data were trained by SVR method, and the gamma value of the model was determined to be 72.50 and C was 0.09 by grid search.

#### 2.2.2. PCA-SVR

Since the number of features tends to increase the model training time, PCA was used here to reduce the dimensionality of SVR input by secondary integration of multidimensional feature covariates in order to reduce the model training time and improve the independence of feature covariates. PCA, as a technique of data dimensionality reduction, can project the original features to the dimension with the maximum amount of projected information as much as possible and ensure the minimum loss of information after dimensionality reduction without affecting the final model prediction results, the processed data are then fed into the SVR for data prediction [29,30].

In the PCA-SVR prediction model, a total of 27 influencing factors were used as input data in this paper. The input training set was processed by PCA to obtain the principal components PC1, PC2, . . . , and PCk (k ≤ 27) for model prediction, and it was found that the cumulative contribution of the first 13 principal components reached more than 85%, which could replace all feature covariates for model training, so k was 13. Then, the dimensionality reduction data was input into SVR, and the gamma value of the model was

determined to be 32.50 and C was 0.37 using the grid search method. Figure 1 shows the structure of the PCA-SVR model. model was determined to be 32.50 and C was 0.37 using the grid search method. Figure 1 shows the structure of the PCA-SVR model. shows the structure of the PCA-SVR model.

In the PCA-SVR prediction model, a total of 27 influencing factors were used as input data in this paper. The input training set was processed by PCA to obtain the principal components PC1, PC2, …… , and PCk (k ≤ 27) for model prediction, and it was found that the cumulative contribution of the first 13 principal components reached more than 85%, which could replace all feature covariates for model training, so k was 13. Then, the dimensionality reduction data was input into SVR, and the gamma value of the

In the PCA-SVR prediction model, a total of 27 influencing factors were used as input data in this paper. The input training set was processed by PCA to obtain the principal components PC1, PC2, …… , and PCk (k ≤ 27) for model prediction, and it was found that the cumulative contribution of the first 13 principal components reached more than 85%, which could replace all feature covariates for model training, so k was 13. Then, the dimensionality reduction data was input into SVR, and the gamma value of the model was determined to be 32.50 and C was 0.37 using the grid search method. Figure 1

*Remote Sens.* **2021**, *13*, x FOR PEER REVIEW 6 of 16

*Remote Sens.* **2021**, *13*, x FOR PEER REVIEW 6 of 16

**Figure 1.** PCA-SVR model structure. **Figure 1.** PCA-SVR model structure. **Figure 1.** PCA-SVR model structure.

When the wind speed modeling is completed and enters the wind speed inversion stage, the feature parameters of the CYGNSS test set are normalized and directly multiplied with the corresponding feature vectors to obtain the principal component parameters. Then, the trained model is used for high wind speed inversion and the inversion accuracy of the inverse wind speed is calculated. When the wind speed modeling is completed and enters the wind speed inversion stage, the feature parameters of the CYGNSS test set are normalized and directly multiplied with the corresponding feature vectors to obtain the principal component parameters. Then, the trained model is used for high wind speed inversion and the inversion accuracy of the inverse wind speed is calculated. When the wind speed modeling is completed and enters the wind speed inversion stage, the feature parameters of the CYGNSS test set are normalized and directly multiplied with the corresponding feature vectors to obtain the principal component parameters. Then, the trained model is used for high wind speed inversion and the inversion accuracy of the inverse wind speed is calculated.

#### 2.2.3. CNN 2.2.3. CNN

2.2.3. CNN A CNN is a feed-forward neural network that performs well on image, audio, and text data. It is easy to update the data model by a back propagation algorithm. The CNN architecture (i.e., the number of layers and their structure) can be applied to a wide range of problems, while the hidden layers also reduce the algorithm's reliance on feature engineering. A CNN is suitable for training with large amounts of data and is capable of solving complex nonlinear problems. The complete neural network structure includes input layer, convolution layer, Relu activation function, pooling layer, fully connected layer, and output layer [19,31]. The optimizer uses adaptive moment estimation (Adam) gradient descent algorithm instead of stochastic gradient descent (SGD) because Adam is able to adjust the learning rate of each parameter, making the parameters smooth for extracting data features. A total of Xn samples are trained and the inversed wind speed A CNN is a feed-forward neural network that performs well on image, audio, and text data. It is easy to update the data model by a back propagation algorithm. The CNN architecture (i.e., the number of layers and their structure) can be applied to a wide range of problems, while the hidden layers also reduce the algorithm's reliance on feature engineering. A CNN is suitable for training with large amounts of data and is capable of solving complex nonlinear problems. The complete neural network structure includes input layer, convolution layer, Relu activation function, pooling layer, fully connected layer, and output layer [19,31]. The optimizer uses adaptive moment estimation (Adam) gradient descent algorithm instead of stochastic gradient descent (SGD) because Adam is able to adjust the learning rate of each parameter, making the parameters smooth for extracting data features. A total of Xn samples are trained and the inversed wind speed values W are output. A CNN is a feed-forward neural network that performs well on image, audio, and text data. It is easy to update the data model by a back propagation algorithm. The CNN architecture (i.e., the number of layers and their structure) can be applied to a wide range of problems, while the hidden layers also reduce the algorithm's reliance on feature engineering. A CNN is suitable for training with large amounts of data and is capable of solving complex nonlinear problems. The complete neural network structure includes input layer, convolution layer, Relu activation function, pooling layer, fully connected layer, and output layer [19,31]. The optimizer uses adaptive moment estimation (Adam) gradient descent algorithm instead of stochastic gradient descent (SGD) because Adam is able to adjust the learning rate of each parameter, making the parameters smooth for extracting data features. A total of Xn samples are trained and the inversed wind speed values W are output.

values W are output. After a large amount of data validation, this paper finally determined the number of convolutional layers to be 3, no pooling layer was set, the convolutional kernel size was 3 × 1, dropout was 0.3, the number of convolutional kernels in each layer was 32, batch-size was 1000, and epochs were 2000. Figure 2 shows the structure of the CNN model used in After a large amount of data validation, this paper finally determined the number of convolutional layers to be 3, no pooling layer was set, the convolutional kernel size was 3 × 1, dropout was 0.3, the number of convolutional kernels in each layer was 32, batch-size was 1000, and epochs were 2000. Figure 2 shows the structure of the CNN model used in this paper. After a large amount of data validation, this paper finally determined the number of convolutional layers to be 3, no pooling layer was set, the convolutional kernel size was 3 × 1, dropout was 0.3, the number of convolutional kernels in each layer was 32, batch-size was 1000, and epochs were 2000. Figure 2 shows the structure of the CNN model used in this paper.

**Figure 2.** CNN model structure (Conv: Convolution layer, FC: fully connected layer).

#### *2.3. High Wind Speed Inversion Process*

#### 2.3.1. Data Processing Flow

The process of high wind speed inversion in this paper can be briefly summarized into four parts: (i) determining the satellite data as well as wind speed data used; (ii) preprocessing and normalizing data; (iii) training the processed data with the three

machine learning methods described above; (iv) using test data to inverse wind speed and analyzing performance of inversion wind speed. The specific wind speed inversion process is shown in Figure 3. chine learning methods described above; (iv) using test data to inverse wind speed and analyzing performance of inversion wind speed. The specific wind speed inversion process is shown in Figure 3.

The process of high wind speed inversion in this paper can be briefly summarized into four parts: (i) determining the satellite data as well as wind speed data used; (ii) preprocessing and normalizing data; (iii) training the processed data with the three ma-

*Remote Sens.* **2021**, *13*, x FOR PEER REVIEW 7 of 16

*2.3. High Wind Speed Inversion Process.* 

2.3.1. Data Processing Flow

**Figure 2.** CNN model structure (Conv: Convolution layer, FC: fully connected layer).

**Figure 3.** High wind speed inversion process based on machine learning. **Figure 3.** High wind speed inversion process based on machine learning.

#### 2.3.2. Data Pre-Processing 2.3.2. Data Pre-Processing

In order to obtain good results, any data derived from a remote sensing satellite for Earth observation needs to undergo rigorous data pre-processing. The datasets were processed according to the following criteria: In order to obtain good results, any data derived from a remote sensing satellite for Earth observation needs to undergo rigorous data pre-processing. The datasets were processed according to the following criteria:


Because the occurrence time of each typhoon was not continuous, the CYGNSS data used in this paper was intermittent in time. CYGNSS data from 30 June 2018 to 3 July, 27 September 2018 to 30 September, 1 January 2019 to 3 January, 3 August 2019 to 8 August, 6 October 2019 to 12 October, 24 October 2019 to 25 October, 30 October 2019, 4 November 2019 to 7 November, 2 August 2020 to 4 August, 10 August 2020 to 11 August and 3 September 2020 were used as the train data (Figure 3), and the reanalysis wind speed datasets of ECMWF and NCEP in the corresponding time were used as the true wind speed data (Figure 3). Figure 4a,b respectively show the number of original samples and the corresponding number of final training samples for each wind speed range. The number of original samples means the data number after filtering the data according to the preprocessing criteria, and the number of final training samples means the training Because the occurrence time of each typhoon was not continuous, the CYGNSS data used in this paper was intermittent in time. CYGNSS data from 30 June 2018 to 3 July, 27 September 2018 to 30 September, 1 January 2019 to 3 January, 3 August 2019 to 8 August, 6 October 2019 to 12 October, 24 October 2019 to 25 October, 30 October 2019, 4 November 2019 to 7 November, 2 August 2020 to 4 August, 10 August 2020 to 11 August and 3 September 2020 were used as the train data (Figure 3), and the reanalysis wind speed datasets of ECMWF and NCEP in the corresponding time were used as the true wind speed data (Figure 3). Figure 4a,b respectively show the number of original samples and the corresponding number of final training samples for each wind speed range. The number of original samples means the data number after filtering the data according to the preprocessing criteria, and the number of final training samples means the training set data (including training data and validation data) number for Machine Leaning after under-sampling original samples.

under-sampling original samples.

**Figure 4.** (**a**) Original training samples histogram; (**b**) Final training samples histogram**. Figure 4.** (**a**) Original training samples histogram; (**b**) Final training samples histogram.

From Figure 4a, we can see that the wind speed samples were concentrated between 20~30 m/s, the number of wind speed samples larger than 30 m/s and 20~30 m/s was seriously unbalanced, the imbalance of the number of samples easily led to the bias of the trained model, which did not have generalization. Therefore, the under-sampling method was used for random sampling to remove some majority samples from the training set, and in order to ensure that there were enough samples for training and that the amount of data for each type of wind speed interval was similar. Finally, when the ratio of samples between 20~30 m/s interval and more than 30 m/s interval was 1:1, a total of 20,648 final training samples were used for training. The specific samples are shown in Figure 4b. Subsequent model training and data research were based on this From Figure 4a, we can see that the wind speed samples were concentrated between 20~30 m/s, the number of wind speed samples larger than 30 m/s and 20~30 m/s was seriously unbalanced, the imbalance of the number of samples easily led to the bias of the trained model, which did not have generalization. Therefore, the under-sampling method was used for random sampling to remove some majority samples from the training set, and in order to ensure that there were enough samples for training and that the amount of data for each type of wind speed interval was similar. Finally, when the ratio of samples between 20~30 m/s interval and more than 30 m/s interval was 1:1, a total of 20,648 final training samples were used for training. The specific samples are shown in Figure 4b. Subsequent model training and data research were based on this basis.

set data (including training data and validation data) number for Machine Leaning after

#### basis. 2.3.3. Feature Parameter Selection

2.3.3. Feature Parameter Selection After data pre-processing, it could be found that the L1 level data products of CYGNSS included many satellite observables, such as DDMA, LES, etc., which are characteristic values depending on wind speed as well as sea surface roughness. Due to the high wind speed measurement environment, especially typhoons, the sensitivity of the characteristic parameters of the two-dimensional delay-Doppler power waveform of the GNSS reflection signal to wind speed decreases, causing an increase in the wind speed measurement error. To reduce the performance error of CYGNSS in detecting typhoons, After data pre-processing, it could be found that the L1 level data products of CYGNSS included many satellite observables, such as DDMA, LES, etc., which are characteristic values depending on wind speed as well as sea surface roughness. Due to the high wind speed measurement environment, especially typhoons, the sensitivity of the characteristic parameters of the two-dimensional delay-Doppler power waveform of the GNSS reflection signal to wind speed decreases, causing an increase in the wind speed measurement error. To reduce the performance error of CYGNSS in detecting typhoons, more characteristic parameters of CYGNSS L1 datasets were extracted to optimize the accuracy of the wind measurement model.

more characteristic parameters of CYGNSS L1 datasets were extracted to optimize the accuracy of the wind measurement model. In this paper, 27 eigenvalues related to sea surface wind speed were used, specifically: Pseudo Random Noise (PRN) satellite number, DDMA, LES, antenna gain, distance from transmitter to specular reflection point, distance from receiver to specular reflection point, specular reflection point (longitude, latitude, time, and elevation angle), QC Flag, Signal-to-Noise Ratio (SNR), GNSS-R satellite position in ECEF, GNSS satellite position in ECEF, BRCS's DDM (specular delay line and Doppler column), BRCS's DDM (peak In this paper, 27 eigenvalues related to sea surface wind speed were used, specifically: Pseudo Random Noise (PRN) satellite number, DDMA, LES, antenna gain, distance from transmitter to specular reflection point, distance from receiver to specular reflection point, specular reflection point (longitude, latitude, time, and elevation angle), QC Flag, Signalto-Noise Ratio (SNR), GNSS-R satellite position in ECEF, GNSS satellite position in ECEF, BRCS's DDM (specular delay line and Doppler column), BRCS's DDM (peak delay line and peak Doppler column), vehicle's specular delay, corrected DDM instrument specular delay, the direct signal code phase, and MSL pressure.

#### ment specular delay, the direct signal code phase, and MSL pressure. **3. Results and Discussion**

#### *3.1. Typhoon Validation Data*

To analyze the feasibility of the three methods for wind speed inversion, the data of the Typhoon Bavi event in August 2020 were studied here. The reflected signal data collected

delay line and peak Doppler column), vehicle's specular delay, corrected DDM instru-

by CYGNSS during Typhoon Bavi in the western Pacific Ocean during 2020.8.22~2020.8.26 were processed as test data (Figure 3). The reanalysis typhoon data released by ECMWF and NCEP were used as the true wind speed for the evaluation of wind measurement accuracy. Only wind speed data above 20 m/s during Typhoon Bavi were inversed here, because the training set in the Machine Learning method only includes data samples with wind speed greater than 20 m/s, as shown in Figure 4. A total of 7389 samples were available for the experiment over the four days. This subsection provides a detailed analysis of the CYGNSS satellite flight tracks and the corresponding true wind speeds during Typhoon Bavi. 2020.8.22~2020.8.26 were processed as test data (Figure 3). The reanalysis typhoon data released by ECMWF and NCEP were used as the true wind speed for the evaluation of wind measurement accuracy. Only wind speed data above 20 m/s during Typhoon Bavi were inversed here, because the training set in the Machine Learning method only includes data samples with wind speed greater than 20 m/s, as shown in Figure 4. A total of 7389 samples were available for the experiment over the four days. This subsection provides a detailed analysis of the CYGNSS satellite flight tracks and the corresponding true wind speeds during Typhoon Bavi. Figure 5a shows the location of region for performance evaluation, and Figure 5b

To analyze the feasibility of the three methods for wind speed inversion, the data of the Typhoon Bavi event in August 2020 were studied here. The reflected signal data collected by CYGNSS during Typhoon Bavi in the western Pacific Ocean during

*Remote Sens.* **2021**, *13*, x FOR PEER REVIEW 9 of 16

**3. Results and Discussion**  *3.1. Typhoon Validation Data* 

Figure 5a shows the location of region for performance evaluation, and Figure 5b shows Typhoon Bavi (2020.8.22~2020.8.26) moving track map and daily area of interest. The CYGNSS data during 2020.8.22~2020.8.26 was first preprocessed for data, and then analyzed specifically according to time after obtaining analyzable data. Typhoon Bavi occurred in the western Pacific Ocean. The typhoon hourly track data used in this study was collected by Department of Water Resources of Zhejiang Province (http://typhoon. zjwater.gov.cn/, accessed on 20 June 2021). In addition, it was combined with the data distribution of CYGNSS to determine the specific typhoon area. Since there was no data in the region after preprocessing on 22 August 2020, this paper mainly studied the data from 23 August 2020 to 26 August 2020. In Figure 5b, the five pointed star represents the starting position of the typhoon, and the dotted box represents each divided typhoon area. Table 1 shows the specific selection range of each regional division. shows Typhoon Bavi (2020.8.22~2020.8.26) moving track map and daily area of interest. The CYGNSS data during 2020.8.22~2020.8.26 was first preprocessed for data, and then analyzed specifically according to time after obtaining analyzable data. Typhoon Bavi occurred in the western Pacific Ocean. The typhoon hourly track data used in this study was collected by Department of Water Resources of Zhejiang Province (http://typhoon.zjwater.gov.cn/, accessed on 20 June 2021). In addition, it was combined with the data distribution of CYGNSS to determine the specific typhoon area. Since there was no data in the region after preprocessing on 22 August 2020, this paper mainly studied the data from 23 August 2020 to 26 August 2020. In Figure 5b, the five pointed star represents the starting position of the typhoon, and the dotted box represents each divided typhoon area. Table 1 shows the specific selection range of each regional division.

**Figure 5.** (**a**) Location of the region for performance evaluation (world map: preview number: GS (2016) 1563); (**b**) Typhoon Bavi (2020.8.22~2020.8.26) moving track map and daily interested area. **Figure 5.** (**a**) Location of the region for performance evaluation (world map: preview number: GS (2016) 1563); (**b**) Typhoon Bavi (2020.8.22~2020.8.26) moving track map and daily interested area.



This paper mainly focused on the data with wind speed above 20 m/s. The proportion of samples has been determined in Section 2.3.2. Two measurement standards were used to compare the performance of three models: 1. Mean absolute error (MAE); 2. Root Mean Square Error (RMSE); and 3. Correlation Coefficient.

#### *3.2. Analysis of Overall Inversion Results 3.2. Analysis of Overall Inversion Results*

The overall performance of the three trained models was investigated for all data during the typhoon, and Figure 6 shows the scatter plots of the true and inverse wind speed for all data during the typhoon for the three models. Table 2 shows the specific performance analysis of the three models. The overall performance of the three trained models was investigated for all data during the typhoon, and Figure 6 shows the scatter plots of the true and inverse wind speed for all data during the typhoon for the three models. Table 2 shows the specific performance analysis of the three models.

This paper mainly focused on the data with wind speed above 20 m/s. The proportion of samples has been determined in Section 2.3.2. Two measurement standards were used to compare the performance of three models: 1. Mean absolute error (MAE); 2. Root

*Remote Sens.* **2021**, *13*, x FOR PEER REVIEW 10 of 16

Mean Square Error (RMSE); and 3. Correlation Coefficient.

**Figure 6.** (**a**) SVR model wind speed inversion results; (**b**) PCA-SVR model wind speed inversion results; and (**c**) CNN model wind speed inversion results. The color bar on the right represents data density. **Figure 6.** (**a**) SVR model wind speed inversion results; (**b**) PCA-SVR model wind speed inversion results; and (**c**) CNN model wind speed inversion results. The color bar on the right represents data density.

**Table 2.** Model performance analysis (Correl. Coef. represents the correlation coefficient).


**Beaufort Scale** 

Firstly, it can be demonstrated from the scatter plot in Figure 6 that all three methods could be used to inverse the wind speed. In Figure 6, it was obvious that the SVR model inversion results had the greatest dispersion, and the inversion results reached a minimum of 10 m/s. The PCA-SVR model after adding data downscaling was partially improved for the problem of data divergence, but there was still a bias. The true wind speed of 20 m/s inversed the results around 35 m/s. While the CNN model had the most concentrated scattered data, the minimum inverse wind speed was about 15 m/s. The inversion results for the wind speed dataset around 20 m/s converged significantly and the outcomes were better than the other two methods. In general, the CNN method showed good inversion performance. the CNN inversion results showed relatively large errors. This conclusion coincides with the results in Table 2. The above contents have verified the accuracy of the model. Next, the comparison between the inverse wind speed and the typhoon track data was discussed. Table 4 shows the results of the comparison between CNN inverse wind speed, true wind speed (ECMWF and NCEP reanalysis wind speed data), and Beaufort scale of typhoon track data (from Department of Water Resources of Zhejiang Province). The approximate wind speed is similar to Best Track data. The CYGNSS samples here should meet less than spatial ± 0.5° and temporal ± 0.5 h from the typhoon track data. The five datasets satis-

*Remote Sens.* **2021**, *13*, x FOR PEER REVIEW 12 of 16

The performance of each of the three models was analyzed in three data intervals: (i) overall; (ii) 20~30 m/s; and (iii) above 30 m/s. From Table 2, except for the MAE value of CNN above 30 m/s, which was slightly inferior to SVR, all the error results indicated that CNN had the best performance. PCA-SVR was the second and SVR was the worst. Especially in the three data intervals, the correlation coefficients of CNN model were the highest. Further analysis showed that the MAE of CNN in the overall interval was improved by 33.90%, RMSE by 30.66% and correlation coefficient by 37.50% over SVR. fied the above conditions. **Table 4.** Comparison results of wind speed data from Department of Water Resources of Zhejiang Province. **(Approximate Wind Speed) 11**  (**30 m/s**) **12**  (**33 m/s**) **12**  (**38 m/s**) **14**  (**42 m/s**) **14**  (**42 m/s**) Date Aug. 24 Aug. 24 Aug. 25 Aug. 25 Aug. 26

However, when the typhoon wind speed was greater than 30 m/s, the deviations of the wind speed values obtained from all three model inversions were all large, possibly because of the lack of higher wind speed train data (>40 m/s), as in Figure 4b, which leads to large bias in the inversion of typhoon data higher than 30 m/s. True wind speed(m/s) 20.07 20.01 24.99 33.66 34.00 CNN wind speed(m/s) 19.24 22.88 27.47 28.94 24.78 Distance from the center of the typhoon track(km) 56.54 26.03 57.60 50.91 66.91

#### *3.3. Analysis of Daily Inversion Results by CNN Models* In Table 4, comparing with CNN wind speed and Typhoon track data, the first

It was known from the analysis in Section 3.2 that the CNN model produced better wind speed inversion results for the overall data during typhoons. Considering the large variation of daily climatic environment and other factors during typhoons, which may affect the results of daily data collection from satellites for the same sea area, the CNN model was used for specific analysis of daily data. Figure 7 shows the daily CYGNSS satellite flight track and corresponding CNN wind speed, while Figure 8 corresponds to the absolute value of wind speed inversion (daily true wind speed minus the CNN model inverse wind speed). Table 3 shows the daily data performance results of the CNN model. column result had the smallest deviation, and the fifth column result was the worst. It shows the greater wind speed level, the worse error is obtained. It is the same result as Tables 2 and 3, the reason has been analyzed before. However, in this paper, the true wind speed (ECMWF and NCEP reanalysis wind speed data) was used as the training benchmark of CNN model. As can be seen from Table 4, compared with the approximate wind speed (from Department of Water Resources of Zhejiang Province) during the typhoon, the true wind speed was actually underestimated, and the inversion performance of CNN model was limited by the true wind speed.

**Figure 7.** *Cont*.

**Figure 7.** (**a**) 2020.8.23 CYGNSS satellite flight track and corresponding CNN wind speed; (**b**) 2020.8.24 CYGNSS satellite flight track and corresponding CNN wind speed; (**c**) 2020.8.25 CYGNSS satellite flight track and corresponding CNN wind speed; (**d**) 2020.8.25 CYGNSS satellite flight track and corresponding CNN wind speed. The color bar on the right represents the wind speed value, Unit: m/s. **Figure 7.** (**a**) 2020.8.23 CYGNSS satellite flight track and corresponding CNN wind speed; (**b**) 2020.8.24 CYGNSS satellite flight track and corresponding CNN wind speed; (**c**) 2020.8.25 CYGNSS satellite flight track and corresponding CNN wind speed; (**d**) 2020.8.25 CYGNSS satellite flight track and corresponding CNN wind speed. The color bar on the right represents the wind speed value, Unit: m/s. **Figure 7.** (**a**) 2020.8.23 CYGNSS satellite flight track and corresponding CNN wind speed; (**b**) 2020.8.24 CYGNSS satellite flight track and corresponding CNN wind speed; (**c**) 2020.8.25 CYGNSS satellite flight track and corresponding CNN wind speed; (**d**) 2020.8.25 CYGNSS satellite flight track and corresponding CNN wind speed. The color bar on the right represents the wind speed value, Unit: m/s.

**Figure 8.** (**a**) 2020.8.23 CYGNSS satellite flight track and corresponding inversion error; (**b**) 2020.8.24 CYGNSS satellite flight track and corresponding inversion error; (**c**) 2020.8.25 CYGNSS satellite flight track and corresponding inversion error; (**d**) 2020.8.26 CYGNSS satellite flight track and corresponding inversion error. The color bar on the right represents the wind speed value, Unit: m/s.


**Table 3.** CNN model daily data performance analysis results.

It can be seen from Figure 7 that the CNN model could be used to inverse the typhoon wind speed, and the inverse wind speed can reach up to 55 m/s. Figure 8 and Table 3 show that the inversion results for 23 and 24 August 2020 were smaller errors compared to the last two days. The reason was that the true wind speeds of the first two days were mostly less than 30 m/s. The true wind speed of the data on 25 and 26 August 2020 was up to 45 m/s, and there were more data in the interval of 30 m/s to 45 m/s, so the CNN inversion results showed relatively large errors. This conclusion coincides with the results in Table 2.

The above contents have verified the accuracy of the model. Next, the comparison between the inverse wind speed and the typhoon track data was discussed. Table 4 shows the results of the comparison between CNN inverse wind speed, true wind speed (ECMWF and NCEP reanalysis wind speed data), and Beaufort scale of typhoon track data (from Department of Water Resources of Zhejiang Province). The approximate wind speed is similar to Best Track data. The CYGNSS samples here should meet less than spatial ± 0.5◦ and temporal ± 0.5 h from the typhoon track data. The five datasets satisfied the above conditions.

**Table 4.** Comparison results of wind speed data from Department of Water Resources of Zhejiang Province.


In Table 4, comparing with CNN wind speed and Typhoon track data, the first column result had the smallest deviation, and the fifth column result was the worst. It shows the greater wind speed level, the worse error is obtained. It is the same result as Tables 2 and 3, the reason has been analyzed before. However, in this paper, the true wind speed (ECMWF and NCEP reanalysis wind speed data) was used as the training benchmark of CNN model. As can be seen from Table 4, compared with the approximate wind speed (from Department of Water Resources of Zhejiang Province) during the typhoon, the true wind speed was actually underestimated, and the inversion performance of CNN model was limited by the true wind speed.

#### **4. Conclusions**

In response to the limitations of environmental conditions during typhoons, the high cost of collecting typhoon wind speed data leads to difficulties in obtaining training samples for high wind speeds. DDM observables such as DDMA and LES can change with the change of wind speed. Some traditional sea surface high wind speed inversion methods use a single DDM-derived observable (DDMA or LES), the incidence angle of specular reflection, and the significant wave height as parameters to establish GMF models with wind speed for wind speed inversion, which cannot fully explore the hidden features in the data. This limits the accuracy of high wind speed inversion. In order to use multidimensional data features to fully exploit the data during typhoons and improve the accuracy of the inversion of typhoon wind speeds in the sea area and the performance of real-time monitoring, this paper proposed a CYGNSS inversion method for high wind

speed on the sea surface based on machine learning. Firstly, CYGNSS satellite data and true wind speed data from ECMWF and NCEP were used to construct the original datasets, and then three machine learning methods, SVR, PCA-SVR, and CNN, were used to train the data greater than 20 m/s during the typhoon. To avoid bias of the models, the undersampling method was adopted to control the number of samples. Lastly, the trained models were used for the inversion of Typhoon Bavi from 23 to 26 August 2020. The following conclusions could be drawn from the experimental results.


The difficulty of high wind speed inversion is the lack of higher wind speed samples, especially more than 40 m/s data, which leads to insufficient model training. Except for this, the selection of true wind speed during typhoons for training is also the key to the performance of the inversion. In the future, with the increasing amount of higher wind speed data and the use of more accurate model winds such as HWRF, GPS Dropsondes, and SFMR during typhoons, the accuracy of the obtained model will be improved and the error of typhoon inversion will be reduced. Eventually, the real-time prediction capability of typhoons will be realized.

**Author Contributions:** Y.Z. and Y.H. conceived and designed the framework of the study. J.Y. completed the data collection and processing. Y.Z. and J.Y. completed the algorithm design and the data analysis and were the lead authors of the manuscript, with contributions from J.Y., W.M., Z.Y. and S.Y. All authors have read and agreed to the published version of the manuscript.

**Funding:** This work was supported by the National Natural Science Foundation of China (Grant No. 41871325) and the National Key R&D Program of China (Project No. 2019YFD0900805).

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** The raw/processed data required to reproduce these findings cannot be shared at this time as the data also forms part of an ongoing study.

**Acknowledgments:** Thanks NASA for the CYGNSSS public data; the European Centre for Medium-Range Weather Forecasts (ECMWF) and the National Centers for Environmental Prediction (NCEP) for the reanalysis dataset; and Department of Water Resources of Zhejiang Province for the typhoon

track data. We would also like to thank Professor Yang Dongkai of Beijing University of Aeronautics and Astronautics and Li Weiqiang of CSIC-IEEC for their suggestions on GNSS-R satellite data analysis. We would like to thank Zhou Bo and Qin Jin from Shanghai Institute of Aerospace Electronics for their suggestions on the receiver of reflected signals.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


MDPI St. Alban-Anlage 66 4052 Basel Switzerland Tel. +41 61 683 77 34 Fax +41 61 302 89 18 www.mdpi.com

*Remote Sensing* Editorial Office E-mail: remotesensing@mdpi.com www.mdpi.com/journal/remotesensing

MDPI St. Alban-Anlage 66 4052 Basel Switzerland

Tel: +41 61 683 77 34

www.mdpi.com

ISBN 978-3-0365-4591-2