**Mapping Paddy Rice Using Weakly Supervised Long Short-Term Memory Network with Time Series Sentinel Optical and SAR Images**

**Mo Wang 1,2, Jing Wang 3,\* and Li Chen 1,2**


Received: 4 October 2020; Accepted: 16 October 2020; Published: 19 October 2020

**Abstract:** Rice is one of the most important staple food sources worldwide. Effective and cheap monitoring of rice planting areas is demanded by many developing countries. This study proposed a weakly supervised paddy rice mapping approach based on long short-term memory (LSTM) network and dynamic time warping (DTW) distance. First, standard temporal synthetic aperture radar (SAR) backscatter profiles for each land cover type were constructed on the basis of a small number of field samples. Weak samples were then labeled on the basis of their DTW distances to the standard temporal profiles. A time series feature set was then created that combined multi-spectral Sentinel-2 bands and Sentinel-1 SAR vertical received (VV) band. With different combinations of training and testing datasets, we trained a specifically designed LSTM classifier and validated the performance of weakly supervised learning. Experiments showed that weakly supervised learning outperformed supervised learning in paddy rice identification when field samples were insufficient. With only 10% of field samples, weakly supervised learning achieved better results in producer's accuracy (0.981 to 0.904) and user's accuracy (0.961 to 0.917) for paddy rice. Training with 50% of field samples also presented improvement with weakly supervised learning, although not as prominent. Finally, a paddy rice map was generated with the weakly supervised approach trained on field samples and DTW-labeled samples. The proposed data labeling approach based on DTW distance can reduce field sampling cost since it requires fewer field samples. Meanwhile, validation results indicated that the proposed LSTM classifier is suitable for paddy rice mapping where variance exists in planting and harvesting schedules.

**Keywords:** paddy rice mapping; dynamic time warping; LSTM; weakly supervised learning; cropland mapping

### **1. Introduction**

Rice is an important staple food source for more than half of the global population [1]. In Asia in particular, the rice planting area was 146 million ha in 2018, accounting for 88% of the world's total (FAOSTAT, 2020). The demand for rice over the next 30 years is anticipated to increase by 90% in Asia [2]. Monitoring rice planting is of great importance to global food security and informed policymaking.

Remote sensing has been providing timely and efficient means for agricultural monitoring for decades. Paddy rice is the only crop that grows under wetland conditions [3]. Its growing cycle comprises three phases: vegetative phase, reproductive phase, and ripening phase, which are detailly divided into 10 growth stages. The growing cycle of paddy rice features unique optical reflectance and radar scattering characteristics compared to other crops due to the abundant water conditions in the

early stages. Many study cases have employed optical (e.g., Landsat, Moderate Resolution Imaging Spectroradiometer (MODIS), SPOT, and Sentinel-2), microwave (e.g., RADARSAT, Phased Array Type L-band Synthetic Aperture Radar (PALSAR), and Sentinel-1), or a combination of the two remote sensing data sources to map rice-growing area at regional or continental scales. Dong and Xiao [4] sorted out rice mapping approaches into four categories, i.e., (1) reflectance data and image statistic-based approaches, (2) vegetation index (VI) data and enhanced image statistic-based approaches, (3) VI or radar backscatter-based temporal analysis approaches, and (4) phenology-based approaches through remote sensing recognition of key growth phases. The essence of rice mapping approaches is classification models using a single (category 1 and 2) or multiple temporal (category 3 and 4) remote sensing images. A single temporal image has its limitations in discriminating paddy rice from other land cover types due to similar remote sensing signals at certain growing stage and mixed pixels [5]. Approaches based on multi-temporal and phenology-based classifiers are drawing more research interests for paddy rice mapping and other cropland mappings [5–9].

Flooding and transplanting signals are unique features for paddy rice. They are often used as key signatures to identify the rice field. Xiao et al. [7,10] generated rice maps for China and Southeast Asia with time-series MODIS data by identifying flooding and transplanting phases using the relationship between normalized difference vegetation index (NDVI) (enhanced vegetation index (EVI)) and land surface water index (LSWI). Yin et al. [11] applied a time series vegetation index to detect flooding and transplanting phases using fusion data from MODIS and Landsat images. Besides vegetation index and water index, synthetic aperture radar (SAR) datasets were also used to detect flooding and transplanting signals. Torbick et al. [12] and Zhang et al. [13] used PALSAR data to identify rice growing stages. Other phenology-based approaches with SAR data sources include European Remote Sensing Satellite-1 (ERS-1) [14] and RADARSAT-2 [15]. More recently, the novel European Space Agency (ESA) Sentinel-1 has been frequently used as a SAR data source for paddy rice mapping [8,9,16,17], as it provides higher spatial resolution (10 m) and more frequent revisit (6 days). It should be noted that optical remote sensing data is often limited to construct an ideal time series dataset due to cloud cover in rice-growing regions. Therefore, radar remote sensing (e.g., SAR) is more suited for time series analysis since it is independent of weather conditions.

Many multi-temporal or phenology-based paddy rice mapping studies treated all rice fields as synchronized in phenology and input multi-temporal data into classifiers simultaneously. Such a manner ignores the planting schedule difference of farmers and variation in rice-growing periods of different rice cultivars. For example, Le Toan et al. [18] observed a shift of up to 6–8 weeks between fields within a region in Indonesia. This could lead to classification error for regional rice mapping with medium and high spatial resolution images. Some studies [19–21] attempted to address this problem with dynamic programming algorithms, such as dynamic time warping (DTW). The authors classified cropland on the basis of the DTW distance of NDVI from each pixel to samples using a threshold or a simple classifier. To obtain a complete NDVI time series dataset, they either had to reconstruct the data from cloud cover or use existing data products, which often are not available. The reconstruction of data for cloudy days could introduce uncertainties to the final time series dataset. Radar remote sensing data, on the other hand, has greater potential in finding phenological alignment using dynamic programming algorithms. Yet, few cropland mapping studies exist that have used such a strategy, except for Li and Bijker [22] who focused on vegetable classification.

Being consistent with general land cover classification development, various supervised and unsupervised classification algorithms have been applied for paddy rice mapping, such as maximum likelihood [23,24], support vector machines (SVM) [25–27], random forest (RF) [27,28], and decision trees (DT) [29]. In recent years, deep learning methods have shown great capability in fields such as computer vision and remote sensing image classification [30,31]. Convolutional neural network (CNN) and recurrent neural network (RNN) are the most adopted algorithms for remote sensing classification tasks, including rice mapping. Compared to shallow learning methods (such as SVM, RF, and DT), deep learning models rely more on abundant training samples to make reliable predictions and to avoid

over-fitting. Kussul et al. [31] presented one of the early attempts of using deep learning models for cropland classification. The authors proposed a 1-D CNN model from multiple Sentinel-1 and Landsat images sensed in the growing season. A total of 547 polygons of training samples were selected by extensive field surveys. Zhang et al. [32] devised a CNN-based approach that classifies small patches for each pixel with fused multi-temporal Landsat and MODIS NDVI data, along with a synthetic phenological variable. They acquired training samples on the basis of existing land use/land cover (LULC) maps and Google Earth. Sun et al. [33] proposed a long short-term memory (LSTM) recurrent neural network (RNN) model for yearly cropland classification of North Dakota, USA, using Landsat imagery. More than 10 million samples were selected to train the LSTM model on the basis of the Cropland Data Layer (CDL). All these deep learning-based cropland classification approaches rely on either a trustworthy labeled training data source or a laborious field survey, which might not be available or may be too expensive for some projects.

With concerns of variance in rice cultivation schedule and abundant training samples needed for deep learning models, we propose a paddy rice mapping approach that labels weak samples using DTW distance with time series SAR data and then feeds them into an LSTM network classifier. First, standard SAR backscatter profiles for each land cover type were constructed on the basis of a relatively small number of paddy rice pixel samples. The polarization of SAR data was carefully selected for that which best reflected flooding and transplanting signals of rice growing stages. DTW distances from each pixel of the study area to standard profiles were then computed. The *k*-nearest pixels were selected as weak samples. Using field samples and weak samples as a training set, we trained an LSTM model using time series multispectral and SAR bands as input. This weakly supervised deep learning approach could benefit from the strong learning ability of RNN models on time series data with less field sampling cost.

### **2. Materials and Methods**

### *2.1. Study Area and Materials*

### 2.1.1. Study Area

The study area is Tongxiang County of China's Zhejiang Province. It was chosen because it is one of China's major rice-growing regions. The county is located in the Hangjiahu Plain of East China, with an area of 727 km<sup>2</sup> (Figure 1). The annual number of days ≥10 ◦C is 217–222. The accumulated temperature in degree-days is 4772–4960 ◦C, featured with a subtropical monsoon climate. Rainfall and river network of the region provide adequate irrigation for single/double-cropping systems. Motivated by the agricultural economy, only single-cropping of rice is practiced in Tongxiang. Rice growing season of the study area is from May to November. However, rice harvest could happen as early as in October and as late as in December.

The study area meets the requirements to test our method for several reasons. First, it represents the typical geography of rice cultivation regions in East China with small to medium field size. Second, the weather of the region is an example of a scenario wherein optical remote sensing data are often not applicable due to cloud cover during the rice-growing season. Third, the size of the region is fit for rice mapping with high to medium resolution images that are to be trained with deep learning models.

*Agriculture* **2020**, *10*, x FOR PEER REVIEW 4 of 19

**Figure 1.** Location of the study area. **Figure 1.** Location of the study area.

### 2.1.2. Datasets and Preprocessing 2.1.2. Datasets and Preprocessing

### Time Series SAR Data Time Series SAR Data

Time series SAR data were used to compute similarities between pixels to land cover samples on the basis of sequence alignment. Time coverage of the SAR data was from April to December, which was extended by a month for the beginning and the end of the period compared with the normal cultivation period (May to November). This was because abnormal cultivation schedules exist for some farmers, for example, early rice planting (e.g., before May) and late harvest (e.g., after November). A total of 23 images of Sentinel-1A Level-1 Ground Range Detected (GRD) product with an interval of 12 days from April to December 2018 were accessed from ESA Copernicus Open Access Hub (https://scihub.copernicus.eu/). Sentinel-1B data product was omitted since it was not provided by ESA for the study area. The images were sensed with C-band in interferometric wide (IW) mode, and in dual of vertical transmitted; vertical received (VV); and vertical transmitted, horizontally received (VH) channels. Attributes of SAR images are listed in Table 1. The radar backscatter data were used to detect key phenology stages of paddy rice, especially flooding signals of early growing stages. According to Clement et al. [34], VV polarization slightly outperforms VH polarization while delineating flooding. Hence, VV bands were chosen to construct the SAR time series. Time series SAR data were used to compute similarities between pixels to land cover samples on the basis of sequence alignment. Time coverage of the SAR data was from April to December, which was extended by a month for the beginning and the end of the period compared with the normal cultivation period (May to November). This was because abnormal cultivation schedules exist for some farmers, for example, early rice planting (e.g., before May) and late harvest (e.g., after November). A total of 23 images of Sentinel-1A Level-1 Ground Range Detected (GRD) product with an interval of 12 days from April to December 2018 were accessed from ESA Copernicus Open Access Hub (https://scihub.copernicus.eu/). Sentinel-1B data product was omitted since it was not provided by ESA for the study area. The images were sensed with C-band in interferometric wide (IW) mode, and in dual of vertical transmitted; vertical received (VV); and vertical transmitted, horizontally received (VH) channels. Attributes of SAR images are listed in Table 1. The radar backscatter data were used to detect key phenology stages of paddy rice, especially flooding signals of early growing stages. According to Clement et al. [34], VV polarization slightly outperforms VH polarization while delineating flooding. Hence, VV bands were chosen to construct the SAR time series.

Several preprocessing steps were carried out using Sentinel Toolboxes software SNAP. First, subsets of original images were generated by cropping extend to the study area. The images were then calibrated to sigma naught (σ0) backscatter coefficients in a linear unit. Terrain correction was applied subsequently using Range Doppler Terrain Correction with Shuttle Radar Topography Mission (STRM) 3 sec grid. We generated 10 × 10 m pixel spacing images, which were reprojected to the Universal Transverse Mercator (UTM) zone 51 N. Then, a multi-temporal speckle filtering was applied on the terrain-corrected images using a 7 × 7 gamma map filter with 3 looks to reduce the noise in the images. Finally, the linear intensity values of the VV bands were converted to the dB unit. Several preprocessing steps were carried out using Sentinel Toolboxes software SNAP. First, subsets of original images were generated by cropping extend to the study area. The images were then calibrated to sigma naught (σ0) backscatter coefficients in a linear unit. Terrain correction was applied subsequently using Range Doppler Terrain Correction with Shuttle Radar Topography Mission (STRM) 3 sec grid. We generated 10 × 10 m pixel spacing images, which were reprojected to the Universal Transverse Mercator (UTM) zone 51 N. Then, a multi-temporal speckle filtering was applied on the terrain-corrected images using a 7 × 7 gamma map filter with 3 looks to reduce the noise in the images. Finally, the linear intensity values of the VV bands were converted to the dB unit.

## Sentinel-2 Time Series Images

Sentinel-2 multispectral images covering rice growing season were obtained from ESA Copernicus Open Access Hub. Screened by cloud cover, only 11 cloud-free dates were usable. Top-of-atmosphere level 1C data products of those dates were downloaded. Some preprocessing steps were also required for the level 1C product. First, atmospheric, terrain, and cirrus correction were performed using the Sen2Cor v2.8 tool provided by ESA to achieve bottom-of-atmosphere level 2A product. Among the 13 spectral bands of Sentinel-2 images, we selected 10 bands (B2-B8, B8a, B11, and B12) as input features, including visible, near-infrared (NIR), and short-wave infrared (SWIR) bands. The reason to include SWIR bands is that they are sensible to leaf water content [35], which could help differentiate rice fields from other crops. B2-B4 and B8 were in a resolution of 10 m. The remaining 6 bands (20 m resolution) were resampled to 10 m by nearest neighbor interpolation using the processing module in SNAP. Table 2 lists the attributes of used Sentinel-2 images.




### **Table 2.** Attributes of optical images.

### Field Sampling Data

Field sampling was conducted to collect samples belonging to broad land cover categories, i.e., paddy rice fields, water bodies (rivers, lakes, and reservoirs), construction (buildings, roads, bare surfaces, and paved surfaces), vegetation (forest, shrubs, orchards, and other planted area), and other croplands. A total of 506 sample sites (points of interest, POIs) were selected through scattering the study area for the five land cover types. The sample sites were selected in a homogeneous spatial distribution principle to cover cultivation variation between sub-regions. A geo-referenced field photo was taken for each POI in November 2018. Locations of geo-referenced photos were converted into kmz format and imported to Google Earth. Polygons of sample sites (areas of interest, AOIs) were

then delineated with the assistance of Google Earth and reference photos. Boundaries of polygons can be distinguished with high-resolution satellite images of 2018 from Google Earth. The land cover type of an AOI was assigned with the land cover type of the field photo where it was taken (as shown in Figure 2). land cover type of an AOI was assigned with the land cover type of the field photo where it was taken (as shown in Figure 2). A total number of 506 AOIs (one for each sample site) were created and then rasterized into a GeoTIFF image with 10 m resolution. Final AOIs include rice field (4569 pixels), water body (4805 pixels), vegetation (4372 pixels), construction (6094 pixels), and other croplands (4592 pixels).

polygons can be distinguished with high-resolution satellite images of 2018 from Google Earth. The

*Agriculture* **2020**, *10*, x FOR PEER REVIEW 6 of 19

croplands. A total of 506 sample sites (points of interest, POIs) were selected through scattering the study area for the five land cover types. The sample sites were selected in a homogeneous spatial distribution principle to cover cultivation variation between sub-regions. A geo-referenced field photo was taken for each POI in November 2018. Locations of geo-referenced photos were converted into kmz format and imported to Google Earth. Polygons of sample sites (areas of interest, AOIs)

**Figure 2.** Illustration of the field sampling process. Image (**a**) (paddy rice field) and (**b**) (other croplands) are example photos of points of interest (POIs). Their locations are shown in the Google Earth image on the left. Boundaries of fields (**a**,**b**) were delineated with the Google Earth image. Likewise, areas of interest (AOIs) for the other three land cover types were also created. Note that the dates of the Google Earth image and the field photos are not the same but were taken in the same **Figure 2.** Illustration of the field sampling process. Image (**a**) (paddy rice field) and (**b**) (other croplands) are example photos of points of interest (POIs). Their locations are shown in the Google Earth image on the left. Boundaries of fields (**a**,**b**) were delineated with the Google Earth image. Likewise, areas of interest (AOIs) for the other three land cover types were also created. Note that the dates of the Google Earth image and the field photos are not the same but were taken in the same year.

*2.2. Methodology* We integrated time series optical and radar remote sensing data to best capture phenological A total number of 506 AOIs (one for each sample site) were created and then rasterized into a GeoTIFF image with 10 m resolution. Final AOIs include rice field (4569 pixels), water body (4805 pixels), vegetation (4372 pixels), construction (6094 pixels), and other croplands (4592 pixels).

### characteristics of paddy rice. The time series data were to be fed into a RNN to cope with non-linear *2.2. Methodology*

year.

relationship and temporal correlation. The framework of this study is shown in Figure 3. First, a labeling strategy was developed. Following the preprocessing steps, standard SAR backscatter profiles for each land cover type were created with a relatively small number of field samples. Pairwise DTW distances to the standard SAR backscatter profile for each land cover type were computed for the whole study area. Top-*k* nearest pixels were then screened out as weakly labeled samples with the highest confidence. A specifically designed LSTM network was trained with integrated input features of optical and SAR data, using field samples (supervised) or optionally We integrated time series optical and radar remote sensing data to best capture phenological characteristics of paddy rice. The time series data were to be fed into a RNN to cope with non-linear relationship and temporal correlation. The framework of this study is shown in Figure 3. First, a labeling strategy was developed. Following the preprocessing steps, standard SAR backscatter profiles for each land cover type were created with a relatively small number of field samples. Pairwise DTW distances to the standard SAR backscatter profile for each land cover type were computed for the whole study area. Top-*k* nearest pixels were then screened out as weakly labeled samples with the highest confidence. A specifically designed LSTM network was trained with integrated input features of optical and SAR data, using field samples (supervised) or optionally combined with DTW-labeled samples (weakly supervised). Finally, a supervised or weakly supervised approach was to be adopted on the basis of experiment results to generate a paddy rice map.

map.

map.

combined with DTW-labeled samples (weakly supervised). Finally, a supervised or weakly

combined with DTW-labeled samples (weakly supervised). Finally, a supervised or weakly

*Agriculture* **2020**, *10*, x FOR PEER REVIEW 7 of 19

**Figure 3.** The framework of mapping paddy rice with dynamic time warping (DTW)-labeling strategy and weakly supervised classification with long short-term memory (LSTM) deep learning classifier. **Figure 3.** The framework of mapping paddy rice with dynamic time warping (DTW)-labeling strategy and weakly supervised classification with long short-term memory (LSTM) deep learning classifier. **Figure 3.** The framework of mapping paddy rice with dynamic time warping (DTW)-labeling strategy and weakly supervised classification with long short-term memory (LSTM) deep learning classifier.

### 2.2.1. Standard Time Series SAR Backscatter Profiles 2.2.1. Standard Time Series SAR Backscatter Profiles 2.2.1. Standard Time Series SAR Backscatter Profiles

To simulate the scenario wherein only a small number of field samples are available, we took 10% of field sample pixels of each land cover type as a base to compute standard SAR backscatter profiles. SAR backscatter values of each date were extracted from the time series of the 23 Sentinel-1 SAR backscatter coefficient images from April to December. For each date and land cover type, the standard SAR backscatter value was computed by averaging on the selected base samples (Figure 4). The five land cover types had different levels of SAR backscatter in dB. Waterbody had a distinctly low radar backscatter profile with litter temporal variance. In contrast, construction had the highest radar backscatter value. Vegetation and cropland were similar in the radar backscatter level. Compared to other cropland, rice fields showed a greater variance due to temporal change in the structure of the canopy and that of the underlying soil surface. To simulate the scenario wherein only a small number of field samples are available, we took 10% of field sample pixels of each land cover type as a base to compute standard SAR backscatter profiles. SAR backscatter values of each date were extracted from the time series of the 23 Sentinel-1 SAR backscatter coefficient images from April to December. For each date and land cover type, the standard SAR backscatter value was computed by averaging on the selected base samples (Figure 4). The five land cover types had different levels of SAR backscatter in dB. Waterbody had a distinctly low radar backscatter profile with litter temporal variance. In contrast, construction had the highest radar backscatter value. Vegetation and cropland were similar in the radar backscatter level. Compared to other cropland, rice fields showed a greater variance due to temporal change in the structure of the canopy and that of the underlying soil surface. To simulate the scenario wherein only a small number of field samples are available, we took 10% of field sample pixels of each land cover type as a base to compute standard SAR backscatter profiles. SAR backscatter values of each date were extracted from the time series of the 23 Sentinel-1 SAR backscatter coefficient images from April to December. For each date and land cover type, the standard SAR backscatter value was computed by averaging on the selected base samples (Figure 4). The five land cover types had different levels of SAR backscatter in dB. Waterbody had a distinctly low radar backscatter profile with litter temporal variance. In contrast, construction had the highest radar backscatter value. Vegetation and cropland were similar in the radar backscatter level. Compared to other cropland, rice fields showed a greater variance due to temporal change in the structure of the canopy and that of the underlying soil surface.

**Figure 4.** Time series SAR backscatter profiles for the five land cover types. **Figure 4.** Time series SAR backscatter profiles for the five land cover types. **Figure 4.** Time series SAR backscatter profiles for the five land cover types.

### 2.2.2. DTW Distance-Based Sampling 2.2.2. DTW Distance-Based Sampling

The dynamic time warping (DTW) algorithm uses dynamic programming techniques to find the optimal alignment and the minimal distance of two temporal sequences. The algorithm was originally developed for speech recognition to cope with different speaking speeds [36], and was introduced in time series remote sensing image analysis in recent years [19,20,37,38]. The alignment of two sequences is through being warped non-linearly by stretching or shrinking the time dimension. In our case, DTW can deal with temporal distortion in rice-growing phases of different rice fields or rice varieties. Mathematical principle and a detailed procedure to implement DTW was elaborated by Berndt and Clifford [39]. The basic idea is illustrated in Figure 5, where A and B are two SAR backscatter time series of rice field samples. The DTW algorithm finds the best alignment that allows minimum accumulated distance for each time step. In this example, the planting date of B was later than that of A. DTW algorithm aligns the phases between these two SAR time series by shifting with the corresponded time gap. As a result, the similarity measure in the SAR time series curve is justified, regardless of planting date difference and rice varieties. The dynamic time warping (DTW) algorithm uses dynamic programming techniques to find the optimal alignment and the minimal distance of two temporal sequences. The algorithm was originally developed for speech recognition to cope with different speaking speeds [36], and was introduced in time series remote sensing image analysis in recent years [19,20,37,38]. The alignment of two sequences is through being warped non-linearly by stretching or shrinking the time dimension. In our case, DTW can deal with temporal distortion in rice-growing phases of different rice fields or rice varieties. Mathematical principle and a detailed procedure to implement DTW was elaborated by Berndt and Clifford [39]. The basic idea is illustrated in Figure 5, where A and B are two SAR backscatter time series of rice field samples. The DTW algorithm finds the best alignment that allows minimum accumulated distance for each time step. In this example, the planting date of B was later than that of A. DTW algorithm aligns the phases between these two SAR time series by shifting with the corresponded time gap. As a result, the similarity measure in the SAR time series curve is justified, regardless of planting date difference and rice varieties.

*Agriculture* **2020**, *10*, x FOR PEER REVIEW 8 of 19

**Figure 5.** Illustration of DTW alignment between two time series SAR backscatter curves. **Figure 5.** Illustration of DTW alignment between two time series SAR backscatter curves.

For each land cover type, pairwise DTW distances to its standard SAR time series were computed. The shorter the distance, the more likely a pixel belongs to the corresponding land cover type. Top-*k* nearest strategy was applied to select weak samples, which means they had the highest confidence belonging to a certain land cover type. The decision on the number of screened samples was made on the basis of the rough size of the rice planting area in the study area. A larger number of samples may enhance the training of the deep learning model. However, it could bring more uncertainty in the sample label, i.e., lower sample quality. The number of labeled samples should be a small portion of the total rice planting area. In practice, a DTW Python module by Pierre Rouanet (https://github.com/pierre-rouanet/dtw) was adopted. For each land cover type, pairwise DTW distances to its standard SAR time series were computed. The shorter the distance, the more likely a pixel belongs to the corresponding land cover type. Top-*k* nearest strategy was applied to select weak samples, which means they had the highest confidence belonging to a certain land cover type. The decision on the number of screened samples was made on the basis of the rough size of the rice planting area in the study area. A larger number of samples may enhance the training of the deep learning model. However, it could bring more uncertainty in the sample label, i.e., lower sample quality. The number of labeled samples should be a small portion of the total rice planting area. In practice, a DTW Python module by Pierre Rouanet (https://github.com/pierre-rouanet/dtw) was adopted.

### 2.2.3. LSTM Deep Learning Classifier 2.2.3. LSTM Deep Learning Classifier

LSTM [40] is an RNN architecture designed to avoid the long-term dependency problem. Compared to feed-forward neural networks, RNN has a feedback connection that deals with temporal dynamic behavior. RNNs use an internal state (memory) from the previous time step as one of the inputs for the current time step. Traditional RNN often encounters vanishing gradient problem while processing long-sequence data. LSTM, on the other hand, can avoid this problem by adding gates to control the cell state. A common LSTM unit has an input gate, an output gate, and a forget gate. A standard RNN unit and LSTM unit are illustrated in Figure 6. LSTM [40] is an RNN architecture designed to avoid the long-term dependency problem. Compared to feed-forward neural networks, RNN has a feedback connection that deals with temporal dynamic behavior. RNNs use an internal state (memory) from the previous time step as one of the inputs for the current time step. Traditional RNN often encounters vanishing gradient problem while processing long-sequence data. LSTM, on the other hand, can avoid this problem by adding gates to control the cell state. A common LSTM unit has an input gate, an output gate, and a forget gate. A standard RNN unit and LSTM unit are illustrated in Figure 6.

*Agriculture* **2020**, *10*, x FOR PEER REVIEW 9 of 19

**Figure 6.** The architecture of standard recurrent neural network (RNN) cell and LSTM cell [41]. **Figure 6.** The architecture of standard recurrent neural network (RNN) cell and LSTM cell [41].

Gates in LSTM cells use a sigmoid activation function to decide whether to keep a feature. "0" means the gates are blocking inputs. "1" means gates are allowing inputs to pass through it. Equations of the three gates are Gates in LSTM cells use a sigmoid activation function to decide whether to keep a feature. "0" means the gates are blocking inputs. "1" means gates are allowing inputs to pass through it. Equations of the three gates are

$$\dot{\mathfrak{u}}\_{l} = \sigma(w\_{l}[h\_{t-1}, \mathfrak{x}\_{l}] + b\_{l}) \tag{1}$$

$$f\_t = \sigma(w\_f[\mathbf{h}\_{t-1}, \mathbf{x}\_t] + b\_f) \tag{2}$$

$$\rho\_t = \sigma(w\_o[h\_{t-1}, \mathbf{x}\_t] + b\_o) \tag{3}$$

where , , and denote input gate, forget gate, and output gate, respectively; σ denotes sigmoid function; and , ℎ−1, , and denote the weight for respective gate neurons, the output of the previous LSTM block, input at the current timestamp, and biases for the respective gates, where *i<sup>t</sup>* , *f<sup>t</sup>* , and *o<sup>t</sup>* denote input gate, forget gate, and output gate, respectively; σ denotes sigmoid function; and *wx*, *ht*−1, *x<sup>t</sup>* , and *b<sup>x</sup>* denote the weight for respective gate neurons, the output of the previous LSTM block, input at the current timestamp, and biases for the respective gates, respectively.

respectively. The cell state and final output are defined by equations

̃

$$
\widetilde{c\_t} = \tanh(w\_c[h\_{t-1}, \mathbf{x}\_t] + b\_i) \tag{4}
$$

$$c\_t = f\_t \times c\_{t-1} + i\_t \times \overline{c}\_t \tag{5}$$

$$h\_t = o\_t \times \tanh(c\_t) \tag{6}$$

ℎ = × ℎ() (6) where and ̃ denote cell state and candidate for cell state at time t, and ℎ denotes the final where *<sup>c</sup><sup>t</sup>* and <sup>e</sup>*c<sup>t</sup>* denote cell state and candidate for cell state at time t, and *<sup>h</sup><sup>t</sup>* denotes the final output (hidden state) of the LSTM cell.

output (hidden state) of the LSTM cell. Considering the large span of the time series data, we developed an LSTM network to identify paddy rice from other land cover types. The core of the model is a stack of two LSTM layers, one fully connected layer, and two dropout layers, as shown in Figure 7. Output layers used the softmax activation function to predict the likelihood that a pixel belongs to one land cover type. Categorical cross-entropy was chosen as the loss function. The choice was made on the basis of several concerns. First, it has been proven that the use of cross-entropy losses made great performance improvements in models with softmax outputs [42]. Second, the training samples for each class were fairly balanced, which makes the network unbiased towards each class. Therefore, other loss functions were neglected, for instance, focal loss, which is suitable to deal with the imbalance problem in datasets. Parameters used to train the network are shown in Table 3. The model was implemented with Python 3.7 language and TensorFlow library. Considering the large span of the time series data, we developed an LSTM network to identify paddy rice from other land cover types. The core of the model is a stack of two LSTM layers, one fully connected layer, and two dropout layers, as shown in Figure 7. Output layers used the softmax activation function to predict the likelihood that a pixel belongs to one land cover type. Categorical cross-entropy was chosen as the loss function. The choice was made on the basis of several concerns. First, it has been proven that the use of cross-entropy losses made great performance improvements in models with softmax outputs [42]. Second, the training samples for each class were fairly balanced, which makes the network unbiased towards each class. Therefore, other loss functions were neglected, for instance, focal loss, which is suitable to deal with the imbalance problem in datasets. Parameters used to train the network are shown in Table 3. The model was implemented with Python 3.7 language and TensorFlow library.

*Agriculture* **2020**, *10*, x FOR PEER REVIEW 10 of 19

**Figure 7.** LSTM network structure. The inputs of the classifier are pixel values of 11 bands (10 multispectral bands and 1 SAR band). At time *t*, the classifier is fed with the band values of image *t*, and the LSTM layers receive cell state and hidden state from the previous timestamp *t − 1*. The final prediction is made by the output layer of the last timestamp. **Figure 7.** LSTM network structure. The inputs of the classifier are pixel values of 11 bands (10 multispectral bands and 1 SAR band). At time *t*, the classifier is fed with the band values of image *t*, and the LSTM layers receive cell state and hidden state from the previous timestamp *t* − *1*. The final prediction is made by the output layer of the last timestamp.

**Table 3.** Parameters of proposed LSTM classifier. **Table 3.** Parameters of proposed LSTM classifier.


The input multispectral features (10 bands) of the classification model were carefully selected

from optical imagery, as described within Section 2.2.2. Meanwhile, SAR backscatter signals provided valuable information to discriminate paddy rice fields. The SAR backscatter coefficient was therefore added as an input feature. It should be noted that the dates of Sentinel-1 SAR images did not coincide with that of Sentinel-2 optical images. The SAR images with the closest sensing date to corresponding optical images were selected. This strategy should be feasible since the time difference between the two data sources was minor (less than 5 days). The SAR data were normalized to the same range (0, 1) as optical bands. The final input features comprised 10 optical bands and 1 SAR band. 2.2.4. Experiment Design The input multispectral features (10 bands) of the classification model were carefully selected from optical imagery, as described within Section 2.2.2. Meanwhile, SAR backscatter signals provided valuable information to discriminate paddy rice fields. The SAR backscatter coefficient was therefore added as an input feature. It should be noted that the dates of Sentinel-1 SAR images did not coincide with that of Sentinel-2 optical images. The SAR images with the closest sensing date to corresponding optical images were selected. This strategy should be feasible since the time difference between the two data sources was minor (less than 5 days). The SAR data were normalized to the same range (0, 1) as optical bands. The final input features comprised 10 optical bands and 1 SAR band.

### We designed 3 experiment schemes on the basis of the number of field samples. In each scheme, supervised and weakly supervised learning were both to be tested. Scheme 1 and 2 are two 2.2.4. Experiment Design

comparative experiments that were to be validated by a fixed test set with a sufficient sample number. First, the test set was randomly selected from the rest of the field samples, apart from those for computing standard SAR backscatter profiles. The sample number of the test set accounted for 50% of all field samples. In scheme 1, 10% of field samples and 5000 DTW-labeled samples each type were used for training. The 10% was the same set of samples for computation of the standard SAR backscatter profiles in Section 3.1. Scheme 1 is designed to test for the case that only a small number of field samples were available. In scheme 2, training samples were the rest of the half field samples, apart from the test set plus 2000 DTW-labeled samples of each type. Scheme 3 utilized 80% of field samples and 2000 DTW-labeled samples of each type for training. It was designed to compare weakly supervised with supervised learning in a condition of relatively adequate field samples. The details of experiment schemes are listed: We designed 3 experiment schemes on the basis of the number of field samples. In each scheme, supervised and weakly supervised learning were both to be tested. Scheme 1 and 2 are two comparative experiments that were to be validated by a fixed test set with a sufficient sample number. First, the test set was randomly selected from the rest of the field samples, apart from those for computing standard SAR backscatter profiles. The sample number of the test set accounted for 50% of all field samples. In scheme 1, 10% of field samples and 5000 DTW-labeled samples each type were used for training. The 10% was the same set of samples for computation of the standard SAR backscatter profiles in Section 3.1. Scheme 1 is designed to test for the case that only a small number of field samples were available. In scheme 2, training samples were the rest of the half field samples, apart from the test set plus 2000 DTW-labeled samples of each type. Scheme 3 utilized 80% of field samples and 2000 DTW-labeled samples of each type for training. It was designed to compare weakly supervised with supervised learning in a condition of relatively adequate field samples. The details of experiment schemes are listed:


### **3. Results**

### *3.1. DTW Distance-Based Sampling Results*

Using the SAR time series of each pixel in the study area and the five standard SAR profiles, we generated five pairwise DTW distance maps (Figure 8). Each map depicted the likelihood that image pixels belong to the corresponding land cover type. The lower the value, the more likely the pixel belonged to the corresponding land cover type. For water body and construction ((b) and (d), respectively, in Figure 8), their DTW distance maps show visibly different patterns than the other three. This can be explained by the SAR backscatter differences between vegetation and non-vegetation. In some parts of the map, DTW distances for paddy rice ((a) in Figure 8) and other crops ((e) in Figure 8) show a similar pattern of low value. Those pixels are likely to be agricultural land whose SAR profile lies between the standard paddy rice profile and other-crop profile. This ambiguity was avoided by the top-*k* nearest strategy to pick out the weak samples with the highest confidence.

For each land cover type, the top-*k* (5000 or 2000 for experiments) pixels with the lowest DTW distance were selected as weak samples. No conflict of labels was found in the selected weak samples since a relatively small amount (to the pixel number of the whole study area) of weak samples were picked.

## *3.2. LSTM Classification Results*

We evaluated the experiments with four metrics, namely, overall accuracy (OA), producer's accuracy (PA) for paddy rice, user's accuracy (UA) for paddy rice, and Kappa coefficient. UA evaluates how often the class in the classification map will be present on the ground, and PA evaluates how often real land cover types in the study area are correctly shown on the classification map. They can be derived from the confusion matrix. PA and UA for paddy rice were of most concern since our goal was solely focusing on paddy rice mapping. The evaluation of three experimental schemes is shown in Table 4 (PA and UA for paddy rice are emphasized in bold and in tables thereafter).

Paddy rice PA and UA for supervised learning in scheme 1 were 0.904 and 0.917, respectively, whereas for weakly supervised learning were 0.981 (improved by 0.77) and 0.961 (improved by 0.44), respectively. However, supervised learning surpassed weakly supervised in OA (0.937 to 0.854) and Kappa (0.921 to 0.817), which indicated that weak samples introduced errors in land cover types other than paddy rice. Further analysis of classification results for scheme 1 was listed in Tables 5 and 6. Compared with supervised learning, PA and OA for paddy rice were notably improved. Meanwhile, PA and OA for vegetation and other crops were depressed. The expanded weak samples might introduce errors in labels for vegetation and other crops since they could have similar time series SAR profiles (Figure 4). Scheme 1 proves that with a small number of field samples for training, the proposed weakly supervised approach had evident improvement in paddy rice identification, regardless of possibly worse identification of other land cover types.

*Agriculture* **2020**, *10*, x FOR PEER REVIEW 12 of 19

**Figure 8.** DTW distances of SAR time series to standard SAR backscatter profiles. (**a**–**e**) are five DTW distance maps for each land cover type respectively. **Figure 8.** DTW distances of SAR time series to standard SAR backscatter profiles. (**a**–**e**) are five DTW distance maps for each land cover type respectively.


**Table 4.** Evaluation of three experiment schemes.

**Table 5.** Confusion matrix for supervised learning in scheme 1.


**Table 6.** Confusion matrix for weakly supervised learning in scheme 1.


Scheme 2 used more field samples for training than scheme 1. The experiment results also showed improvement with weakly supervised learning in paddy rice PA and UA, although not as much as scheme 1. PA was improved by 0.017 from 0.968 to 0.985, and UA improved by 0.021 from 0.972 to 0.993. This can be explained by the classification model. The proposed classification model was fairly fitted by training with 50% of field samples. No evident improvement was observed in scheme 3, while 80% of field samples were used for training. Paddy rice PA was slightly improved. Considering the relatively small sample number of the test set, this improvement was neglectable.

### *3.3. Paddy Rice Map*

On the basis of the model evaluation, we implemented the weakly supervised approach in scheme 3 to generate a paddy rice map (Figure 9). The resulting map shows that the majority of paddy rice cultivation was in the north and west of the county. The total area of mapped paddy rice fields was 13,998.8 ha, slightly less than the government-published figure of 14,666 ha. With closer examination of four selected areas (Figure 10), we found that the resulting map depicts paddy rice fields with fair accuracy. Meanwhile, the precise boundaries of contiguous rice fields failed to be delineated due to the resolution of the data source. Boundaries between rice fields are generally 1–3 m in width, and hence *3.3. Paddy Rice Map*

are unfeasible to be mapped with 10-m resolution images. The narrow boundaries were also counted as paddy rice fields in the produced map. was 13,998.8 ha, slightly less than the government-published figure of 14,666 ha. With closer examination of four selected areas (Figure 10), we found that the resulting map depicts paddy rice

rice cultivation was in the north and west of the county. The total area of mapped paddy rice fields

*Agriculture* **2020**, *10*, x FOR PEER REVIEW 14 of 19

On the basis of the model evaluation, we implemented the weakly supervised approach in

As we were focusing on paddy rice mapping, the classification accuracy of other land cover types was less regarded. With visual interpretation, water and construction type were well identified, while vegetation and other crops were mixed up in some areas. In some cases, these two broad categories could have a similar time series pattern in terms of spectral and radar signals. fields with fair accuracy. Meanwhile, the precise boundaries of contiguous rice fields failed to be delineated due to the resolution of the data source. Boundaries between rice fields are generally 1–3 m in width, and hence are unfeasible to be mapped with 10-m resolution images. The narrow boundaries were also counted as paddy rice fields in the produced map.

**Figure 9. Figure 9.** Classification map for paddy rice with weakly Classification map for paddy rice with weakly supervised LSTM deep learning classifier. supervised LSTM deep learning classifier.

*Agriculture* **2020**, *10*, x FOR PEER REVIEW 15 of 19

**Figure 10.** Classification result inspection on selected areas with high-resolution satellite images. **A1** is the high-resolution remote sensing image and **A2** is its classification result, vice versa for **B**–**D**. Images (**A1**,**B1**) were sensed in May 2018, (**C1**) was sensed in July 2018, and (**D1**) was sensed in December 2018. By visual interpretation, paddy rice fields in (**A1**,**B1**,**C1**) are displayed in green, while in (**D1**), paddy rice fields are displayed in yellow since the crop was harvested. **Figure 10.** Classification result inspection on selected areas with high-resolution satellite images. **A1** is the high-resolution remote sensing image and **A2** is its classification result, vice versa for **B**–**D**. Images (**A1**,**B1**) were sensed in May 2018, (**C1**) was sensed in July 2018, and (**D1**) was sensed in December 2018. By visual interpretation, paddy rice fields in (**A1**,**B1**,**C1**) are displayed in green, while in (**D1**), paddy rice fields are displayed in yellow since the crop was harvested.

### As we were focusing on paddy rice mapping, the classification accuracy of other land cover **4. Discussion**

types was less regarded. With visual interpretation, water and construction type were well identified, while vegetation and other crops were mixed up in some areas. In some cases, these two broad categories could have a similar time series pattern in terms of spectral and radar signals. **4. Discussion** This study mapped paddy rice fields using a weakly supervised LSTM deep learning network with time series optical and SAR remote sensing data. The motivation of this study was to show that This study mapped paddy rice fields using a weakly supervised LSTM deep learning network with time series optical and SAR remote sensing data. The motivation of this study was to show that deep learning classifiers have a strong capability in crop mapping, but often require a larger number of training samples. To solve this problem, a DTW distance-based sampling method was developed to deal with the time alignment in rice cultivation and label weak samples. Experiments on different numbers of training samples showed that the DTW distance-based sampling improved classification results for paddy rice, whereas the field samples were insufficient.

deep learning classifiers have a strong capability in crop mapping, but often require a larger number of training samples. To solve this problem, a DTW distance-based sampling method was developed to deal with the time alignment in rice cultivation and label weak samples. Experiments on different numbers of training samples showed that the DTW distance-based sampling improved classification results for paddy rice, whereas the field samples were insufficient. Previous studies, e.g., Zhong et al. [43] and Sun et al. [33], have applied RNNs in crop mapping, Previous studies, e.g., Zhong et al. [43] and Sun et al. [33], have applied RNNs in crop mapping, although not specifically for rice mapping. The effectiveness of using RNNs to classify croplands has been proven. However, the experiment results of those studies are not comparable to our study since different datasets and experiment schemes were used. Those studies relied on high-quality field survey data to provide a large number of training samples. Our experiments showed that with weakly supervised learning, only 10% of field samples as training data achieved comparable results (PA 0.981,

although not specifically for rice mapping. The effectiveness of using RNNs to classify croplands has

UA 0.961) to supervised learning on 50% of field samples (PA 0.968, UA 0.972), and weakly supervised learning on 50% field samples (PA 0.985, UA 0.993) had similar performance to supervised learning on 80% field samples (PA 0.988, UA 0.987). Although in scheme 1, weakly supervised learning with weak samples impaired the differentiation of vegetation and other crops, the goal of paddy rice mapping was promoted. The results imply that DTW-labeled paddy rice, water, and construction samples had high confidence and contributed significantly to the training of the RNN classifier.

Many factors could have an impact on the identification and mapping of paddy rice with multi-temporal remote sensing data. First, the resolution of satellite images creates the bottleneck of the precision of paddy rice mapping. The average size of paddy rice fields determines the appropriate image resolution to be used. The typical length of crop field sides is tens of meters in the study area. The boundaries between crop fields are narrow. Among the 11 input feature bands, 5 of them were in 10-m resolution and 6 of them were resampled to 10 m from 20 m. A mixed pixel problem still exists for small land parcels such as boundaries between croplands. Image segmentation on high-resolution images as supplementary data is a possible solution to alleviate this problem. The second factor is the continuity of time series data. Optical remote sensing data were critically confined by weather conditions in the study area. Sentinel-2 images of only 11 dates (less than a quarter of sensed images) from April to November were qualified for generating cloud-free data. With more complete time series images, the RNN classifier would yield more reliable results. This is a common problem for many studies using optical remote sensing data. The third factor is feature selection. Multi-temporal profile analysis of paddy rice widely utilizes temporal analysis of vegetation indexes (VIs) and water indexes (WIs) [19,44,45]. Some evident characteristics can be observed with VIs or WIs at a certain phase of paddy rice growth. For instance, in the heading phase, normalized difference vegetation index (NDVI) and enhanced vegetation index (EVI) show an increase and decrease variation, respectively, with a peak; land surface water index (LSWI) shows a jump in the transplanting phase due to water inundation [46]. As for the deep learning classifier in this study, we avoided VIs or WIs as input features due to the following concern. The range of VIs or WIs is different from that of spectral bands. If the VIs or WIs are rescaled to the range of spectral reflectance, the principle of the indexes is no longer valid. On the other hand, the neural networks would learn an implicit representation of raw input data, which is analogous to manually created features such as VIs and WIs.

Flooding signals are key features at early growing phases of paddy rice to differentiate from other crop types. Before transplanting, radar backscatter is low due to flooded fields. Radar backscatter increases during the vegetative phase, reaching a maximum at the heading stage [47]. We utilized the SAR backscatter coefficient as an input to the LSTM classifier. However, the power of SAR data as an input feature to identify the rice field was restrained by the number of valid optical images. Only a small portion of SAR images was used.

Most studies on rice mapping using multi-temporal data ignored planting schedule differences. Inputting all multi-temporal features as a whole into shallow machine learning classifiers, such as SVM and random forest, could impair the classifier's predictive ability. RNNs have strong adaptability to avoid this problem through learned memory control. Accordingly, many land cover classification studies using deep learning classifiers achieved better results compared to shallow classifiers. In the meantime, a larger number of training samples were required for the training of deep learning models. The proposed labeling method was especially designed for paddy rice, whose radar signal pattern through the growing season is unique. The experiment was conducted for a single-cropping regime in East China. However, the DTW-labeling strategy is valid for the multiple-cropping regime and other regions of the world, since it only relies on all-weather SAR remote sensing data. If applied to a multiple-cropping regime, each cropping regime should be categorized as an independent land cover type. For instance, single/double/triple-cropping rice fields should be separate land cover types. With enough labeled samples, state-of-the-art deep learning classifiers (non-sequential or sequential as in this study) could be chosen to adapt to a specific rice-growing region in the world.
