1. Introduction
Paddy rice, as a staple food, feeds almost half the world’s population [
1]. It is of important to obtain the spatial distribution of paddy rice in a timely and accurate fashion [
2,
3,
4,
5,
6]. Mapping paddy rice is significant for understanding and evaluating regional, national, and global issues such as food security, climate change, disease transmission, and water resource utilization [
6].
Southwest China is a key paddy rice planting area. Affected by the climate, this area has a lot of precipitation and is cloudy and foggy. With the continuous development of satellite remote sensing technology, our ability to monitor and map paddy rice fields has improved. Time series of remote sensing data, such as MODIS and Landsat, have been widely used in paddy rice monitoring [
7,
8]. Higher spatial resolution data can improve the mapping accuracy of paddy rice [
9].
With the development of time series images, phenology-based algorithms have been developed [
8,
9,
10,
11]. Paddy rice fields are a mixture of water and rice plants, which have unique physical characteristics during the flooding and transplanting stages. We can use the relationship between the land surface water index (LSWI) and the normalized difference vegetation index (NDVI) or enhanced vegetation index (EVI) to identify the features. At present, transplanting-based algorithms have been widely used for mapping paddy rice fields in many areas [
10,
11]. For example, Xiao et al. used a transplant algorithm to map paddy rice, and used MODIS data to generate paddy rice distribution maps in South China, South Asia, and Southeast Asia [
8,
11]. Zhang et al. used time series vegetation index data to obtain a map of paddy rice distribution in Northeast China in 2010 [
12]. Dong et al. obtained the dynamic changes of paddy rice planting area in Northeast China from 1986 to 2010 through time series Landsat images and phenology-based algorithms [
13].
However, optical satellite remote sensing data are affected by cloud and rain images, and there are few effective images in southwest China. In the course of this research, we searched for valid images (cloud coverage < 60%) in Southwest China in 2020. The results show that Landsat 8 OLI (temporal resolution, 16 days) has about 12 images, and Sentinel-2 (temporal resolution, 5 days) has about 20 images. There is a lack of sufficient time series data and appropriate methods for relevant research in Southwest China [
14]. In order to improve the availability of the image, it is necessary to remove clouds from images. However, cloud coverage in cloudy and rainy areas is frequent and lasts for a long time, and a single data source after cloud removal is still unable to establish time series data satisfying paddy rice monitoring. Therefore, the development of a method that combines the cloud removal method with the spatiotemporal fusion model may provide an effective method for paddy rice mapping in cloudy and rainy areas. Removing thick clouds from the image is a necessary condition to improve its quality and usability [
15]. Generally speaking, the method of removing thick clouds requires cloudless images as auxiliary images to restore the spectral information occluded by clouds [
16,
17,
18,
19]. Melgani proposed a contextual multiple linear prediction (CMLP) method to reconstruct spectral values of cloud areas in Landsat images [
18]. Zhu et al. developed a modified neighborhood similar pixel interpolator (MNSPI) method to remove thick cloud pollution from satellite images over land [
15]. This method can maintain the spatial continuity of the filled images even when the time interval between the auxiliary image date and the predicted image date is long [
20]. As compared with CMLP, this method has a better effect of restoring image information [
15]. When cloudy and auxiliary images are acquired in different seasons, the reflectance value estimated by the MNSPI method is more accurate than the CMLP result.
Fusion of high-spatial/low-temporal resolution data with high-temporal/low-spatial resolution data (such as Sentinel-2 and MODIS) can generate high-temporal/high-spatial resolution data. In previous studies, many spatiotemporal fusion models have been proposed and applied, such as the spatial and temporal adaptive reflectance fusion model (STARFM), the enhanced STARFM, and the robust adaptive spatial temporal fusion model (RASTFM) [
21,
22,
23]; however, these fusion algorithms have certain limitations [
24]. First, the complexity of land cover changes in homogeneous or heterogeneous landscapes is not fully considered. Secondly, most spatiotemporal fusion algorithms require two or more low–high-spatial resolution images as prior information. Aiming at the limitations of the aforementioned spatiotemporal fusion methods, Zhu et al. proposed a flexible spatiotemporal data fusion (FSDAF) model [
24]. Its goal is to more accurately predict high-resolution images of non-uniform areas by capturing gradual and sudden land cover type changes.
At present, the spatiotemporal fusion model has proved its ability in crop type detection [
25,
26,
27,
28]. For example, Zhu et al. used the STARFM model to fuse Landsat and MODIS images, combined with support vector machines (SVM) to classify crop types in 2017 [
25]. A recent study used fusion images of several key dates and used a supervised random tree (RT) classifier to map paddy rice fields in Hunan, China [
26]. Cai et al. used the RASTFM model to fuse MODIS and Sentinel-2 data, combined with the random forest (RF) method to map paddy rice [
27]. Yin et al. used the STARFM model to fuse Landsat and MODIS data. On this basis, they used a phenology-based algorithm to map paddy rice fields [
28].
In this study, we aimed to use the cloud removal interpolation method and data fusion method to reconstruct remote sensing time series data and map paddy rice areas combining these with a phenology-based algorithm, and provide a new method to map paddy rice in cloudy and rainy areas. The specific steps were as follows: (1) the MNSPI approach was used to remove clouds and interpolation for Sentinel-2 and Landsat 8 OLI images with 30–60% cloud coverage; the quality assessment (QA) band of MODIS product was used to remove clouds from MODIS images; (2) the FSDAF model was used to fuse Sentinel-2 images and MODIS images to obtain Sentinel-2-like fusion images; (3) a dataset with a high spatiotemporal resolution of the study area was constructed based on the three kinds of remote sensing data, and we mapped paddy rice combined with a phenology-based algorithm; (4) we used field survey data and official statistics to evaluate the accuracy of the final paddy rice map.
3. Method
The spatiotemporal data fusion and phenology-based paddy rice mapping methodology mainly involved the following steps (
Figure 4): (1) removing thick clouds, i.e., for the Sentinel-2 and Landsat 8 OLI images, thick clouds were removed with MNSPI approach and, for the MODIS images, clouds were removed with the QA band; (2) FSDAF prediction using images after preprocessing (Sentinel-2, MOD09GA); (3) mapping paddy rice using the phenology-based algorithm; and (4) accuracy assessment.
3.1. Removal of Thick Clouds
MNSPI is an approach proposed by Zhu et al. [
15] to remove thick clouds based on a neighborhood similar pixel interpolator. Initially, this method was developed to address the strip loss problem of Landsat ETM+ [
20], according to the idea of the original NSPI, using the information of neighboring similar pixels to restore the spectral values of cloudy pixels. However, clouds are usually large clusters. In order to explain this difference in spatial pattern, the NSPI method needs to be modified.
The modified steps were as follows: First, two images are required, an auxiliary image acquired on date
, and the auxiliary image should be cloudless for the clouded part of the clouded image acquired on date
. Second, extract the cloud mask from the cloud image. The cloud mask can be extracted with the help of the QA band of the image, or the cloud mask layer can be established by visual interpretation. Third, a large window covering the cloud layer is established to find the closest pixels around the cloud layer that meet the spectral similarity criteria as similar pixels. Fourth, spatial distance (
) and spectral distance (
) are both normalized. Finally, determine the weights and predict the results. The weight can be determined by the relative spatial distances from the target pixel to its similar pixels and to the cloud center. The coordinates of all cloud pixels are averaged to calculate the cloud center. Finally, the value of the target pixel can be predicted as:
where
is the prediction of the target pixel based on spectro-spatial information,
is the prediction based on spectro-temporal information,
represents the average spatial distance between the target pixel and its similar pixels, and
represents the spatial distance between the target pixel and the cloud center.
In 2018, Zhu et al. [
30] established an automatic system for interpolating all types of polluted pixels in a time series image by integrating NSPI and MNSPI into an iterative process. The input data of the system are a time series of images and the related cloud and cloud shadow mask, and the output data are a time series images without missing pixels caused by clouds, cloud shadows, and SLC-off gaps.
We constructed a simulated cloudy time series image to test the effectiveness of the method. We selected a real clear sky image (Landsat 8 OLI 2020-05-02 and Sentinel-2 2020-04-26) and simulated a cloud patch of random size at a random location on the image, for each type of land (cropland, grassland, forest, water body, and built area), which randomly constructed three cloud patches. The cloud mask was extracted according to the QA band that comes with the image product, and the accuracy of the cloud boundary was manually checked for modification. Then, a time series was constructed using the image with the cloud patch and other images. The cloud mask was the same. Then, we input the two types of data into the reconstruction system and output the image after removing the cloud. The reconstruction system is based on IDL code. Finally, the predicted de-cloud pixels were compared with the real image to judge the effectiveness of the MNSPI method.
For the processing of the MOD09GA images, we used its state_1 km quality assessment band to remove clouds. The state_1 km band indicates the quality of the reflectance data state and is in binary form. We used Python code to define the cloud removal function according to the number of binary digits and realized the batch cloud removal function of MOD09GA.
3.2. FSDAF
After using the MNSPI method to remove cloud and interpolation on Sentinel-2 and Landsat 8 OLI images, time series data cannot be completely constructed only by relying on these two kinds of data. We needed daily MODIS data to fill the gaps in the time series images. The spatiotemporal fusion model FSDAF was used to fuse Sentinel-2 and MODIS to generate high-temporal/high-spatial resolution data.
The FSDAF model uses the temporal prediction information obtained by the mixed pixel decomposition and the spatial prediction information obtained by the thin plate spline (TPS) interpolation, and then uses the filter-based method to combine the two types of prediction information to obtain the final prediction [
24]. The goal of this method is to predict the fine-resolution data of heterogeneous regions by obtaining mutation information of land cover types. FSDAF includes six main steps [
24]: (1) Classify fine-resolution images at
, and ISODATA classification was used in this study. (2) Use the image change information of each category from
to
in the coarse-resolution image to estimate fine-resolution image change information from
to
. (3) On the basis of the change information between the fine-resolution image at
and
estimated in the previous step and the known fine-resolution image at
, calculate the preliminary fusion result at
. On this basis, considering the influence of factors such as the change of the object types between two periods, the fusion image residual is calculated from the time dimension. (4) The TPS interpolator is used to sample from the coarse image of
, and the fusion image residual is calculated from the spatial dimension. (5) Predict and allocate residuals based on TPS. (6) Design a weight function to assist local window similar pixels to filter the change information between fine-resolution images and add it to the known fine-resolution images at
to obtain the final fused image.
We selected two time periods, when both MODIS and Sentinel-2 images were clear, to test the effectiveness of the FSDAF model. By searching the time series images, = 2020-01-11 and = 2020-04-26 were determined. The coarse-resolution image (MODIS) and fine-resolution image (Sentinel-2) at were used as the input reference image to predict the fine-resolution image corresponding to the coarse-resolution image at . Finally, the fine-resolution image predicted at was compared with the original fine-resolution image at to obtain the accuracy of the method.
In this study, the MODIS image input in the FSDAF model was selected based on the date of the existing image to minimize the time interval between base pairs. In addition, in order to obtain more precise and fine images, we also considered the time interval between the base image date () and the predicted image date (). Finally, the images predicted by FSDAF were combined with original Sentinel-2 and Landsat 8 OLI images to form a time series dataset, which was ready for phenology-based paddy rice mapping.
3.3. Phenology-Based Paddy Rice Mapping Algorithm
3.3.1. Identification of Flooding Signal
A unique physical feature of paddy rice fields is that paddy rice plants grow on flooded soil [
8,
31]. The time dynamics of paddy rice fields growth periods are shown in three main periods: (1) flooding and transplanting period; (2) growth period (tillering period, heading period, and maturity period); and (3) fallow period after harvest. Before transplanting, most of the paddy rice fields are water bodies. After transplanting, paddy rice grows in a flooded field, therefore, during this period, paddy rice fields are a mixture of paddy rice plants, soil, and water. After this, the paddy rice continues to grow until the canopy completely covers the fields. The rice fields, after harvest, are bare soil or rice stubble [
9].
The characteristics can be detected by using the relationship between the time series NDVI and LSWI. For each image, we calculated the NDVI and LSWI using the surface reflectance values of the red band (
), NIR band (
), and SWIR band (
). These were calculated as follows:
We smoothed NDVI and LSWI time series by Savitzky–Golay filter [
32]. All the curves in
Figure 5 are the results after smoothing. During the flooding period, the paddy rice fields were a mixture of water and green paddy rice plants, and the LSWI value was greater than the NDVI value (
Figure 5a, ~DOY100). After transplanting the paddy rice seedlings, the NDVI of the paddy rice field gradually increased, and the LSWI gradually decreased. After transplanting, at about 80–100 days, most of the paddy rice fields were completely covered by the paddy rice canopy, and the LSWI value was lower than the NDVI value. The unique spectral feature of the flooded period (LSWI > NDVI) was used as a spectral signal to extract paddy rice when analyzing the time series satellite images [
8,
9,
11]. In this study, we continued to explore algorithms that combine NDVI and LSWI. As shown in
Figure 5a, the LSWI value was slightly higher than that of NDVI in the highlighted area, which was around DOY100. According to the previous phenological data, this period was the transplanting flood period. For fused images, we used the following threshold in the specific time window (DOY100 to DOY110) to identify flooded pixels: LSWI+0.1
NDVI. After a pixel is identified as having a transplanting signal, to ensure accuracy, follow-up changes in NDVI can be observed. Paddy rice crops grew rapidly after transplanting, and NDVI reached its peak (NDVI > 0.5) from DOY200 to DOY210.
3.3.2. Generating Other Land Cover Masks to Reduce Potential Impacts
Affected by factors such as atmospheric conditions and other land cover types with similar characteristics, the preliminary paddy rice map was inevitably polluted by noise [
28]. Therefore, according to the phenology characteristics of other land cover types in the study area and the time distribution of VIs (
Figure 5) [
8,
9,
11,
28,
33], a mask was established to remove these noises to minimize their potential impact, as described below.
Crops in the non-flooded period (such as corn) and forest vegetation had LSWI < NDVI during the entire observation period, and the curve changes had similar trends (
Figure 5b,c). In this study, we divided the pixels whose LSWI values were always less than the NDVI values during the growing season into crops in the non-flooded period and forests. The built area had an LSWI < 0 value throughout the plant growing season (
Figure 5d), and the pixels whose LSWI average was about 0 were divided into the built area. There was a unique relationship between LSWI and NDVI of permanent water pixels, that is, LSWI > NDVI, as shown in
Figure 5e. In this study, we used pixels with an average NDVI < 0.1 during the entire observation period to be classified as permanent water bodies.
Finally, we combined the set threshold with the existing land cover type map to build a paddy rice map based on pixels.
3.4. Accuracy Assessment
3.4.1. Evaluation of the MNSPI Approach and the FSDAF Model
For the accuracy evaluation of the MNSPI method, we evaluated the accuracy of the MNSPI by comparing the cloud-removed images after randomly setting cloud patches with the original images [
15]. For the Sentinel-2 images on 26 April and Landsat 8 OLI images on 2 May, in each image, we randomly designed three cloud patches for each land type to test the accuracy of the method under different land types. After the cloud removal process, we compared the reflectance values of restored pixel with the reflectance values of original images in the cloud patch.
For the accuracy evaluation of the FSDAF model, we evaluated the accuracy of FSDAF prediction by comparing the image predicted by FSDAF with the original image [
24]. We used a Sentinel-2 image from 12 January and the MODIS images from 12 January and 26 April to predict a Sentinel-2 image from 26 April, and then we compared the predicted Sentinel-2 image on April 26 with the original Sentinel-2 image for evaluation.
The correlation coefficient (r) and root mean square error (RMSE) values of the predicted reflectance were compared with the true reflectance (including blue, green, red, near-infrared, SWIR band, NDVI, and LSWI) to quantitatively evaluate the accuracy [
24]; r was used to show the linear relationship between predicted and actual reflectance. The closer the value of r is to one, the higher the consistency between the fused image and the original value. RMSE was used to measure the difference between the fusion image and the original image. The smaller the value, the smaller the difference between the fusion image and the original image, and the better the prediction effect. In addition to the comparison of reflectivity, the vegetation index was also compared accordingly. The RMSE calculation formula is as follows:
F and G represent the fusion image and the original image, respectively; b is the band number; and B is the total number of bands. is the image size.
3.4.2. Accuracy Assessment of Paddy Rice Map
There are two ways to verify the paddy rice map. The first is to use field survey points for accuracy verification and calculate the overall accuracy and the Kappa coefficient. The second is to calculate the consistency of the paddy rice planting area with the official statistics.
5. Discussion
It is difficult to obtain sufficiently clear optical images with high spatiotemporal resolution in cloudy and rainy areas. Therefore, the ultimate goal of this research was to provide an operational method for paddy rice mapping by alleviating or avoiding data limitations to a certain extent. In this study, the MNSPI approach and the FSDAF model were used to obtain high-precision time series data, as well as a new method combined with a phenology-based algorithm. The study results show that the accuracy of this method is better. The MNSPI approach and the FSDAF model both had high accuracy, and the reflectivity of the predicted image was consistent with the original image. The final paddy rice map accuracy verification result was also good, and there was a good correlation with the official data. In addition, the two models had a high degree of automation. Users only need to prepare basic input data and modify a few parameters to run the program [
15]. On the whole, the method used in this study has a good application effect in cloudy and rainy areas.
The method proposed in this study still has certain limitations: First, in the process of applying the MNSPI approach, as cloud cluster size increases, the accuracy of the restored image decreases slightly. This is because this method uses cloud neighborhood information. The larger the area of the cloud, the farther the distance between the target pixel and its similar pixels, the correlation may decrease, and therefore prediction accuracy may decrease. Furthermore, in areas with frequent and persistent clouds, it is difficult to reconstruct image time series with acceptable accuracy because clouds completely cover many images [
15]. Second, generally speaking, the prediction quality of Sentinel-2-like images is affected by the time interval between base image date (
) and predicted image date (
). The longer the time interval, the worse the quality of the FSDAF-predicted image [
24]. In our study, the time interval was controlled within 90 days. In addition, the accuracy of the input image in the FSDAF model will also affect the prediction results. After removing the cloud for input image, the number of effective pixels in the image will affect the prediction accuracy of the model to a certain extent. One problem common to the two methods above is that both methods have similar pixel search processes, and therefore a long calculation time may be required when processing large areas of massive images. Therefore, when rebuilding time series in a large area and long period, it is recommended to use a high-performance computer or cloud platforms to increase the calculation speed, for example, Pei Engine and Google Earth Engine [
36]. Third, we used a pixel-based method for classification. The land types in the study area were relatively fragmented, and as a result, the “salt and pepper” effect inevitably appears. It can be seen from the example in
Figure 10 that, in a small area, the same types of features were classified into different categories, and the originally uniform plot was divided into many blocks. There was also the problem of mixed pixels. In the process of field collection, it was found that some paddy rice fields were too small in area and formed mixed pixels with other land cover (such as corn) [
26,
37]. Fourth, due to different phenological periods, the method proposed in this study may not be directly applicable to other regions. When it is applied to other regions, it is necessary to carefully review and select the threshold value of paddy rice detection indicators.
Some studies have shown that because SAR data are not affected by weather conditions, SAR data have better application prospects in areas with clouds and fog [
38]. Studies have also proven that the combination of optical and SAR data can improve the extraction accuracy of land cover information in areas with severe heterogeneity [
39,
40]. In addition, object-based image analysis (OBIA) technology continues to develop and has been successfully applied in many fields. It can reduce the occurrence of the salt and pepper phenomenon. In high-spatial resolution remote sensing images, especially the mapping of fragmented paddy rice fields, object-based classification methods are more advantageous than pixel-based classification methods [
26,
37]. Therefore, subsequent paddy rice mapping methods in cloudy and rainy areas can consider the fusion of SAR data and optical data and the use of object-based methods to improve the accuracy of the results.