1. Introduction
Maize is one of the most important crops grown throughout the world, with the United States, China, and Brazil being the top three maize-producing countries [
1]. It is an important staple food for more than two billion people and a valuable source material for the production of ethanol, animal feed, biofuel, and other products, such as starch and syrup [
2,
3]. With population growth, the demand for maize is rapidly increasing, so monitoring maize growth status has attracted much attention. The leaf area index (LAI) and biomass are important indicators for maize growth, reflecting the effects of nutritional deficiencies, pests and diseases, droughts and floods, etc.; therefore, their accurate estimation can assist in the monitoring of maize growth status to guide field management and early yield estimation [
4,
5,
6]. Traditional estimation methods rely mainly on field sampling and manual measurement, which are time-consuming, labor-intensive, and prone to errors due to subjective factors. However, in the last 60 years, satellite remote sensing technology has rapidly developed, provided more and more data at high spatial and temporal resolutions, and it is now possible to use remote sensing data to obtain dynamic estimates of the LAI and biomass rapidly, accurately, and on a large scale [
7]. As a consequence, increasing effort is being devoted to research on LAI and biomass estimation based on remote sensing inversion methods [
8,
9,
10,
11,
12,
13].
Remote sensing inversion methods can be classified into two categories: physical models and statistical models [
14,
15,
16]. Physical models are commonly used for LAI and biomass estimation and provide an effective representation of the relationship between biophysical parameters and remote sensing data. However, physical models rely on some prior knowledge for input parameters and some idealized assumptions regarding the crop canopy; therefore, they are of limited accuracy. In addition, it is difficult to estimate the LAI and biomass through the application of physical models to optical and synthetic aperture radar (SAR) data, since the different imaging mechanisms involved in the acquisition of optical and SAR data mean that the corresponding physical models are very different. Therefore, statistical methods play a vital role in the estimation of the LAI and biomass. Traditional machine learning methods, such as multiple linear regression (MLR), support vector regression (SVR), and random forest regression (RFR), are widely used in LAI and biomass estimation and have achieved very good results [
17,
18]. Nevertheless, all of these methods rely on empirical formulas and custom-built features, such as the normalized difference vegetation index (NDVI) and the enhanced vegetation index (EVI), which limit the capabilities of the corresponding inversion models. Deep neural networks, also known as deep learning, which have demonstrated great superiority in automated extraction of deep features and in the approximation of complicated nonlinear relationships, have attracted much attention in recent years. They have achieved great success in many remote sensing tasks [
19,
20,
21,
22,
23,
24], such as classification [
25], image preprocessing [
26], object detection [
27,
28], and scene understanding [
29,
30]. However, their application to LAI and biomass estimation has been hindered by the limited amounts of in situ data that are available [
24]. Owing to the limited efficiency of in situ data collection, it is very expensive to obtain the massive amounts of data that deep neural networks, as data-driven models, require for the training of their multitudes of parameters. Consequently, finding a way for deep neural networks to estimate the maize LAI and biomass with limited in situ data is a necessary, albeit challenging, task.
Data augmentation is one of the techniques used in machine learning to increase the size and diversity of training datasets as much as possible, thereby enhancing the generalization abilities of models. Widely used data augmentation methods in image processing include horizontal/vertical flip, rotation, scaling, clipping, translation, contrast, color disturbance, and the addition of noise. However, these methods change the position or color of the original pixels and, therefore, can only be applied to scenes that are not sensitive to target correspondence or local color changes (such as in image classification [
31] and target recognition [
32]), and they are not suitable for quantitative remote sensing inversion. Because changes in contrast, the vibrance of color, and the addition of noise will change the one-to-one mappings between different bands and the LAI and biomass, rotation and translation will also eliminate the corresponding relationships between remote sensing data and the measured LAI and biomass, leading to what we call inconsistency between source and target (IST). The data mixing-up method (mixup) [
33] is a novel method proposed to extend datasets based on interpolation and has achieved excellent results in the field of image classification. By mixing training data, it greatly improves the generalization ability of a model and the processing ability in the face of adversarial attack. This paper adapts the mixup method to estimate the maize LAI and biomass with a number of innovative improvements. Instead of using the mixed-up data directly, we predict the interpolation coefficient through a deep neural model, thereby mitigating the interpolation error that arises in the original mixup method. We call our method mixup
.
With the help of mixup
, we are able to leverage a deep neural network to estimate the maize LAI and biomass with limited in situ data. In this paper, we propose a novel deep neural network, a gated Siamese deep neural network (GSDNN), to integrate optical and SAR data in order to estimate the maize LAI and biomass through a gating mechanism. Considering the respective advantages and disadvantages of optical data and SAR data, our approach is to integrate these data for maize LAI and biomass estimation [
17]. Specifically, optical data can provide rich spectral information, but it is difficult to use them to reflect the contributions of leaves within the maize canopy. When the leaf density of the canopy is high, electromagnetic waves in the visible and near-infrared bands mainly interact with the middle and upper layers of the canopy, which leads to saturation of the reflectivity and vegetation index of the optical data. For example, the most commonly used vegetation index, the NDVI, is sensitive to low LAI values (≤3), but it will reach saturation for medium or high LAI values (>3) [
34]. Similarly, when the biomass is at a medium to high level (>2 kg/m
), the NDVI also exhibits saturation [
35]. It has also been found that SAR data have quite good penetration and contain much more three-dimensional structural information about the maize canopy, but radar remote sensing data are easily affected by scattering and attenuation by the canopy, resulting in limited inversion accuracy of the growth parameters. The backscatter coefficient and polarization decomposition parameters extracted from SAR data can help to alleviate the saturation phenomenon in LAI and biomass inversion [
18,
36], but they are easily affected by soil background and topographic factors, resulting in errors in LAI and biomass inversion [
37]. Therefore, the integration of optical and SAR data in order to estimate maize LAI and biomass is essential, and the key aspect of this process is the establishment of an effective and deep integration mechanism. In this paper, the use of the GSDNN is proposed as a way to achieve the above goals by leveraging a gating mechanism [
38] and a Siamese architecture [
39], thereby establishing multiple information interaction channels between the two branches of the neural network corresponding to optical and radar remote sensing data, respectively, and enabling relatively exact and deep information interactions during optical and radar data fusion. At the same time, the use of a gating mechanism helps to increase the depth of the deep neural network and avoid gradient disappearance and gradient explosion problems [
40].
Overall, in this paper, we focus on the use of deep-neural-network-integrated optical and SAR data to retrieve the maize LAI and biomass. Our objectives are (1) to adapt the mixup method to maize LAI and biomass estimation tasks, (2) to propose a novel deep neural network (with integrated optical and SAR data) to estimate the maize LAI and biomass, and (3) to study the effect of the number of extending samples on the model accuracy.
3. Methods
In this paper, our aim is to enable a deep neural network to estimate the maize LAI and biomass with limited in situ data. Deep neural networks are data-driven and, therefore, are difficult to apply directly to scenarios such as LAI and biomass estimation, where there is a shortage of training data. First, we attempt to augment the training data through an improved version of the mixup method. We then propose a novel gated Siamese deep neural network (GSDNN) to leverage both SAR and optical data to improve the accuracy of LAI and biomass estimation. The proposed GSDNN can effectively extract deep features from SAR and optical data through its use of a Siamese architecture and learn to fuse them to yield better estimation accuracy of the LAI and biomass through a gating mechanism.
3.1. From Mixup to Mixup
The mixup method is a data augmentation method that was first proposed for image classification in machine learning [
33]. It augments data by incorporating the prior knowledge that linear interpolations of feature representations should lead to the same interpolations of the associated targets. In mixup, virtual feature–target training examples are constructed based on convex combinations of pairs of examples and their labels sampled from the mixup vicinal distribution as follows:
where
, for
. The hyper-parameter
controls the interpolation coefficient
, which means that the strength of interpolation between feature–target pairs is controlled by
. The function
f can then be learned by minimizing the following expression:
This is known as the empirical vicinal risk (EVR) principle [
42]. Previous studies have shown that mixup is simple but quite effective in image classification.
Motivated by mixup, we tried to adapt it for use in maize LAI and biomass estimation instead of image classification. We termed this adaptation mixup. Unlike the applications considered in previous studies, maize LAI and biomass estimation is a regression task and, therefore, much more sensitive to the problem of inconsistency between the source and target (IST) caused by data augmentation. As a linear interpolation method, mixup still suffers from the IST problem when the training data are nonlinear. As has been shown in previous studies, the values of the LAI and biomass gradually increase with the nonlinear growth of maize. Although the use of a smaller can alleviate the IST problem, this will reduce the diversity of virtual data and increase the risk of a large neural network memorizing limited data, leading to serious overfitting of the model.
To solve this problem, we propose a training method to predict the interpolation coefficient
according to
as follows:
where
and
are in situ LAI or biomass values corresponding to synthetic
. This provides the model with decoupling ability. Through analysis of the interpolation data components, the prediction ability of the model is greatly improved, and the problem of inconsistency between source and target is greatly alleviated. It can be seen from Equation (
3) that the loss function is weighted according to the distance between the two interpolation samples. When this distance increases, the weight of the virtual samples constructed from the interpolation samples gradually becomes weaker, and thus, the inconsistency between source and target is reduced. In this paper, we call the method represented Equations (
2) and (
3) mixup
.
3.2. Deep Neural Network for LAI and Biomass Estimation: GSDNN
Figure 2 presents the architecture of our proposed GSDNN. This model consists mainly of two types of modules: a fusion layer and a regression layer. The former is used for deep fusion of optical and SAR data, and the latter is used to perform quantitative inversion of the maize LAI and biomass. First, a gating mechanism and a Siamese architecture are used to realize the effective fusion of optical and SAR data. Then, multitask learning is used to obtain the maize LAI, Biomass_wet, and Biomass_dry simultaneously, which helps to overcome the overfitting problem and improve model accuracy.
3.2.1. Fusion Layer
The fusion layer consists of two main parts: (1) a gated control layer (GCL), which is used to extract the complementary effective information of each channel and reduce mutual interference, and (2) a full connection layer (FCL), which is used to realize nonlinear transformation of features and to map data from high to low dimensions or from low to high dimensions.
In the
ith fusion layer, we denote the input data of the optical and SAR channels by
and
, respectively. In the GCL, the optical channel will get complementary information from the SAR channel, and vice versa. Specifically, the gating mechanism is designed to select effective information and is defined as
where superscripts
o and
s indicate the optical and SAR channels, respectively, and
denotes the activation function. In this paper, ReLU is used as the activation function.
,
,
, and
are the parameters of the gating mechanism and are acquired based on the stochastic gradient descent (SGD) algorithm. We then obtain the output of the GCL for the fusion layer as follows:
where ⊙ denotes the Hadamard product. As shown in Equation (
5), the information extracted from the optical data is selectively integrated into the SAR channel, thus enriching the information of the SAR channel, and vice versa. In this way, the GCL layer enables deep fusion of optical and SAR information.
The FCL in the fusion layer is a nonlinear transformation increasing the depth of the network and the fitting ability of the model. The output of the fusion layer is defined as follows:
where
and
denote the outputs of the optical and SAR channels, respectively, in the
ith fusion layer.
and
are the parameters of the nonlinear transformation of the optical channel, and
and
are the parameters of the nonlinear transformation of the SAR channel.
3.2.2. Regression Layer
The regression layer is mainly used to explore the relationship between the depth features obtained from the fusion layer and the maize LAI and biomass. Considering the difference between the LAI and biomass, multitask learning is used to obtain the maize LAI, Biomass_wet, and Biomass_dry simultaneously, which helps to overcome the overfitting problem and improve model accuracy [
43,
44].
First, the regression layer cascades the optical and SAR information from the fusion layer:
where
l denotes the last layer of the fusion layer. Then, an FCL is used to implement reduction and optimization:
Note that
denotes an FCL. All of the FCLs considered in this paper have the same structure, namely,
where
and
denote the input and output, respectively, of the FCL.
represents the parameters learned by training, including
and
.
is the activation function, for which ReLU is used in this paper. All of the FCLs in
Figure 2 have their own independent parameters. Finally, based on the optical and SAR depth fusion features, this paper uses an independent fully connected network for regression of each maize parameter as follows:
where
,
, and
are the estimation values of the model.
3.2.3. Timestamp Embedding
Given that the growth parameters of crops exhibit a regular trend of change with the passage of the growth period, this study considers time information in a half month (15 days) to take account of time and, thereby, enhance the inversion accuracy of the model. Because of the discreteness and large ranges of variation of remote sensing data and imaging dates, simply taking the imaging date as a feature will reduce the generalization ability of the model. To deal with this problem, inspired by the word vector technique in artificial intelligence (AI), we propose the use of timestamp embedding to encode time information. In detail, one year is divided into 25 time groups with 15 days as a time stage, and each group is represented by an n-dimensional vector.
3.2.4. Objective Function
In this subsection, we discuss in detail the objective function of the GSDNN for maize LAI and biomass inversion. Here, we denote by
,
, and
the in situ data of the LAI and biomass, and by
and
the optical and SAR data. We can then express the training sample as
. According to the minimum mean-square error (MMSE), the objective function based on multitask learning is as follows:
where
,
, and
are the estimation values of the GSDNN. MSE is the mean-square error, which is defined as
To obtain the final objective function based on the training examples from mixup
, we combine Equations (
3), (
11) and (
12):
where
,
, and
denote the GSDNN models for the LAI, Biomass_wet, and Biomass_dry. In this paper, Equations (
3) and (
13) are used as objective functions for in situ data and augmentation data, respectively.
3.2.5. Accuracy Assessment
To reduce the impact of data randomness, fivefold cross-validation was used to assess the accuracy of the LAI and biomass estimation models. First, the dataset was randomly divided into five parts, and each of these was used as test data to alternately train the model. Then, 10% of the data from the remaining four parts were randomly selected as the verification set, with all of the remaining data as the training set. The coefficient of determination () and the root-mean-square error (RMSE) were used to assess the precision of the LAI and biomass estimation models. Each model was trained and tested five times, and the mean value of 100 runs was taken as the final result.
3.3. GSDNN Workflow
Figure 3 shows the workflow of the GSDNN, which includes the following procedures:
3.4. Experimental Data Preparation
In this paper, 158 samples were used in four growth stages, with each sample consisting of in situ data (LAI and biomass) and the corresponding optical and SAR features. First, we selected six bands (B2–B4, B8, and B11–B12 for Sentinel-2 and B2–B8 for Landsat8) and five vegetation indexes (NDVI, RVI, EVI, SAVI, and MSAVI) and extracted their GLCM texture features as the optical features (59 dimensions). Then, we selected two bands (VV and VH) and their ratio (VV/VH), polarization parameters (H, A, and ), and GLCM texture features as the SAR features (22 dimensions). Finally, we combined the optical and SAR features as the input feature and the in situ LAI and biomass as their label.
Considering the different ranges of features and their labels, we normalized them as in Equation (
14). When the GSDNN performs the LAI and biomass inversion, the original value must be estimated before data normalization. This is done by the inverse operation of Equation (
14), as follows:
In mixup
,
was set to 0.2 [
33]. Multiple random sampling was carried out to synthesize the new data during the mixup
procedure. The number of samples was the synthetic dataset size. Specifically, for each sampling, we randomly selected two samples from the above 158 samples. The specific calculation method is shown in Equation (
1).
3.5. Settings and Training Details for the GSDNN
As shown in
Figure 2, the GSDNN consists of two main parts: a fusion layer and a regression layer. In this paper, three fusion layers were used to fuse the optical and SAR features. The optical and SAR input features were, respectively, 59-dimensional and 22-dimensional, and thus, the total input features were 81-dimensional. Other parameters of each layer for the GSDNN were defined as follows:
Fusion layer: The hidden layer sizes of both the gate control layer and FCL were set to 300, the output dimension was 300, and the internal parameter sizes of the three fusion layers were consistent.
Regression layer: This layer took as input the concatenation of the outputs of the SAR and optical channels. In the present case, the size of the input layer for the regression layer was set to 600, the hidden layer sizes of the LAI, Biomass_wet, and Biomass_dry predictors were set to 300, and the final outputs were three scalars.
The timestamp embedding dimension n was set to 10.
For training, Adam [
46] was used as the optimizer, with a learning rate 0.0001. The batch size was set to 100. To reduce the impact of the randomness of training and data, 100 runs of fivefold cross-validation were conducted, and then the mean and variance of the results of 500 experiments in total were calculated and reported. Specifically, the data were randomly divided into five copies each time, and then each copy was recycled as a test set. Because the depth of model training depended on the convergence of the verification set, in this paper, a random selection of 10% of the data from the remaining four copies was taken as the verification set, with all of the remaining data as the training set. In this study, we used the PyTorch deep learning framework based on Python 3.6.8. We ran all experiments with the Ubuntu 18.04 operating system on a GeForce RTX 2080Ti GPU.
6. Conclusions
In this study, a novel method, the GSDNN + mixup
, was proposed to integrate optical and SAR data for the estimation of the maize LAI and biomass from limited in situ data. We proposed a modified version of the mixup training method, which is called mixup
, to deal with the problem of data shortage, and we found that the most appropriate amount of synthetic data from mixup
was five times the amount of original data. The GSDNN proposed in this study can realize deep fusion of optical and SAR data, and its use of a gating mechanism and Siamese architecture leads to more accurate estimation of the maize LAI and biomass. GSDNN + mixup
gives significantly more accurate estimates of the LAI and biomass than other machine learning methods, such as MLR, SVR, RFR, and MLP, with
values of 0.71, 0.78, and 0.86 and RMSEs of 0.58, 871.83 g/m
, 150.76 g/m
for LAI, Biomass_wet, and Biomass_dry, respectively (
Table 3 and
Table 4). This study of the GSDNN + mixup
provides insights into how a deep neural network with integrated optical and SAR data can be used to estimate maize LAI and biomass with limited data. Evaluation of the performance of the GSDNN + mixup
for the estimation of other crop growth parameters and further exploration of novel methods to overcome in situ data shortage will be topics for our future work.