1. Introduction
Gross primary production (GPP) represents the amount of organic matter and energy through photosynthesis per unit time and area in territorial ecosystem [
1]. The accumulation of GPP in ecosystems is a process through which atmosphere carbon dioxide is fixed by plant to form organic carbon [
2]. GPP is a direct basis for reflecting the productivity of territorial ecosystem and carbon reserves [
3] and also the key factor to realize global carbon balance [
4]. The data from the Food and Agriculture Organization (FAO) show that the global cropland area in 2020 was 1576 million hectares, accounting for about 12.09% of the world’s total land area [
5]. This proportion is expected to increase in the future to accommodate the food demand with an increasing population. Compared with other natural ecosystems, the cropland accounts for 9.4% of the global total GPP [
6], of which maize GPP is the largest proportion (14.9% of global cropland GPP) [
7]. Therefore, the accurate estimation of daily maize GPP plays a significant role in evaluating the global carbon cycle.
Several GPP estimation models have been developed and can be divided into three categories: vegetation index (VI)-based models, process-based models, and light use efficiency (LUE) models. VI-based models employ a purely statistical approach to estimate GPP. For example, Sims et al. [
8] utilized variables such as land surface temperature (LST) and enhanced vegetation index (EVI). Nonetheless, models based on statistical relationships between variables and GPP may not be optimally adaptable for estimation in varying conditions, as the function developed for one site may not be applicable to another site [
9]. Process-based models comprehensively consider the integration of soil, vegetation, and the atmosphere to dynamically simulate the physiological process of a plant [
10]. They are distinguished by their profound recognition of the mechanism of vegetation growth and estimate GPP mechanistically. However, due to the scarcity and quality of available vegetative parameters and the intricate nature of the model process, it is difficult to generalize process-based models. LUE models use the maximum of LUE (LUEmax) [
11] for GPP calculation and consider the effect of environmental conditions such as water, temperature, and phenology on vegetation photosynthesis [
12,
13,
14]. However, LUE models greatly rely on environmental factors. Water pressure variables like vapor pressure deficit (VPD) cannot adequately characterize the effects of water availability on vegetation production [
15]. In addition, VIs (e.g., the normalized difference vegetation index (NDVI) [
16]) as a proxy for the fraction of photosynthetically active radiation (FPAR) will also produce errors in the estimation of GPP [
15]. In summary, the previously mentioned methods have their own limitations and face challenges in GPP estimation. Therefore, there is a crucial need to identify an efficient method for estimating GPP.
Crop growth is affected by growing environments (e.g., air temperature, soil properties, and field management) and controlled by plant phenology [
17,
18]. The heterogeneity of time and space suggests that these factors interact and collectively affect crop production. Changing factors affecting crop productivity often involve nonlinear processes [
19,
20]. Additionally, traditional methods fall short in supporting the development of modern agriculture, which requires abundant data and robust algorithms [
21]. Consequently, machine learning (ML) has gained popularity. The ML method disregards the intricate process of crop physics; instead, simple inputs and outputs assist in creating effective relationships and reconstructing knowledge frameworks [
22]. It can effectively model complex processes using extensive field data [
23]. Several popular ML methods such as decision tree (DT), random forest (RF), artificial neural network (ANN), and support vector machine (SVM) have demonstrated effectiveness in estimating ecosystem productivity [
24,
25,
26,
27]. Yet, existing studies mainly focus on simply utilizing topography, vegetation indices, and meteorological data as model inputs for GPP estimation. They ignore the process of GPP synthesis and lack the influence of plant physiological activation [
28].
Ecosystems constantly adjust plant growth to cope with the changing environment, causing seasonal variations and the formation of transitional periods known as phenology [
29]. Changes like earlier leaf growth and delayed crop activity could affect the seasonal climate and CO
2 absorption [
30]. Thus, phenology greatly affects ecosystem productivity [
31] and is vital for carbon fixation and photosynthesis. At the leaf scale of a crop, chlorophyll (Chl) content per unit leaf area is closely related to the photosynthetic rate [
32,
33,
34]. A previous study established a close link between GPP and Chl [
35]. However, obtaining significant observed chlorophyll data is challenging [
36]. Owing to the close correlation between leaf Chl content and leaf nitrogen content per unit leaf area (i.e., specific leaf nitrogen (SLN)) [
37,
38], the leaf photosynthetic rate is also strongly associated with SLN [
39,
40,
41]. SLN changes with different phenological stages [
42], and the maximum allowable SLN is a function of the phenological stage [
43,
44,
45]. Therefore, it is of great significance to consider the maximum allowable SLN and plant phenology as factors potentially affecting the leaf photosynthetic rate in ML methods. This approach is expected to enhance the estimation of maize GPP via ML methods from a physiological aspect by integrating SLN and phenology.
In this study, widely used meteorological data (i.e., solar shortwave radiation (SSR) and air temperature (Tair)) and a satellite vegetation index (i.e., NDVI) were selected to compose the control group of the input combination. The vegetation index is highly correlated with GPP [
46]. GPP is directly controlled by SSR [
47]. Air temperature (Tair) affects the carbon absorption of vegetation [
48]. For comparison, input combinations including SLN and maize phenology (represented by the normalized maize phenology (NMP)) alone or simultaneously were created. The purpose of this study is to first verify the importance of SLN and NMP, via RF methods, in improving maize GPP estimation with different input combinations. Subsequently, the optimal input combination including SLN or NMP will be applied to validate and compare three other ML methods’ performance (i.e., the SVM, CNN, and ELM methods). This study attempts to determine the importance of SLN and NMP in improving GPP estimation via ML methods and provide a reference for similar research.
4. Discussion
In this study, four ML methods were used to predict the maize GPP of three sites in Nebraska (i.e., the NE1, NE2, and NE3 sites) and one site in Minnesota (i.e., the RO1 site). Previous ML models simply took processed meteorological data and remote sensing data as model inputs, without the consideration of influences from phenology and leaf physiology on photosynthesis. The novelty of this study is integrating maize phenology (represented by NMP) and leaf photosynthetic rate factor determined by phenology (represented by SLN) into the model inputs. The selection of appropriate input variables plays a key role in GPP prediction [
89]. The contribution rate of selected variables and the importance of SLN and NMP are further verified by ranking the importance of input factors in the RF method. SSR is the most important contribution factor (43.3%) to GPP, and it is the main energy source of organisms. There is a direct relationship between photosynthesis and SSR. The physiological process and photosynthesis of maize are regulated by the light and thermal effects brought by radiation [
90]. The contribution rate of NDVI is 21.8%, and it is the most commonly used feature factor, which can reflect the plant canopy dynamics. The greater the amount of green vegetation, the more infrared light it absorbs, leading to a rise in NDVI [
91]. Unexpectedly, the contribution of SLN and NMP factors exceeded that of air temperature, suggesting that they also had a significant impact on GPP estimates. Three kinds of temperature also have a certain proportion, reflecting the characteristics of the climate of the site. In addition, by regulating the physiological process of vegetation, temperature makes it form a phenological process for maize to a large extent. Therefore, there is a certain correlation between temperature and phenological factor (NMP). But by using data from three NE sites in Nebraska, the four ML methods all prove that the positive effect of NMP on the model (A3) can compensate for the decrease in accuracy caused by information overlap. Therefore, it is feasible to consider the three temperature and NMP factors simultaneously.
The role of SLN and NMP were further validated from site to site in Nebraska using the RF method. The division of the training set and test set was determined by the respective data volume and water stress differences of the three sites, and the optimal input variable combination of
Section 3.2.2 was used. SLN and NMP both maintained their positive impact on the model, but to different degrees, depending on the specific soil and water information at each site. When the other three ML methods (SVM, CNN, and ELM) were applied to GPP prediction for all three sites in Nebraska, good results were also obtained (NSE > 0.95 and RMSE < 2 gC·m
−2·day
−1), and the estimation accuracy of all methods was similar. Specifically, after considering SLN and NMP, the accuracy of all models improved. But when SLN and NMP were considered at the same time, SVM and RF had the greatest improvement, while for CNN and ELM, only when SLN or NMP was considered separately. CNN uses a more complex model and weight sharing in its algorithm, which can learn complex problems quickly. BP neural network algorithms use single hidden layer feedforward neural networks (SLFNs) as universal approximators, but their parameter optimization is complicated [
92]. ELM just solves this problem, and its hidden layer parameters do not need to be optimized. At the same time, the approximation capability of SLFNs can be maintained. Therefore, the correlation between SLN and NMP created an overlap of known information in CNN and ELM. Thus, the accuracy of CNN and ELM decreased, so it was better to use only one kind of physiological information. In the verification using the RO1 site, the addition of SLN and NMP enabled different ML methods to obtain different results using the unbiased estimator (URMSE) for evaluation. All ML methods had a high degree of fit of scatter points after considering physiological information, which proved its effectiveness.
In the process of evaluating the accuracy of GPP prediction results, data uncertainty has impacts on the verification of results. First, inaccurate GPP observations in flux towers will produce errors [
75], and there are uncertainties in NDVI remote sensing observation data sources. The GPP at three Nebraska sites ranges from 0 and 32 gC·m
−2·day
−1, which leads to the saturation of NDVI [
93]. The reconstruction of NDVI using a 8-day MOD09Q1 product and a daily MOD09Q1 product can generally obtain higher accuracy. However, in cloudy conditions, the 8-day composite product still contains continuous noise [
53]. Secondly, the maximum allowable SLN we considered was obtained at leaf scale, which has the problem of scale mismatch with meteorological data and remote sensing data. Moreover, the SLN of the RO1 site was derived from the polynomial fitting relationship between NMP and SLN at three sites in Nebraska, and the points in
Figure 3 are still discrete to some extent, which also brings uncertainty. Thirdly, when RO1 was used for verification, all models produced a high RMSE, which was probably due to the differences in farmland management between the RO1 site and the three sites in Nebraska. Differences in moisture and soil brought about spatial consistency. Finally, ML method sensitivity is highly dependent on the amount of data and its accuracy [
94]. While the data set used in this study has high accuracy, errors in the training data set and correlations between input variables (such as SLN and NMP) can affect GPP estimation.
Through the results, the advantages of SLN and NMP in improving maize daily GPP estimation via four ML methods have been demonstrated. In the future, it would be interesting to consider other vegetation indices in the input to correct the saturation phenomenon of NDVI in high-value GPP. Moreover, certain important factors associated with site meteorological data, such as vapor pressure deficit (VPD) and soil moisture, are anticipated to be incorporated into the model. In addition, utilizing longer time spans of data to increase data volume to enhance data alignment is another method to enhance accuracy. Note that, owing to the fixed polynomial function between SLN and NMP, data redundancy seemed to occur. On the one hand, in this case, the introduction of SLN or NMP alone, rather than both, may guarantee the robustness of ML methods, such as ELM and CNN (
Table 3). On the other hand, direct measurement or high-frequency remote sensing inversion of SLN is needed in the future to further study the value of SLN. However, the big gap between satellite pixel and leaf blade areas will make it a big challenge for SLN inversion via satellite platforms. Fortunately, low-altitude unmanned aerial vehicles (UAVs) provide an available way at the regional scale.
5. Conclusions
GPP plays key role in maintaining carbon balance in terrestrial ecosystems and climate change. It is essential to accurately quantify daily GPP. This study, taking maize as an example, based on five traditional inputs (NDVI, SSR, Tmean, Tmin, and Tmax), we discussed the importance of NMP and SLN in improving the daily GPP estimation via four popular ML methods (RF, SVM, CNN, and ELM). The prediction results are assessed in detail and comprehensively compared using accuracy metrics (NSE, RMSE, Bias, CV, and URMSE).
The advantages of introducing NMP and SLN into inputs have been demonstrated by all applied ML methods with the flux data in four sites. It is just that different ML methods have different sensitivities to SLN and NMP. The significance of SLN and NMP was also confirmed in the importance ranking of random forest. It is noted that considering the fixed relationship between the maximum allowable SLN and NMP, for the CNN and ELM methods, introducing NMP or SLN alone may obtain superior results than introducing them simultaneously. This study indicates that plant phenology and leaf-level photosynthetic factors have great value in improving GPP estimation via ML methods. But they have been commonly ignored by previous research. ML methods with the consideration of SLN or NMP are expected to improve the evaluation accuracy of global maize GPP.
All in all, as organic matter accumulates via maize photosynthesis, GPP exhibits a direct correlation with photosynthesis rate. SLN and NMP, concurrently regulating photosynthesis, exert an influence on GPP synthesis. Integrating these dynamic physiological aspects of maize as input variables into machine learning models has notably improved the models’ accuracy. This study provided new insights to improve GPP estimation via ML methods.