Estimation of Gross Primary Productivity Using Performance-Optimized Machine Learning Methods for the Forest Ecosystems in China

Na, Qin; Lai, Quan; Bao, Gang; Xue, Jingyuan; Liu, Xinyi; Gao, Rihe

doi:10.3390/f16030518

Open AccessArticle

Estimation of Gross Primary Productivity Using Performance-Optimized Machine Learning Methods for the Forest Ecosystems in China

by

Qin Na

¹,

Quan Lai

^1,2,*

,

Gang Bao

^1,2,

Jingyuan Xue

³,

Xinyi Liu

⁴ and

Rihe Gao

¹

College of Geographical Science, Inner Mongolia Normal University, Hohhot 010022, China

²

Inner Mongolia Key Laboratory of Remote Sensing and Geographic Information Systems, Inner Mongolia Normal University, Hohhot 010022, China

³

Institute for Disaster Management and Reconstruction, Sichuan University, Chengdu 610041, China

⁴

School of Public Management, Inner Mongolia University of Finance and Economics, Hohhot 010070, China

^*

Author to whom correspondence should be addressed.

Forests 2025, 16(3), 518; https://doi.org/10.3390/f16030518

Submission received: 20 February 2025 / Revised: 11 March 2025 / Accepted: 13 March 2025 / Published: 15 March 2025

(This article belongs to the Special Issue Application of Machine-Learning Methods in Forestry)

Download

Browse Figures

Versions Notes

Abstract

Gross primary productivity (GPP) quantifies the rate at which plants convert atmospheric carbon dioxide into organic matter through photosynthesis, playing a vital role in the terrestrial carbon cycle. Machine learning (ML) techniques excel in handling spatiotemporally complex data, facilitating accurate spatial-scale inversion of forest GPP by integrating limited ground flux measurements with Remote Sensing (RS) observations. Enhancing ML algorithm performance for precise GPP estimation is a key research focus. This study introduces the Random Grid Search Algorithm (RGSA) for hyperparameters tuning to improve Random Forest (RF) and eXtreme Gradient Boosting (XGB) models across four major forest regions in China. Model optimization progressed through three stages: the Unoptimized (UO) XGB model achieved

R^{2}

= 0.77 and

R M S E

= 1.42 g Cm⁻² d⁻¹; the Hyperparameter Optimized (HO) XGB model using RGSA improved performance by 5.19% in

R^{2}

(0.81) and reduced

R M S E

by 9.15% (1.29 g Cm⁻² d⁻¹); the Hyperparameter and Variable Combination Optimized (HVCO) XGB model with selected variables (LAI, Temp, NR, VPD, and NDVI) further enhanced

R^{2}

to 0.83 and decreased

R M S E

to 1.23 g Cm⁻² d⁻¹. The optimized GPP estimates exhibited high spatial consistency with existing high-quality products like GOSIF GPP, GLASS GPP, and FLUXCOM GPP, validating the model’s reliability and effectiveness. This research provides crucial insights for improving GPP estimation accuracy and optimizing ML methodologies for forest ecosystems in China.

Keywords:

gross primary productivity; forest ecosystem; machine learning; hyperparameter optimization algorithm

1. Introduction

GPP of vegetation refers to the total amount of organic carbon fixed through photosynthesis per unit time and area. GPP is a critical metric for assessing the capacity of ecosystems to provide material and energy support, forming an essential basis for studying and understanding ecosystem functions and processes [1]. Global forest ecosystems occupy approximately one-third (33%) of the Earth’s total terrestrial surface area, encompassing an estimated 43 million square kilometers [2]. Forest GPP contributes about 48.5% of the total GPP of global terrestrial ecosystems [3], playing a pivotal role in global biodiversity, climate regulation, and carbon flux. In the context of rising greenhouse gas concentrations [4], accurately estimating forest GPP is vital for understanding forest ecosystem functions and addressing global environmental changes.

GPP inversion involves using specific methods and auxiliary variables for extending limited measurements to a regional scale [5,6]. This process requires selecting appropriate tools and variables, where “tools” refer to inversion methods, and “variables” denote the reference inputs [7]. Inversion tools encompass the Light Use Efficiency (LUE) model [8], terrestrial ecosystem process models [9], data-driven models based on Eddy Covariance (EC) observations [10], and ML and artificial intelligence [11]. These methods have shown great potential in solving complex and nonlinear problems in the RS field and are thus widely used in vegetation productivity inversion. For instance, Guo et al. employed the Support Vector Machines (SVM) model to simulate and evaluate the GPP of cornfields in the northwest region, demonstrating SVM’s high predictive accuracy and stability in that area [12]. Wolanin et al. used Sentinel-2 and Landsat-8 data to estimate crop GPP through neural networks, achieving promising results [13], while Bai et al. established a GPP regression model using meteorological data and LAI based on the RF algorithm [14]. Previous studies have demonstrated the high performance and accuracy potential of ML algorithms such as RF, SVM, and ANN (Artificial Neural Network) in addressing complex spatiotemporal prediction tasks [15,16,17].

However, effectively utilizing RS data with appropriate ML algorithms to enhance the accuracy of forest GPP estimation remains a challenging task, despite the notable successes and inherent limitations of these techniques. Selecting the appropriate ML algorithm and feature set poses a significant challenge, as different algorithms have unique strengths and weaknesses and are suitable for different data types and problems [18]. It is especially important to choose RS variables that accurately reflect the ecological and physiological characteristics of forests for GPP estimation [19]. Additionally, ML often faces issues with multicollinearity when processing RS image data. High correlations among variables can lead to instability in model estimation and reduce explanatory power, a problem that has not been extensively researched [20,21]. Hyperparameter tuning is crucial for enhancing the performance of ML models, as unoptimized hyperparameters can lead to underfitting or poor generalization. However, finding the optimal parameter combination is often time-consuming and computationally intensive. This study proposes a robust optimization framework that systematically addresses these challenges through integrated tool selection, variable optimization, and hyperparameter fine-tuning mechanisms [17,22,23].

This study aims to utilize advanced model optimization algorithms to fine-tune ML models for constructing an accurate estimation model of the GPP of forest ecosystems in China. The specific objectives are: (1) to optimize ML models using hyperparameter optimization algorithms and evaluate the best models based on site and season; (2) to explore multiple variable combinations across all models and select the optimal combination; and (3) to perform a consistent assessment of the generated GPP data with existing GPP products. The results of this study will provide a novel methodological framework for accurately estimating GPP under various environmental scenarios, and they will offer a scientific basis for the effective management and conservation of forest ecosystems.

2. Materials and Methods

2.1. Study Area

China exhibits significant regional differences in forest coverage due to its geographical and climatic diversity, with forest types ranging from boreal coniferous forests in the northeast to tropical rainforests in the south [24]. Forests are primarily concentrated in the eastern coastal areas, the northeast, and the mountainous regions of the southwest [25]. To examine these diverse forest ecosystems, this study focuses on several representative regions across China (Figure 1). In the northeast, Jilin Province is characterized by temperate mixed coniferous and broadleaf forests. In contrast, Jiangxi and Guangdong Provinces in southern China are dominated by subtropical evergreen broadleaf forests, noted for their rich biodiversity. Yunnan Province, with its plateau topography, features unique subtropical and tropical montane forests, making it a biodiversity hotspot.

2.2. Data and Preprocessing

2.2.1. Flux Data from ChinaFLUX

ChinaFLUX (http://www.chinaflux.org/, accessed on 14 February 2025), the Chinese Terrestrial Ecosystem Flux Research Network, provides continuous long-term measurements of carbon, water, and energy exchanges between terrestrial ecosystems and the atmosphere. In this study, we used the daily flux data that had been quality-controlled and gap-filled by the ChinaFLUX data center [26,27]. In this study, we extracted multiple meteorological parameters from the ChinaFLUX database, including Temp, VPD, solar radiation (SR), NR, photosynthetically active radiation (PAR). Additionally, two key carbon flux metrics were acquired: net ecosystem exchange (NEE) and ecosystem respiration (RE). GPP was subsequently derived from the acquired NEE and RE measurements using the established relationship:

G P P = N E E - R E

[28].

2.2.2. Remote Sensing Data

Vegetation greenness significantly influences GPP; therefore, this study incorporates indices such as NDVI, EVI (Enhanced Vegetation Index), and LAI. NDVI and EVI are derived from MODIS products (MOD13A2 Terra and MYD13A2 Aqua) with a 16-day revisit cycle and 1 km spatial resolution [29]. By combining Terra and Aqua data, we obtained composite data with an 8-day revisit cycle [30]. LAI data are sourced from the GLASS product, with a 0.05° spatial resolution and an 8-day temporal resolution [31]. The temporal resolution enhancement of forest greenness indicators was achieved through Cubic Spline interpolation, transforming 8-day NDVI, EVI, and LAI measurements into daily time series data [32]. This sophisticated interpolation approach effectively captures the natural cosine-like variations in forest greenness patterns, thereby ensuring enhanced data accuracy and temporal continuity for subsequent analyses.

2.2.3. Remote Sensing-Based GPP Products

Global Solar-Induced chlorophyll Fluorescence (GOSIF), FLUXCOM, and GLASS GPP products are widely utilized in terrestrial ecosystem productivity studies. The GOSIF GPP product (http://data.globalecology.unh.edu/data/GOSIF_v2/, accessed on 14 February 2025) is derived from Orbiting Carbon Observatory-2 (OCO-2) Solar-Induced chlorophyll Fluorescence (SIF) observations and MODIS reflectance data, providing global coverage at 0.05° spatial resolution and 8-day temporal resolution [33]. FLUXCOM (Flux Communication) GPP (https://www.fluxcom.org/CF-Download/, accessed on 14 February 2025), generated by integrating FLUXNET site observations with machine learning algorithms and remote sensing data, offers monthly estimates at 0.5° spatial resolution, representing a fusion of multiple modeling approaches [34]. The GLASS GPP product (https://glass.hku.hk/archive/GPP/, accessed on 14 February 2025), developed using a light use efficiency model and improved remote sensing inputs, delivers measurements at 0.05° spatial resolution with 8-day temporal intervals [35].

2.3. Methods

2.3.1. Machine Learning Methods

ML is a multidisciplinary field that encompasses a variety of specific algorithms. This study selects three ML algorithms: RF, ANN, and XGB.

(1): RF: RF is an ensemble ML method that generates multiple decision trees from random samples [36,37]. During the construction of decision trees, the sample distribution is considered, and the bootstrap process is used for resampling. A bagging (bootstrap aggregating) process creates the final solution by averaging the results of the bootstrap trees [38]. This process improves the model’s performance and robustness by mitigating overfitting and enhancing generalization. In this study, the RF algorithm is implemented using the Scikit-learn library in Python 3.11, with hyperparameters such as n_estimators, max_depth, min_samples_split, and min_samples_leaf adjusted accordingly.
(2): ANN: ANNs simulate biological neural systems and typically consist of an input layer for explanatory variables, multiple hidden layers for nonlinear computation, and an output layer for producing results [39]. The weights and biases in the neural network are optimized by minimizing the cost function between actual labels and predicted values, allowing the network to learn and adapt to various data patterns for improved prediction accuracy. In this study, the ANN structure was modified from an initial two-layer neural network to a two-layer neural network with Dropout layers (rate = 0.5) inserted between hidden layers [40] to prevent overfitting and enhance generalization.
(3): XGB: Introduced by Chen and Guestrin, XGB is a highly efficient algorithm within the gradient boosting framework [41]. It supports parallel tree boosting Chen et al. and reduces overfitting through L1 and L2 regularization [42]. XGB’s innovative features, such as randomized parameter selection, leaf node proportion adjustment, and a unique tree penalty mechanism, offer superior performance for various data science tasks [43,44]. In this study, XGB is implemented using the xgboost library in Python 3.12, with hyperparameters like n_estimators, learning_rate, max_depth, subsample, colsample_bytree, min_child_weight, lambda, and alpha optimized to enhance model accuracy and efficiency.

2.3.2. Random Grid Search Algorithm (RGSA)

Hyperparameter tuning is a critical step in developing ML models. Manual tuning can be time-consuming and labor-intensive due to the complex interrelationships be-tween hyperparameters. An automated hyperparameter tuning algorithm, namely RGSA, was developed to address this optimization challenge. The RGSA methodology effectively optimizes the hyperparameters of RF and XGB models, thereby enhancing their predictive performance. Unlike Bayesian optimization [45] which relies on probabilistic surrogate models, RGSA employs a two-stage approach combining random and grid search [23,46] strategies with lower computational overhead, while incorporating a unique stability evaluation mechanism through the PSI (population stability index) index [47]. The detailed workflow and implementation framework of RGSA are illustrated in Figure 2.

(1): Initialize the upper limit ( ${p a r a m}_{m a x}$ ), lower limit ( ${p a r a m}_{m i n}$ ) and coarse step size ( ${s t e p}_{r o u g h}$ ) for the parameters to be adjusted.
(2): Use random search within the upper and lower limits for ${s t e p}_{r o u g h}$ , and calculate the average accuracy of each point using ten-fold cross-validation.
(3): Determine the coarse optimal point ( ${p a r a m}_{r o u g h}$ ) and obtain the range $g_{r a n g e} \in [{p a r a m}_{m a x} - {p a r a m}_{m i n}, {p a r a m}_{m a x} + {p a r a m}_{m i n}]$ .
(4): Use grid search within the $g_{r a n g e}$ to determine ${p a r a m}_{b e s t}$ .
(5): Evaluate the model’s stability using the $P S I$ index (Equation (1)). If $P S I < 0.1$ , the model is considered stable; otherwise, return to step 1 and reinitialize the parameters.

$P S I = \sum [(y^{i} - x^{i}) \times \ln (\frac{y^{i}}{x^{i}})]$

(1)

where $x^{i}$ is the i-th model’s estimated GPP, and $y^{i}$ is the i-th flux tower’s measured GPP.

2.3.3. Comparison Map Profile Method (CMP)

Spatial validation of the selected optimal scheme was conducted by comparing the ML-based GPP estimates with historical RS products. For comparative analysis, the GPP estimates spanning from 2009 to 2015 were spatially resampled to maintain consistency with the resolution of historical products. The Comparison Map Profile (CMP) method [48,49] was used to evaluate spatial consistency. This method quantifies spatial similarity by calculating the absolute difference (D) and the cross-correlation coefficient (CC) for each pixel at multiple spatial scales. The D value represents the absolute difference, while the CC value indicates spatial similarity. For multi-spatial scale analysis, a moving window process was adopted, with the window size increasing from scale 1 (3 × 3 pixels) to scale 20 (41 × 41 pixels). The results from these 20 scales were averaged to produce the CMP map. Lower D values and higher CC values indicate greater similarity between the generated GPP data and historical products, thereby validating the selected optimal scheme.

In each pixel of each moving window, D is calculated as follows [50]:

D = a b s (\bar{x} - \bar{y})

(2)

where

\bar{x}

and

\bar{y}

are the average GPP of the model estimates and the historical products within the moving window, respectively.

C C = \frac{1}{N^{2}} \sum_{i = 1}^{N} \sum_{j = 1}^{N} \frac{(x_{i j} - \bar{x}) \times (y_{i j} - \bar{y})}{σ_{x} \times σ_{y}}

(3)

where

x_{i j}

and

y_{i j}

are the pixel GPP values from the model results and historical products, respectively.

i

and

j

denote the coordinates within the moving window.

σ_{x}

and

σ_{y}

are the standard deviations of

x

and

y

within the moving window.

N

is the number of pixels within each moving window, corresponding to the window size [51].

2.3.4. Model Accuracy Evaluation Methods

The accuracy of five methods in estimating forest ecosystem GPP was evaluated using three performance measures: coefficient of determination (

R^{2}

), root mean square error (

R M S E

) and mean absolute error (

M A E

).

R^{2}

represents the correlation strength between predicted and actual values, reflecting the model’s fit.

R M S E

, sensitive to large errors due to squaring the prediction errors, captures significant deviations.

M A E

, representing the average absolute prediction error, intuitively shows the average deviation. Overall, these metrics provide a comprehensive assessment of model accuracy. The calculation formulas are shown in (4)–(6) [52]:

R^{2} = \frac{{[\sum_{i = 1}^{n} (y_{i} - \bar{y}) (x_{i} - \bar{x})]}^{2}}{\sum_{i = 1}^{n} {(y_{i} - \bar{y})}^{2} \sum_{i = 1}^{n} {(x_{i} - \bar{x})}^{2}}

(4)

R M S E = \sqrt{\frac{1}{n} \sum_{i = 1}^{n} {(y_{i} - x_{i})}^{2}}

(5)

M A E = \frac{1}{n} \sum_{i = 1}^{n} (y_{i} - x_{i})

(6)

2.3.5. Variance Inflation Factor (VIF) Analysis

V I F

analysis was performed to assess multicollinearity among predictors [53]. VIF values were calculated for all independent variables in the model. Predictors with VIF > 10 were considered to indicate severe multicollinearity and were excluded from subsequent modeling to ensure the stability and reliability of regression coefficients. The calculation formulas are shown in (7).

V I F = \frac{1}{1 - R^{2}}

(7)

2.3.6. Enhanced GPP Estimation Model

This study evaluated three machine learning models (RF, ANN, XGB) alongside two linear regression models (LASSO and RIDGE) for estimating forest ecosystem GPP. The comparative analysis aimed to assess the potential and advantages of machine learning approaches through the integration of multi-source data and advanced model optimization algorithms. The specific steps are outlined in Figure 3: (1) Select eight variables, including vegetation indices (NDVI, EVI, LAI), radiation data (PAR, NR, SR), and meteorological data (Temp, VPD), as the baseline feature set for model development. (2) Standardize all data to generate the required input for the model and split the dataset into training (2003–2008, ~75%) and testing sets (2009–2010, ~25%). (3) Optimize the hyperparameters for three ML models (RF, ANN, XGB) and two linear regression models (LASSO, RIDGE) using the RGSA developed in this study. (4) Address multicollinearity, noting significant correlations (e.g., EVI with NDVI and SR, NR, PAR, correlations > 0.88, see Figure 4). Use variable combinations in Figure 5 for further optimization. Temp and VPD, representing temperature and moisture, are not split. (5) Evaluate post-optimization of ML models with site data to select the best model and optimal variable combination. (6) Apply the selected model and feature combination for spatial inversion analysis of forest GPP in the relevant province (2009–2015) and compare results with other GPP products (GOSIF, GLASS, FLUXCOM).

3. Results

3.1. Model Evaluation

3.1.1. Assessing the GPP Estimation Capability of Different Models

Figure 6, presents the performance comparison of seven machine learning models for GPP prediction against EC-based measurements. The performance metrics reveal a clear pattern of improvement from linear models to ensemble methods with RGSA hyperparameter optimization.

The linear regression models (LASSO and Ridge) showed identical performance with

R^{2}

of 0.66 and

R M S E

of 1.74 g Cm⁻² d⁻¹. Their PSI values of 0.175 and 0.172 indicated moderate prediction stability. The ANN model achieved a higher

R^{2}

(0.70) and lower error metrics (RMSE = 1.63 g Cm⁻² d⁻¹) but exhibited the highest PSI value (0.668) among all models, suggesting poor prediction stability.

Ensemble methods demonstrated substantial improvements over linear and ANN models. The base RF and XGBoost models both achieved an

R^{2}

of 0.77 with similar error metrics (

R M S E

≈ 1.43–1.44 g Cm⁻² d⁻¹). Notably, these models showed remarkably improved stability with PSI values of 0.050 and 0.041, respectively.

The RGSA-based hyperparameter optimization further enhanced model performance. The RF HO (hyperparameters optimization) model reached an

R^{2}

of 0.79 with reduced errors (

R M S E

= 1.37 g Cm⁻² d⁻¹). The XGBoost HO model demonstrated the best overall performance with the highest

R^{2}

(0.81), lowest

R M S E

(1.29 g Cm⁻² d⁻¹), nearly zero bias (0.03 g Cm⁻² d⁻¹), and excellent stability (PSI = 0.040). The regression slopes for ensemble models (0.923–0.967) were consistently closer to the 1:1 line compared to linear models and ANN, indicating better capture of the GPP relationship across all observed values.

In summary, the ensemble models, particularly XGBoost with RGSA-based hyperparameter optimization, outperformed linear regression and ANN approaches in GPP prediction, showing higher accuracy, lower errors, and substantially better prediction stability.

3.1.2. Seasonal GPP Simulation Results

The model performances across different seasons (spring [March–May], summer [June–August], autumn [September–November], and winter [December–February]) are shown in Figure 7. Overall, the ML models exhibited the highest accuracy in winter, with all models showing higher

R^{2}

values ranging from 0.36 to 0.79, lower

R M S E

(0.83–1.45 g Cm⁻² d⁻¹), and lower

M A E

(0.62–1.09 g Cm⁻² d⁻¹) than other seasons. Spring follows, with autumn showing slightly lower performance than spring, while summer shows significantly poorer performance compared to other seasons (

R^{2}

= 0.54~0.68,

R M S E

= 1.63~1.95 g Cm⁻² d⁻¹,

M A E

= 1.23~1.56 g Cm⁻² d⁻¹). Additionally, the ML models consistently outperformed linear models across all seasons, and the XGB model performed the best in all seasons.

3.2. Comparison of Estimation Capabilities with Different Variable Combinations

VIF analysis was conducted to evaluate multicollinearity among the environmental variables (Table 1). The analysis revealed varying degrees of multicollinearity among the environmental variables. PAR and SR exhibited severe multicollinearity with extremely high VIF values (>100), suggesting they contain highly redundant information. Several vegetation indices and environmental parameters (NR, EVI, VPD, Temp, and NDVI) showed moderate multicollinearity with VIF values ranging from 5.17 to 8.46, slightly exceeding the commonly used threshold of 5. Only LAI demonstrated low collinearity with a VIF value of 2.84. Based on these findings, we implemented strategic variable selection in subsequent modeling to minimize redundancy while preserving essential predictive information.

The five models exhibited varying prediction accuracies on the test set when using different combinations of explanatory variables (Figure 5), as illustrated in Figure 8. Variable combination 2, which includes NDVI, NR, Temp, VPD, and LAI, yielded the optimal predictive performance, explaining an average of 73.7% of GPP variance (R² = 0.737) with an RMSE of 1.52 g Cm⁻² d⁻¹ and an MAE of 1.16 g Cm⁻² d⁻¹. Compared to using all variables (combination 7), combination 2 improved the average R² by 3.96% and reduced RMSE by 0.77 g Cm⁻² d⁻¹. Combination 5, which incorporated EVI, NR, Temp, VPD, and LAI, demonstrated the second-best performance, explaining an average of 73% of GPP variance with an RMSE of 1.54 g Cm⁻² d⁻¹ and an MAE of 1.17 g Cm⁻² d⁻¹. Variable combinations 1, 3, and 7 exhibited comparable performance, with R² values ranging from 0.71 to 0.72, RMSE from 1.56 to 1.60 g Cm⁻² d⁻¹, and MAE from 1.19 to 1.22 g Cm⁻² d⁻¹. Combinations 4 and 6 showed similar average performance to combinations 1, 3, and 7; however, most individual models demonstrated inferior stability with these two variable sets. Overall, variable combination 2 demonstrated superior predictive performance and model stability compared to all other variable combinations.

3.3. Comparison of Optimized Model Results with Other Products

3.3.1. Comparison of GPP Estimation Results Based on Site Data

Based on the GPP measurements from ChinaFLUX, this study compared the accuracy of forest GPP estimates from different products and the XGB model. From January 2009 to December 2010, a comparative analysis of monthly GPP observed at EC sites, alongside measurements from GOSIF, GLASS, and FLUXCOM, as well as estimates from an XGB model (Figure 9a), demonstrated that all three RS-based GPP products and the XGB-based estimates effectively captured the seasonal variations in GPP within China’s forest ecosystems. Among them, the XGB model yields notably more accurate results in forest GPP estimation compared to other products. Specifically, the correlation coefficients (

R

) of the XGB model reached 0.92 (Figure 9b), which is notably higher than those of FLUXCOM, GLASS, and GOSIF. This indicated that XGB can capture the trend of GPP changes with higher accuracy (Figure 9a). In terms of estimate error, the

R M S E

of the XGB model was 1.55, the lowest among the four, further indicating its highest accuracy. For the other three data products, both GLASS and GOSIF had an

R

of 0.82, with an

R M S E

of 2.42 and 2.21, respectively, while FLUXCOM had an

R

of 0.72 and an

R M S E

of 3.28. Overall, the XGB model performed the best at the site-specific scale.

3.3.2. Comparison of Spatial Consistency in GPP Estimates

Using the XGB model, forest GPP for Jilin, Guangdong, Jiangxi, and Yunnan provinces were estimated and compared with RS-based GPP products from GLASS, GOSIF, and FLUXCOM. The results are shown in Figure 10. ML-based GPP and the RS-based GPP showed good consistency in spatial distribution, with the most significant agreement existing in GLASS products (Figure 10).

The CC was calculated to analyze the spatial trend consistency between ML-based GPP and the three RS-based GPP products. The results indicated that the spatial trends of ML-based GPP are consistent with those of GLASS, GOSIF, and FLUXCOM. Additionally, the spatial distribution difference level between GPP and the three products was measured using D. The results demonstrate that the differences between GPP and the three products exhibit reasonable consistency, with D ranging from 0.31 to 58.01 g Cm⁻² d⁻¹ (Figure 10). Overall, GPP shows the best consistency in comparison with GLASS, which also implied that the ML-based GPP products adopted in this study is reliable for estimating the GPP of forest ecosystems.

4. Discussion

4.1. Evaluation of Model Performance

Our comprehensive analysis demonstrated the superior capability of ML approaches in estimating forest GPP across four EC flux sites in China, consistently surpassing the performance of conventional linear regression methods. These findings align with and further reinforce the growing body of evidence supporting the advantages of ML-based approaches in ecosystem carbon flux estimation. For instance, Bzdok et al. reported similar improvements in carbon flux predictions using ML algorithms [54], while Lee et al. documented enhanced accuracy in GPP estimations through advanced ML techniques [55]. More recent studies by Irvin et al. and Zhu et al. have also corroborated the robust performance of ML methods in capturing complex ecosystem-atmosphere interactions [7,56]. Irrespective of forest type classification, the XGBoost model exhibited markedly enhanced GPP estimation accuracy and predictive performance across the entire study domain, demonstrating substantial improvements over alternative methodological approaches. The five methods showed varying capabilities in estimating GPP for different forest types, with XGB, a tree-based ML algorithm, showing obvious advantages over the others [57,58].

Our comparative analysis reveals that both hyperparameter optimization (HO) and combined hyperparameter-variable optimization (HVCO) significantly enhance XGBoost model performance for GPP estimation (Figure 11). The RGSA effectively identifies optimal parameter configurations, while the variable combination optimization addresses multicollinearity issues evident in the VIF analysis. The selected variable combination (NDVI, Temp, VPD, NR, LAI) exhibits substantially lower multicollinearity than excluded variables like PAR and SR (VIF > 100), contributing to the model’s improved stability and accuracy [59]. Notably, NDVI (VIF = 5.17) was selected over EVI (VIF = 8.08) despite EVI’s theoretical advantages in dense vegetation, likely because NDVI’s lower multicollinearity provides greater statistical benefits within the model framework. This optimization strategy resulted in substantial performance improvements, with HVCO achieving 7.79% higher R² and 13.38% lower RMSE compared to the unoptimized model, demonstrating that addressing both algorithm parameterization and multicollinearity through strategic variable selection provides a robust framework for forest productivity assessment.

Our models exhibited notably poorer performance during summer compared to other seasons. This seasonal discrepancy can be primarily attributed to three factors in our study regions. First, increased cloud cover during the East Asian monsoon season significantly reduces optical remote sensing data quality in southern provinces [60]. Second, frequent summer precipitation creates complex soil moisture conditions where moisture-productivity relationships become nonlinear when approaching saturation [61]. Third, dense summer canopies in these forest ecosystems may lead to optical signal saturation, while high atmospheric water vapor content increases uncertainties in satellite data processing [62]. These combined factors limit the models’ ability to accurately capture the complex ecosystem dynamics during the peak growing season, suggesting that future improvements could incorporate multi-source remote sensing data to overcome these seasonal limitations.

4.2. Key Influencing Factors of Forest GPP

Among the key environmental and biophysical variables (LAI, Temp, NR, VPD, and NDVI) utilized in estimating GPP of forest ecosystems in China, LAI emerges as the most influential determinant (Figure 12). LAI represents the total leaf area within a specific region and is not only a crucial indicator of vegetation cover and growth status but also a key factor in determining the photosynthetic potential of plants [63,64,65]. LAI reflects the ability of vegetation to intercept solar radiation [66], which is directly related to the absorption of PAR. Higher LAI means more leaf area involved in photosynthesis, thereby increasing the photosynthetic rate, which plays a decisive role in the increase in GPP. Therefore, LAI is positively correlated with GPP (R = 0.79 as shown in Figure 4). Additionally, Zhang et al. reported that even using LAI alone as a variable can effectively estimate GPP in different forest types in North America, further highlighting the significant variable LAI in GPP estimation [11]. Other variables, such as Temp, NR, and VPD, are also important for estimating vegetation photosynthesis [67]. For example, results in Bao et al. demonstrated that GPP responds significantly to Temp, VPD, soil moisture supply, light saturation, cloud cover, and CO₂, and most of these responses are nonlinear [68]. Although EVI has a higher correlation with GPP than NDVI, it was not selected as a parameter in the optimal variable combination. This may be due to the high correlation between EVI and LAI (R = 0.77), leading to multicollinearity issues that reduce the model’s estimation performance. Furthermore, NDVI had the lowest importance, which might be due to data uncertainties and issues related to Cubic Spline interpolation. However, explanatory variable combinations with NDVI excluded might result in a worse model performance. The comparison of model performance between four-factor (LAI, Temp, NR, and VPD) and five-factor (LAI, Temp, NR, VPD, and NDVI) approaches revealed interesting patterns in GPP estimation accuracy. The slight reduction in model performance when excluding NDVI suggests that while NDVI contributes to model accuracy, the four-factor model still maintains robust predictive capability (Figure 13). This may be attributed to several reasons. First, most research sites are located in southern regions, where vegetation types and growth conditions may differ in their correlation with NDVI compared to other areas [69]. Given the abundant vegetation cover and extended growth cycles in these regions, NDVI, as an indicator of vegetation health and activity, may not fully capture its sensitivity and discriminative ability under varying vegetation conditions in this study. Second, although NDVI’s relative importance in the model is limited, it still provides essential information on vegetation status, which is crucial for predicting GPP [70]. NDVI reflects vegetation biomass and photosynthetic activity, indispensable factors in GPP estimation. Third, in the ML model, no significant multicollinearity issues were observed between NDVI and other environmental factors (Figure 4). This indicates that NDVI can serve as a complementary factor, working in conjunction with other variables without causing information redundancy.

4.3. Comparison with Existing Product Data

In this study, although there are discrepancies between the ML-based GPP and RS-based GPP products in certain regions, the consistency in the temporal distribution characteristics remains high (Figure 14). This indicates that the ML model can effectively capture the seasonal variations in GPP in forest ecosystems, and the results were consistent with previous research findings. For instance, Zhang et al. found that ML models, such as SVR and RF, can better capture the seasonal variations in GPP as compared to the traditional biogeochemical process models [11]. Similarly, Lee et al. confirmed that the tuned ML model DNN accurately predicted GPP in forest ecosystems under extreme climatic conditions [64]. The value range of ML-based GPP in this study is 1.06~8.45 g Cm⁻² d⁻¹, which is comparable to RS-based GPP products, such as GLASS (0~9.62 g Cm⁻² d⁻¹), GOSIF (0.37~9.50 g Cm⁻² d⁻¹), and FLUXCOM (0.03~8.45 g Cm⁻² d⁻¹), demonstrating the capability and effectiveness of the proposed approach in simulating forest GPP in this study (Figure 14).

ML-based GPP in this study showed significant consistency with GLASS-based GPP, particularly in the provinces of Guangdong, Yunnan, and Jiangxi (see Figure 14a,b,d). This phenomenon may mainly be attributed to three factors: First, the LAI data used in this study are sourced from GLASS; second, there is a similarity in the selection of characteristic variables between this study and GLASS GPP (characteristic variables include NDVI, PAR, Temp, and the ratio of ET to NR) [71]. Third, the spatial resolution of the GPP obtained in this study and GLASS GPP are both 1 km, indicating that the consistency in spatial resolution may also be a contributing factor to the similarity in GPP results.

4.4. Limitations and Prospects

Although the XGB-based GPP estimation model demonstrates promising performance, several aspects could be further enhanced to optimize the model performance. Firstly, sampling sites are regional. While the flux tower data for Chinese forest types used in the study are representative, the sites are mainly concentrated in southern regions. This bias may limit accuracy improvements in northern regions with lower GPP values. Secondly, sampling interval differences exist. Remote sensing data such as EVI, NDVI, and LAI are obtained at a 16-day frequency, and the model uses daily data derived through Cubic Spline interpolation, potentially introducing additional uncertainty. Thirdly, there is spatial resolution difference among the data. The variables in the study come from sources with different spatial resolutions. Although downscaling or upscaling was applied using the resampling method in ArcGIS to standardize the spatial resolution, uncertainties may still exist in variable extraction.

Future research should focus on the following aspects: First, to improve the spatiotemporal resolution of GPP estimates for forest ecosystems, it is essential to develop a high-resolution and representative large-scale training dataset. This approach aims to enhance data quality [72,73], thereby increasing the accuracy of GPP estimation. Second, to further boost the performance of the RGSA hyperparameter optimization algorithm, integrating swarm intelligence algorithms such as the Whale Optimization Algorithm (WOA) [74] and Particle Swarm Optimization (PSO) [75] is recommended. This integration could improve hyperparameter tuning accuracy and reduce computation time. Third, optimizing the variable combinations used for estimating GPP should be considered. In addition to the variables used in this study, other variables highly correlated with GPP, such as precipitation, wind speed, and terrain data, should be included. Fourth, using additional characteristic variables related to GPP products, and combining this strategy with few-shot learning methods in transfer learning [76,77], could help address the challenge of accurately estimating GPP in remote areas where measured GPP data are scarce or costly to obtain [78].

5. Conclusions

Based on GPP measurements from four ChinaFLUX forest ecosystem sites, this study developed an efficient and effective GPP model by comparing various RGSA-optimized algorithms and different variable combinations. With the optimal estimation method selected, this method was applied for inversion analysis on the observed GPP data. In addition, the performance of the selected model was evaluated by comparing the ML-based GPP with RS-based GPP products. The XGB model optimized with RGSA hyperparameters (Section 3.2) showed the highest accuracy (

R^{2}

= 0.81,

R M S E

= 1.29 g Cm⁻² d⁻¹,

M A E

= 0.97 g Cm⁻² d⁻¹) in GPP simulations, achieving a site-scale correlation coefficient of 0.92. Compared to the model without hyperparameter tuning,

R^{2}

improved by 5.19% and

R M S E

decreased by 9.15% (Figure 10), significantly enhancing predictive accuracy. Using the variable combination of LAI, Temp, NR, VPD, and NDVI, the model achieves superior GPP estimation performance (

R^{2}

= 0.83,

R M S E

= 1.23 g Cm⁻² d⁻¹,

M A E

= 0.92 g Cm⁻² d⁻¹). Compared to using all variables,

R^{2}

increased by 2.47% and

R M S E

decreased by 4.65% (Figure 10). The GPP data obtained in this study show high spatial consistency with existing RS products (GLASS GPP, GOSIF GPP, and FLUXCOM GPP), accurately capturing seasonal GPP variations in the forest ecosystems. This demonstrated excellent quality and reliability of the developed ML-based GPP model in this study. Overall, this study provides essential technical support for GPP estimation at both methodological and application levels, guiding future research. The findings significantly advance the capacity to quantify climate change effects on forest GPP through the implementation of RGSA, enabling enhanced carbon flux estimations and providing insights into global carbon cycle mechanisms.

Author Contributions

Conceptualization, Q.N. and Q.L.; Data curation, R.G.; Investigation, G.B. and X.L.; Methodology, Q.N.; Supervision, Q.L.; Validation, J.X., X.L. and R.G.; Writing—original draft, Q.N.; Writing—review and editing, Q.L. and J.X. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China (Grant No. 42461049).

Data Availability Statement

The data presented in this study are available on request from the corresponding author due to privacy.

Acknowledgments

The authors would like to thank the editors and the anonymous reviewers for their crucial comments, which improved the quality of this paper.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

GPP	Gross primary productivity
ML	Machine learning
RS	Remote Sensing
RGSA	Random Grid Search Algorithm
RF	Random Forest
XGB	eXtreme Gradient Boosting
LAI	Leaf Area Index
Temp	Temperature
NR	Net Radiation
VPD	Vapor Pressure Deficit
NDVI	Normalized Difference Vegetation Index
LUE	Light Use Efficiency
EC	Eddy Covariance
SVM	Support Vector Machines
ANN	Artificial Neural Network
SR	Solar Radiation
PAR	Photosynthetically Active Radiation
NEE	Net Ecosystem Exchange
RE	Ecosystem Respiration
EVI	Enhanced Vegetation Index
GOSIF	Global Solar-Induced chlorophyll Fluorescence
CMP	Comparison Map Profile Method
VIF	Variance inflation factor
OCO-2	Orbiting Carbon Observatory-2

References

Bloomfield, K.J.; Stocker, B.D.; Keenan, T.F.; Prentice, I.C. Environmental controls on the light use efficiency of terrestrial gross primary production. Glob. Change Biol. 2023, 29, 1037–1053. [Google Scholar] [CrossRef] [PubMed]
Stat, F. Statistical Database of the Food and Agriculture Organization of the United Nations; FAO: Egypt, Cairo, 2014. [Google Scholar]
Beer, C.; Reichstein, M.; Tomelleri, E.; Ciais, P.; Jung, M.; Carvalhais, N.; Rödenbeck, C.; Arain, M.A.; Baldocchi, D.; Bonan, G.B. Terrestrial gross carbon dioxide uptake: Global distribution and covariation with climate. Science 2010, 329, 834–838. [Google Scholar] [CrossRef]
Zhang, H.; Chen, B.; Xu, G.; Yan, J.; Che, M.; Chen, J.; Fang, S.; Lin, X.; Sun, S. Comparing simulated atmospheric carbon dioxide concentration with GOSAT retrievals. Sci. Bull. 2015, 60, 380–386. [Google Scholar] [CrossRef]
Jung, M.; Reichstein, M.; Margolis, H.A.; Cescatti, A.; Richardson, A.D.; Arain, M.A.; Arneth, A.; Bernhofer, C.; Bonal, D.; Chen, J. Global patterns of land-atmosphere fluxes of carbon dioxide, latent heat, and sensible heat derived from eddy covariance, satellite, and meteorological observations. J. Geophys. Res. Biogeosciences 2011, 116. [Google Scholar] [CrossRef]
Xiao, J.; Ollinger, S.V.; Frolking, S.; Hurtt, G.C.; Hollinger, D.Y.; Davis, K.J.; Pan, Y.; Zhang, X.; Deng, F.; Chen, J. Data-driven diagnostics of terrestrial carbon dynamics over North America. Agric. For. Meteorol. 2014, 197, 142–157. [Google Scholar] [CrossRef]
Zhu, X.-J.; Yu, G.-R.; Chen, Z.; Zhang, W.-K.; Han, L.; Wang, Q.-F.; Chen, S.-P.; Liu, S.-M.; Wang, H.-M.; Yan, J.-H. Mapping Chinese annual gross primary productivity with eddy covariance measurements and machine learning. Sci. Total Environ. 2023, 857, 159390. [Google Scholar] [CrossRef]
Pei, Y.; Dong, J.; Zhang, Y.; Yuan, W.; Doughty, R.; Yang, J.; Zhou, D.; Zhang, L.; Xiao, X. Evolution of light use efficiency models: Improvement, uncertainties, and implications. Agric. For. Meteorol. 2022, 317, 108905. [Google Scholar] [CrossRef]
Lin, S.; Hu, Z.; Wang, Y.; Chen, X.; He, B.; Song, Z.; Sun, S.; Wu, C.; Zheng, Y.; Xia, X. Underestimated interannual variability of terrestrial vegetation production by terrestrial ecosystem models. Glob. Biogeochem. Cycles 2023, 37, e2023GB007696. [Google Scholar] [CrossRef]
Wang, S.; Zhang, Y.; Ju, W.; Qiu, B.; Zhang, Z. Tracking the seasonal and inter-annual variations of global gross primary production during last four decades using satellite near-infrared reflectance data. Sci. Total Environ. 2021, 755, 142569. [Google Scholar] [CrossRef]
Zhang, Z.; Xin, Q.; Li, W. Machine learning-based modeling of vegetation leaf area index and gross primary productivity across North America and comparison with a process-based model. J. Adv. Model. Earth Syst. 2021, 13, e2021MS002802. [Google Scholar] [CrossRef]
Guo, H.; Zhou, X.; Dong, Y.; Wang, Y.; Li, S. On the use of machine learning methods to improve the estimation of gross primary productivity of maize field with drip irrigation. Ecol. Model. 2023, 476, 110250. [Google Scholar] [CrossRef]
Wolanin, A.; Camps-Valls, G.; Gómez-Chova, L.; Mateo-García, G.; Van Der Tol, C.; Zhang, Y.; Guanter, L. Estimating crop primary productivity with Sentinel-2 and Landsat 8 using machine learning methods trained with radiative transfer simulations. Remote Sens. Environ. 2019, 225, 441–457. [Google Scholar] [CrossRef]
Bai, J.; Zhang, H.; Sun, R.; Li, X.; Xiao, J.; Wang, Y. Estimation of global GPP from GOME-2 and OCO-2 SIF by considering the dynamic variations of GPP-SIF relationship. Agric. For. Meteorol. 2022, 326, 109180. [Google Scholar] [CrossRef]
Killeen, P.; Kiringa, I.; Yeap, T.; Branco, P. Corn Grain Yield Prediction Using UAV-Based High Spatiotemporal Resolution Imagery, Machine Learning, and Spatial Cross-Validation. Remote Sens. 2024, 16, 683. [Google Scholar] [CrossRef]
Liu, Q.; Gui, D.; Zhang, L.; Niu, J.; Dai, H.; Wei, G.; Hu, B.X. Simulation of regional groundwater levels in arid regions using interpretable machine learning models. Sci. Total Environ. 2022, 831, 154902. [Google Scholar] [CrossRef]
Yu, R.; Yao, Y.; Tang, Q.; Shao, C.; Fisher, J.B.; Chen, J.; Jia, K.; Zhang, X.; Li, Y.; Shang, K. Coupling a light use efficiency model with a machine learning-based water constraint for predicting grassland gross primary production. Agric. For. Meteorol. 2023, 341, 109634. [Google Scholar] [CrossRef]
Schratz, P.; Muenchow, J.; Iturritxa, E.; Cortés, J.; Bischl, B.; Brenning, A. Monitoring forest health using hyperspectral imagery: Does feature selection improve the performance of machine-learning techniques? Remote Sens. 2021, 13, 4832. [Google Scholar] [CrossRef]
Jin, J.; Hou, W.; Ma, X.; Wang, H.; Xie, Q.; Wang, W.; Zhu, Q.; Fang, X.; Zhou, F.; Liu, Y. Improved estimation of gross primary production with NIRvP by incorporating a phenophase scheme for temperate deciduous forest ecosystems. For. Ecol. Manag. 2024, 556, 121742. [Google Scholar] [CrossRef]
Winship, C.; Western, B. Multicollinearity and model misspecification. Sociol. Sci. 2016, 3, 627–649. [Google Scholar] [CrossRef]
Yu, T.; Pang, Y.; Sun, R.; Niu, X. Spatial downscaling of vegetation productivity in the forest from deep learning. IEEE Access 2022, 10, 104449–104460. [Google Scholar] [CrossRef]
Gaber, M.; Kang, Y.; Schurgers, G.; Keenan, T. Using automated machine learning for the upscaling of gross primary productivity. Biogeosciences 2024, 21, 2447–2472. [Google Scholar] [CrossRef]
Tian, Z.; Fu, Y.; Zhou, T.; Yi, C.; Kutter, E.; Zhang, Q.; Krakauer, N.Y. Estimating Forest Gross Primary Production Using Machine Learning, Light Use Efficiency Model, and Global Eddy Covariance Data. Forests 2024, 15, 1615. [Google Scholar] [CrossRef]
Xu, W.; Zhang, B.; Xu, Q.; Gao, D.; Zuo, H.; Ren, R.; Diao, K.; Chen, Z. Enhanced Carbon Storage in Mixed Coniferous and Broadleaf Forest Compared to Pure Forest in the North Subtropical–Warm Temperate Transition Zone of China. Forests 2024, 15, 1520. [Google Scholar] [CrossRef]
Zhang, D.; Zuo, X.; Zang, C. Assessment of future potential carbon sequestration and water consumption in the construction area of the Three-North Shelterbelt Programme in China. Agric. For. Meteorol. 2021, 303, 108377. [Google Scholar] [CrossRef]
Yu, G.-R.; Wen, X.-F.; Sun, X.-M.; Tanner, B.D.; Lee, X.; Chen, J.-Y. Overview of ChinaFLUX and evaluation of its eddy covariance measurement. Agric. For. Meteorol. 2006, 137, 125–137. [Google Scholar] [CrossRef]
Yu, G.; Chen, Z.; Piao, S.; Peng, C.; Ciais, P.; Wang, Q.; Li, X.; Zhu, X. High carbon dioxide uptake by subtropical forest ecosystems in the East Asian monsoon region. Proc. Natl. Acad. Sci. USA 2014, 111, 4910–4915. [Google Scholar] [CrossRef] [PubMed]
Hu, C.; Hu, S.; Zeng, L.; Meng, K.; Liao, Z.; Wang, K. Estimation of Daily Maize Gross Primary Productivity by Considering Specific Leaf Nitrogen and Phenology via Machine Learning Methods. Remote Sens. 2024, 16, 341. [Google Scholar] [CrossRef]
Ha, T.V.; Uereyen, S.; Kuenzer, C. Agricultural drought conditions over mainland Southeast Asia: Spatiotemporal characteristics revealed from MODIS-based vegetation time-series. Int. J. Appl. Earth Obs. Geoinf. 2023, 121, 103378. [Google Scholar] [CrossRef]
Lee, B.; Kim, N.; Kim, E.-S.; Jang, K.; Kang, M.; Lim, J.-H.; Cho, J.; Lee, Y. An artificial intelligence approach to predict gross primary productivity in the forests of South Korea using satellite remote sensing data. Forests 2020, 11, 1000. [Google Scholar] [CrossRef]
Ma, H.; Liang, S. Development of the GLASS 250-m leaf area index product (version 6) from MODIS data using the bidirectional LSTM deep learning model. Remote Sens. Environ. 2022, 273, 112985. [Google Scholar] [CrossRef]
Huang, L.; Liu, M.; Jiang, Y.; Tang, R. Coupled Estimation Of daily Gross Primary Production and Evapotranspiration at 84 Global Forest Sites. In Proceedings of the 2021 IEEE International Geoscience and Remote Sensing Symposium IGARSS, Brussels, Belgium, 11–16 July 2021; pp. 3741–3744. [Google Scholar]
Li, X.; Xiao, J. Mapping photosynthesis solely from solar-induced chlorophyll fluorescence: A global, fine-resolution dataset of gross primary production derived from OCO-2. Remote Sens. 2019, 11, 2563. [Google Scholar] [CrossRef]
Jung, M.; Reichstein, M.; Schwalm, C.R.; Huntingford, C.; Sitch, S.; Ahlström, A.; Arneth, A.; Camps-Valls, G.; Ciais, P.; Friedlingstein, P. Compensatory water effects link yearly global land CO2 sink changes to temperature. Nature 2017, 541, 516–520. [Google Scholar] [CrossRef]
Liang, S.; Zhang, X.; Xiao, Z.; Cheng, J.; Liu, Q.; Zhao, X. Global Land Surface Satellite (GLASS) Products: Algorithms, Validation and Analysis; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2013. [Google Scholar]
Breiman, L. Random forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
Feng, P.; Wang, B.; Li Liu, D.; Waters, C.; Xiao, D.; Shi, L.; Yu, Q. Dynamic wheat yield forecasts are improved by a hybrid approach using a biophysical model and machine learning technique. Agric. For. Meteorol. 2020, 285, 107922. [Google Scholar] [CrossRef]
Little, M.P.; Rosenberg, P.S.; Arsham, A. Alternative stopping rules to limit tree expansion for random forest models. Sci. Rep. 2022, 12, 15113. [Google Scholar] [CrossRef]
Aziz, G.; Minallah, N.; Saeed, A.; Frnda, J.; Khan, W. Remote sensing based forest cover classification using machine learning. Sci. Rep. 2024, 14, 69. [Google Scholar] [CrossRef]
Li, M.; Zhu, Z.; Ren, W.; Wang, Y. Predicting Gross Primary Productivity under Future Climate Change for the Tibetan Plateau Based on Convolutional Neural Networks. Remote Sens. 2024, 16, 3723. [Google Scholar] [CrossRef]
Chen, T.; Guestrin, C. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 785–794. [Google Scholar]
Chen, J.; Li, K.; Tang, Z.; Bilal, K.; Yu, S.; Weng, C.; Li, K. A parallel random forest algorithm for big data in a spark cloud computing environment. IEEE Trans. Parallel Distrib. Syst. 2016, 28, 919–933. [Google Scholar] [CrossRef]
Zhang, M.; Chen, E.; Zhang, C.; Han, Y. Impact of seasonal global land surface temperature (LST) change on gross primary production (GPP) in the early 21st century. Sustain. Cities Soc. 2024, 110, 105572. [Google Scholar]
Qadeer, A.; Shakir, M.; Wang, L.; Talha, S.M. Evaluating machine learning approaches for aboveground biomass prediction in fragmented high-elevated forests using multi-sensor satellite data. Remote Sens. Appl. Soc. Environ. 2024, 36, 101291. [Google Scholar] [CrossRef]
Mitra, B.; Tiwari, S.P.; Uddin, M.S.; Mahmud, K.; Rahman, S.M. Decision tree ensemble with Bayesian optimization to predict the spatial dynamics of chlorophyll-a concentration: A case study in Bay of Bengal. Mar. Pollut. Bull. 2024, 199, 115945. [Google Scholar] [CrossRef]
Ma, A.; Wan, Y.; Zhong, Y.; Wang, J.; Zhang, L. SceneNet: Remote sensing scene classification deep learning network using multi-objective neural evolution architecture search. ISPRS J. Photogramm. Remote Sens. 2021, 172, 171–188. [Google Scholar] [CrossRef]
Liu, X.; Al-Shaibah, B.; Zhao, C.; Tong, Z.; Bian, H.; Zhang, F.; Zhang, J.; Pei, X. Estimation of the key water quality parameters in the surface water, middle of northeast China, based on Gaussian process regression. Remote Sens. 2022, 14, 6323. [Google Scholar] [CrossRef]
Gaucherel, C.; Alleaume, S.; Hély, C. The comparison map profile method: A strategy for multiscale comparison of quantitative and qualitative images. IEEE Trans. Geosci. Remote Sens. 2008, 46, 2708–2719. [Google Scholar] [CrossRef]
Yao, Y.; Wang, X.; Li, Y.; Wang, T.; Shen, M.; Du, M.; He, H.; Li, Y.; Luo, W.; Ma, M. Spatiotemporal pattern of gross primary productivity and its covariation with climate in China over the last thirty years. Glob. Change Biol. 2018, 24, 184–196. [Google Scholar] [CrossRef] [PubMed]
Yao, Y.; Li, Z.; Wang, T.; Chen, A.; Wang, X.; Du, M.; Jia, G.; Li, Y.; Li, H.; Luo, W. A new estimation of China’s net ecosystem productivity based on eddy covariance measurements and a model tree ensemble approach. Agric. For. Meteorol. 2018, 253, 84–93. [Google Scholar] [CrossRef]
Zhou, T.; Hou, Y.; Yang, Z.; Laffitte, B.; Luo, K.; Luo, X.; Liao, D.; Tang, X. Reducing spatial resolution increased net primary productivity prediction of terrestrial ecosystems: A Random Forest approach. Sci. Total Environ. 2023, 897, 165134. [Google Scholar] [CrossRef]
Sarkar, D.P.; Shankar, B.U.; Parida, B.R. Machine learning approach to predict terrestrial gross primary productivity using topographical and remote sensing data. Ecol. Inform. 2022, 70, 101697. [Google Scholar] [CrossRef]
Kalantar, B.; Ueda, N.; Idrees, M.O.; Janizadeh, S.; Ahmadi, K.; Shabani, F. Forest Fire Susceptibility Prediction Based on Machine Learning Models with Resampling Algorithms on Remote Sensing Data. Remote Sens. 2020, 12, 3682. [Google Scholar] [CrossRef]
Ij, H. Statistics versus machine learning. Nat Methods 2018, 15, 233. [Google Scholar]
Lee, B.; Kim, E.; Jang, K.; Cho, N.H.; Song, W.; Park, G.; Park, C.; Lim, J.-H. Estimating Gross Primary Production using machine-learning algorithms based on eddy covariance measurements and remote sensing in forest ecosystem. In Proceedings of the EGU General Assembly Conference Abstracts, Vienna, Austria, 1 April 2019; p. 9036. [Google Scholar]
Irvin, J.; Zhou, S.; McNicol, G.; Lu, F.; Liu, V.; Fluet-Chouinard, E.; Ouyang, Z.; Knox, S.H.; Lucas-Moffat, A.; Trotta, C. Gap-filling eddy covariance methane fluxes: Comparison of machine learning model predictions and uncertainties at FLUXNET-CH4 wetlands. Agric. For. Meteorol. 2021, 308, 108528. [Google Scholar] [CrossRef]
Lundberg, S.M.; Erion, G.; Chen, H.; DeGrave, A.; Prutkin, J.M.; Nair, B.; Katz, R.; Himmelfarb, J.; Bansal, N.; Lee, S.-I. From local explanations to global understanding with explainable AI for trees. Nat. Mach. Intell. 2020, 2, 56–67. [Google Scholar] [CrossRef]
Jiang, H.; Chen, A.; Wu, Y.; Zhang, C.; Chi, Z.; Li, M.; Wang, X. Vegetation monitoring for mountainous regions using a new integrated topographic correction (ITC) of the SCS+ C correction and the shadow-eliminated vegetation index. Remote Sens. 2022, 14, 3073. [Google Scholar] [CrossRef]
Habeeb, H.N.; Mustafa, Y.T. Deep learning approaches for estimating forest vegetation cover and exploring influential ecosystem factors. Earth Sci. Inform. 2024, 17, 3379–3396. [Google Scholar] [CrossRef]
Whitcraft, A.K.; Vermote, E.F.; Becker-Reshef, I.; Justice, C.O. Cloud cover throughout the agricultural growing season: Impacts on passive optical earth observations. Remote Sens. Environ. 2015, 156, 438–447. [Google Scholar] [CrossRef]
Xie, M.; Li, L.; Liu, B.; Liu, Y.; Wan, Q. Responses of terrestrial ecosystem productivity and community structure to intra-annual precipitation patterns: A meta-analysis. Front. Plant Sci. 2023, 13, 1088202. [Google Scholar] [CrossRef] [PubMed]
Chaparro, D.; Duveiller, G.; Piles, M.; Cescatti, A.; Vall-Llossera, M.; Camps, A.; Entekhabi, D. Sensitivity of L-band vegetation optical depth to carbon stocks in tropical forests: A comparison to higher frequencies and optical indices. Remote Sens. Environ. 2019, 232, 111303. [Google Scholar] [CrossRef]
Zha, T.; Barr, A.; Bernier, P.-Y.; Lavigne, M.; Trofymow, J.; Amiro, B.; Arain, M.; Bhatti, J.; Black, T.; Margolis, H. Gross and aboveground net primary production at Canadian forest carbon flux sites. Agric. For. Meteorol. 2013, 174, 54–64. [Google Scholar] [CrossRef]
Lee, H.; Park, J.; Cho, S.; Lee, M.; Kim, H.S. Impact of leaf area index from various sources on estimating gross primary production in temperate forests using the JULES land surface model. Agric. For. Meteorol. 2019, 276, 107614. [Google Scholar] [CrossRef]
Chen, X.; Cai, A.; Guo, R.; Liang, C.; Li, Y. Variation of gross primary productivity dominated by leaf area index in significantly greening area. J. Geogr. Sci. 2023, 33, 1747–1764. [Google Scholar] [CrossRef]
De Bock, A.; Belmans, B.; Vanlanduit, S.; Blom, J.; Alvarado-Alvarado, A.; Audenaert, A. A review on the leaf area index (LAI) in vertical greening systems. Build. Environ. 2023, 229, 109926. [Google Scholar] [CrossRef]
Madani, N.; Parazoo, N.C.; Kimball, J.S.; Ballantyne, A.P.; Reichle, R.H.; Maneta, M.; Saatchi, S.; Palmer, P.I.; Liu, Z.; Tagesson, T. Recent amplified global gross primary productivity due to temperature increase is offset by reduced productivity due to water constraints. AGU Adv. 2020, 1, e2020AV000180. [Google Scholar] [CrossRef]
Bao, S.; Wutzler, T.; Koirala, S.; Cuntz, M.; Ibrom, A.; Besnard, S.; Walther, S.; Šigut, L.; Moreno, A.; Weber, U. Environment-sensitivity functions for gross primary productivity in light use efficiency models. Agric. For. Meteorol. 2022, 312, 108708. [Google Scholar] [CrossRef]
Yang, Q.; Liu, X.; Huang, Z.; Guo, B.; Tian, L.; Wei, C.; Meng, Y.; Zhang, Y. Integrating satellite-based passive microwave and optically sensed observations to evaluating the spatio-temporal dynamics of vegetation health in the red soil regions of southern China. GIScience Remote Sens. 2022, 59, 215–233. [Google Scholar] [CrossRef]
Zhou, Z.; Ding, Y.; Liu, S.; Wang, Y.; Fu, Q.; Shi, H. Estimating the applicability of NDVI and SIF to gross primary productivity and grain-yield monitoring in China. Remote Sens. 2022, 14, 3237. [Google Scholar] [CrossRef]
Liang, S.; Cheng, J.; Jia, K.; Jiang, B.; Liu, Q.; Xiao, Z.; Yao, Y.; Yuan, W.; Zhang, X.; Zhao, X. The global land surface satellite (GLASS) product suite. Bull. Am. Meteorol. Soc. 2021, 102, E323–E337. [Google Scholar] [CrossRef]
Leinenkugel, P.; Wolters, M.L.; Kuenzer, C.; Oppelt, N.; Dech, S. Sensitivity analysis for predicting continuous fields of tree-cover and fractional land-cover distributions in cloud-prone areas. In Remote Sensing the Mekong; Routledge: London, UK, 2018; pp. 53–75. [Google Scholar]
Virnodkar, S.S.; Pachghare, V.K.; Patil, V.; Jha, S.K. Remote sensing and machine learning for crop water stress determination in various crops: A critical review. Precis. Agric. 2020, 21, 1121–1155. [Google Scholar] [CrossRef]
Mirjalili, S.; Lewis, A. The whale optimization algorithm. Adv. Eng. Softw. 2016, 95, 51–67. [Google Scholar] [CrossRef]
Wang, D.; Tan, D.; Liu, L. Particle swarm optimization algorithm: An overview. Soft Comput. 2018, 22, 387–408. [Google Scholar] [CrossRef]
Song, Y.; Wang, T.; Cai, P.; Mondal, S.K.; Sahoo, J.P. A comprehensive survey of few-shot learning: Evolution, applications, challenges, and opportunities. ACM Comput. Surv. 2023, 55, 1–40. [Google Scholar] [CrossRef]
Habeeb, H.N.; Mustafa, Y.T. Deep Learning-Based Prediction of Forest Cover Change in Duhok, Iraq: Past and Future. Forestist 2025, 75, 68. [Google Scholar] [CrossRef]
Ma, Y.; Chen, S.; Ermon, S.; Lobell, D.B. Transfer learning in environmental remote sensing. Remote Sens. Environ. 2024, 301, 113924. [Google Scholar] [CrossRef]

Figure 1. Distribution of forest flux tower sites across China. The map shows the locations of four flux tower stations (marked by red pentagons): QYZ (Qianyanzhou), CBS (Changbai Mountain), DHS (Dinghushan), XSBN (Xishuangbanna), overlaid on an elevation map of China. The surrounding photographs show the flux tower installations at each site.

Figure 2. Schematic workflow of the Random Grid Search Algorithm (RGSA), illustrating the iterative process from hyperparameter space initialization through random sampling with cross-validation to fine-grained grid search, culminating in optimal parameter selection after stability verification.

Figure 3. Technical roadmap. The workflow integrates vegetation indices (NDVI: Normalized Difference Vegetation Index, EVI: Enhanced Vegetation Index), biophysical parameters (LAI: Leaf Area Index, PAR: Photosynthetically Active Radiation), radiation measurements (NR: Net Radiation, SR: Solar Radiation), and environmental variables (Temp: Temperature, VPD: Vapor Pressure Deficit) as features. Data was split chronologically (75%/25%) between training and test sets. Machine learning (RF, ANN, XGBoost) and linear models (LASSO: Least Absolute Shrinkage and Selection Operator, RIDGE: Ridge Regression) were implemented with optimization procedures (RGSA: Random Grid Search Algorithm) for GPP estimation.

Figure 4. Correlation among multiple variables.

Figure 5. Schematic diagram of the input variable combinations used for model development. The diagram illustrates seven different combinations (1–7) of remote sensing variables including vegetation indices (EVI, NDVI, LAI), radiation parameters (PAR, NR, SR), and climate factors (Temp, VPD). Each combination represents a unique input variable set tested in the modeling framework.

Figure 6. Comparison of predicted versus EC-based GPP (g Cm⁻² d⁻¹) using different machine learning models. (a) LASSO regression, (b) Ridge regression, (c) ANN, (d) RF base model (RF Base), (e) XGBoost base model (XGBoost Base), (f) RF with hyperparameters optimization (RF HO), and (g) XGBoost with hyperparameters optimization (XGB HO). The scatter plots show the density distribution of data points, with the red line representing the linear regression fit, black dashed line indicating the 1:1 line, and gray shaded area showing the 95% confidence interval.

Figure 7. Seasonal comparison of model performance in GPP estimation. The subpanels show (a)

R^{2}

, (b)

R M S E

(g Cm⁻² d⁻¹), and (c)

M A E

(g Cm⁻² d⁻¹) metrics across four seasons: spring (March–May), summer (June–August), autumn (September–November), and winter (December–February).

R M S E

and

M A E

are expressed in g Cm⁻² d⁻¹.

Figure 7. Seasonal comparison of model performance in GPP estimation. The subpanels show (a)

R^{2}

, (b)

R M S E

(g Cm⁻² d⁻¹), and (c)

M A E

(g Cm⁻² d⁻¹) metrics across four seasons: spring (March–May), summer (June–August), autumn (September–November), and winter (December–February).

R M S E

and

M A E

are expressed in g Cm⁻² d⁻¹.

Figure 8. Performance evaluation of GPP estimation models across different variable combinations (1–7): (a)

R^{2}

, (b)

R M S E

(g Cm⁻² d⁻¹), (c)

M A E

(g Cm⁻² d⁻¹). Box plots show median (central line), interquartile range (box, 25th-75th percentiles), and data distribution within 1.5×IQR (whiskers). Mean values are represented by dots (•) and squares (□).

Figure 8. Performance evaluation of GPP estimation models across different variable combinations (1–7): (a)

R^{2}

, (b)

R M S E

(g Cm⁻² d⁻¹), (c)

M A E

(g Cm⁻² d⁻¹). Box plots show median (central line), interquartile range (box, 25th-75th percentiles), and data distribution within 1.5×IQR (whiskers). Mean values are represented by dots (•) and squares (□).

Figure 9. Comparison of GPP estimation models against eddy covariance (EC) measurements. (a) Temporal GPP dynamics (g Cm⁻² d⁻¹) at four flux sites (DHS, QYZ, CBS, XSBN) from 2009 to 2010, showing FLUXCOM, GOSIF, GLASS, and XGBoost model estimates versus EC observations (dotted line). (b) Taylor diagram summarizing statistical performance of the four models relative to EC measurements through correlation coefficient (R), standard deviation, and RMSE metrics.

Figure 10. Spatial distribution of correlation coefficient (CC, upper row) and distance (D, lower row) metrics derived from CMP analysis for GOSIF, GLASS, and FLUXCOM products across four study regions (a–d). Higher CC values indicate stronger correlation while higher D values represent greater dissimilarity between observed patterns.

Figure 11. Comparison between predicted and EC-based GPP under three optimization strategies: (a) Unoptimized (UO), (b) Hyperparameter Optimized (HO), and (c) Hyperparameter and Variable Combination Optimized (HVCO). Black solid lines represent fitted regression lines, red dashed lines show 1:1 relationships, and color intensity indicates data point frequency. Statistical metrics (

R^{2}

,

R M S E

,

M A E

, and

B I A S

) demonstrate progressive model improvement from UO to HVCO.

Figure 11. Comparison between predicted and EC-based GPP under three optimization strategies: (a) Unoptimized (UO), (b) Hyperparameter Optimized (HO), and (c) Hyperparameter and Variable Combination Optimized (HVCO). Black solid lines represent fitted regression lines, red dashed lines show 1:1 relationships, and color intensity indicates data point frequency. Statistical metrics (

R^{2}

,

R M S E

,

M A E

, and

B I A S

) demonstrate progressive model improvement from UO to HVCO.

Figure 12. Relative contribution of biophysical factors influencing forest GPP. The pie chart illustrates the percentage contribution of five key factors: leaf area index (LAI, 59.9%), temperature (Temp, 18.5%), net radiation (NR, 14.0%), vapor pressure deficit (VPD, 4.3%), and normalized difference vegetation index (NDVI, 3.4%).

Figure 13. Model evaluation results for estimating site GPP using LAI, Temp, NR, and VPD factors without NDVI input (HVCO-No-NDVI). Black solid line shows the fitted regression, red dashed line represents the 1:1 relationship, and color intensity indicates data point frequency.

Figure 14. Temporal variation in GPP estimates from multiple products across four Chinese provinces (2008–2016). Time series comparison of gross primary productivity (GPP) derived from four products (FLUXCOM, GOSIF, GLASS, and XGBoost) for (a) Guangdong, (b) Yunnan, (c) Jilin, and (d) Jiangxi provinces from 2008 to 2016.

Table 1. Multi-collinearity assessment result.

Row	Variables	VIF
1	PAR	119.89
2	SR	119.12
3	NR	8.46
4	EVI	8.08
5	VPD	6.97
6	Temp	6.76
7	NDVI	5.17
8	LAI	2.84

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Na, Q.; Lai, Q.; Bao, G.; Xue, J.; Liu, X.; Gao, R. Estimation of Gross Primary Productivity Using Performance-Optimized Machine Learning Methods for the Forest Ecosystems in China. Forests 2025, 16, 518. https://doi.org/10.3390/f16030518

AMA Style

Na Q, Lai Q, Bao G, Xue J, Liu X, Gao R. Estimation of Gross Primary Productivity Using Performance-Optimized Machine Learning Methods for the Forest Ecosystems in China. Forests. 2025; 16(3):518. https://doi.org/10.3390/f16030518

Chicago/Turabian Style

Na, Qin, Quan Lai, Gang Bao, Jingyuan Xue, Xinyi Liu, and Rihe Gao. 2025. "Estimation of Gross Primary Productivity Using Performance-Optimized Machine Learning Methods for the Forest Ecosystems in China" Forests 16, no. 3: 518. https://doi.org/10.3390/f16030518

APA Style

Na, Q., Lai, Q., Bao, G., Xue, J., Liu, X., & Gao, R. (2025). Estimation of Gross Primary Productivity Using Performance-Optimized Machine Learning Methods for the Forest Ecosystems in China. Forests, 16(3), 518. https://doi.org/10.3390/f16030518

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Estimation of Gross Primary Productivity Using Performance-Optimized Machine Learning Methods for the Forest Ecosystems in China

Abstract

1. Introduction

2. Materials and Methods

2.1. Study Area

2.2. Data and Preprocessing

2.2.1. Flux Data from ChinaFLUX

2.2.2. Remote Sensing Data

2.2.3. Remote Sensing-Based GPP Products

2.3. Methods

2.3.1. Machine Learning Methods

2.3.2. Random Grid Search Algorithm (RGSA)

2.3.3. Comparison Map Profile Method (CMP)

2.3.4. Model Accuracy Evaluation Methods

2.3.5. Variance Inflation Factor (VIF) Analysis

2.3.6. Enhanced GPP Estimation Model

3. Results

3.1. Model Evaluation

3.1.1. Assessing the GPP Estimation Capability of Different Models

3.1.2. Seasonal GPP Simulation Results

3.2. Comparison of Estimation Capabilities with Different Variable Combinations

3.3. Comparison of Optimized Model Results with Other Products

3.3.1. Comparison of GPP Estimation Results Based on Site Data

3.3.2. Comparison of Spatial Consistency in GPP Estimates

4. Discussion

4.1. Evaluation of Model Performance

4.2. Key Influencing Factors of Forest GPP

4.3. Comparison with Existing Product Data

4.4. Limitations and Prospects

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI