*2.3. Model*

#### 2.3.1. Deep Learning Model

In this study, we constructed our model based on the GeoMAN algorithm developed by Liang et al. in 2018, which was originally applied to predict air quality [21]. The GeoMAN algorithm can extract the spatial correlation of input variables and consider the influence of neighboring sites on the target site's GPP, which can help estimate the GPP of grasslands more accurately. The GeoMAN algorithm consists of an encoder and a decoder. The encoder contains a mechanism to consider the features within a site, a mechanism to consider the spatial features between sites, and an LSTM model to extract the local features of the site to be predicted and the spatial features of relevant surrounding sites. The decoder includes a temporal attention mechanism and an LSTM model, which decode the feature vector output by the encoder to predict grassland GPP.

There are complex correlations between environmental variables and GPP at each flux site. The inter-site feature attention mechanism of the GeoMAN algorithm dynamically captures the association between environmental variables and GPP within the site targeted for prediction. The inter-site feature attention mechanism for the target flux site is estimated as follows:

$$\boldsymbol{\sigma}\_{k,t} = \boldsymbol{\sigma}\_0^T \tanh\left(\mathbf{W}\_0 \left[\mathbf{h}\_{t-1}; \mathbf{s}\_{t-1}\right] + \mathbf{U}\_0 \left[\mathbf{I}\_k^0\right] + \mathbf{b}\_0\right) \tag{1}$$

In Equation (1), [*ht*−1; *<sup>s</sup>t*−1] denotes the concatenation operation in Tensorflow between *<sup>h</sup>t*−<sup>1</sup> and *<sup>s</sup>t*−1, as they are the hidden state and the cell state of the LSTM network at time *t* − 1, respectively, which contain the information of the previous *t* − 1 time steps, thereby forming the long-term and short-term memory of the LSTM network. *I*<sup>0</sup> *<sup>k</sup>* means the *k*-th time series at the flux site to be predicted. *v*0, *W*0, *U*0, and *b*<sup>0</sup> are the learnable parameters: during the learning and training process of the model, they are continuously updated according to the loss function via backpropagation. The environmental factors selected for predicting GPP in this study include temperature (Ta), photosynthetically active radiation (PAR), enhanced vegetation index (EVI), normalized difference vegetation index (NDVI), and land surface water index (LSWI). The formula for calculating the weighting values of each factor based on the Geoman model's inter-site feature attention mechanism is as follows:

$$\alpha\_{k,t} = \frac{\exp(e\_{k,t})}{\sum\_{j=1}^{T} \exp\left(e\_{j,t}\right)}\tag{2}$$

In Equation (2), *αk*,*<sup>t</sup>* is the weighting value of the *k*-th feature (one each for Ta, PAR, EVI, NDVI, and LSWI) at time *t*. The sum of the weighting values of all features is 1. The calculated weighting values are multiplied by the corresponding feature values to distinguish the importance of different features according to the GeoMAN model.

As shown in Figure 2, the LSTM unit is an important computational unit in the GeoMAN model. Figure 3 shows the schematic diagram of the LSTM unit. The input at time *t* is *xt*, and *ct*−<sup>1</sup> and *ht*−<sup>1</sup> are the cell state and the hidden state at time *t* − 1, respectively. They go through three main stages inside the LSTM unit: first, the forgetting stage, which selectively forgets the input from the previous time step; second, the selective memory stage, which selectively remembers the input from the current time step, emphasizing

important parts while remembering less important parts; and third, the output stage, which outputs the new hidden state *ht* and the cell state *ct* and inputs them to the next time step.

**Figure 2.** The structure of GeoMAN adapted from Liang et al. [21].

**Figure 3.** The structure of LSTM.

After the input data are assigned different weights by the inter-station feature attention mechanism and updated by the LSTM unit, it is essential to select pertinent time periods for GPP forecasting. It further enhances the precision of the prediction outcomes. Therefore, the GeoMAN model introduces a temporal attention mechanism. The formula for calculating the attention weight of each hidden state at a historical time step is as follows [35]:

$$\mathbf{u}\_{t,\tau} = \mathbf{v}\_d^T \tanh\left(\mathbf{W}\_d \left[h\_{\tau-1}'; \mathbf{s}\_{\tau-1}'\right] + \mathbf{W}\_d' h\_t + \mathbf{b}\_d\right) \tag{3}$$

$$\gamma\_{t,\tau} = \frac{\exp(u\_{t,\tau})}{\sum\_{j=1}^{T} \exp(u\_{j,\tau})} \tag{4}$$

In Equation (3), *h <sup>τ</sup>*−<sup>1</sup> and *<sup>s</sup> <sup>τ</sup>*−<sup>1</sup> are the hidden state and the cell state of the LSTM at time step *<sup>τ</sup>* <sup>−</sup> 1. *<sup>τ</sup>* is the output prediction time step. *<sup>v</sup>d*, *<sup>W</sup>d*, *<sup>U</sup>d*, and *<sup>b</sup><sup>d</sup>* are the learnable parameters. The output vector of the time attention mechanism at this time step is as follows [35]:

$$\mathbf{c}\_{\tau} = \sum\_{t=1}^{T} \gamma\_{t,\tau} \mathbf{h}\_t \tag{5}$$

#### 2.3.2. Model Training and Evaluation

The values of each element of the flux site data used in this study have large differences. To ensure that model learning and training are not affected by this issue, each element of the data is standardized and normalized using the following formula:

$$\mathbf{x}' = \frac{\mathbf{x} - \mu}{\sigma} \tag{6}$$

In Equation (6), *μ* and *σ* are the mean and standard deviation of the corresponding element, respectively. The processed element data have a mean of 0 and a variance of 1, which prevents the model from being biased toward elements with large value ranges and ensures the accuracy of the model. After data pre-processing, the learning and training steps start. Since the observation data from the nine flux sites are not large in scale (a total of 1421 records), ten-fold cross-validation was applied to make full use of the data and to ensure the model's prediction performance on the whole data set. That is, for each fold, 10% of the data were taken as the test set, and the remaining 90% were used for learning and training. Then, the data used for learning and training were divided into training and validation sets at a ratio of 9:1, and the data order was randomly shuffled to avoid over-fitting. The model uses Mean Squared Error (MSE) as the loss function and the Adam optimizer to update model parameters. This process was repeated ten times to complete the predictions on all the data and evaluate the results.

This study used three common statistical indicators to evaluate model prediction accuracy: mean squared error, mean absolute error, and R-squared. The relevant formulas are as follows:

$$R^2 = (\frac{\sum\_{i=1}^n \left(y\_i - \overline{y}\right)\left(y\_i' - \overline{y}'\right)}{\sqrt{\sum\_{i=1}^n \left(y\_i - \overline{y}\right)^2} \sqrt{\sum\_{i=1}^n \left(y\_i' - \overline{y}'\right)^2}}\tag{7}$$

$$RMSE = \sqrt{\frac{1}{n} \sum\_{i=1}^{n} \left( y\_i' - y\_i \right)^2} \tag{8}$$

$$MAE = \frac{1}{n} \sum\_{i=1}^{n} |y\_i' - y\_i| \tag{9}$$

where *y <sup>i</sup>* and *yi* are the predicted and observed values of GPP, respectively; *y* and *y* are the mean values of *y <sup>i</sup>* and *yi*, respectively; and n is the number of observation samples.

#### **3. Case Analysis**

In this section, we introduced the basic idea of our experimental design. We tested the performance of different models on the aggregate data to prove that deep learning models have highly accurate prediction capabilities. Moreover, considering that the model predictions must conform to basic ecological knowledge, we were required to conduct multiple experiments by controlling the spatial distance of flux sites, vegetation types, and environmental factors.

#### *3.1. Prediction Accuracy with All Factors*

#### 3.1.1. Comparison of Model Performance with the Use of All Data

In this analysis, we used the data from all flux sites to train the Random Forest (RF), Support Vector Machine (SVM), Deep Belief Network (DBN), and GeoMAN models. According to the ten-fold cross-validation results of all the models, the relationship between predicted GPP and observed GPP is shown in Figure 4 and Table 2. It is obvious that there are different training effects between the four models. The Random Forest model has the lowest prediction accuracy, with a prediction RMSE of 0.954 g Cm−<sup>2</sup> d−1, MAE of 0.553 g Cm−<sup>2</sup> d−1, and R2 of 0.810; the Deep Belief Network model achieves a certain improvement in prediction accuracy compared to the Random Forest model, with a prediction RMSE of 0.912 g Cm−<sup>2</sup> d−1, MAE of 0.559 g Cm−<sup>2</sup> d−1, and R2 of 0.827; the Support Vector Machine model has similar prediction accuracy to the Deep Belief Network model, with a prediction RMSE of 0.910 g Cm−<sup>2</sup> d−1, MAE of 0.571 g Cm−<sup>2</sup> d−1, and R2 of 0.827; and the GeoMAN model has the highest prediction accuracy with a prediction RMSE of 0.788 g Cm−<sup>2</sup> d−1, MAE of 0.440 g Cm−<sup>2</sup> d−1, and R2 of 0.870, which indicates that the GeoMAN model could fit the GPP values of the nine flux sites in the Tibetan Plateau better than the other three models.

**Figure 4.** Performance of RF, SVR, DBN, and GeoMAN models in predicting GPP.


**Table 2.** Results of the four models with the use of all data.

3.1.2. Performance of Single Flux Site

As shown in Figure 1, the distribution map of the nine flux sites indicates that different sites have different vegetation and climate conditions. Therefore, it is necessary to test the GPP prediction accuracy of the GeoMAN model at different flux sites.

#### 1. Test site GPP against remaining sites

We took the target flux site as the test data and the remaining sites as the training data, for a total of nine sites being tested. As shown in Figure 5, there is a large difference in performance among the different flux sites. The possible reasons for this difference are (1) some sites have different vegetation types from the target site, and (2) some sites have different climate conditions from the target site due to their long distance.

**Figure 5.** Predicted GPP vs. labeled GPP at a single site.

2. Test site GPP against the other sites at distance of 500 km and 100 km

The distances between each flux site were calculated using the Haversine formula based on their latitude and longitude. The calculation results are shown in Table 3.


**Table 3.** Distances between the flux sites (km).

Based on the results in Table 3, each target flux site was predicted using the flux sites within 500 km and 100 km as the training data. There are no flux sites within 100 km of the GL and ZF sites, so they were not included in the prediction results for sites within 100 km. The prediction results are shown in Figures 6 and 7. According to the results, the accuracy of the AR site increases as the range of selected sites decreases, while the DXST, NMC, and ZF sites show a decreasing trend in accuracy as the range of selected sites decreases. The overall accuracy of the DXSW site is lower than when using all sites for prediction, but it shows a rebound trend as the range decreases. According to Table 1, the AR site with an increasing trend has alpine Kobresia meadow as its vegetation type, while the DXST, NMC, and ZF sites with a decreasing trend have alpine meadow steppe as their vegetation type. It can be speculated that the prediction effect of each site is related to the vegetation type of the other selected sites.

#### 3. Selecting training data according to vegetation type

As shown in Table 1, the training data for each site to be predicted comes from the other flux sites with the same vegetation type. The final predictions are shown in Figure 8. It is obvious that selecting the training data based on vegetation type has higher overall prediction accuracy compared to selecting the training data based on distance (no corresponding prediction results are available for the HBSH site because it has a different vegetation type compared to the other flux sites). Next, we combined the results of the previous three experiments to examine the effect of selecting training data under different conditions on prediction accuracy, and the combined results are shown in Table 4.

As shown in Table 4, using vegetation type as the training data in the screening mechanism is better than using site distance as the training data from an overall perspective. However, from a single-site perspective, the GL and HBKO sites show a decreasing trend in accuracy. This is because for the AR and HBKO sites, which have significantly less data than other sites and share the same vegetation type as the GL site, the training data are insufficient, leading to a decrease in the prediction accuracy at the GL site. The HBKO site, which has always maintained an R-squared value above 0.9 from an overall perspective, does not have much room for accuracy improvement. At the same time, we compared the prediction results of the training data without screening and with screening for vegetation type. Although some sites have lower accuracy, the AR site shows a significant improvement in accuracy; as a result, the overall prediction accuracy of the latter method is not lower than that of the former method. This result shows that increasing the amount of training data with the same vegetation type can achieve equally good results as increasing the overall amount of training data.

**Figure 6.** Predicted GPP vs. the labeled GPP at a single site (500 km).

**Figure 7.** *Cont*.

**Figure 7.** Predicted GPP vs. labeled GPP at a single site (100 km).

**Figure 8.** Predicted GPP vs. labeled GPP at a single site based on vegetation type.


**Table 4.** Results comparing all previous experiments.

Numbers in red indicate that the prediction results for the corresponding site have increased in accuracy compared to the previous results without setting a distance range; numbers in blue indicate that the results have decreased in accuracy; and numbers in black indicate that the results have no significant changes.
