Wheat Yield Estimation Study Using Hyperspectral Vegetation Indices

Wu, Renhong; Fan, Yuqing; Zhang, Liuya; Yuan, Debao; Gao, Guitang

doi:10.3390/app14104245

Open AccessArticle

Wheat Yield Estimation Study Using Hyperspectral Vegetation Indices

by

Renhong Wu

¹,

Yuqing Fan

^2,*,

Liuya Zhang

²

,

Debao Yuan

² and

Guitang Gao

³

¹

National Nuclear Power Planning and Design Institute Co., Ltd., Beijing 100095, China

²

College of Geoscience and Surveying Engineering, China University of Mining and Technology (Beijing), Beijing 100083, China

³

National Nuclear Power Planning and Design Research Institute Co., Ltd., Survey Branch, Beijing 100095, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(10), 4245; https://doi.org/10.3390/app14104245

Submission received: 18 April 2024 / Revised: 7 May 2024 / Accepted: 12 May 2024 / Published: 16 May 2024

(This article belongs to the Section Agricultural Science and Technology)

Download

Browse Figure

Review Reports Versions Notes

Abstract

:

Wheat is the main grain crop in our country, and the traditional wheat yield estimation method is time-consuming and laborious. By estimating wheat yield efficiently, quickly and non-destructively, agricultural producers can quickly obtain information about wheat yield, manage wheat fields more scientifically and accurately, and ensure national food security. Taking the Xinxiang Experimental Base of the Crop Science Research Institute, Chinese Academy of Agricultural Sciences as an example, hyperspectral data for the critical growth stages of wheat were pre-processed. A total of 27 vegetation indices were calculated from the experimental plots. These indices were then subjected to correlation analysis with measured wheat yield. Vegetation indices with Pearson correlation coefficients greater than 0.5 were selected. Five methods, including multiple linear regression, stepwise regression, principal component regression, neural networks and random forests, were used to construct wheat yield estimation models. Among the methods used, multiple linear regression, stepwise regression and the models developed using principal component analysis showed a lower modelling accuracy and validation precision. However, the neural network and random forest methods both achieved a modelling accuracy R² greater than 0.6, with validation accuracy R² values of 0.729 and 0.946, respectively. In addition, the random forest method had a lower cross-validation RMSE value, with values of 869.8 kg/hm⁻², indicating a higher model accuracy. In summary, the random forest method provided the optimal estimation for wheat yield, enabling the timely and accurate pre-harvest wheat yield prediction, which has significant value for precision agriculture management and decision making.

Keywords:

vegetation index; multiple regression; principal component regression; neural network; random forest

1. Introduction

The wheat industry plays a substantial role in the national economy. According to the China Statistical Yearbook, in recent years, wheat production in the country has accounted for approximately 28% of the total agricultural output. Traditional methods for wheat yield estimation have relied on manual [1,2], time-consuming and costly field sampling, which is susceptible to human error. Therefore, it is imperative to explore efficient and accurate approaches for estimating wheat yield. Hyperspectral remote sensing technology [3] allows for the continuous and rapid acquisition of spectral information related to various growth stages and physiological states of wheat. This technology enables timely and accurate crop [4] monitoring and yield prediction before harvest [5]. It empowers agricultural producers [6,7,8,9] to engage in the scientifically informed management of wheat fields.

In the field of crop yield estimation through remote sensing in China, some scholars have employed vegetation indices [10,11] as remote sensing feature parameters to build yield estimation models, while others have combined crop growth conditions [12,13] and physiological indicators. Although these studies have shown promising results for yield estimation, there is still room to optimize approaches for constructing wheat yield estimation models [12]. Using winter wheat in Xinxiang, Henan Province as a case study, this study pre-processed hyperspectral data collected during the wheat grain-filling stage [14] and calculated vegetation indices [15]. Correlation analysis was conducted between these indices and yield [16] to select the most suitable ones. Five modeling methods, including multiple linear regression (MLR) [17], stepwise multiple linear regression (SMLR), principal component analysis (PCA), artificial neural networks (ANN) and random forests (RF), were employed [14,18,19]. The study used 2/3 of the data (20 plots) for model development and reserved the remaining 1/3 (10 plots) for model validation to assess the accuracy of the wheat yield estimation models [20,21,22]. A comprehensive analysis was performed to evaluate the modeling and validation precision of each method and determine the most effective model for wheat yield estimation.

2. Materials and Methods

2.1. Study Area and Data

As depicted in Figure 1, the experimental data comprise hyperspectral canopy measurements of wheat, collected in mid-May 2020 at the Xinxiang Experimental Base of the Crop Science Research Institute, Chinese Academy of Agricultural Sciences (located at 113°51′ E, 35°18′ N). The study site benefits from a favorable agricultural climate, characterized by synchronized sunlight, temperature and water resources, making it conducive to wheat growth. The experimental plots (a total of 30 plots) were randomly designed [23,24,25,26], ensuring uniform management practices for wheat growth. Particular attention was given to disease and pest control, as well as weed removal.

ASD FieldSpec4 is a portable spectral scanning hyperspectral imager from ASD (Analytical Spectral Devices, Longmont, CO, USA), which uses spectral scanning hyperspectral imaging technology to obtain hyperspectral information by scanning a light source such as a laser or optical fiber point by point in one dimension. ASD FieldSpec4 performance indicators are shown in Table 1 ASDFieldSpec4 was used to collect hyperspectral data from the wheat canopy during the filling phase. The spectral range covered an effective range of 350–2500 nm, with sampling intervals of 1.4 nm (350–1000 nm) and 2 nm (1001–2500 nm), subsequently resampled at 1 nm intervals. The spectrometer had a field of view angle of 25°. Data acquisition was carried out under clear-sky conditions with favorable lighting, typically between 10:00 a.m. and 2:00 p.m. local time. Data acquisition on cloudy days may have led to the deterioration of spectral signal quality, unclear spectral characteristics and an increased shadow effect, which will affect the accuracy of the data. Therefore, data acquisition was avoided on cloudy days. Thirty experimental plots were selected for data collection. During data acquisition, the sensor was positioned vertically, approximately 1 m above the canopy. Five points within each plot were sampled evenly, with each point measured ten times. The average of these ten measurements for each point was computed as the representative spectral reflectance value for that specific area. Following the collection of data from five plots, an instrument calibration was performed using a diffuse reflectance standard white panel. Finally, the acquired hyperspectral data underwent preprocessing using the ViewSpecPro Version 5.6 software

R (λ) = \frac{I_{s a m p l e} (λ)}{I_{w h i t e r e f e r e n c e} (λ)} \times 100 %

(1)

where

I_{s a m p l e} (λ)

x_{i}

is the intensity of reflected light from the sample object at a specific wavelength,

I_{w h i t e r e f e r e n c e} (λ)

is the intensity of reflected light from the white reference object at the same wavelength.

Once the wheat reached maturity, the harvest process for each plot was initiated. Wheat from each plot was harvested separately, bagged, air-dried and weighed, and the yield of wheat in each individual plot was determined. After comprehensive consideration of various factors, a group of 27 vegetation indices was selected [24,27] in the wavelength range of 350–2500 nm to estimate wheat yield [28]. The chosen vegetation indices included the Normalized Difference Infrared Index, Enhanced Vegetation Index-2, Difference Vegetation Index-1, Green Atmospherically Resistant Index, Normalized Moisture Index, Normalized Red-Edge Spectral Index, Vegetation Dryness Index, Normalized Substance Index, Modified Red-Edge Ratio Index and Simple Ratio Index, among others. A list of the selected vegetation indices is provided in Table 2.

2.2. Methods

Multiple Linear Regression (MLR) analysis refers to a statistical method where one variable is considered the dependent variable, while one or more other variables are treated as independent variables within a set of correlated variables. It involves establishing a linear mathematical model that quantifies the relationships among these variables and uses sample data for analysis. On the other hand, stepwise multiple regression, after introducing new variables, systematically examines existing variables and eliminates any that prove to be statistically insignificant. This approach can be employed to enhance the precision of the wheat yield estimation model.

The main concept behind Principal Component Analysis (PCA) is dimensionality reduction. It achieves this by orthogonal transformation, converting correlated components of the original random vectors into new, uncorrelated random vectors. Algebraically, this transformation results in the covariance matrix becoming diagonal, while geometrically, it represents the conversion of the original coordinate system into a new orthogonal coordinate system, aligning with the p orthogonal directions that capture the maximum spread of the data points. Subsequently, PCA reduces the dimensionality of the multivariate data system while preserving a high degree of precision. It further simplifies the system into a one-dimensional representation through the construction of a value function. The central challenge in PCA is to obtain the projection matrix. Similar to other machine learning algorithms, PCA generalizes from vector space to one-dimensional space and then extends to the general case. One of the primary advantages of Principal Component Regression lies in the fact that the principal components are uncorrelated, thus mitigating issues of information overlap. In the research process, selecting an appropriate number of principal components for model fitting can enhance the precision of the wheat yield estimation model.

Neural networks have rapidly advanced in the field of big data and intelligent computing in recent years. They are distinguished by their capabilities for autonomous learning, adaptation and distributed storage of information. Consequently, they have found extensive applications in areas such as intelligent control, identification and monitoring. The primary approach utilized for building the wheat yield estimation model is Artificial Neural Networks (ANN), specifically the Multi-Layer Perceptron (MLP). The MLP is designed to process data by receiving inputs in the input layer and executing the wheat yield estimation task through the output layer. Neurons within the MLP can employ various activation functions and undergo training and learning algorithms via backpropagation. In an MLP, each linear combination is propagated to the next layer, with each layer providing its computed results for the subsequent layer. The backpropagation process enables the MLP to iteratively adjust the weights in the network until it achieves the weights that minimize the cost function, thereby enhancing the precision of the wheat yield estimation model.

Random Forest (RF) is a bagging ensemble method based on a collection of decision trees. In this approach, if you have Nc classifiers, the original dataset is split into Nc subsets (with replacement), known as bootstrap samples. In contrast to a single decision tree, the splits in a Random Forest occur randomly. Rather than searching for the optimal choice at each split, the Random Forest employs random subsets of features for each tree (typically, the number of features is calculated using sqrt(·) or log(·)). The goal is to find thresholds that best separate the data. Ultimately, many trees are trained in a somewhat weak manner, and each tree produces different predictions. Each tree focuses on a part of the sample space, allowing for inaccurate predictions in some regions. However, scikit-learn employs algorithms that average the results, resulting in highly accurate predictions. Even though theoretically trained Random Forests differ in probability averaging, the differences in average predictions are generally minimal. Therefore, this method often yields reliable results.

3. Results

3.1. Modeling Analysis and Model Evaluation

The IBM SPSSStatistics 23 software was used to construct the wheat yield estimation models based on 2/3 of the experimental data collected during the critical growth stage of wheat. This subset consisted of 20 randomly selected plots out of the 30 experimental plots. Five different methods, including Multiple Linear Regression (MLR), Stepwise Multiple Regression (SMLR), Principal Component Analysis (PCA), Artificial Neural Networks (ANN) and Random Forest (RF), were employed to select the appropriate independent variables (vegetation indices) for modeling the estimation of the dependent variable (yield). The remaining 1/3 of the experimental data (the remaining 10 plots out of the total 30 experimental plots) were used to evaluate the accuracy of the wheat yield estimation models. This evaluation aimed to assess the stability and precision of the model in estimating wheat yields.

In the assessment of yield estimation model accuracy, three evaluation metrics are introduced: the coefficient of determination (R²), root mean square error (RMSE) and mean square error (MSE). These metrics are used to evaluate the precision of yield estimation models based on real yield measurements and predicted values. R² reflects the goodness of fit of the yield estimation model, with a higher value indicating a better fit. RMSE measures the dispersion between observed and predicted values and is particularly sensitive to outliers, making it valuable for assessing the accuracy of predicting both relatively low and high yields. Generally, higher R-squared values and lower values for RMSE and MSE indicate a higher model accuracy. The formulas for calculating these three evaluation metrics are as follows:

R^{2} = 1 - \sum_{i = 1}^{n} {(x_{i} - y_{i})}^{2} / \sum_{i = 1}^{n} {(x_{i} - \bar{x})}^{2}

(2)

R M S E = \sqrt{\sum_{i = 1}^{n} {(y_{i} - x_{i})}^{2} / n}

(3)

M S E = \sum_{i = 1}^{n} {(y_{i} - x_{i})}^{2} / n

(4)

where

x_{i}

is the yield of wheat,

\bar{x}

is the average yield of wheat,

y_{i}

is the wheat yield predicted by the model, and

n

is the number of data points.

The model constructed using the Random Forest regression method can be evaluated using three key metrics. Firstly, the coefficient of determination (R²), which, when its value exceeds 0.6 but remains less than 1.0, signifies a higher level of model accuracy and yields a relatively ideal estimation performance. Secondly, the root mean square error (RMSE) can be used. A smaller RMSE value is indicative of heightened model accuracy. This metric gauges the degree of dispersion between observed and predicted values, revealing how closely the predictions align with the actual values. Lastly, the mean square error (MSE) is frequently selected as an accuracy evaluation metric. A smaller MSE indicates superior model precision. It quantifies the dispersion between observed and predicted values, serving as a measure of accuracy. In unison, these three metrics provide a comprehensive assessment of the model’s performance, aiding in the determination of its suitability for the specific task.

3.2. Correlation Analysis between Vegetation Index and Yield

The vegetation indices constructed from the crown-level hyperspectral data during the wheat milk filling stage were subject to a correlation analysis with the actual wheat yield for each plot, utilizing the SPSS software. Among these indices, twelve exhibited correlation coefficients exceeding 0.5, as presented in Table 3. Most of them achieved significant correlations at the 0.005 significance level, NDVIg, NVI, NDRSR and GNDVI displayed significant correlations at the 0.02 significance level. Remarkably, NDII, NDWI, NDMI and VDI all had correlation coefficients exceeding 0.6, with NDWI displaying the highest correlation coefficient with a yield of 0.705. The lowest correlation coefficient between NDRE and the yield was 0.521.

3.3. Multiple Methods to Build Models Based on Vegetation Index

For the construction of wheat yield estimation models, Multiple Linear Regression, Stepwise Multiple Regression, Principal Component Analysis and Neural Network methods were implemented using the SPSS software. On the other hand, the Random Forest method was implemented using the Python programming language. This combination of tools allowed for a comprehensive analysis and modeling approach to estimate wheat yields.

3.3.1. Multiple Linear Regression Method

The selected vegetation indices with Pearson correlation coefficients exceeding 0.5 were utilized as independent variables, and yield was used as the dependent variable in the SPSS software. The Multiple Linear Regression method was employed to construct the wheat yield estimation model. The resulting model achieved a correlation coefficient of 0.835, and the significance (F) was 0.03, meeting the requirements for conducting regression analysis. This indicates a strong relationship between the selected vegetation indices and wheat yield, suggesting the model’s effectiveness in estimating wheat production.

3.3.2. Multiple Stepwise Regression Method

The Stepwise Multiple Regression method, similar to Multiple Linear Regression, is used to select vegetation indices without multicollinearity. In the final selection, NDWI and mSRI-2 were the two variables included in the regression model. The results indicate that the first model was built on NDWI and yield, while the second model involved NDWI, mSRI-2, and yield. It’s worth noting that the first model, with an R² value of 0.497, exhibited a lower accuracy compared to the second model, with an R² value of 0.612. In the subsequent evaluation of yield estimation model accuracy, the first model can be excluded, and an accuracy assessment can be performed using the second yield estimation model. This selection process ensures a more accurate estimation of wheat yield.

3.3.3. Principal Component Analysis Method

Principal Component Analysis (PCA) extracted three principal components, and when these three components were used for both Multiple Linear Regression and Stepwise Multiple Regression, the accuracy of the yield estimation models was found to be less than 0.6. This lower accuracy suggests that the models had limited predictive power and yielded less than ideal results. It may be necessary to explore alternative modeling techniques or consider additional variables to improve the accuracy of the yield estimation models.

3.3.4. Neural Network Method

In the SPSS software, 76.7% of the data was utilized as training data, while 23.3% of the data was allocated for validation during the regression modeling process. The resulting model achieved a modeling accuracy of 0.743, which surpasses the threshold of 0.6, indicating a relatively high-quality model with favorable accuracy. In order to evaluate the generalization ability of the model more comprehensively, the accuracy of the model is 0.702 by K-fold cross-validation. This indicates that the model has strong generalization ability and is suitable for estimating wheat yield.

3.3.5. Random Forest Method

Using the Python programming language in PyCharm, the Random Forest method was employed for modeling. After importing necessary libraries such as numpy, pandas, matplotlib and sklearn, the experimental data was imported. The model was established with two variables, yield and vegetation indices. The data was sliced, with 70% allocated for training and 30% for testing. A Random Forest regressor was employed to build the regression model, and various evaluation parameters were set. Once the regression model was constructed, the predicted values of yield were compared to the actual values through a linear regression analysis, and a scatter plot was generated to visualize the relationship. Various evaluation parameters were also generated to assess the model’s accuracy. The final result demonstrated that the model achieved a validation accuracy of 0.935 and K-fold cross-validation accuracy of 0.897, indicating a highly accurate and effective model.

3.3.6. Computational Complexity and Noise Robustness of the Model

When constructing the production estimate model, the multiple stepwise regression method has high computational complexity, needs to evaluate multiple variable combinations, and is relatively sensitive to noise in the data. The multiple linear regression method has low computational complexity, can directly fit the linear relationship, and has good robustness to slight noise. Principal component analysis reduces the dimensionality of high dimensional data with a low computational complexity and has a certain noise robustness. Although neural networks have a high computational complexity, their nonlinear fitting ability and flexibility of parameter adjustment give them a certain degree of noise robustness. Random forest has low computational complexity due to the integration of multiple decision trees, and has strong robustness to light noise. Computational complexity and noise robustness of different methods are summarized in Table 4.

3.4. Modeling Accuracy Evaluation

The validation accuracy for Multiple Linear Regression, Stepwise Multiple Regression and Principal Component Regression was 0.199, 0.398 and 0.422, respectively, all of which were less than 0.6, and the p-values were relatively high, indicating less than ideal yield estimation results. On the other hand, the validation accuracy for the Neural Network and Random Forest methods was 0.729 and 0.946, respectively, with corresponding p-values of 0.2143 and 0.1405. As shown in Table 5, the model constructed using the Random Forest method exhibited the highest accuracy, yielding the most favorable experimental results.

4. Discussion and Conclusions

The primary focus of this research was to investigate the relationship between wheat canopy hyperspectral data during the crucial growth stage and the subsequent yield of mature wheat. Using hyperspectral data from the wheat grain-filling stage, 27 vegetation indices were calculated and correlated with the measured yield in the experimental area. These vegetation indices exhibited significant correlations at the 0.01 significance level, with correlation coefficients exceeding 0.5. Based on different water treatment tests, Xiao Lujie [23] measured the spectral reflectance of the canopy during the key growth period of winter wheat, and calculated 29 kinds of hyperspectral vegetation index. A winter wheat yield estimation model based on the combination of single vegetation index and multiple vegetation index was established, and the accuracy of the model was high. The results showed that the preferred vegetation index could be used for the construction of wheat yield estimation model. Utilizing the selected superior vegetation indices, five different methods were employed to establish wheat yield prediction models: Multiple Linear Regression (MLR), Stepwise Multiple Regression (SMLR), Principal Component Analysis (PCA), Artificial Neural Network (ANN) and Random Forest (RF).

The Multiple Linear Regression model yielded an R² value of 0.835, and the second model established with Stepwise Multiple Linear Regression achieved an R² value of 0.612. However, when evaluating the better of the two methods, the R² value was only 0.199. It shows that the methods of multiple linear regression and multiple stepwise regression are less effective than machine learning models in wheat yield prediction, which is consistent with the research results of Zhao Xin et al. [20]. Because the general linear algorithm model is relatively simple, it is difficult to model nonlinear data or polynomial regression with correlations between data features, and it is difficult to represent highly complex data well, so the fitting accuracy is low.

Despite Principal Component Analysis extracting three superior principal components, the accuracy of models built using these components remained relatively low, with the highest R² value being 0.398. This is consistent with the conclusion of Hong Xue [25]. For data with a nonlinear relationship, principal component analysis may not be able to accurately extract the main features. Its information may be overwritten by other variables, resulting in a model that is still less accurate after using this method. The specific reasons for this merit further exploration. In contrast, the Artificial Neural Network method achieved a more satisfactory modeling accuracy with an R² value of 0.729 and a K-fold cross-validation accuracy of 0.702. The Random Forest method, implemented using Python and relevant packages, resulted in the highest modeling accuracy with an R² value of 0.946 and a K-fold cross-validation accuracy of 0.897. The cross-validation produced an RMSE value of 869.8 kg/hm⁻², signifying a highly accurate model and an ideal yield estimation effect. This is consistent with the research results of Hua Qilong [29].

The use of Principal Component Analysis resulted in a moderate improvement in the accuracy of the yield estimation models derived from Multiple Linear Regression and Stepwise Multiple Linear Regression when compared to models without Principal Component Analysis. However, the overall results remained suboptimal. In contrast, models built using the neural network and Random Forest methods achieved significantly improved accuracy, with Random Forest proving to be the most effective in yield estimation. This is consistent with the conclusion of Zhao Xin [30].

The high correlation between these vegetation indices (NDII, NDWI, NDMI and VDI) and wheat yield is mainly due to their ability to reflect the moisture status of soil and vegetation. High NDII [27] values reflect high vegetation moisture content, while NDWI, NDMI and VDI [24] reflect soil and vegetation moisture status. This is consistent with the research results of Fei Shuaipeng [27]. Since water is a key factor affecting wheat growth and yield, these indices show a high correlation with wheat yield. The low correlation between vegetation index NDRE and wheat yield may be affected by growth stage, soil type, accuracy of remote sensing data and other environmental factors, which leads to the limitation of its prediction ability, climatic conditions and thus reduces the correlation with wheat yield. Future studies can explore a new vegetation index, adjust the modeling algorithm or utilize emerging remote sensing technology to improve the accuracy of the wheat yield estimation model.

5. Limitations and Future Work

Traditional wheat yield estimation methods often yield less than ideal precision. In comparison, our study demonstrates that employing different modeling approaches has led to higher accuracy in wheat yield estimation, particularly when utilizing the Random Forest method. This article boasts several key advantages: Firstly, our approach offers ease of data acquisition, with straightforward and user-friendly operations. This makes it very suitable for estimating wheat yield in smaller areas, resulting in reduced labor and resource expenditures. Secondly, our use of Principal Component Analysis to extract key components, combined with a stepwise regression approach, not only enhances efficiency but also improves modeling precision. Thirdly, the incorporation of neural networks and Random Forest algorithms has shown remarkable compatibility with high-spectral data, leading to more accurate yield estimations. Finally, our approach allows for flexibility in selecting the target estimation areas, ensuring high precision. The timely and accurate estimation of wheat yield before harvesting empowers agricultural producers by providing them with the information needed for effective and scientific field management. The method provided in this article, with its enhanced precision and efficiency, underscores the potential for advancing wheat yield estimation and optimizing agricultural practices.

The accuracy indicators of the model are discussed, but the outliers and their effects on the results are not analyzed in depth, and future research will focus on the effects of outliers on the modeling results, as well as the effects of specific observations or field conditions on the model. Less accurate models such as multiple linear regression and principal component analysis have been discussed, but the potential value of these models in practical applications has not been explored, and future research will focus on whether specific conditions exist for these models to improve their performance. In the future, the stage of the growth process of wheat at which data is collected, the daily management of crops, the frequency of data collection and the influence of these vegetation index changes over time on yield estimation will be analyzed, a new vegetation index will be explored and modeling algorithms will be adjusted. It is helpful to deeply understand the influence of environmental variables such as climatic conditions and soil characteristics on the modeling accuracy of yield estimation during the whole growth period of wheat. This paper compares the performance of different modeling methods, with random forest and neural network methods performing better, which may be related to the nonlinear nature of the data or the ability of different modeling methods to capture the correlation between vegetation index and wheat yield. Future studies will explore these aspects in depth to improve understanding of the differences in modeling method performance and provide more targeted recommendations for model selection. To improve the usefulness of the study, these models were implemented for the day-to-day management of crops in future studies, including frequency of data collection and integration with other agricultural management practices. This will increase the practical impact of the research and improve its operability.

In our study, the use of portable hyperspectral instruments for large-scale wheat yield estimation incurs significant human and material resources. Additionally, the selection of only 30 wheat plots resulted in a relatively small dataset, which may not be entirely representative. Therefore, the results obtained through the application of neural networks and Random Forest methods have certain limitations. Furthermore, there is room for improvement in terms of wheat variety selection, sample size and ensuring the comparability of samples from different regions. Hence, future research endeavors should focus on refining the methods to make the wheat yield estimation models more accurate, efficient and practical. This could involve optimizing data collection processes, increasing sample sizes and ensuring a more comprehensive representation of different wheat varieties and geographic regions. By addressing these aspects, we can enhance the precision, efficiency and practicality of the models for estimating wheat yield, making them more suitable for widespread application.

Author Contributions

R.W.: Methodology, formal analysis, writing—original draft, visualization. Y.F.: Methodology, software, writing—original draft. L.Z.: Methodology, data curation, writing—original draft. G.G.: Data curation, writing—review and editing, formal analysis. D.Y.: Supervision, writing—review and editing, funding acquisition, data curation, supervision, project administration. All authors have read and agreed to the published version of the manuscript.

Funding

This study was supported by the National Natural Science Foundation of China (No. 52174160).

Data Availability Statement

The data presented in this study are avalable on request from the corresponding author. The data are not publicly available due to privacy.

Conflicts of Interest

Author Renhong Wu was employed by the company Nuclear Power Planning and Design Institute Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Bognár, P.; Kern, A.; Pásztor, S.; Lichtenberger, J.; Koronczay, D.; Ferencz, C. Yield Estimation and Forecasting for Winter Wheat in Hungary Using Time Series of MODIS Data. Int. J. Remote Sens. 2017, 38, 3394–3414. [Google Scholar] [CrossRef]
Han, X.; Wei, Z.; Chen, H.; Zhang, B.; Li, Y.; Du, T. Inversion of Winter Wheat Growth Parameters and Yield Under Different Water Treatments Based on UAV Multispectral Remote Sensing. Front. Plant Sci. 2021, 12, 609876. [Google Scholar] [CrossRef]
Yang, S.; Hu, L.; Wu, H.; Fan, W.; Ren, H. Estimation Model of Winter Wheat Yield Based on Uav Hyperspectral Data. In Proceedings of the IGARSS 2019—2019 IEEE International Geoscience and Remote Sensing Symposium, Yokohama, Japan, 28 July–2 August 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 7212–7215. [Google Scholar]
Yang, Q.; Shi, L.; Han, J.; Zha, Y.; Zhu, P. Deep Convolutional Neural Networks for Rice Grain Yield Estimation at the Ripening Stage Using UAV-Based Remotely Sensed Images. Field Crops Res. 2019, 235, 142–153. [Google Scholar] [CrossRef]
Reza, M.N.; Na, I.S.; Baek, S.W.; Lee, K.-H. Rice Yield Estimation Based on K-Means Clustering with Graph-Cut Segmentation Using Low-Altitude UAV Images. Biosyst. Eng. 2019, 177, 109–121. [Google Scholar] [CrossRef]
Zhuo, W.; Fang, S.; Wu, D.; Wang, L.; Li, M.; Zhang, J.; Gao, X. Integrating Remotely Sensed Water Stress Factor with a Crop Growth Model for Winter Wheat Yield Estimation in the North China Plain during 2008–2018. Crop J. 2022, 10, 1470–1482. [Google Scholar] [CrossRef]
Sun, Y.; Zhang, S.; Tao, F.; Aboelenein, R.; Amer, A. Improving Winter Wheat Yield Forecasting Based on Multi-Source Data and Machine Learning. Agriculture 2022, 12, 571. [Google Scholar] [CrossRef]
Li, Y.; Ren, Y.Z.; Gao, W.L.; Tao, S.; Jia, J.D.; Liu, X.L. Analysis of Influencing Factors on Winter Wheat Yield Estimations Based on a Multisource Remote Sensing Data Fusion. Appl. Eng. Agric. 2021, 37, 991–1003. [Google Scholar] [CrossRef]
Zhang, P.-P.; Zhou, X.-X.; Wang, Z.-X.; Mao, W.; Li, W.-X.; Yun, F.; Guo, W.-S.; Tan, C.-W. Using HJ-CCD Image and PLS Algorithm to Estimate the Yield of Field-Grown Winter Wheat. Sci. Rep. 2020, 10, 5173. [Google Scholar] [CrossRef]
Jin, N.; Tao, B.; Ren, W.; He, L.; Zhang, D.; Wang, D.; Yu, Q. Assimilating Remote Sensing Data into a Crop Model Improves Winter Wheat Yield Estimation Based on Regional Irrigation Data. Agric. Water Manag. 2022, 266, 107583. [Google Scholar] [CrossRef]
Ji, Y.; Chen, Z.; Cheng, Q.; Liu, R.; Li, M.; Yan, X.; Li, G.; Wang, D.; Fu, L.; Ma, Y.; et al. Estimation of Plant Height and Yield Based on UAV Imagery in Faba Bean (Vicia faba L.). Plant Methods 2022, 18, 26. [Google Scholar] [CrossRef]
Aranguren, M.; Castellón, A.; Aizpurua, A. Wheat Yield Estimation with NDVI Values Using a Proximal Sensing Tool. Remote Sens. 2020, 12, 2749. [Google Scholar] [CrossRef]
Xu, W.; Chen, P.; Zhan, Y.; Chen, S.; Zhang, L.; Lan, Y. Cotton Yield Estimation Model Based on Machine Learning Using Time Series UAV Remote Sensing Data. Int. J. Appl. Earth Obs. Geoinf. 2021, 104, 102511. [Google Scholar] [CrossRef]
Wang, F.; Wang, F.; Hu, J.; Xie, L.; Yao, X. Rice Yield Estimation Based on an NPP Model With a Changing Harvest Index. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2020, 13, 2953–2959. [Google Scholar] [CrossRef]
Wang, F.; Yao, X.; Xie, L.; Zheng, J.; Xu, T. Rice Yield Estimation Based on Vegetation Index and Florescence Spectral Information from UAV Hyperspectral Remote Sensing. Remote Sens. 2021, 13, 3390. [Google Scholar] [CrossRef]
Ge, H.; Ma, F.; Li, Z.; Du, C. Grain Yield Estimation in Rice Breeding Using Phenological Data and Vegetation Indices Derived from UAV Images. Agronomy 2021, 11, 2439. [Google Scholar] [CrossRef]
Zhao, Y.; Han, S.; Meng, Y.; Feng, H.; Li, Z.; Chen, J.; Song, X.; Zhu, Y.; Yang, G. Transfer-Learning-Based Approach for Yield Prediction of Winter Wheat from Planet Data and SAFY Model. Remote Sens. 2022, 14, 5474. [Google Scholar] [CrossRef]
Liu, Z.; Xu, Z.; Bi, R.; Wang, C.; He, P.; Jing, Y.; Yang, W. Estimation of Winter Wheat Yield in Arid and Semiarid Regions Based on Assimilated Multi-Source Sentinel Data and the CERES-Wheat Model. Sensors 2021, 21, 1247. [Google Scholar] [CrossRef]
Duan, B.; Fang, S.; Zhu, R.; Wu, X.; Wang, S.; Gong, Y.; Peng, Y. Remote Estimation of Rice Yield With Unmanned Aerial Vehicle (UAV) Data and Spectral Mixture Analysis. Front. Plant Sci. 2019, 10, 204. [Google Scholar] [CrossRef]
Fu, Z.; Jiang, J.; Gao, Y.; Krienke, B.; Wang, M.; Zhong, K.; Cao, Q.; Tian, Y.; Zhu, Y.; Cao, W.; et al. Wheat Growth Monitoring and Yield Estimation Based on Multi-Rotor Unmanned Aerial Vehicle. Remote Sens. 2020, 12, 508. [Google Scholar] [CrossRef]
Shen, Z.; Odening, M.; Okhrin, O. Adaptive Local Parametric Estimation of Crop Yields: Implications for Crop Insurance Rate Making. Eur. Rev. Agric. Econ. 2018, 45, 173–203. [Google Scholar] [CrossRef]
Song, R.; Cheng, T.; Yao, X.; Tian, Y.; Zhu, Y.; Cao, W. Evaluation of Landsat 8 Time Series Image Stacks for Predicitng Yield and Yield Components of Winter Wheat. In Proceedings of the 2016 IEEE International Geoscience and Remote Sensing Symposium (IGARSS), Beijing, China, 10–15 July 2016; IEEE: Piscataway, NJ, USA, 2016; pp. 6300–6303. [Google Scholar]
Xiao, L.; Yang, W.; Feng, M.; Sun, H.; Wang, C. Winter wheat yield estimation model based on hyperspectral vegetation index. Chin. J. Ecol. 2022, 41, 1433–1440. [Google Scholar] [CrossRef]
Xiao, L. Hyperspectral Remote Sensing Monitoring of Winter Wheat Growth, Physiology and Yield under Drought Stress. Ph.D. Thesis, Shanxi Agricultural University, Jinzhong, China, 2019. [Google Scholar]
Hong, X. Study on Vegetation Index Yield Model Based on Hyperspectral Remote Sensing Data of Rice. Ph.D. Thesis, Shenyang Agricultural University, Shenyang, China, 2017. [Google Scholar]
Wang, D. Rice Yield Estimation by Hyperspectral and Multispectral Remote Sensing. Ph.D. Thesis, Wuhan University, Wuhan, China, 2017. [Google Scholar]
Fei, S.; Yu, X.; Lan, M.; Li, L.; Xia, X.; He, Z.; Xiao, Y. Winter wheat yield estimation based on hyperspectral remote sensing and ensemble learning. Sci. Agric. Sin. 2021, 54, 3417–3427. [Google Scholar]
Korohou, T.; Okinda, C.; Li, H.; Cao, Y.; Nyalala, I.; Huo, L.; Potcho, M.; Li, X.; Ding, Q. Wheat Grain Yield Estimation Based on Image Morphological Properties and Wheat Biomass. J. Sens. 2020, 2020, 1571936. [Google Scholar] [CrossRef]
Hua, Q. Winter Wheat Yield Estimation Model Based on Machine Learning Research. Ph.D. Thesis, Northwest Agriculture and Forestry University of Science and Technology, Xianyang, China, 2023. [Google Scholar] [CrossRef]
Zhao, X. Research on Wheat Yield Inversion Based on UAV Image Analysis by Machine Learning Algorithm. Ph.D. Thesis, Anhui University, Hefei, China, 2020. [Google Scholar] [CrossRef]

Figure 1. Location of the study area.

Table 1. Performance metrics for ASD FieldSpec4.

Property	Argument
Wavelength Range	350–2500 nm
Spectral Resolution	3 nm@700 nm 6 nm@1400/2100 nm
Sampling Interval	1.4 nm@350–1000 nm 2 nm@1001–2500 nm
Scanning Time	100 ms
Number of Channels	2151
Maximum radiation	VNIR 2 times the sun, SWIR 10 times the sun

Table 2. Vegetation indices selected in this study.

Vegetation Index	Calculation Formula
Normalized Difference Infrared Index (NDII)	$R_{819} - R_{1649} / R_{819} + R_{1649}$
Normalized Difference Vegetation Index-1 (NDVI-1)	$R_{800} - R_{670} / R_{800} + R_{670}$
Normalized Difference Vegetation Index-2 (NDVI-2)	$R_{810} - R_{680} / R_{810} + R_{680}$
Modified Simple Ratio Index-1 (mSRI-1)	$R_{750} - R_{445} / R_{705} - R_{445}$
Modified Simple Ratio Index-2 (mSRI-2)	$R_{750} - R_{445} / R_{750} + R_{445}$
Enhanced Vegetation Index-2 (EVI-2)	$2.5 \times (R_{824} - R_{651}) / (1 + R_{824} + 2.4 \times R_{651})$
Difference Vegetation Index-1 (DVI-1)	$R_{824} - R_{651}$
Difference Vegetation Index-2 (DVI-2)	$R_{800} - R_{670}$
Difference Vegetation Index-3 (DVI-3)	$R_{810} - R_{680}$
New Vegetation Index (NVI)	$R_{810} / R_{560}$
Atmospherically Resistant Vegetation Index (ARVI)	$R_{872} - (R_{661} - (R_{488} - R_{661})) / R_{872} + (R_{661} - (R_{488} - R_{661}))$
Green Atmospherically Resistant Vegetation Index (GARI)	$R_{872} - (R_{559} - (R_{488} - R_{661})) / R_{872} + (R_{559} - (R_{488} - R_{661}))$
Normalized Difference Green Index (NDVIg)	$R_{750} - R_{550} / R_{750} + R_{550}$
Normalized Difference Red-Edge Simple Ratio (NDRSR)	$R_{872} - R_{712} / R_{872} + R_{712}$
Normalized Difference Water Index (NDWI)	$R_{872} - R_{1245} / R_{872} + R_{1245}$
Normalized Difference Red Edge (NDRE)	$R_{790} - R_{720} / R_{790} + R_{720}$
Modified Normalized Difference Vegetation Index (MNDVI)	$R_{750} - R_{705} / R_{750} + R_{705} - 2 * R_{445}$
Vegetation Dryness Index (VDI)	$R_{970} - R_{900} / R_{970} + R_{700}$
Normalized Difference Moisture Index (NDMI)	$R_{1649} - R_{1792} / R_{1649} + R_{1792}$
Modified Red-Edge Ratio (MSR)	$(R_{750} / R_{705} - 1) / (R_{750} / R_{705} + 1)$
Simple Ratio (SR)	$R_{872} / R_{661}$
Normalized Difference Green Vegetation Index (GNDVI)	$R_{780} - R_{550} / R_{780} + R_{550}$
Vogelmann Red Edge Index (VREI)	$R_{742} / R_{722}$
Modified Triangular Vegetation Index (MTCI)	$R_{754} - R_{709} / R_{709} + R_{681}$
Normalized Difference Red Edge Normalized Index (RENDVI)	$R_{750} - R_{702} / R_{750} + R_{702}$
Three-Band Water Index (TBWI)	$R_{973} - R_{1720} / R_{1447}$
Photochemical Reflectance Index (PRI)	$R_{529} - R_{569} / R_{529} + R_{569}$ ₉

Note: “i” in R_i represents different wavelength bands.

Table 3. Correlation between vegetation index and yield.

Vegetation Index	Pearson Correlation (Taken in Absolute Terms)	Significance
NDRE	0.521	0.006
GARI	0.526	0.003
PRI	0.533	0.014
DVI-1	0.564	0.001
DVI-2	0.567	0.001
DVI-3	0.568	0.001
EVI-2	0.571	0.001
TBWI	0.581	0.001
NDMI	0.602	0.002
NDII	0.676	0.001
VDI	0.678	0.001
NDWI	0.705	0.001

Table 4. Computational complexity and noise robustness of different methods.

Method	Computational Complexity	Noise Robustness
SMLR	High, involves evaluating multiple variable combinations	Sensitive to noise, the model is easily influenced by noise in the data
MLR	Low, capable of directly fitting linear relationships	Good robustness, has some tolerance to minor noise
PCA	Low, transforms high-dimensional data into low-dimensional data	Some robustness, able to resist some noise interference
ANN	High, requires training large numbers of neurons and multi-layer network structures	Some degree of robustness, but may be significantly affected by large amounts of noise or outliers
RF	Low, integrates multiple decision trees, each tree is relatively simple and can be processed in parallel	Strong robustness, able to resist minor noise interference

Table 5. Modeling accuracy and verification accuracy of different methods.

Modeling Method	Modeling Accuracy	Verification Accuracy	K-Fold Cross-Validation Accuracy	RMSE (kg/hm⁻²)
PCA	0.382	0.422	0.488	1201.32
SMLR	0.612	0.398	0.493	1083.61
ANN	0.743	0.729	0.702	912.52
MLR	0.835	0.199	0.204	1123.45
RF	0.935	0.946	0.897	869.80

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wu, R.; Fan, Y.; Zhang, L.; Yuan, D.; Gao, G. Wheat Yield Estimation Study Using Hyperspectral Vegetation Indices. Appl. Sci. 2024, 14, 4245. https://doi.org/10.3390/app14104245

AMA Style

Wu R, Fan Y, Zhang L, Yuan D, Gao G. Wheat Yield Estimation Study Using Hyperspectral Vegetation Indices. Applied Sciences. 2024; 14(10):4245. https://doi.org/10.3390/app14104245

Chicago/Turabian Style

Wu, Renhong, Yuqing Fan, Liuya Zhang, Debao Yuan, and Guitang Gao. 2024. "Wheat Yield Estimation Study Using Hyperspectral Vegetation Indices" Applied Sciences 14, no. 10: 4245. https://doi.org/10.3390/app14104245

APA Style

Wu, R., Fan, Y., Zhang, L., Yuan, D., & Gao, G. (2024). Wheat Yield Estimation Study Using Hyperspectral Vegetation Indices. Applied Sciences, 14(10), 4245. https://doi.org/10.3390/app14104245

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Wheat Yield Estimation Study Using Hyperspectral Vegetation Indices

Abstract

1. Introduction

2. Materials and Methods

2.1. Study Area and Data

2.2. Methods

3. Results

3.1. Modeling Analysis and Model Evaluation

3.2. Correlation Analysis between Vegetation Index and Yield

3.3. Multiple Methods to Build Models Based on Vegetation Index

3.3.1. Multiple Linear Regression Method

3.3.2. Multiple Stepwise Regression Method

3.3.3. Principal Component Analysis Method

3.3.4. Neural Network Method

3.3.5. Random Forest Method

3.3.6. Computational Complexity and Noise Robustness of the Model

3.4. Modeling Accuracy Evaluation

4. Discussion and Conclusions

5. Limitations and Future Work

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI