1. Introduction
Soybean [
Glycine max (L.) Merr.] is one of the world’s most important cereals, with uses such as cooking oil, livestock feed, and biofuel feedstock, as well as being a source of protein for the human diet. Such versatility makes soybean a pillar in the economy of countries such as the United States and Brazil, which are leaders in world production [
1]. Soybean yield prediction is of great interest for the global market, conducting government policies, and increasing global food security [
2]. Productivity is an average measure of production efficiency. Soybean yield levels are associated with characteristics of the region, such as weather conditions, soil properties, temperature, genotype and treatment, affecting the soybean crop’s development, as Wei and Molin pointed out [
3].
Remote sensing data have been used to predict and monitor yields, providing accurate information on crop status for early estimation of yield on a local/regional scale. New technologies of lightweight hyper/multispectral sensors of handheld size can be carried by unmanned aerial vehicles (UAVs), allowing detailed crop data. According to Banerjee et al. [
4], crop canopy reflectance can be remotely sensed, providing information on characteristics of the biochemical composition (e.g., chlorophyll, moisture content, dry biomass, canopy), structural parameters (e.g., leaf area, leaf angle), and soil properties (e.g., soil moisture). Depending on the application, these characteristics can be related to vegetation vigour, growth, and nutritional status, among others. In addition, an early estimate of field-scale production also contributes to the phenotyping of high-yield plants and precision agriculture [
5]. In this context, machine learning algorithms have contributed to the analysis of the high spatial and spectral dimensionality of remote sensing data. Traditional regression methods often cannot capture complex and nonlinear relationships among data. Thus, aiming for more efficient modelling, machine learning-based methods allow for exploring larger datasets of contrasting data types [
6]. Many machine learning-based regression methods have been widely applied to achieve accurate yield predictions for various crops in recent years, e.g., cotton [
7], wheat [
8,
9,
10], maise [
11,
12,
13], and soybean [
2,
5,
14,
15,
16].
Wei and Molin [
3] used machine learning approaches to estimate soybean productivity based on the number of grains and thousand-grain weights. The highest precision was obtained with a linear regression model adjusted by the number of grains, which achieved a determination coefficient (R
2) of 0.70. Maimaitijiang et al. [
5] evaluated the fusion of data extracted from RGB, multispectral and thermal sensors carried by UAVs to estimate soybean grain yield based on machine learning algorithms. The authors concluded that the fusion of multimodal data improved the yield estimate and was better adapted to spatial variations. The highest precision was obtained by a deep neural network with R
2 = 0.72.
Such remote sensing data have generally relied on vegetation spectral indices derived from mathematical formulations for different bands. Several studies have described the correlation between vegetation indices and crop yield derived from multi- or hyperspectral data. According to Zhao et al. [
9], the relationship between vegetation indices and crop yield can be seen as a function of canopy characteristics, such as chlorophyll content, biomass, and canopy architecture. Silva et al. [
2] concluded that the soil-adjusted vegetation index (SAVI) and normalised difference vegetation index (NDVI) were efficient in predicting productivity, with the highest values of these indices corresponding to the highest productivity observed in the field. In their experiments, Zhang et al. [
12] concluded that NDVI and the simple ratio (SR) index were the best vegetation indices for soybean yield prediction.
Another critical issue to be considered in soybean productivity prediction is the phenological development stage to drive image acquisition. Phenological development can be separated into two stages, vegetative (Vi) and reproductive (Rj), with their respective subclasses, as defined by Fehr and Caviness [
17]. Ma et al. [
18] conducted field experiments with soybean canopy reflectance measurements using a multispectral handheld radiometer during the R2, R4, and R5 growth stages. The regression models showed a positive correlation between canopy reflectance near the 700–800 nm wavelengths (transformed into NDVI) and grain yield, indicating R4 and R5 as the most suitable stages for early crop yield prediction. In turn, Zhang et al. [
15] performed studies that indicated R5 as the best stage for single-period prediction modelling. Maimaitijiang et al. [
5] also reported that several studies have indicated an optimal window for the stage between flowering and the initial filling of the grains (or growth stages from R2 to R5). Eugenio et al. [
14] investigated using multispectral sensors transported by UAVs to collect images on irrigated soybean fields to estimate yield. The authors analysed the influence of the phenological stage to fit prediction models of grain yield with a multilayer perceptron algorithm. As an irrigated soybean area, the vegetative stage (V6) presented the best impact on predictions. This result indicates that the type of treatment used on the crop can influence the data acquisition window for estimating production.
In addition to spectral data, crop height is a structural parameter that plays an important role in modelling crop growth, health status, production forecasting, and biomass estimation [
19]. The combination of canopy structure and spectral information has been tested to improve the performance of prediction models, including grain yield [
19,
20,
21]. For soybean, several studies have shown positive correlations between canopy structure (i.e., canopy height) and grain yield prediction [
21,
22,
23].
As commented by Zhang et al. [
15], hyperspectral remote sensing with lightweight sensors onboard UAVs can obtain continuous spectrum information and high-resolution images. Crop canopy spectra in narrow bands can be captured, and therefore, information on the biophysical/biochemical composition of the canopy status can be provided in more detail. Hyperspectral sensors, such as the Rikola camera (Senop Ltd., Kangasala, Finland) [
24], acquire frame format images and stereo pairs for the generation of digital surface models (DSMs) and hyperspectral orthomosaics. Then, spectral and structural attributes of the crop canopy can be derived. Compared with commercially available multispectral cameras that collect few bands with broad bandwidths, the Rikola camera captures narrow contiguous spectral bands. Thus, the entire spectrum can be used to analyse the crop canopy reflectance in grain yield estimation. Additionally, as productivity is a variable related to the local growing conditions, vegetation indices can be used to identify spatial variations. From the spectral information, sample plots can be better conducted to collect data to estimate predictive models.
The studies previously presented demonstrate the importance of soybean productivity estimations for farmers and government economic policies. However, it still remains challenging to predict productivity with high accuracy due to several factors, such as environmental, climatic, and biological factors. On the other hand, the availability of modern spectral sensors makes it possible to obtain crop images of high spatial and spectral resolutions, and machine learning algorithms contribute to complex and multivariate data analysis.
In this context, three hypotheses were raised. First, we have assumed that the sampling design directly affects the quality of soybean yield models. Traditional sampling used in agriculture might not represent all the variability of the analysed phenomenon, and, sometimes, it can be biased. Therefore, to guarantee a satisfactory representativity of the variable of interest, as well as a number of balanced samples for each class, a judgement-based sampling considering the spatial variability observed in spectral vegetation indices can improve the performance of the models. A second issue is the small number of samples since in situ data collection is laborious, time-consuming and expensive. The in situ sampling technique based on plots assumes that the plants’ spectral response within each plot is correlated to the soybean yield. Hence, it would be possible to apply a technique of data augmentation in which the soybean yield value obtained from a sample plot could be associated with several pixels. It is expected that this technique increases the variance of samples, which is observed in high-resolution images. The last hypothesis is that better performance is obtained for soybean yield prediction models fitted from images taken near the end of the reproductive stage, before senescence when photosynthetic activity is reduced. In addition, in regions with high climate instability, drought events throughout crop development can decrease soybean yield, as well as the correlation between plant spectral response and productivity in that vegetative stage.
In this paper, we aimed to model soybean yield relying on spectral and geometric data derived from high-spatial-resolution images from the Rikola camera and machine learning-based regression. The specific objectives are (i) to propose a new method of sampling design based on judgement considering the spatial variability observed in spectral vegetation indices; (ii) to investigate the contribution of the Rikola camera bands in predicting soybean yield; (iii) to propose a method of data augmentation; and (iv) to assess the contribution of the canopy height as a feature in the input dataset for modelling.
The methodological concept is based on the plant height and the variability of vegetation during grain growth, which could be detected by spectral sensors and correlated with the productivity observed in samples surveyed in the field. The methodology was conducted in an area with a history of low yield to ensure soil variability and test the technique. In addition, the objective was to analyse within the continuous spectrum [500–900 nm] which bands best contribute to the estimates and the use of indices to guide the collection of sample plots.
Thus, this study proposes a new sampling design method based on the distribution of sample plots, considering the spatial variability of vegetation spectral indices. It aims for a better representation of the variable of interest and obtains balanced samples for classes. Another novelty is the data augmentation technique that increases the variance of samples relying on spectral variability observed in high-resolution images.
4. Discussion
This study investigated soybean grain yield estimation using UAV-based hyperspectral imaging and photogrammetric 3D modelling. Two machine learning regression techniques (MLR and RF) were used to produce estimates from in situ data. The cross-validation was used with ten folds, resulting in an average bias of 1.20% for MLR and -0.47% for RF.
Significant accuracies were achieved due to the sampling procedure guided by evidence of spectral variability detected from image data and by applying a sample delineation technique, which allows the selection of random samples in a stratified dataset. This finding is essential to demonstrate that the spectral response of soybeans in the reproductive stage is related to productivity and can be captured by remote sensing techniques. There was an initial concern about whether the stratified random sampling could lead to biased results. However, the definition of the sample plot locations was determined by spectral evidence, in which three vegetation indices were adopted with slicing at class intervals, allowing the selection of representative elements of each class. Furthermore, the features derived from high-resolution imagery showed that spectral information combined with the vertical structure (height) significantly contributed to estimating the soybean productivity when the model was fitted using the RF algorithm. The model was also able to adjust to a condition of large dispersion of the effect of soil attributes on soybean yield due to the distribution of coefficients found by the multiple regression model applied.
The experiments demonstrated that adding height to the prediction model can produce better results, depending on the machine learning algorithm. For the MLR algorithm, the height attribute did not significantly improve the estimates, likely because the vertical structure is correlated with the spectral bands close to the R and IR bands (the estimate without height produced an r = 0.77, and with height, it was r = 0.79). In contrast, the nonlinear regression model, fitted by the RF algorithm, showed better performance when using the height features (the estimate without height produced an r = 0.83, and with the height, it was r = 0.89, considering 25 bands and height). Thus, as noted by other studies (e.g., [
5,
12,
19,
42]), regressive models based on trees are more suitable for estimating productivity. Furthermore, in this study, instead of using the information of only one pixel, the size of each sample area was composed of nine pixels with productivity values proportional to the spatial dimension. Thus, each set of sample pixels allowed an increase in the number of local features for the various information layers, representing the variation in plant productivity.
In general, the experiments showed that a set with many spectral bands was not necessary. In the MLR algorithm, indices with some bands can be used rather than considering multiple bands. On the other hand, the nonlinear RF algorithm performed better with spectral bands along with the height attribute and was significantly superior to the MLR algorithm in estimating productivity. The results showed that using four spectral bands (552 nm, 672 nm, 701 nm, 810 nm) together with the height attribute produced a better result (r = 0.91) in the RF prediction model, which indicated an R
2 = 0.828. It is worth highlighting that two of these bands are in the red-edge region, which are not easily found in multispectral sensors. The results obtained in this study were higher than the R
2 = 0.70 reached by Wei and Molin [
3] to estimate soybean yield considering the number of grains and thousand-grain weight in a linear regression approach. The R
2 found in this research was similar to the value R
2 = 0.824 obtained by Maimaitijiang et al. [
5] using a deep neural network with different types of images (multispectral, thermal, and RGB) in rainfed soybean. It is worth mentioning that machine learning approaches such as RF require a significantly lower sample number than deep learning algorithms. The experiments conducted by Eugenio et al. [
14] with multispectral images and machine learning resulted in R
2 = 0.84. However, soybean yield estimates were carried out on irrigated soybean crops, that is, in a controlled area, while the approach studied in this research was developed with a rainfed soybean crop. The RF model demonstrated a better ability to identify and estimate the production potential within the study area, compared to the MLR model. The estimated value was compared with the weight of grains harvested at 13% moisture to obtain an accuracy measure. The difference between the estimated and collected weights was 1.5% in this case study.
Another relevant issue for more accurate modelling is the representativity of the samples. Traditional methods of soybean yield determination use flow meters in harvesters. Although this approach is the most precise method, it is performed only at the end of the harvest. Our proposal for sampling design aims to minimise the bias and obtain statistically representative samples of the spatial variation in soybean yield. Due to the data augmentation technique applied, a significant number of samples were obtained, resulting in a satisfactory performance of the soybean prediction model. The plant growth stage also has an important role in model accuracy. The area of interest is located in a region where the rainfall regime is very irregular. In drier years, the soybean development and the process of ageing are accelerated, affecting the development of the pods and seed grain. In addition, heat waves also cause the abortion of flowers. Other factors also affecting the productivity are the type of soil, soil compactness, and nematodes. Therefore, adopting the R5 stage seems adequate because the vegetation has already undergone the stress caused by external factors, having more correlation with the final productivity.