1. Introduction
The growing global population has heightened the need for reliable food sources and food security, underscoring the importance of advancing efficient and sustainable agricultural practices. The agriculture industry today faces substantial challenges, including rising global food demand, crop diseases, pest outbreaks, limited arable land, and the impacts of climate change. Addressing these issues is vital for ensuring a resilient and productive agricultural sector. Research by Tan and Reynolds indicates that in southwestern Ontario, water supply and demand pose the greatest challenge to the agricultural sector [
1]. Interestingly, farmers in this region are less concerned about climate change compared to those in areas more frequently affected by extreme weather events [
2]. The agriculture and agri-food sectors contributed approximately 7% to Canada’s gross domestic product (GDP) and accounted for one in every nine jobs in 2022 [
3]. While climate change may not present an immediate threat to the Canadian agricultural industry, it is wise to stay informed and proactively prepare for potential future climate variations.
Precision agriculture (PA) employs advanced technologies and data analysis techniques to optimize crop yields while minimizing resource use. This approach involves evaluating quantified spatial and in situ plant data to inform agricultural practices such as the application of water, labor, and fuel, thereby reducing costs and preventing excessive waste, including pesticide and nutrient loss. PA integrates various spatial technologies, such as geographic information systems (GIS), handheld ground-based data collection devices, and remote sensing through ground-based or aerial vehicles, to develop and implement efficient agricultural strategies [
3]. Given the high demand for data collection, remote sensing techniques are employed in crop management to precisely manage, produce, and predict crop data for analysis. Accurate crop yield prediction is crucial for helping farmers address production challenges and mitigate the effects of climate variability and change on crop yield [
4].
Among the various platforms of surface spectral data collection in PA, space-borne satellites are one of the most stable platforms [
5,
6,
7,
8,
9,
10]. A key advantage of using optical satellite images for remote sensing is the ability to obtain spectral data over large land areas in a single snapshot with high resolution. Traditionally, researchers have faced challenges with optical satellite images due to their relatively lower spatial resolution compared to ground-collected data [
11]. This limitation has restricted research to regional scales rather than local, field-scale studies. For example, Landsat 8, launched in 2013 by the United States, features the Optical Land Imager (OLI) with a spatial resolution of up to 30 m [
12]. Similarly, Sentinel-2, launched in 2015, features 13 multispectral bands with spatial resolutions of 10 m, 20 m, and 60 m, and a revisit time of 5 days with its constellation of twin satellites [
13]. In contrast, VENμS’s VSSC (Vegetation and Environment Monitoring on a New Micro-Satellite Super-Spectral Camera) captures optical images at a resolution as high as 5.3 m. Additionally, VENμS has a revisit time of 2 days, compared to Landsat 8’s 16 days [
12,
14]. These advantages in both high spatial and temporal resolution make VENµS a superior choice for detailed crop monitoring and analyses, providing more frequent and precise data for agricultural applications.
The ease of access to satellite data offers a significant advantage over the other major remote sensing methods: ground-level and UAV-level remote sensing. Many satellite datasets, such as those from VENμS, Landsat series, Sentinel-2, MODIS, and SPOT series are publicly available. VENμS imagery can be downloaded free of charge in its predefined areas, thereby reducing both labor and monetary research costs compared to ground sampling and UAV flight operations. While crop monitoring has traditionally relied on satellite imagery, UAV-based systems often challenge their usability due to superior spatial and temporal resolution. Crop growth stages can vary week to week, making some satellite images unsuitable for timely analysis. For instance, Sentinel-2 data have yielded unsatisfactory crop yield prediction results due to cloud coverage and lower temporal resolution [
15]. VENμS addresses this issue by providing higher spatial resolution data compared to most satellites, while maintaining frequent revisits of 2 days and offering a wide range of multispectral bands [
14].
Furthermore, UAV operations are often constrained by weather conditions. Clear skies and low wind speeds are typically required to collect high-quality data. While UAVs offer flexible planning and scheduling, VENµS can achieve similar advantages by mitigating poor coverage with its short revisit period. Despite this, UAVs provide an edge over satellites by allowing researchers greater control over the location and timing of data collection. However, UAV flights with payloads such as multispectral cameras are often restricted under aviation regulations, and additional procedures or certifications are often required if the flight is to be conducted in a regulated aerodrome in most countries. For instance, Transport Canada mandates the registration of any remotely piloted aircrafts (RPAs) weighing between 250 g and 25 kg, which encompasses most commercially available UAVs that can carry spectral sensors as payloads [
16]. Additionally, operating these RPAs categorized by Transport Canada requires the pilot to have different classes of operation licenses based on the location of flight. In contrast, satellite data can often be obtained online free of charge and without any operational requirements, making the data widely accessible. Thus, VENμS effectively combines the advantages of both satellite and UAV systems, offering high spatial resolution, frequent temporal coverage, and ease of data access.
With the spectral data collected from remote sensing imagery, vegetation index (VI) calculations become feasible. VIs are mathematical transformations of spectral bands widely used in agricultural research to determine specific plant properties, such as leaf area index (LAI), chlorophyll content, and nutrient levels [
17,
18,
19]. Consequently, VIs are commonly employed for crop growth and health monitoring, including yield prediction [
19,
20]. For instance, vegetation indices that performed well in the study by Fu et al. were derived using the red absorption portion of the spectrum [
21]. On multispectral cameras, this typically includes the red band and red-edge bands. Indices such as the normalized difference vegetation index (NDVI), normalized difference red edge (NDRE), and soil-adjusted vegetation index (SAVI) have been previously studied as effective indices in winter wheat yield monitoring [
11,
22]. VENμS is specifically designed for vegetation monitoring, offering more bands in the red-edge and near-infrared range than most publicly available satellite data. This enhanced spectral capability improves its ability to detect vegetation properties. Therefore, it is important to further explore VENμS’s yield prediction potential using a more diverse range of vegetation indices that may be unavailable from other satellites.
Recently, machine learning regression methods, such as Random Forest (RF) and Support Vector Regression (SVR), have been extensively investigated for biomass and yield estimation [
23,
24,
25,
26]. These machine learning methods can capture complex patterns and relationships in the data that traditional methods might miss, and was proven to be viable in yield prediction [
27,
28]. RF, for example, can handle a large number of input variables and is less likely to overfit due to its ensemble nature. Hunt et al. successfully mapped winter wheat yield using Sentinel-2 data and RF regression models, achieving a relatively low root mean square error at 0.66 t/ha [
29]. This work suggests the potential of utilizing higher spatial resolution data to capture the within-field yield variability with a common machine learning algorithm.
SVR, conversely, focuses on optimizing a margin around a hyperplane, which can result in better generalization on unseen data. While traditional regression methods are straightforward and easier to interpret, machine learning regression methods like RF and SVR offer significant advantages in terms of handling complexity, scalability, and adaptability, making them suitable for a wide range of modern data-driven applications.
Compared to most publicly available satellites, such as Sentinel-2, VENμS offers additional bands in the red-edge and near-infrared ranges, which are particularly advantageous for vegetation monitoring. It also provides relatively higher spatial and temporal resolution. Despite these benefits, VENμS has been rarely studied in yield estimation research. Therefore, to make a well-informed prediction of winter wheat yield at a local, field-scale using VENμS data, it is essential to introduce an appropriate prediction model. The objective of this study is to (i) investigate the relationships between yield and VIs at difference growth stages, (ii) to evaluate the effectiveness of RF and SVR models in predicting yield, (iii) to determine the optimal combinations of dates (growth stages) for yield prediction in a winter wheat field located in southwestern Ontario, (iv) to uncover insights in the ranked importance of VIs from different forth stages, and (v) to produce a yield prediction map.
2. Materials and Methods
2.1. Study Area and Data Collection
The study site is in Strathroy-Caradoc, Ontario, Canada, near the village of Mount Brydges, which is about 23 km southwest of the urban center of London, Ontario (
Figure 1). The studied period was from May to early July of 2020, during which the average recorded temperature was 22 °C and the relative humidity averaged 73%. The climate in the area is classified as a warm-summer humid continental climate (Dfb) according to the Köppen climate classification system. The area is predominantly agricultural croplands and its major field crops include winter wheat, corn, and soybeans [
30]. Winter wheat was selected as the focus of this study. A winter wheat field covering 53.7 hectares in this region was designated as the specific area for investigation.
The cultivar in the studied field was soft red winter wheat, which was planted in October of 2019. In the region of Southwest Ontario, winter wheat typically lies dormant over the winter after planting, then commences shooting in late April of the following year and is harvested from early to mid-July. VENμS imagery acquisition was performed at each consequent growth stage starting at tillering, then stem elongation, booting, heading, flowering, early fruit development, and ripening. The growth stages were verified by gauging the plant’s physical characteristics using the Biologische Bundesanstalt, Bundessortenamt and CHemical industry (BBCH) scale at the field, matching the satellite overpass dates (
Table 1). In total, 8 cloud-free VENμS images were acquired.
2.2. VENμS Satellite Imagery and Preprocessing
The data used in this research were collected by VENμS (Vegetation and Environment Monitoring on a New MicroSatellite), which was launched in August 2017. This satellite marks the first Earth observation collaboration between France and Israel, led by the Centre National d’Etudes Spatiales (CNES) and the Israeli Space Agency (ISA). The mission aims to monitor plant growth and health status, providing valuable insights into the impacts of environmental factors, human activities, and climate change on Earth’s land surface [
14]. Since 2017, the VENμS VM1 mission has provided multispectral data from its 12 different bands, featuring a spatial resolution of 5.3 m, a revisiting period of 2 days, and operating at an altitude of 720 km above sea level (
Table 2). As implied by its name, VENμS excels in monitoring Earth’s surface vegetation, which is facilitated by its extensive red-edge and near-infrared bands.
The imagery was categorized as level 2A (L2A) surface reflectance data, with each scene covering areas ranging from 27 × 27 km2 to 27 × 54 km2 at the spatial resolution of 5 × 5 m2. The satellite imagery was processed and distributed by Theia MUSCATE (MUlti SATellite, multi-CApteurs, for multi-TEmporelles data), a component of the Theia Land Data Centre. This French inter-agency organization aims to provide satellite data and value-added products for scientific communities and public policy actors. MUSCATE facilitates the processing and distribution of large volumes of satellite imagery, particularly from VENμS, Sentinel-2, and Landsat satellites. This includes tasks such as atmospheric corrections and creating cloud-free surface reflectance syntheses. The processed data are used in various applications, including agriculture, forestry, urban planning, and environmental monitoring. Each of the VENμS L2A products contained two versions of surface reflectance data for the 12 bands, from B01 to B12. The first version of the surface reflectance rasters is denoted as SRE.DBL (Surface Reflectance), which is atmospherically corrected. The second version is denoted as FRE.DBL (Flat Reflectance), which is SRE.DBL files further corrected for slope effects. This correction suppresses apparent reflectance variations due to the orientation of slopes with regard to the sun, making the corrected image appear as if the land surface were flat. For this study, the FRE.DBL raster files were adopted.
The L2A surface reflectance rasters were encoded as 16-bit signed integers, necessitating preprocessing before any manipulation by dividing pixel values of each channel by 1000. This preprocessing was conducted in Python 3.9.19 using packages such as “rasterio”, “gdal”, and “numpy” to extract and obtain surface reflectance values from each band at the study site. Subsequently, the 12-band raster values were normalized to a range between 0 and 1 for use in later calculations.
2.3. Vegetation Indices
Vegetation indices (VIs) were used in this study as predictors of final harvested yield. The VIs were calculated as raster products using the 12 VENμS bands in Python 3.9.19, employing the same packages used in satellite image preprocessing. Additionally, several VIs that utilize spectral information in the red edge and near-infrared wavelengths, which are well represented in VENμS data, have demonstrated strong correlations with crop growth, health, and yield [
17,
31,
32]. A total of 21 VIs were tested in this study, including 8 variations of existing VIs based on their original development formulas (
Table 3). This was made possible by fitting the narrow bandwidth of VENμS bands into VI formulas initially developed with legacy sensors and satellites. For instance, NDVI was developed using the Landsat-1 Multispectral Scanner, where the NIR band 7 had a bandwidth range of 800 to 1100 nm. With the VSSC, both bands 11 and 12 fit within this NIR range, allowing for the inclusion of variations of existing VIs in the analysis.
2.4. Yield Dataset
The yield data were collected at harvest on 25 July 2020, with a combine harvester equipped with a 10 m-wide and 1.5 m-long header. Yield data were generated as a point shapefile, with yield data recorded approximately every second at the center of the harvester’s track. To ensure accuracy, potential outliers located at the edges of the field were removed. For this study, the shapefile was interpolated into a 5 × 5 m2 spatial resolution raster using QGIS 3.22 with inverse distance weighted (IDW) interpolation, matching the VENμS imagery and the derived vegetation indices (VIs). This approach was adopted to fully utilize the high-resolution advantage of VENμS data and to produce a detailed yield prediction map.
2.5. Machine Learning Regression Modelling and Cross-Validation
In machine learning, regression models are used to predict continuous outcomes based on input variables. Two notable techniques in this domain are Random Forest (RF) regression and Support Vector Regression (SVR), both of which offer robust solutions for complex regression problems. Advantages of machine learning regression also include its ability to automatically learn from data without being explicitly programmed for each specific task. Given that the regression models in this study were based on pixel-level analysis, machine learning regression methods were ideal for our needs as they excel in handing large data sizes. In our study, we used three key metrics to evaluate the performance of our regression models: mean absolute error (MAE), R-squared (R2), and root mean squared error (RMSE). These metrics were employed during both the cross-validation stage and the calibration and validation of the final model to ensure a comprehensive assessment of model accuracy and reliability.
RF is an ensemble learning method that constructs multiple decision trees during calibration and outputs the mean prediction of these trees. By using multiple trees, RF reduces overfitting, a common issue in single-decision tree models. Each tree is built from a random sample of the data, with a random subset of features selected at each node to decide splits. This randomness helps make the model more resilient to noise and outliers. RF can handle large, high-dimensional datasets and identify important variables in the modeled relationships. Additionally, RF provides measures of feature importance, helping to understand the impact of each variable on the prediction.
SVR, on the other hand, extends the concepts of Support Vector Machines (SVMs) from classification to regression. Like RF, SVR is also generally robust to over fitting. It is a result of its margin maximization, the use of kernel functions, the epsilon-insensitive loss function, and the reliance on support vectors. Unlike traditional methods that minimize the error between predicted and observed values, SVR attempts to fit the error within a certain threshold. It involves the creation of a hyperplane in a multidimensional space where the distance between the data points and the hyperplane is minimized, ensuring that errors do not exceed a defined threshold. This makes SVR particularly useful in cases where a margin of tolerance is specified in the predictions. SVR is highly effective in handling non-linear relationships through the use of kernel functions, making it adaptable to various types of data [
46].
In this study, the data collected over 8 dates were randomly divided into a 70% calibration set and a 30% validation set. A 10-fold K-fold cross-validation approach was employed in this study to ensure the robustness and generalizability of the machine learning regression models. This method involved splitting the calibration data into multiple subsets (folds), using each subset in turn as the validation set, while the remaining data were used for training. With 10 folds, each fold used 90% of the data for training and 10% for validation. This approach ensured that each training set was large enough to effectively train the model, while each validation set was sufficient to provide a reliable evaluation without overfitting. During the cross-validation stage, MAE, R
2, and RMSE served as crucial indicators of model performance. Cross-validation involved partitioning the calibration dataset into multiple folds and iteratively training and validating the model on these folds. MAE provided the average magnitude of errors in the predictions, indicating the overall accuracy of the model without considering the direction of errors. RMSE, which penalized larger errors more significantly due to its squared component, offered insight into the model’s ability to handle large deviations from observed values. R
2, representing the proportion of variance explained by the model, evaluated the goodness of fit, with values closer to 1 indicating a better fit. By averaging these metrics across all folds, we obtained a robust estimate of the model’s performance and its variability, thus mitigating the risk of overfitting or underfitting to specific subsets of data. For the RF models, the RMSE dictated the optimal cross-validated RF model with an optimal number of splits at each tree node. The MAE value of that optimal RF model represents its average magnitude of the errors in the prediction. MAE served the same purpose in the SVR models, but RMSE determined the optimal SVR model with the optimal regularization parameter. The equations are as follows:
where
is the observed value,
is the predicted value, and
is the mean of the observed values, and;
where
represents the predicted yield (t/ha),
denotes the observed yield (t/ha), n is the total number of observations, and i serves as the summation index, incrementing by one.
After cross-validation, the final model was trained on the entire cross-validated calibration dataset and then evaluated on both the calibration and validation datasets using the same metrics. In the calibration stage, RMSE assessed the model’s fit to the data it was trained on, while R2 measures how well the model captures the underlying data patterns. High R2 values, coupled with low RMSE, suggest a good fit. However, it is crucial to compare these metrics with those from the validation stage. The validation stage involved assessing the model on unseen data, providing an indication of its generalization ability. Consistent performance across calibration and validation sets, characterized by similar R2, and RMSE values, indicates a robust model.
Figure 2 displays the workflow of the methodology. The modeling was written in the R programming language using RStudio by utilizing packages such as “randomForest” and “e1071” for RF and SVR, respectively. In both models, the independent variables were the VIs. Data collected over the 8 dates were ran individually, then divided into two groups of “pre-heading” and “post-heading”. For each dataset, a 10-fold cross-validation was performed using packages “caret” and “kernlab”. The yield prediction raster was also created using the “raster”, “sp”, and “rasterVis” packages in RStudio.
5. Conclusions
This study evaluated the effectiveness of VENμS multispectral imagery in predicting winter wheat yield in southwestern Ontario using machine learning methods. A total of 21 VIs, including eight variations of existing VIs based on their original development formulas, were tested. The best prediction result demonstrated a high correlation between VENμS data and observed yield, with an R2 = 0.86 and an RMSE = 0.3925 t/ha using an SVR model. According to our results, a reliable prediction of yield can be achieved two months prior to harvest using the combined pre-heading stage data, and the best result can be obtained 39 days prior when using all data from the pre- and post- heading stages. The findings suggest that VENμS data can offer superior yield prediction accuracy compared to other publicly available satellites, and could potentially serve as a viable alternative to UAV data for local, field-scale studies.
Though machine learning algorithms are effective at capturing complex patterns among variables, it is important to recognize the empirical nature of these models. They rely on existing datasets for validation, and can only approximate the observed yield, which is only verifiable at harvest. This intrinsic limitation highlights the potential discrepancies between predicted and actual outcomes. Such limitations underscore the necessity for ongoing calibration and testing of these models under varied agricultural conditions and across different crop cycles to ensure their reliability and accuracy.
Additionally, while k-fold cross-validation and a 70/30 train-test split were employed in this study due to their broad adoption and effective use of the data, future work could explore spatial splitting as a viable alternative. Spatial splitting, which divides the dataset based on geographic location rather than by random subsets, may provide a more realistic evaluation of the model’s robustness across different parts of the field by better addressing spatial autocorrelation. Investigating this approach could enhance the model’s performance in capturing spatial variability within the field.
VENμS, as mentioned above, does not provide worldwide coverage, which is a significant drawback limiting the use of its superior high-resolution multispectral data. Although this study showed that Sentinel-2 is a less effective alternative, it remains the next best option for predicting yield with publicly available satellite data using our method. Its worldwide frequent coverage can produce comparable results to VENμS at the field-scale and potentially similar results at a regional scale. This research highlights the potential of high-resolution satellite data with multispectral cameras for yield prediction. Future studies may also consider using commercial satellites such as PlanetScope and WorldView-3 as alternatives for high-resolution multispectral data. Combining these satellite data with yield estimation models could lead to advancements in low labor costs, and non-destructive yet highly accurate yield predictions, providing a more detailed understanding of crop yield potential and distribution.