1. Introduction
The vine is a widely cultivated plant, and the global area under vines, which corresponds to the total area planted with vines for all uses (wine, table grapes, and raisins), including young vines not yet in production, is estimated at 7.3 million hectares in 2022. The International Organization of Vine and Wine (OIV) estimates that 258 million hectoliters of wine were produced in 2022, of which 161 million hectoliters were produced in Europe [
1]. In order to avoid further interventions during vinification and to make wines with optimal organoleptic characteristics, the grapes must be harvested at their optimal maturity [
2], given that grapes are non-climacteric fruits, meaning that they do not continue to ripen once harvested. Making a decision regarding when to harvest grapes at their ideal ripeness necessitates a thorough understanding of grape composition factors that are essential for achieving a desired wine style. This decision should take into account various factors such as the grape variety and the desired organoleptic characteristics of the wine, local climate [
3], topography [
4], seasonal weather conditions, and vineyard management practices [
5]. Conventional methods employed to assess grape ripeness involve measuring the concentration of sugar in the juice as an indicator of grape sugar accumulation or monitoring changes in grape acidity through titratable acids or pH levels [
6]. This requires the application of various sampling procedures during ripening, the employment of specialized staff, and the application of chemical analyses in the laboratory, which are very costly and time-consuming. Therefore, the assessment of grape maturity by accurate, contactless, non-destructive methods is desirable especially if these could be carried out automatically by an agribot [
7,
8].
The visible and near-infrared (VNIR, 350–1000 nm, often referred to also as Vis–NIR) and the short-wave infrared (SWIR, 1000–2500 nm) regions of the electromagnetic spectrum have been recognized as critical ranges in grape production offering valuable spectral information for the estimation of different ripening parameters and phenolic composition [
9,
10]. While VNIR–SWIR spectroscopy enables the detection of molecular absorptions, including overtones of fundamental absorptions in the mid-infrared and combination bands thereof [
11], it is the analysis of these signals using advanced statistical methods (including machine learning and artificial intelligence techniques) that allows the quantification of indicators from the recorded spectra [
12,
13].
In the same context, subsets of the VNIR–SWIR range may also provide important features for ripeness estimation including the most critical ones, i.e., sugar content, pH, and titratable acidity (TA, also called total acidity); for example, the 1100–2300 nm range has been employed to estimate these and other grape quality parameters including the total phenolic content, total flavonoids, total anthocyanins, and total tannins [
14]. Along the same line, Daniels et al. (2019) [
15] explored the use of NIR spectroscopy to quantify total soluble solids (TSS), TA, TSS/TA, and pH non-destructively on intact bunches, achieving promising results using the Partial Least Squares (PLS) algorithm. Ping et al. (2023) [
16], by using the VNIR spectra, estimated the soluble solid content (SSC) and TA of the grapes, also recording the changes in the chemical composition at different maturity levels. The Vis, NIR, and SWIR spectroscopy was also explored by Meja-Correal et al. (2023) [
17] for grape TSS estimation by applying a PLS regression model and selecting the best spectral range to avoid complex and potentially overfitted regression models, concluding that the most suitable spectral range for TSS predictions was the NIR range (701–1000 nm).
In the realm of artificial intelligence (AI), attention layers [
18], a fundamental component in the realm of deep learning, have witnessed significant adoption and innovation in recent years [
19], particularly in the context of natural language processing and computer vision tasks [
20]. These layers, often integrated within transformer architectures [
21], serve to capture intricate dependencies and relationships between input elements by assigning varying degrees of relevance to different parts of the input sequence. Multi-head attention mechanisms, an extension of attention layers, allow for the simultaneous consideration of multiple sets of attention weights, enabling the model to capture a spectrum of contextual information and multiple levels of abstraction [
21]. In the domain of computer vision, the fusion of multi-head attention with Convolutional Neural Networks (CNNs) has demonstrated remarkable performance in tasks such as image captioning and object detection [
22]. This combination leverages the spatial hierarchy captured by CNNs with the capacity of multi-head attention to model long-range dependencies, enhancing the model’s ability to comprehend complex visual scenes and exhibit state-of-the-art performance in a multitude of image-related tasks. Furthermore, the integration of multi-head attention with CNNs contributes to the burgeoning field of explainable AI [
23]. The attention mechanisms provide an inherent form of interpretability by revealing which parts of an image or input sequence the model focuses on when making predictions. This transparency can facilitate insights into the model’s decision-making process, aiding in the understanding and trustworthiness of the AI system, and thus enhancing its explainability and interpretability for both research and practical applications.
Another key concept in artificial intelligence is the use of multi-output models which can simultaneously predict multiple targets from a given set of input features and patterns [
24,
25]. This differentiates them from classical single-output models that in traditional supervised learning learn a function that maps each of the inputs to a single corresponding output value. A closely related concept is multi-task learning, which involves training a model to perform multiple tasks simultaneously—the key difference between the two is that in multi-task learning, diverse tasks can be trained using distinct training sets or features, whereas in multi-output learning, the output variables typically share the same training data or features [
26]. In principle, the advantage of multi-output regression lies in its capacity to predict multiple correlated outputs simultaneously, enabling the model to leverage interrelations between the outputs, which enhances overall predictive accuracy and captures complex dependencies within the data.
This research builds upon our previous work and specifically the results presented in [
27], where only sugar content was determined in four different grape varieties using a CNN which outperformed standard machine learning algorithms, using data collected during the harvest and pre-harvest seasons of 2020 and 2021. In the summer of 2023, new data were collected from the same grapevines and two additional maturity indicators were determined in the laboratory, namely, pH and TA.
In this paper, the main research objectives are threefold: (1) expand the local grape spectral library to further include pH and titratable acidity maturity indicators; (2) validate the performance of the pre-trained CNN model that predicts
Brix from the in situ point spectra which were developed from past growing seasons (i.e., 2020 and 2021) [
27], using data from a completely new season (2023); (3) develop new models that further predict pH and total acidity, including examining the potential use of multi-output models that simultaneously predict all three maturity parameters. Additionally, the main novelties of the paper are the following: (1) Most studies focus on laboratory-collected spectra, whereas in this paper we focus exclusively in spectra collected in situ; (2) No studies have examined the use of multi-output models to simultaneously predict multiple oenological properties simultaneously; (3) A novel multi-input–multi-output CNN that incorporates a multi-head attention mechanism is proposed, combining the information of multiple spectral pre-treatments and predicting all maturity indicators simultaneously; the multi-head attention mechanism provides the model with the capacity to examine those wavelengths which are considered most important.
3. Results
3.1. Data Collected in the 2023 Growing Season
In this subsection, the results pertaining to the specific objective 1 noted in the Introduction are presented. The data collected and the major statistical moments thereof across the three years (2020, 2021, and 2023) are summarized in
Table 2. The results of 2020–2021 were reported in [
27], but are also presented in this table to aid the comparisons between the two campaigns (merged 2020–2021 campaign and 2023). To aid the comprehensibility of this table, the boxplots (overlaid with swarmplots) are also given in
Figure 3, and visually present the distributions per each variety only for the 2023 data. Grape samples were collected at different stages during ripening, so that the sample set included the widest possible range of values for
Brix, TA, and pH. Therefore, we observe that in Malagouzia there are some high values of pH out of limits as well as titratable acidity in Malagouzia and Syrah. This is due to the early harvesting of samples at maturity of these two varieties.
The progress of the grapes’ reflectance as the maturity progresses is depicted in
Figure 4. It appears that in Chardonnay and Syrah, higher values are obtained in the
Brix plots in the visible range, and specifically at the 500–700 nm wavelength range. This is due to the maturity stage of the grapes as sampling started before the grapes were at the stage of veraison, when the grapes have a more greenish color, which corresponds consequently to a higher albedo (reflectance) for lower
Brix values.
The pairplot visualization of
Brix, pH, and TA (
Figure 5), with the inclusion of kernel density estimates on the main diagonal and separation by grape variety, provides a comprehensive overview of the relationships among these key parameters. The high correlations observed within this set of variables are indicative of their interconnected nature. Notably, the highest correlation is found between
Brix (sugar content) and TA (total acidity) with a coefficient of 0.8, underlining the inverse relationship between sugar content and acidity in grapes. On the other hand, the correlation between pH and TA is slightly lower but still substantial at 0.72. This relationship emphasizes how acidity and pH levels are related, affecting the overall balance and taste of grapes.
3.2. Evaluation of the 2020–2021 Model on the 2023 Dataset
This section presents the performance of the CNN model developed in [
27] when applied to the new dataset collected in 2023, with a comparison with the original results of the independent set from the 2020–2021 dataset (
Table 3), addressing the specific objective 2. A key point that needs to be noted is that the results in 2023 are from the mean prediction across the five CNN models developed (due to the use of five-fold cross-validation). These results provide valuable insights into the model’s behavior under different varieties in a completely different cultivation year.
Most notably, the CNN model demonstrated enhanced predictive performance when applied to the 2023 dataset for Chardonnay. In 2020–2021, the model achieved an RMSE of 2.10 Brix, an of 0.63, and an RPIQ of 2.24. However, in 2023, these metrics improved significantly, with an RMSE of 1.96 Brix, of 0.74, and RPIQ of 3.39. This finding suggests that the model adapted well to the new vintage, providing more accurate predictions. In contrast to Chardonnay, the other wine varieties, Malagouzia, Sauvignon Blanc, and Syrah, experienced a decline in performance when the model was applied to the 2023 dataset. Sauvignon Blanc, in particular, exhibited a significant decrease in and RPIQ, dropping from 0.86 to 0.61 and from 4.11 to 2.48, respectively. These findings underscore the challenges of applying a model trained on historical data to new and independent datasets.
All in all, an interesting observation is the relatively consistent RMSE values across both time periods for all wine varieties. In 2020–2021, RMSE values ranged from 1.76 to 2.20, while in 2023, they ranged from 1.96 to 2.75. This consistency in RMSE values indicates that the model maintained a similar level of precision in predicting wine attributes, despite variations in and RPIQ which are affected also by the different variance and IQR values, respectively, that are found in the two datasets. Moreover, it should be noted that whereas the 2020–2021 dataset used a single berry whose sugar content was measured in situ with a portable refractometer, the 2023 dataset comprises spectra from three different berries within a single bunch whose sugar content was measured in the laboratory using conventional analytical methods.
3.3. Prediction of Maturity (Brix, pH, and TA) on the 2023 Dataset
This subsection reports the results of the specific objective 3, and specifically examines the accuracy of different models (including multi-output models) on the 2023 dataset which includes all three maturity indicators. First, the results of the standard single-output machine learning models are reported, i.e., where each maturity indicator is predicted independently from the others. Then, multi-output models are examined where all maturity indicators are predicted simultaneously.
3.3.1. Standard Single-Output Machine Learning Models
The results presented in
Table 4 offer valuable insights into the performance of standard ML models when applied to the 2023 dataset for predicting the three critical oenological maturity indicators, namely,
Brix, pH, and titratable acidity (TA). The table summarizes the mean performance metrics for the best combination of learning algorithm and spectral pre-treatments for each maturity indicator and grape variety.
For the prediction of
Brix, the machine learning models generally exhibit consistent performance, with
values ranging from 0.74 to 0.89 across different varieties. Chardonnay, utilizing the Random Forest (RF) model with ABS+SG1+SNV spectral pre-treatment, stands out with a notably high
value of 0.86 and a high RPIQ of 4.49. This suggests that the model provides accurate and reliable predictions for the sugar content in Chardonnay grapes. Furthermore, Malagouzia with Partial Least Squares (PLS) modeling and REF spectral pre-treatment also exhibits a strong
value of 0.89 and an RPIQ of 5.43, indicating good predictive performance. Overall, the machine learning models maintain consistent predictive accuracy for
Brix, similar to the previous table discussed (
Table 3).
With respect to the pH predictions, the models perform reasonably well, with Chardonnay, utilizing the RF model with ABS+SG1+SNV spectral pre-treatment, achieving a remarkable value of 0.88 and a high RPIQ of 5.72. This suggests that the model excels in predicting the acidity levels of Chardonnay grapes. However, Malagouzia displays comparatively lower predictive performance for pH, with an value of 0.44, indicating a less reliable prediction for this particular maturity indicator. The remaining varieties, Sauvignon Blanc and Syrah, exhibit moderate predictive performance. It is worth noting that, overall, the machine learning models offer satisfactory pH predictions, except for Malagouzia where improvements might be needed.
In the case of titratable acidity, the machine learning models consistently perform well. It should be noted that in all cases the best learning algorithm was RF. The models exhibit values ranging from 0.67 to 0.83, indicating good predictive accuracy for TA. Notably, Chardonnay with RF and ABS+SG1+SNV pre-treatment achieves an value of 0.67 and an RPIQ of 1.90, suggesting reliable predictions for titratable acidity. The other varieties also perform adequately well, with higher values but also higher RMSE values; this is due to their higher variability considering that the dataset includes samples that had high TA (not yet mature).
3.3.2. Multi-Output Models
The results of the multiple output models considered are reported in
Table 5. Reported are the best combination of learning algorithm (PLS or RF) and spectral pre-treatments (out of the nine considered) for the standard ML models, and the results of the proposed multi-input–multi-output CNN employing the multi-head attention mechanism. The best standard ML model was selected as the one minimizing the mean
across all three maturity indicators and five-folds. To aid the comparison between the single-output models and the multi-output ones,
Figure 6 provides the
values in barplots.
Comparing these results, it is first clear that the proposed CNN model generally outperforms the best from the standard ML multi-output models, with only two exceptions (in Chardonnay, the Brix and pH predictions have lower accuracy; however, TA is predicted more accurately). However, comparing the best single-output models with the multi-output models, it may be seen that there is no clear winner. In general, the multi-output models have lower accuracy when predicting the Brix content, but perform slightly better in the other two maturity indicators. This may be attributed to the fact that the multi-output models strive to average out the prediction errors across all three maturity indicators, and the increase in model performance in pH and TA comes only to the detriment of the Brix content.
3.4. Model Intepretability
In the interpretability analysis of the proposed model, the focus was placed on identifying the spectral regions (or features) that the multi-head attention layer deems important (
Figure 7). This relative feature importance plot concerns all three maturity indicators simultaneously. Annotated on each plot are the five top-valued wavelengths for each variety.
For Chardonnay, it is evident that the focus is placed primarily between 550 and 900 nm, with distinct peaks at 640, 720, 800, 830, and 890 nm. Three more regions emerge as important, namely, around 1040, 1540, 2130, and 2390 nm. As far as Malagouzia is concerned, it is evident that the model has not performed a very sparse feature identification, with focus placed on multiple regions of the spectrum. Still, some wavelengths that emerge are at 720, 1100, 1350, and 2040 nm. The relative importance plot for Sauvignon Blanc reveals that in this cultivar SWIR is the most important spectral region. In addition to bands in VNIR at 770 and 920 nm, bands at 1510, around 2110, and at 2320 nm are where the model mostly focuses. With respect to the red wine variety (i.e., Syrah), the model tends to place particular emphasis between 550 and 1100 nm, with the top five peaks standing at 750, 770, 840, 930, and 1030 nm.
4. Discussion
Building on our prior research, as detailed in [
27], where a CNN excelled in sugar content prediction for four grape varieties using data collected in 2020 and 2021, we present a novel approach. In the summer of 2023, new data were gathered from the same grapevines, encompassing in situ spectral measurements and laboratory determinations of pH and titratable acidity (TA). Our objectives were threefold: firstly, to enrich the local grape spectral library by incorporating pH and TA as additional maturity indicators; secondly, to validate the performance of the CNN model developed in previous seasons in predicting sugar content (
Brix) in the 2023 data; and lastly, to develop new models for pH and total acidity prediction, including an exploration of multi-output models that simultaneously predict all three maturity parameters. Practically speaking, these models may be accordingly applied to newly recorded in situ point spectra to ascertain in real time the maturity of wine grapes and thus help select the optimal harvest time. In tandem with the final point, a multi-input–multi-output CNN enhanced by a multi-head attention mechanism was proposed, which empowers the model to discern the most influential spectral wavelengths for prediction. To qualitatively describe the accuracy metrics below, “excellent fit” is used to describe
above 0.8 and RPIQ above 4 while “good fit” denotes
above 0.6 and RPIQ above 2.
With respect to the first goal, the grape spectral library was expanded in the 2023 field campaign to encompass pH and TA measurements, in addition to sugar content, for the four distinct grape varieties (i.e., Chardonnay, Malagouzia, Sauvignon Blanc, and Syrah), across the entire maturity period (
Figure 3). The cross-correlations observed in the data (
Figure 5) revealed a clear and consistent inverse relationship between acidity on the one hand and sugar content and pH on the other as the grapes ripened. Furthermore, the mean spectral plots (
Figure 4) indicated significant similarities across the maturity indicators, reinforcing the hypothesis that these intercorrelations could potentially be harnessed by a single multi-output model. This insight underscores the interplay between key oenological parameters and signifies the potential for more robust and unified predictive models that leverage these intricate relationships.
As far as the second goal is concerned, there is a notable difference in data acquisition methodologies between the 2020–2021 and 2023 datasets that should be acknowledged. In the former, we employed single-berry spectral acquisitions, whereas in the latter, we adopted a more complex approach, collecting spectra from three berries within the same bunch (subsequently extracting the mean reflectance), and conducted measurements of
Brix, pH, and TA from the must of the entire bunch after crushing. Despite this divergence in data acquisition, the results remained encouragingly consistent (
Table 3). In particular, there was a relative increase in terms of RMSE of 20% (mean across the varieties), with the results for Chardonnay noting a decrease in RMSE. Across all varieties, the RPIQ metric was above 2, signifying a good fit. This discrepancy raises intriguing questions about the potential for robust predictive models that can adapt to different data collection scenarios, in addition to their usage on a completely new growing season, lending further resilience to the system.
Turning to the third objective, we developed and evaluated a range of predictive models, encompassing both single-output machine learning models (such as PLSR, RF, and SVR) and multi-output models (PLSR and RF), in addition to our novel CNN equipped with a multi-head attention mechanism. The following observations may be noted from the single-output models. First, it is evident that models developed using the 2023 data were more robust in predicting the sugar content than the best model developed from the 2020–2021 data with a mean RMSE of 1.66
Brix compared to 2.40, respectively (
Table 4). The
and RPIQ values indicate excellent fit for Chardonnay, Malagouzia, and Syrah, with Sauvignon Blanc showing good fit. Second, the prediction performance for pH and TA is mixed and changes from one variety to the other, with the red grape variety (i.e., Syrah) producing the most robust results. For example, the Chardonnay and Syrah models have excellent fit for pH, while only a good fit is observed in TA for Sauvignon Blanc and Syrah. The worst results are observed for the pH predictions of Malagouzia and for the TA predictions of Chardonnay (
of 0.44 and 0.67, respectively). Third, it is important to note that the spectral source used by the best model is neither consistent through the maturity indicators nor through the different varieties. Still, the first derivative and SNV transforms emerge as the ones able to produce more robust models.
Comparing the standard multi-output models (i.e., PLSR and RF) and the proposed model with the single-output results (
Figure 6), it is evident that the standard ML models in most cases perform slightly worse than their best counterparts. It is noted that the best multi-output ML model uses a single spectral source from which it predicts the maturity indicators per each variety, while crucially, the best single-output models are selected for each maturity indicator independently after considering all available spectral sources. Interestingly, it is the first and second derivatives that produce the most robust results in the standard ML multi-output models. The above indicates that despite the intercorrelations and similar trends observed, if the goal is to maximize the accuracy and apply standard ML models, then using the best combination of learning algorithm and spectral source per each maturity indicator may yield the best results. Moving to the proposed CNN model, from the same figure it is noted that the model yields slightly lower predictions in terms of sugar content, but produces better results for pH and (most notably) for TA. Overall, however, there is no clear consistent winner between the best single-output model per variety and indicator, and the proposed model. However, it should be understood that each CNN model (i.e., for each variety) is compared against the best out of 27 different models (three learning algorithms times nine spectral sources).
The outcomes of the third objective, while informative, underscored the complexity of simultaneously predicting multiple oenological properties. The multi-output models, while offering baseline performance, failed to unlock significant advantages in multi-parameter prediction. Intriguingly, the multi-output models displayed marginal improvements in predicting pH and TA, albeit at the expense of sugar content (Brix). These findings prompt a deeper exploration of multi-output models’ utility in viticulture and oenology and highlight the challenges associated with harmonizing predictions across diverse parameters.
Compared to other studies in the literature, our results are similar or better. Fadock et al. [
47], in the Syrah variety, found lower
and RMSE values in predicting
Brix, pH, and acidity, and specifically
0.70 and RMSE 1.09 for
Brix,
of 0.72 and RMSE of 0.06 for pH, and
of 0.31 and RMSE of 1.25 for TA. Pampuri et al. [
48] in Chardonnay reported
of 0.87 and RMSE 1.90 for
Brix,
of 0.62 and RMSE of 0.14 for pH, and
of 0.80 and RMSE of 3.94 for TA. Rouxinol et al. [
14] also report high
values for
Brix (0.86) and TA (0.86) in red grape varieties using the 1100–2300 nm spectral range. Finally, Ping et al. [
16], using the table grape variety Kyoto, reported
of 0.92 and RMSE of 1.01 for
Brix, and
of 0.94 and RMSE of 1.78 for TA.
Importantly, the proposed model incorporated the multi-head attention mechanism which may be used to add an interpretability degree to the black box CNN model. Examining the relative feature importance ascribed to all maturity indicators (
Figure 7), and comparing them with the findings of our previous study that only examined sugar content [
27], the subsequent observations can be noted. First, the spectral region around 730 nm identified as important primarily for Malagouzia and Syrah (with noticeable peaks at the two other varieties as well) has been widely reported in the literature [
49,
50] and may be ascribed to the second overtone of a fundamental wide absorption band around 3390 cm
in fructose and glucose, i.e., the main dissolved solids in the aqueous solution of the grape juice, due to the O–H bonds [
51]. The 1510 nm in Chardonnay and Sauvignon Blanc could be an overtone of the C=O 1730 cm
associated with the carboxylic acid group of tartaric acid [
52]. In the upper SWIR, the 2120 nm identified in Chardonnay and Sauvignon Blanc may be potentially ascribed to a combination of the O–H bending and C–O stretching vibrations [
53], while the 2320 nm in Sauvignon Blanc may be due to the C–O bond in glucose, and specifically the second overtone of sharp absorption bands at 1080 cm
in the fingerprint region [
51]. As pH is not associated directly with specific molecules, it is difficult to present similar interpretations.
With respect to the limitations of the present study, the focus was placed only on four different varieties collected from a single estate and (for pH and TA) only within a single growing season. Notably, it is evident that there is some degree of variability in the prediction accuracy of pH and TA with respect to the variety examined. Thus, the results may not generalize well to different wine grape varieties and/or grape-growing regions and future work is necessary to ascertain their generalizability. Moreover, three berries from the whole cluster were selected to measure their VNIR–SWIR spectrum while the maturity indicators were estimated from the whole cluster. Therefore, despite our best efforts to select representative berries from the cluster, there may have been a small bias in the selection process, while the VNIR–SWIR spectrum from which the maturity indicators are estimated is only a partial snapshot of the whole cluster. This may only be ameliorated through the use of setup that employs a hyperspectral imaging camera in order to capture the spectrum of the whole cluster.
In the future, this work can be extended to include more grape varieties and data from multiple growing seasons (particularly for pH and TA). Different multi-output models may also be studied to ascertain whether the simultaneous prediction can result in enhanced accuracy of prediction. Another potential avenue of research is to examine whether the use of models that incorporate multiple varieties in the calibration set are as robust as the variety-specific models that have been developed herein. Although most studies develop independent models for each cultivar, some studies focusing on table grapes have developed a single global model [
15] while others have compared the accuracy between (i) a global model and (ii) white- and red-grape-specific models [
54]. The use of in situ hyperspectral imaging is also a promising research avenue to transfer the models to automated agricultural robots that can determine the maturity degree of the entire field [
55].