1. Introduction
Wine grape variety identification poses a significant challenge for the wine industry [
1]. Particularly when relying solely on a single grape leaf or fruit for varietal identification, traditional deep learning modeling methods often struggle to effectively distinguish varieties with a high degree of similarity [
2]. The complexity of this challenge stems from the diversity of grapes, encompassing variations in morphology, color, size, and varietal characteristics influenced by different growing environments [
3]. Traditional deep learning approaches falter in addressing this intricate and variable scenario [
4].
To address this issue, a novel deep learning approach is proposed in this study to mitigate the challenges associated with using a single feature for variety identification [
5]. By employing deep learning algorithms such as convolutional neural networks (CNNs) and recurrent neural networks (RNNs), a multifeature fusion strategy is introduced, integrating information from both grape leaves and fruits [
6]. This methodology deviates from conventional deep learning techniques by emphasizing the amalgamation of information from diverse sources, thereby enhancing the accuracy of identifying similar varieties [
7]. The significance of this study is to fill the gap in the existing research on wine varietal recognition and to propose an innovative solution to the limitations of traditional deep learning models in this task. While traditional single-feature-dependent models perform poorly when facing varieties with high similarity, the multisource information fusion approach proposed in this paper provides new ideas to address this challenge. By optimizing and improving the deep learning model and fusing multisource information into the model, this study achieves more accurate identification of similar varieties, thus providing a more reliable varietal identification tool for the wine industry. This research result is of great significance to grape growers, winemakers and wine producers, which can improve the accuracy and efficiency of varietal identification, and thus promote the improvement of wine quality and market competitiveness. Therefore, the work in this paper is not only academically innovative and practical, but also positively contributes to the development of the wine industry.
Deep learning has been extensively applied in varietal recognition across various agricultural and cash crops. Alper Taner [
8] utilized a deep learning model for hazelnut varietal classification, achieving an accuracy of 98.63%. Karim Laabassi et al. [
9] employed a deep learning approach for wheat variety classification, yielding classification accuracies ranging from 85% to 95.68%. Murat Koklu [
10] achieved notable success in the classification of rice varieties using a deep learning approach. Chunguang Bi [
11] devised a neural network capable of accurately and efficiently classifying corn seeds, meeting the stringent requirements of high-precision classification of corn seed images. Poornima Singh Thakur [
12] demonstrated that their method outperformed several contemporary deep learning methods in crop disease identification, achieving an accuracy of 99.16% on the PlantVillage dataset. Resul Butuner [
13] attained the highest classification success rate of 99.80% using an ANN algorithm that leveraged deep features obtained from the SqueezeNet model. Nurhadi Wijaya [
14] employed CNN for efficient citrus type classification, achieving a commendable accuracy of 96.0% in classifying oranges. Deep learning methods possess robust feature learning and representation capabilities, enabling the extraction of intricate characteristics of grape varieties from extensive datasets [
15]. Bogdan Franczyk [
16] developed a vineyard grape recognition model with a 99% accuracy rate in correctly identifying grape varieties that are relatively discernible. Amin Nasiri [
17] employed a modified deep learning model that achieved an average classification accuracy of more than 99% in recognizing different grape varieties. Marco Sozzi [
18] obtained promising results using YOLOv3, YOLOv4, and YOLOv5 deep learning algorithms for the automatic detection of white grape varieties. Compared to traditional machine learning methods, deep learning exhibits superior performance in handling large-scale data, enabling the extraction of underlying patterns and enhancing model generalization capabilities [
19]. In the realm of grape variety identification, deep learning not only enhances identification accuracy but also adapts well to variations in grape varieties across different regions, climates, and growing conditions, thereby exhibiting enhanced robustness [
20]. Although deep learning has shown strong potential for grape variety identification, the wine industry still faces many challenges. Varietal mixing and cross-fertilization are more common in the grape growing process, making it more difficult to distinguish between similar varieties [
21]. In addition, the morphological characteristics of grape varieties are influenced by the growing environment, climate, and soil conditions, further increasing the complexity of identification [
22]. Therefore, there is a need to develop a method to accurately identify grape varieties to meet the needs of the wine industry for quality assurance and product traceability.
Against this background, this study is dedicated to addressing the needs and challenges of varietal identification in the wine industry. Our goal is to construct an efficient and accurate grape variety identification model to provide reliable quality assurance and product traceability services. To address the challenges of recognizing grape varieties, we propose a novel approach that has the following three key contributions.
WineYOLO-RAFusion multiscale fusion network model: This innovative target detection model integrates multiscale fusion techniques to achieve more comprehensive information extraction, thus improving the recognition accuracy of wine grape varieties.
MultiFuseYOLO model: We propose a new model, MultiFuseYOLO, for recognizing wine grape varieties with high similarity, which integrates fruit and leaf features and improves the model’s ability to discriminate between varieties through comprehensive and accurate fusion of information from multiple sources.
SynthDiscrim algorithm: This algorithm is the core algorithm of the multisource information fusion method, which seamlessly integrates fruit and leaf information by using complex techniques such as weighted feature fusion, optimized threshold setting and combined discriminant score calculation.
2. Materials and Methods
2.1. Dataset
This experiment used the publicly available dataset Embrapa WGISD [
23] as the primary experimental dataset Self-harvested leaf dataset as the supplementary experimental dataset. Uses of the Embrapa WGISD include relaxation of the instance segmentation problem, classification, semantic segmentation, object detection and counting, and WGISD can also be used for grape varieties. The dataset consists of 300 images containing 4432 grape bunches labeled with bounding boxes. Of these images, 137 were labeled with a binary mask to identify the pixels of each grape bunch. This means that out of 4432 grape bunches, 2020 grape bunches were provided with binary masks for instance segmentation. The grape berries in WGISD are shown in
Figure 1 and the number of images in WGISD is shown in
Table 1.
This data collection took place in September 2023 at a wine estate in Yantai, Shandong Province, focusing on grape variety identification. The wine estate specializes in wine grapes, covering five varieties, Noble Aroma, Matheran, Cabernet Sauvignon, Sauvignon Blanc, and Chardonnay. Due to the limitations of the collection season, not all grape plants were fruiting at the time, so we focused our data collection on photographing the leaves.
During data collection, we used a selfie stick with a Sony lens and kept the shooting distance at 20–30 cm. This approach not only facilitated high-quality photography of the target, but also utilized the cell phone for data supplementation, ensuring comprehensive and accurate collection. Camera lens section parameters are shown in
Table 2 below. The grape leaves in the self-collected dataset are shown in
Figure 2, and the number of images in the self-harvested leaf is shown in
Table 3.
As can be seen from the datasets shown above, the publicly available dataset Embrapa WGISD contains mainly grape fruit bunches, so our annotation of it focused mainly on the bunches. The self-collected dataset, on the other hand, mainly contains grape leaves, so our labeling of the self-collected dataset mainly focused on the leaves.
Such dataset characteristics made it necessary for us to focus on the features of different parts of the annotation process, focusing on the morphology and structure of fruit bunches in the case of Embrapa WGISD, while in the case of the self-collected dataset, more attention was paid to the features of leaves and their related information. Such an annotation strategy helped to ensure that our annotation work on different datasets was more precise and targeted.
2.2. WineYOLO-RAFusion Multiscale Fusion Network Modeling
First and foremost, we established a deep learning model as the cornerstone of integrated information fusion to distinguish it from other conventional deep learning models [
24]. The model had to achieve high localization accuracy while also ensuring robust recognition accuracy for grape varieties.
The WineYOLO-RAFusion model, developed with YOLOv7 [
25] as a benchmark, introduces the innovative Res-Attention module [
26] and a CFP-centered feature pyramid multiscale fusion module to address the challenges associated with wine grape variety identification and fruit localization tasks.
The model inherits the real-time capabilities and high efficiency of YOLOv7, providing a strong foundation for the target detection task [
27]. By incorporating the Res-Attention module, the model introduces residual connections and attention mechanisms during the feature extraction phase, significantly enhancing focus on the target region. This enhancement boosts the model’s sensitivity to grape variety features, thereby further improving the accuracy of variety identification.
Secondly, the WineYOLO-RAFusion model employs a CFP-centered feature pyramid multiscale fusion module to better accommodate the diversity of wine grapes. By constructing a centered feature pyramid, the model can fuse features at different scales in a more focused manner, enhancing its ability to represent grape berries across multiple scales [
28]. This renders the model more adaptable to variations in fruit sizes and shapes for the fruit localization task, consequently improving the accuracy and robustness of fruit localization [
29].
In summary, the WineYOLO-RAFusion model offers an efficient and accurate solution for varietal identification and fruit localization tasks in wine grapes by integrating YOLOv7, the Res-Attention module, and the CFP-centered feature pyramid multiscale fusion module. This model amalgamates the classical framework of target detection with advanced feature fusion techniques, providing robust support for intelligent production in the winemaking industry.
2.2.1. Res-Attention Module
The Res-Attention module is a new attention mechanism based on the CBAM [
30] (convolutional block attention module) module. The Res-Attention module aims to reduce the information loss that may occur during the information transmission process by introducing a residual structure to minimize the information loss that may occur during the transmission process. The structure of the Res-Attention module is shown in
Figure 3.
In the Res-Attention module, the feature map first enters the channel attention model, which retains some of the input information [
31]. By introducing a spatial attention model after the channel attention model, some of the previously retained information is fused with the information output from the spatial attention module [
32]. Eventually, the fused feature information is output to the final output of the Res-Attention module.
The channel attention mechanism first takes the input feature map
and passes it through global maximum pooling and global average pooling to obtain two
feature maps
. Then, the resulting feature maps pass through a shared network consisting of a two-layer multilayer perceptron (
MLP) with the number of neurons in the first hidden activation layer
, where
is the shrinkage rate, and the number of neurons in the second layer is
. Afterwards, the output features of the shared network are summed element-wise, and the final complete feature vector is output by the channel attention mechanism
. The formula is shown below.
where
denotes the sigmoid function;
,
, and the
MLP weights
and
are shared for both inputs; and
is after the ReLU activation function.
The spatial attention mechanism first takes the feature vectors obtained from the previous channel attention mechanism module as the input feature vectors for this module. The input feature vector is first subjected to a maximum pooling operation and an average pooling operation to obtain two feature vectors
and
, respectively. Then, the maximum pooled features and average pooled features are subjected to a channel splicing operation. Afterwards, the feature vectors are reduced to one dimension by a convolutional convolution operation of (7 × 7). Finally, after a sigmoid function, the feature vector is obtained,
. The formula is shown below.
where
denotes the sigmoid function and
denotes the convolution operation with a convolution kernel of size (7 × 7).
The design of this new Res-Attention module takes into account information integrity during feature transfer and emphasizes the importance of previously retained information by means of residual concatenation. This design allows the module to capture key features in the image more efficiently and achieve more comprehensive information fusion in both spatial and channel dimensions. The introduction of the Res-Attention module is expected to enhance the performance of the model in image processing tasks, especially in scenarios where a large amount of detailed information needs to be processed and contextual relationships need to be preserved.
2.2.2. FPN Feature Pyramid
The FPN [
33] (feature pyramid network) is a deep learning network structure for target detection and semantic segmentation tasks, aiming to efficiently process feature information at different scales to enhance the model’s ability to perceive and localize targets.
The core idea of the FPN is to enable the network to focus on both details and overall information in an image by building a multiscale feature pyramid [
34]. Its main components include the bottom feature map, the top feature map, and the lateral connections [
35]. The bottom feature map is the original feature map generated by the backbone network (usually a pre-trained convolutional neural network such as ResNet [
36] or VGG [
37]). The top layer feature map contains the high-level semantic features obtained from the bottom layer feature map through multiple up-sampling and convolutional operations.
Horizontal connectivity is responsible for connecting the bottom feature map to the top feature map in order to form a feature pyramid [
38]. In this way, the FPN network is able to capture the multiscale features of the image at different levels, enabling the network to have a good perception of both small and large targets. In the task of target detection, the introduction of the FPN effectively solves the problem of uneven processing of different size targets by the network and improves the accuracy and robustness of target detection.
Overall, the FPN feature pyramid provides deep learning models with more comprehensive visual information by integrating multiple levels of feature information, making them more effective in processing multiscale objects. This design has enabled the FPN to achieve significant performance gains in tasks such as target detection and semantic segmentation, making it an important component in many visual tasks.
2.2.3. CFP Centered Feature Pyramid Multiscale Fusion Module
Based on the structure and functions of the FPN introduced above, the CFP is a module that improves and enhances the FPN. The WineYOLO-RAFusion proposed in this paper, on the other hand, incorporates an FPN-based CFP module. The centralized feature pyramid (CFP) [
39] is an innovative target detection algorithm, whose core idea is to introduce a global explicit centralized conditioning scheme to optimize the construction of the feature pyramid and the information extraction process. Compared with the existing methods, CFP not only focuses on the feature interactions between different layers, but also considers the feature adjustment within the same layer, which shows significant advantages especially in the dense prediction task.
This approach proposes a spatially explicit visual center scheme consisting of a lightweight multilayer perceptron (
MLP) [
40] for capturing global remote dependencies, and a learnable visual center for aggregating local critical regions. By globally centralizing the supervision of commonly used feature pyramids in a top-down manner, the CFP effectively enhances the extraction of multiscale feature information and achieves consistent performance improvement on a strong object detection baseline.
The overall architecture diagram includes the input image, the CNN backbone for extracting the visual feature pyramid, the introduced explicit visual center (EVC), the global centralized conditioning (GCR), and the decoupled head network for target detection, which consists of a classification loss, a regression loss, and a segmentation loss, where C denotes the class size of the dataset used. The CFP module structure is shown in
Figure 4.
The specific flow of the CFP implementation is as follows. The input image is fed to the backbone network to extract a five-layer feature pyramid X, where the spatial size of each layer of feature is 1/2, 1/4, 1/8, 1/16, and 1/32 of the input image, respectively. The top layer of the feature pyramid (i.e., ) is captured using an EVC structure, and a proposedlightweight MLP architecture to capture the global long-range dependencies of (the lightweight MLP architecture is not only simpler in structure, but also lighter in size and more computationally efficient compared to a transformer encoder based on a multiple attention mechanism). A learnable visual center mechanism is used with the lightweight MLP to aggregate the input image’s local corner regions based on the proposed ECV, in order to enable the shallow features of the feature pyramid to simultaneously benefit from the visual centralization information of the deepest features in an efficient mode, where the explicit visual centralization information obtained from the deepest in-layer features is used to simultaneously modulate all the pre-shallow features (using GCR to modulate and ). These features are aggregated into a decoupled head network for classification and regression.
The EVC consists mainly of two blocks connected in parallel, lightweight MLP and LVC. The resultant feature maps of these two blocks are connected together along the channel dimension as the output of the EVC used for downstream recognition. Between and EVC, the stem block is used for feature smoothing instead of implementing it directly on the original feature maps. The stem block consists of a 7 × 7 convolution with an output channel size of 256, followed by a batch normalization layer and an activation function layer.
The lightweight
MLP consists of two residual modules: a depth-separable convolution-based module and a channel
MLP-based module. The input of the
MLP module is the output of the depth-separable convolution module. Both modules undergo channel scaling and DropPath operations to improve feature generalization and robustness. Compared with spatial
MLP, channel
MLP not only effectively reduces the computational complexity, but also meets the requirements of general-purpose vision tasks. Finally, both modules implement channel scaling, DropPath and residual join operations. LVC is an encoder with an intrinsic dictionary consisting of an intrinsic codebook (
, where
is the total spatial number of input features, where H and W denote the spatial magnitude of the height and width of the feature maps, respectively) and a set of learnable visuocentric scale factors
. The processing of LVC consists of two main steps; first, encoding the input features using a set of convolutional layers and further processing using CBR blocks, and second, combining the encoded features with the intrinsic codebook through a set of learnable scale factors. For this purpose, we used a set of scale factors s in sequence such that
and
mapped the corresponding position information. The information about the
kth codeword in the whole image can be computed in the following way.
where
is the
ith pixel point,
is the
kth learnable visual codeword,
is the
kth scale factor also set as a learnable parameter.
is information about the position of each pixel relative to the codeword.
K is the total number of visual centers. A fully connected layer and a
convolutional layer are then used to predict salient key class features. Finally, the input features from the Stem block
and the localized corner region features with scale factor coefficients are subjected to channel multiplication and channel addition.
The contribution of CFP is to propose an intra-layer feature conditioning method for the feature pyramid and a top-down global centralized conditioning strategy, which provides a more effective means of information extraction and feature optimization for the target detection task. The introduction of this algorithm injects innovation into the research framework of the thesis and provides strong support for improving target detection performance.
2.3. MultiFuseYOLO: A Study of Multisource Information Fusion Methods
In order to further improve the accuracy and robustness of grape variety identification, we added a novel combinatorial discriminative method of multisource information fusion to the WineYOLO-RAFusion model, and obtained a new model, MultiFuseYOLO, which aims to achieve more reliable varietal identification by combining optimized leaf and fruit information with effective processing of multitarget detection results. MultiFuseYOLO inherits the state-of-the-art architecture of the WineYOLO-RAFusion model, but incorporates a multisource information fusion mechanism in the final stage of the model. The core algorithm of the multisource information fusion mechanism is the SynthDiscrim algorithm, which uses a user-adjustable threshold that triggers the consideration of composite discriminations when the probability of classifying a fruit or leaf falls below the threshold. The flowchart of the multisource fusion method is shown in
Figure 5.
2.3.1. SynthDiscrim Algorithm
In the integrated discrimination process, multiple leaves and fruits detected in the image are first integrated. The integrated discrimination scores of leaves and fruits are obtained by accumulating the classification probabilities of all leaves and fruits for each variety
. It is assumed that for a particular variety, there are N leaf and M fruit detections corresponding to the classification probabilities of
and
, respectively. Then, the average of the classification scores for fruits and leaves, denoted as
and
, respectively, is calculated to combine the effects of the two, and this discrimination process is shown as follows.
The formulas for
are shown below:
After obtaining the composite judgment score
for leaves and fruits, a final composite score
is obtained by multiplying the average classification scores of fruits and leaves by the corresponding weights.
is calculated using the following formula,
where
denotes the final composite judgment score of grapes in the image.
and
denote the weights used for fruit and leaf classification scores, respectively.
and
denote the average fruit and leaf classification scores, respectively.
2.3.2. EMASlideLoss Function
EMASlideLoss uses exponential moving average (EMA) to smooth SlideLoss. EMA is an averaging method that gives more weight to the most recent data and progressively less weight to older data. In the formula,
is a parameter called the smoothing coefficient, which determines the weight of the newer observations relative to the older ones.
where
is the EMA at time
t,
is the SlideLoss at time t, and
is the smoothing factor, usually between 0 and 1, which determines the weight of the new data.
The role of EMA allows for better smoothing, and by introducing EMA, EMASlideLoss aims to reduce fluctuations in the loss values. This helps the model to learn more consistently, especially when noise or outliers occur during training, and it also provides good generalization ability. Smoothing the loss values may help the model to generalize better to unseen data, as it relies more on stable trends rather than local noise. At each training step t, EMASlideLoss calculates the current SlideLoss value , and then updates the EMA value using the EMA formula. The EMA value is considered as the smoothed loss and is used to guide the update of the model parameters. This approach prevents the model from being overly sensitive to outliers in the training set and improves the stability of the model. As the parameter moves closer to 1, the new observations have a greater effect on the EMA and the EMA is more sensitive. The closer to 0, the smoother the EMA is and the more resistant it is to noise and fluctuations.
Overall, the goal of EMASlideLoss is to improve the training of deep learning models by introducing exponential moving averages to improve the stability and generalization of training.
4. Conclusions
In this study, we explored the task of identifying wine grape varieties. Through separate experimental analyses of fruit and leaf characteristics, we found that certain wine grape varieties posed greater difficulties when identified and classified based on fruit or leaf characteristics alone. Recognizing this challenge, this paper aimed to address this issue by improving the performance of grape variety identification models at the model level.
To address this problem, we developed a highly specialized target detection model, WineYOLO-RAFusion, based on the YOLOV7 model, which has been carefully designed to excel in classifying wine grape varieties. However, since WineYOLO-RAFusion has difficulty in distinguishing between wine grape varieties that are highly similar, a new model, MultiFuseYOLO, was proposed based on this problem. MultiFuseYOLO is a new model obtained by adding a multisource information fusion method to the WineYOLO-RAFusion model. The core of the multisource information fusion method is the SynthDiscrim algorithm, which seamlessly integrates fruit and leaf information using complex techniques such as weighted feature fusion, optimized threshold setting and combined discriminant score calculation. These improvements greatly enhanced the model’s ability to accurately recognize various grape varieties.
On the experimental side, we rigorously compared the model proposed in this paper with other state-of-the-art models (e.g., YOLOV5, YOLOX, YOLOV7, and WineYOLO-RAFusion), and the results highlighted the superior performance of MultiFuseYOLO in terms of precision, recall, and F1 score. Notably, the model exhibited extremely high accuracy, especially in recognizing similar-looking species and achieving high-precision recognition.
In conclusion, the focus of this study was to improve the performance of grape variety identification at the model level. The innovation of MultiFuseYOLO by combining the WineYOLO-RAFusion model with the multisource information fusion approach is a major advancement in the accurate identification of wine grape varieties. These contributions provide powerful and effective solutions to the challenges encountered in wine grape variety identification.