1. Introduction
Grapes are one of the most important fruits in the world, and the healthy and stable development of their industry is of great significance to the national economic development and farmers’ income increase. In the cultivation of grapes, the larger the planting area, the larger the scale of damage when disease occurs and the greater the economic loss caused. Among grape leaf diseases, black rot, brown spot, and verticillium are the most common, of which black rot is one of the most important grape diseases worldwide. Black rot is a fungal disease that causes yield loss in grapes, showing black spots on leaves and fruit, and it is prevalent in the wetter spring and early summer seasons and affects a wide range of areas. Therefore, the rapid and accurate identification of grape leaf diseases and implementation of preventive measures can greatly reduce the degree of its harm in favor of increasing grape production and income [
1]. At present, grape diseases mainly rely on agricultural experts for on-site identification, and manual identification is subjective, time-consuming, and labor-intensive, so it is important to develop a fast, accurate, and intelligent grape disease identification system [
2].
Computer vision is widely used in the field of agriculture, and with the development of image processing and computer technology, image segmentation methods have experienced three basic stages, as follows: classical segmentation methods, machine learning methods, and deep learning methods. These methods have been applied in agricultural disease detection.
Traditional image segmentation techniques, such as threshold segmentation, can distinguish lesions from the background by using color and texture properties. Then, each pixel in the image is compared; if its gray value is greater than the threshold, the pixel is classified into one category, and if its gray value is less than the threshold, it is classified into another category. Dutta et al. [
3] proposed a method for the efficient real-time segmentation of diseased leaves on kohlrabi plots by adjusting VI and Otsu thresholds. ZixiLiu et al. [
4] used the Otsu method, OpenCV morphological operation, and morphological transformation method to outline the outline of the object for corn gray spots, corn rust, large corn spots, and healthy corn leaves and used the outline to obtain the difference set between the corn leaves and the background to obtain a complete corn leaf image. Classical image segmentation methods require high image quality, and if the environmental conditions change during the image quality, the recognition results will be poor or even invalid. Therefore, the versatility and robustness of these methods cannot be satisfying, and the accuracy in practical applications cannot be guaranteed.
With the development of machine learning, many researchers have started to try to apply it to disease speckle segmentation to improve the accuracy and robustness of segmentation. Attiquekhan et al. [
5] used a genetic algorithm (GA) to add a feature selection step that further speeds up the process of obtaining improved classification results using support vector machines. Ambarwari et al. [
6] used Support Vector Machine (SVM) with RBF kernel for plant species recognition with 82.67% accuracy. S. Appeltans et al. [
7] removed soil pixels from hyperspectral images through LDA classification and a custom noise filtering algorithm. Machine learning methods can yield satisfactory segmentation results using small sample sizes, but these methods require multiple steps of image preprocessing and are relatively complex to execute. In addition, machine learning-based segmentation methods are relatively weak in unstructured environments and require researchers to manually design feature extraction and classifiers, which makes the work more difficult.
With the improvement of computer hardware performance, deep learning has been rapidly developed. Currently, common deep learning algorithms include the full convolutional neural network algorithm FCN proposed by Long et al. [
8] for the problem of extremely high memory cost and low computational efficiency of CNN. Zhang et al. [
9] established a full convolutional network (FCN)-based segmentation model for wheat spikelets, which effectively achieves the segmentation of wheat spikelets in the field environment. Badrinarayanan et al. [
10] proposed SegNet, which uses an inverse convolutional filter to replace the traditional up-sampling operation, eliminating the need to learn to increase the sampling rate. Zhao et al. [
11] proposed PSPNet, global prior information that is effective in obtaining high-quality results in scene semantic analyses. DeepakKumar et al. [
12] used pyramid scene parsing network (PSPNet) and fuzzy rule model to develop an innovative multilevel model (PSGIC) for estimating wheat leaf rust and its infection level. Chen et al. successively proposed Deeplab [
13], DeeplabV2 [
14], DeeplabV3 [
15], and DeeplabV3+ [
16], which can efficiently extract multi-scale image semantic information. Cai et al. [
17,
18] used a modified DeeplabV3+ to segment maple leaves and spots, and then assessed the extent of disease damage. Ronneberger et al. [
19] proposed a new model called U-Net. The U-Net network improves the FCN network by combining encoding paths that capture contextual information and decoding paths used for precise positioning, which splices high-resolution features with decoder up-sampled output features by jumping structures. Yi, Liu, et al. [
20,
21] performed algorithm improvement based on U-Net for light bark birch and rice segmentation. Chen et al. [
22], based on the U-Net network, proposed BLSNet to improve the accuracy of rice lesion segmentation through attention mechanism and multi-scale extraction. Aiming at the problems of low crop classification accuracy, insufficient plant disease feature extraction, and inaccurate disease edge segmentation in the traditional plant classification model, this paper proposes an improved U-Net-based plant disease segmentation method, i.e., CVU-Net. The experimental results show that CVU-Net can take into account various requirements, such as accuracy and average intersection-to-union ratio, can segment small lesions well, and has good segmentation effects on the edges of lesions.
The contribution of this paper mainly includes the following three parts:
A grape black rot spot segmentation model CVU-Net is proposed to achieve the accurate segmentation of grape black rot spots.
A dual-attention mechanism is incorporated into the U-Net encoding network, enabling the model to better capture the edge, texture, and semantic information of the target, thereby producing more accurate segmentation results.
The use of multiple atrous convolutions in multi-scale ASPP for parallel sampling of the input image for feature extraction, enriching the semantic information by expanding the sensory field, and encoding the global context using image-level features can avoid the problem of segmentation error due to falling into the local features, thus improving the segmentation performance of the network.
Because some of the diseased spots on grape leaves are small, and the edges of the lesions are blurry, there is no way for traditional deep learning methods to identify them accurately, so to improve the accuracy of disease semantic segmentation, this paper proposes an improved U-Net network, called CVU-Net network. In this network, we use the VGG network as the backbone feature extraction network [
23], add the attention mechanism module to the feature extraction network part, and introduce the ASPP module to increase the field of view of the filter. We compared the segmentation performance of traditional U-Net, DeeplabV3+, PSP-NET, and the CVU-Net network proposed in this paper on the grape disease dataset. The experimental results show that CVU-Net outperforms the other compared networks in terms of segmentation performance. The improved method in this paper significantly improves the segmentation capability of the network, which effectively improves the segmentation accuracy of disease images.
The remainder of this article is structured as follows. We start with a description of the materials and methods of the experiments in
Section 2. In
Section 3, experiment results with a detailed discussion about the experiment are given. In
Section 4, we discuss the experiments and suggest directions for future work. Finally, the conclusions of the experiment are presented in
Section 5.
2. Materials and Methods
2.1. Data Acquisition and Preprocessing
For this study, we utilized the publicly available Plant Village dataset, which comprises 54,309 RGB images showcasing symptoms of 26 common diseases found on the leaves of 14 different plant species. Among these, this paper specifically selected 262 images of grape leaves afflicted with black rot as our test subjects. All of these images are verified by researchers with expertise in grape diseases.
To label and segment the dataset, this paper employed the LabelMe 4.5.13. LabelMe is a visual annotation tool developed in Python 3.7.0 and designed using the Qt5 graphic library. It is used for labeling images for tasks such as semantic segmentation and target detection. Moreover, LabelMe supports label generation in VOC and COCO formats.
In this paper, LabelMe was used to delineate the shape and location of grape leaf spots by drawing closed regions through polygons. The labeled data were stored in JSON format, and the data labels were subsequently converted into binary PNG images using the json_to_dataset command. In these binary images, the black areas represent the background, while the red areas represent the leaf spots, as illustrated in
Figure 1.
In the real environment, the collected image datasets may be affected by weather, light, dust, etc., so being more consistent with the real environment also further improves the robustness and generalization ability of the model. This article performs data enhancement through random transformation, adjusting image brightness and contrast, adding noise and translation. Subsequently, the images were resized to a resolution of 256 × 256 pixels. This process resulted in a total of 2096 experimental data images, thereby creating the grape leaf spot dataset, PD1.
Figure 2 illustrates some of the augmentation outcomes.
Using the dataset construction methodology described above, the grape disease images were randomly divided, with 90% allocated to the training set and the remaining 10% designated as the test set. To account for the inherent randomness in this process, multiple tests were conducted to enhance accuracy.
2.2. Data Enhancement
- (1)
Random rotation transformation
To verify more possibilities, this article simulates multi-angle shooting datasets and uses rotation and mirror-flipping methods for data enhancement. Random rotation is calculated as follows:
Set the pixel coordinates of the image before rotation to (x, y), and its coordinates are expressed as follows:
After rotating by angle
, the coordinates of the corresponding pixel point in the image are
, and the coordinates at this time are expressed as follows:
The equivalent transformation is as follows:
Putting Formula (1) into Formula (3), we obtain the following:
Mirror flipping includes vertical mirror flipping and horizontal mirroring flipping. The vertical mirror flip uses the horizontal midline as the axis and flips vertically, and the horizontal mirror flip uses the vertical midline as the axis and flips horizontally.
- (2)
Brightness and contrast adjustment
Due to the influence of weather and light, the clarity of the dataset will be affected when collecting the dataset. To better fit the situation of grape diseases in the natural environment, this article expands the dataset by adjusting the brightness and contrast to make the dataset model as close as possible to various situations encountered in the natural environment.
Adjust brightness as follows: Change the brightness of an image by directly adding, subtracting, multiplying, or dividing operations on each pixel value of the image. Let R represent the original RGB value, represent the adjusted RGB value, g is the adjustment factor, and the brightness adjustment formula is as shown in (5).
Adjust contrast
: Fine and effective contrast adjustment can be achieved by training a neural network to learn a contrast transformation function. Assuming m to be the median of image brightness, the meanings of R, R’ and g are the same as above, and the specific calculation method is shown in (6).
- (3)
Add noise as follows:
By adding noise to simulate the interference factors that would appear in the real world, the performance of the segmentation algorithm can be tested and evaluated by adding noise to the image, and the robustness of the algorithm can be further improved.
Gaussian noise is a kind of random noise that obeys Gaussian distribution (also called normal distribution). It is characterized by adding random disturbances in the form of a bell-shaped curve to the image. It has two parameters, mean and variance, where the mean reflects the symmetry The direction of the axis and the variance represent the width of the normal distribution curve, and its probability density function is shown in Equation (7). Among these, the random variable is x, the mathematical expectation is
, and the variance is
.
Salt-and-pepper noise, also known as impulse noise, is a type of noise commonly used in image processing. Its characteristic is that black pixels or white pixels will appear randomly in the image, or they may appear at the same time. The occurrence of salt-and-pepper noise can be introduced due to sensor failure, signal transmission errors, or other issues during image acquisition.
Salt-and-pepper noise usually causes obvious black and white spots in the image, seriously affecting the look and quality of the image. Gaussian noise will make the image as a whole feel blurry and distorted, reducing the clarity and contrast of the image. This paper uses a combination of Gaussian noise and salt-and-pepper noise to improve the robustness of the algorithm and greatly increase the generalization effect of the model.
2.3. Experiment Platform and Evaluation Metrics
The hardware and software configurations used for the experiments in this paper are shown in
Table 1.
In our experiments, this paper employed two evaluation metrics, pixel accuracy (PA) and mean intersection over union (MIoU), to assess the segmentation performance of grape disease images. The formula is as follows, where k denotes the total number of categories, denotes the number of pixels belonging to category i but predicted to belong to category j, denotes the number of pixels correctly predicted, and and denote false positive and false negative results, respectively.
- (1)
Pixel accuracy (PA)
PA represents the ratio of correctly predicted pixels to the total number of pixels. Its calculation formula is as follows:
- (2)
Mean intersection over union (MIoU)
MIoU is a widely used evaluation metric in experimental studies of semantic segmentation. It involves calculating the ratio of the intersection between the real and predicted sets to the union of the real and predicted sets for each category, and then calculating the average across all categories. The calculation formula is as follows:
2.4. Network Architecture
U-Net is a neural network model that consists of an encoder–decoder architecture as shown in
Figure 3. The encoder part uses the CNN architecture as a contraction path to extract image features and reduce resolution, and the contraction path has four sub-blocks, each of which consists of two consecutive 3 × 3 convolutions, the ReLU activation function, and the maximum pooling layer for down-sampling. Two 3 × 3 convolution operations can effectively reduce the neural network complexity and keep the original segmentation accuracy unchanged. In each down-sampling step, the number of feature channels is doubled. The decoder part consists of convolutional blocks containing up-sampling operations to form an extended path to repair the image detail information, locate the boundary of the segmented object, and gradually restore the spatial resolution of the feature map. In the expansion path, the sub-blocks contain two consecutive 3 × 3 convolutions, the ReLU activation function, and the up-sampling inverse convolution layer. Up-sampling expands the feature map to twice its original size and restores missing detail information. Splicing is a unique U-Net feature that clips the low-level detail features captured by the down-sampling process in the same layer and splices them into the high-level semantic features extracted using the up-sampling process. The final output segmentation result combines both the object category recognition basis provided by the low-resolution information and the accurate positioning segmentation basis provided by the high-resolution features, which improves the problem of insufficient information in the up-sampling process and achieves accurate segmentation.
The U-Net model achieves excellent segmentation results on a variety of datasets, but the U-Net model also has some shortcomings. Firstly, the redundancy is too large, as each pixel point needs to take a patch, and then the similarity of the patches of two neighboring pixels is very high, which leads to a very large amount of redundancy, resulting in very slow network training. Secondly, high classification accuracy and localization accuracy cannot coexist; when the sensory field is chosen to be larger, the dimensionality reduction multiplier of the corresponding pooling layer behind it will increase, which will lead to lower localization accuracy, but if the sensory field is smaller, then the classification accuracy will be lower. Then, the shallow network information is directly input into the decoder part will cause the poor segmentation of lesion edges. To improve the segmentation performance of the model, and at the same time improve the abovementioned shortcomings, this paper makes the following improvements on the traditional U-Net model structure: (1) Use VGG to replace the U-Net feature extraction network as follows: based on the U-Net framework, the network used for the method feature extraction is replaced with VGG, which greatly improves the training accuracy of the network, and obtains a more accurate segmentation algorithm. (2) Add an ASPP module as follows: replace ordinary convolution with atrous convolution and add spatial pyramid pooling structure to effectively reduce the loss of local information and lack of correlation of long-distance information caused by the gridding effect, and different scale features can be obtained without using pooling layer. (3) Adding CA as follows: CA is added to the feature extraction module and ASPP module to reduce the loss of accuracy to train a more accurate segmentation method and improve the segmentation accuracy of the proposed model for grape disease images. The improved model structure is shown in
Figure 4.
2.5. Optimizing the Feature Extraction Part
In this paper, U-Net is used as the basic framework for constructing the model, and the convolutional layer and the max pooling layer of the VGG16 network are used as the encoder of the U-Net network to improve the efficiency and accuracy of the image feature interpretation to improve the semantic segmentation accuracy of the U-Net and to reduce the influence of other factors on the interpretation accuracy of the model. The backbone feature extraction network part is shown in
Figure 5.
The encoder can use VGG to obtain feature layer after feature layer for stacking for convolution and max pooling. Five initial effective feature layers can be obtained using the backbone feature extraction part for the next stacking and stitching.
2.6. Attention Mechanism
Attention mechanisms in deep learning select the information that is more critical to the task from a large amount of information, and combining attention mechanisms with fast convolution can better improve the performance of semantic segmentation tasks. To date, the most popular attention mechanism is still squeeze and excite (SE) attention [
24]. SENet is designed to enable the network to perform dynamic channel feature recalibration to improve the network’s representational capabilities, the structure of which is shown in
Figure 6. From the structure, it can be seen that for an input X, it is convolved to obtain a feature map (U), for which an SE module can be attached to attach the channel attention; for U, the spatial information of each of its channels is first compressed to a single value, i.e., a vector of size 1 × 1 × C is obtained from the U of size H × W × C. Then, a set of FC layers are applied to the vector to perform a weighting adjustment to obtain a 1 × 1 × C channel attention vector; finally, the channel attention vector is weighted to U to form a weighted feature map. However, SENet only considers encoding inter-channel information and ignores the importance of positional information, which is crucial for capturing object structure in visual tasks.
Subsequent work, such as BAM [
25] and CBAM [
26], exploits positional information by reducing the number of channels followed by a large-size convolution, which is then used to compute spatial attention. However, convolution can only establish local relationships but cannot model the long-term dependencies necessary for visual tasks.
Coordinate attention (CA) [
27], on the other hand, enables lightweight networks to pay attention to a large area while avoiding incurring large computational overheads by embedding location information into the channel attention. To mitigate the loss of location information caused by 2D global pooling, CA decomposes the channel attention into two parallel 1D feature encoding processes to efficiently integrate spatial coordinates to input information into the generated attention graph. Specifically, CA uses two 1D global pooling operations to aggregate input features along vertical and horizontal directions into two separate direction-aware feature maps, respectively. These two feature maps embedded with direction-specific information are then encoded into two separate attention maps, each capturing the long-range dependencies of the input feature maps along one spatial direction. Thus, location information can be provided in advance in the generated attention maps. The two attention maps are then applied to the input feature map using multiplication to emphasize the representation of interest. Because this attention operation distinguishes spatial directions (i.e., coordinates) and generates coordinate-aware feature maps, the proposed method is referred to as coordinate attention.
The CA module encodes channel relations and long-range dependencies through precise position information, similar to the SE module, which is also divided into two steps, coordinate information embedding and coordinate attention generation; its specific structure is shown in
Figure 7.
For an input X, each channel is initially encoded using a pooling kernel of size (H, 1) along the horizontal coordinate direction or a pooling kernel of size (1, W) along the vertical coordinate direction. This results in the expression of the output for the cth channel with a height of h as follows:
Similarly, the output of the cth channel with width w is expressed as follows:
With the above transformation, the features are aggregated along two directions, resulting in a pair of direction-aware feature maps that are able to obtain a global sensory field and accurately encode positional information. They are then modified using the 1 × 1 convolutional transform function
as follows:
where [·,·] denotes the cascade operation along the spatial dimension, and δ is a nonlinear activation function, which is an intermediate feature map that encodes spatial information in the horizontal and vertical directions. This is the shrinkage rate used to control the size of the SE block. Then, f is decomposed into two independent tensors
and
along the spatial dimension,
and
are transformed using two additional 1 × 1 convolutions, and
and
are transformed into tensor inputs X with the same number of channels to obtain
and
, respectively, as follows:
where σ represents the sigmoid function. To reduce the computational overhead and model complexity, papers typically decrease the number of channels in f using an appropriate shrinkage rate r. Subsequently,
and
are expanded and employed as attention weights, respectively. Finally, the output Y of the CA module can be expressed as:
2.7. FPN-Based Feature Fusion Branching
During the process of CNN learning image features, the image resolution gradually decreases due to deep convolutional operations. This can result in lower-resolution deep features at the output, leading to recognition errors for objects that occupy a relatively small percentage of pixels in the image. To enhance multi-scale detection accuracy, it is beneficial to combine features from different network layers during training.
Feature pyramid network (FPN) [
28] is a method used for fusing feature maps from different layers to enhance the feature extraction process. Its specific structure is depicted in
Figure 8. FPN can fuse feature maps that capture different scales of information. As illustrated in the figure, FPN generates a new set of deep features by up-sampling the deep features twice, stacking them with the shallow features, and then convolving them to produce a new set of deep features. Feature fusion occurs sequentially, allowing the prediction network to incorporate five preliminary and effective feature maps generated by the VGG component of the U-Net backbone network. The fused feature map contains richer semantic and spatial information because it incorporates features from various levels. This enrichment contributes to the improved segmentation performance of the U-Net network.
Feature layer after feature layer can be obtained using VGG for stacking for convolution and max pooling. Five initial valid feature layers can be obtained using the backbone feature extraction part for the next stacking and stitching.
2.8. Multi-Scale Feature Fusion for Hollow Space Pyramid Pooling ASPP
The pooling operation of the semantic segmentation network in the process of expanding the receptive field and aggregating contextual information makes it easy to lose position and dense semantic information, while atrous convolution reduces the dependence on parameters and calculation processes on the basis of ensuring the image resolution properties. It requires fewer parameters to achieve the expansion effect of the effective receptive field of the convolution kernel and effectively aggregate contextual information. Consider a 2D atrous convolution that applies atrous convolution on the input feature map x for each position i and filter w of the input feature map y, as follows:
where k denotes the convolution kernel size and r denotes the sampling rate. The above formula indicates that a new filter is obtained by inserting r − 1 zero values along each spatial dimension between two consecutive filter values. Then, the feature mapping x is convolved through this filter to obtain the final feature map. Consequently, atrous convolution can control the sensory field of the filter and the compactness of the network output features by adjusting the sampling rate, all without increasing the number of parameters or computational effort.
Multi-scale fusion’s atrous spatial pyramid pooling ASPP uses atrous convolution with multi-level atrous sampling rates to sample feature maps in parallel, allowing the ASPP module to learn image features from different receptive fields [
29]. Because the dilated convolution with a large sampling rate will degenerate into a 1 × 1 convolution due to the inability of the image boundary response to capture long-range information, the image-level features obtained through global average pooling are integrated into the ASPP module, that is, the image-level features. The feature map outputs by the four convolution branches are input into a 1 × 1 convolution layer and then bilinearly up-sampled to a specific spatial dimension. The calculation process is as shown in the following formula:
In the formula,
represents the atrous convolution with sampling rate r and convolution kernel size n×n on level features, image(x) represents using the global average pooling method to extract image-level features from the input x, and the ASPP structure is shown in
Figure 9.
The ASPP structure expands the sensory field and enhances semantic information through parallel sampling using atrous convolution at multiple sampling rates. Additionally, image-level features effectively capture global contextual information and account for context relationships, thereby preventing segmentation errors arising from overreliance on local features and ultimately improving target segmentation accuracy. Therefore, before up-sampling, the feature map containing high-level semantic information is input to the ASPP module to obtain features of different scales, which helps to improve the network’s lesion extraction performance.
3. Results
3.1. Determination of Training Parameters
Because too small and too large learning rates can lead to very slow model convergence and model non-convergence, it is necessary to determine an appropriate initial learning rate. This article designs and tests the accuracy of the U-Net model trained with four initial learning rates. The results are shown in
Figure 10. It can be seen that when the learning rate is 0.0001, the epoch is 100, and the average intersection and merger ratio of the method on the PD1 dataset is 86.81%, which achieves good segmentation results. On this basis, based on the empirical values of commonly used network training hyperparameters and repeated testing, starting network hyperparameters are provided for subsequent experiments on the CVU-Net model, as shown in
Table 2.
3.2. Comparison of Different Attention Mechanisms
To verify the difference between using different attention mechanisms on the detection performance of the algorithm, while controlling other variables consistently, this experiment will add the following four attention mechanisms: SENet, CBAM, ECA, and CA to the original model for comparison and analysis. The original model is an improved U-Net model with added VGG and ASPP modules. Using MIoU and PA as indicators, segmentation experiments were conducted on the grape disease image test set with complex backgrounds.
Table 3 shows the comparison results of different attention mechanisms. As can be seen from the table, the MIoU and PA indicators of the CA attention mechanism are the highest, reaching 91.09% and 94.33%, respectively. Therefore, this paper selects CA as the most appropriate attention mechanism based on its performance and uses the training set to evaluate the segmentation performance of the CVU-Net model.
3.3. Ablation Experiments
To assess the performance of the proposed CVU-Net method in the task of grape disease semantic segmentation, it was compared with traditional semantic segmentation methods such as FCN, PSPNet, U-Net, and DeeplabV3+. MIoU and PA were selected as metrics to evaluate the segmentation performance of each method.
To test the generalization ability of CVU-Net and verify its robustness, segmentation and comparison experiments were conducted on the constructed training and test sets. To confirm the effectiveness of the CVU-Net concept, which includes using a network that provides better segmentation than the original feature extraction network (VGG), incorporating an ASPP module into the jump connection section and adding CA to both the enhanced feature extraction module and the ASPP module, the following ablation experiments were performed on the test set.
- (1)
VU-Net: Based on the traditional U-Net architecture, the feature extraction network is replaced with the VGG network, which has a superior segmentation effect.
- (2)
AVU-Net: Building upon VU-Net, an ASPP module is integrated into the jump connection layer.
- (3)
CVU-Net1: Extending AVU-Net, CA is introduced into the enhanced feature extraction module.
- (4)
CVU-Net: Further enhancing AVU-Net, CA is integrated into both the enhanced feature extraction module and the ASPP module.
Table 4 presents the experimental results of different configurations on the PD1 dataset. It is evident that CVU-Net outperforms the other configurations, indicating that the addition of the CA module after the feature extraction module and ASPP module effectively enhances the model’s segmentation capabilities.
3.4. Fivefold Cross Validation
To further compare the performance of different models or parameter settings, we found the best model or parameter configuration. Fivefold cross-validation experiments were performed for different parameter selections. We divided the dataset into five equally sized subsets. In each iteration of the fivefold cross-validation, four of the five subsets were used to train the model, while the remaining subsets were used to test its performance. This process was performed five times, ensuring that each subset was used once as a test set. We chose four parameter schemes, as shown in the
Table 5. For the performance indicators MIoU and PA, the mean of five cross-validations was calculated, and the experimental results are shown in the
Table 6.
It can be seen from the experimental results that when the bitch size is 16 and the learning rate is 0.0001, the values of MIoU and PA are the highest, reaching 91.18% and 94.40%, respectively. After weighing the evaluation indicators of different schemes, we finally chose scheme4 to carry out the next experiment.
3.5. Performance Comparison of Different Segmentation Methods
This paper compared CVU-Net with traditional U-Net, PSPNet, and DeeplabV3+. The comparison results of different segmentation algorithms are presented in
Table 7. As shown in the table, the improved method in this paper achieves a pixel accuracy (PA) of 94.33%, which is 3.67%, 3.57%, and 5.41% higher than that of the traditional U-Net algorithm, PSPNet algorithm, and DeeplabV3+ algorithm, respectively. Regarding the mean intersection over union (MIoU), the improved method in this paper attains a value of 91.13%. In terms of MIoU, it outperforms the traditional U-Net algorithm, PSPNet algorithm, and DeeplabV3+ algorithm by 4.97%, 5.51%, and 5.44%, respectively. These experimental results demonstrate that the incorporation of the depth attention mechanism in this paper’s method enhances the model’s feature extraction capability and significantly improves the accuracy of grape semantic segmentation. The visualization of the segmentation results is shown in
Figure 11. In
Figure 11, the first column represents the original grape leaf images, the second column depicts the manually labeled images, the third column displays the segmentation results from the DeeplabV3+ model, the fourth column shows the segmentation results from the PSPNet model, the fifth column exhibits the segmentation results from the U-Net model, and finally, the sixth column demonstrates the segmentation results from the CVU-Net model proposed in this paper. It can be seen from the visualization results that the U-Net model segmentation is more accurate, but small lesions will be missed and misidentified; the PSPNet model is not effective and will identify dense small lesions as one large lesion. Spot edge detection is not accurate enough; the DeeplabV3+ model will miss the detection of small lesions and produce unclear edge segmentation of large lesions. CVU-Net is more accurate in segmenting the edges of lesions and small lesions, which is basically consistent with the annotation situation, and can achieve very good accuracy results. The visualization results prove that adding the ASPP module can enhance the model’s perception of the input image and capture a wider range of contextual information. Adding CA to the feature extraction module and ASPP module can help the model further learn the correlation between features and focus on important feature channels to more accurately segment the lesion area and lesion edge.
3.6. Disease Spot Grading and Comparison Experiments
Because there is no clear grading standard for the degree of grape leaf spots, to more accurately analyze the grading of the degree of grape leaf black rot spots, this paper takes the standard Grapevine Downy Mildew Disease classification method [
30] developed by the People’s Republic of China as a reference to develop a grading standard for grape black rot leaf spots. This paper is based on the principle of pixel point statistics. Using Python to achieve the statistics of the area of the disease spot, the leaves are divided into three levels, as follows: level 1, level 2, and level 3. the specific grading standards are shown in
Table 8.
Where k is the proportion of the diseased area to the whole image, the principle calculation formula is as follows:
In the formula, is the area of the lesion area, is the area of the whole image, indicates the lesion area, and indicates the image area.
To measure the effectiveness of this model, based on the grading PD1, a comparison experiment was conducted using the traditional U-Net, VGG + U-Net, and ASPP + VGG + U-Net with the method of this paper, as shown in the following table, in which VU-Net denotes the model of the traditional U-Net introducing the VGG network, and AVU-Net denotes the model of the traditional U-Net introducing the VGG network and the model of ASPP module.
As can be seen from
Table 9, the highest segmentation accuracy of all models for the level 3 category in the experiment may be due to the larger area of the level 3 leaf spot. A comparison of the segmentation accuracy of each model for the level 3 lesions is shown in
Figure 12.
It can be clearly seen from
Figure 12 and
Table 9 that compared with the U-Net model, VU-Net model, and AVU-Net model, the segmentation accuracy of the level 3 category of this model has increased by 4.0%, 2.43%, and 1.16%, respectively. For the other two types of lesion levels, this model is improved compared to the U-Net model, VU-Net model, and AVU-Net model.
4. Discussion
The main work of this article includes the following four parts: first, improve the segmentation accuracy of the algorithm, improve the segmentation accuracy for low-level disease categories, and effectively improve the algorithm’s segmentation accuracy for low-level disease categories; second, select further research and experiments will be carried out on grape leaves with different degrees of disease; the third is to study how to reduce the interference of uncertain factors, such as noise and shadows in the image, on the segmentation accuracy of the algorithm; and the fourth is to conduct further research on the unclear segmentation of lesion edges and the misdetection or missed detection of small lesions.
It can be seen from the experiments that the method CVU-Net proposed in this paper can extract the diseased areas in the images more effectively than methods such as U-Net, DeeplabV3+, and PSPNet. The PA of the whole grape disease image dataset reaches 94.33%, and MioU reaches 91.09%, which are 4.93% and 3.67% higher than the traditional U-Net network, respectively. The robustness of CVU-Net was fully verified by comparing it with the other three semantic segmentation methods on the grape disease test set. Although CVU-Net segmented the grape disease image more accurately than the other test methods, its segmentation of the occluded region was not accurate for the grape disease leaves that were occluded by leaves in more complex cases. Therefore, we recommend constructing a relevant dataset and conducting further experimental studies in the future to address this issue.
5. Conclusions
In response to the low accuracy of grape disease image segmentation, this paper proposes a segmentation method CVU-Net based on a deep learning network. Our method combines the U-Net model with the VGG network, significantly improving the training accuracy of the network and achieving more precise segmentation results.
We incorporate the ASPP module into the skip connection part, expanding the receptive field and aggregating context information to avoid the loss of position information and dense semantic information caused by pooling operations while reducing the dependence on parameters and calculation processes. It can help the model better capture the edge information of the image and retain the detailed features of the image, allowing the model to produce more refined and accurate segmentation results.
In this paper, we introduce CA into the feature extraction module and ASPP module, which can better restore the edge information of objects and further improve the feature extraction capabilities of the method, reducing missed objects. Experiments on PD1 show that our method can effectively extract the areas of grape leaf black rot disease spots and achieve more accurate and efficient segmentation of disease spots. However, the segmentation effect on other disease images of grape leaves is unknown. In the next step, we will pre-train the model on other grape disease image datasets to achieve the segmentation and recognition of different diseases in real environments.