3.1. Evaluation Indicators
The performance of the model is evaluated using four evaluation metrics: Pixel Accuracy (PA), IoU, Dice, and Recall. Pixel accuracy represents the proportion of correctly predicted pixels to the total pixels. IoU is used to calculate the ratio of the intersection and union of the two sets of true values and predicted values for each category. The calculation of PA and IoU is as follows:
where,
Pij refers to the total number of
i pixels predicted as
j pixels;
Pii refers to the total number of
i pixels predicted as
i pixels, i.e., the total number of correctly classified pixels. The
k value for each stage in the two-stage model is 1. Specifically, in the first stage of the two-stage model,
k = 1 represents leaf, while in the second stage, it represents lesion.
Dice is usually used to calculate the similarity between two samples, and the value range is [0, 1]. A dice value close to 1 indicates that the set similarity is high, that is, the segmentation effect between the target and the background is better. A dice value close to 0 indicates that the target cannot be effectively segmented from the background. Recall is the ratio between the number of samples correctly predicted as positive classes and the total number of positive classes. Dice and Recall are calculated as follows:
where
TP represents the true positive example,
FP represents the false positive example, and
FN represents the false negative example.
3.2. Comparison of Different Segmentation Models
To verify the effectiveness of TRNet and U-Net(ResNet50), the U-Net, U-Net(MobileNet), DeepLabV3+(ResNet50), DeepLabV3+(MobileNet), SETR, and PSPNet(ResNet50) were chosen as control models for the first stage and second stage in this study, and comparisons of the results are shown in
Table 4 and
Table 5 [
32,
33]. All the above models were implemented on the created dataset. The weight file with the best training effect was saved and used for testing, and the mask image acquired from the test was extracted and put onto the original image to obtain the segmentation result. The quantitative results were tabulated in a table format and were visualized in the form of renderings.
Disease images collected in a production environment have problems with overlapping leaves and complex backgrounds, which makes it difficult to separate leaves from the background. In order for the model to accurately segment the target blade, the model must take into account the global features while paying attention to the local features. TRNet combines the advantages of the Transformer and the convolutional neural network. The Transformer’s ability to control the global features allows the model to better focus on the entire image, improves the attention weight of the target leaves, and reduces segmentation errors caused by complex backgrounds. At the same time, the focus on local features makes TRNet equally sensitive to detailed features in the target leaves. Therefore, the TRNet network achieves the best segmentation performance of leaves and lesions, with a PA of 93.94%, an IoU of 96.86%, a Dice coefficient of 72.25%, and a Recall of 98.60%. Compared with the SETR model using the Transformer as the encoder, the PA was improved by 2.38%, the IoU was improved by 4.25%, the Dice coefficient was improved by 1.13%, and the Recall was improved by 2.46%. Among the segmentation networks using convolutional networks as encoders, DeepLabV3+(ResNet50) achieved the highest metrics, which were 92.90%, 95.49%, 71.65%, and 97.42% for the PA, IoU, Dice coefficient, and Recall, respectively. The PA, IoU, Dice coefficient, and Recall for TRNet increased by 1.04%, 1.37%, 0.6% and 1.18%, respectively, compared to DeepLabV3+(ResNet50). It can be seen that the segmentation performance of TRNet was significantly improved. It further shows that the combination of the Transformer and the CNN was effective.
In the second-stage task, the model needed to extract complete disease spots from the target leaf, which required the model to extract finer features. Since ResNet50 is deeper and wider than the original U-Net encoder, it can extract more comprehensive disease spot information. Therefore, in the fine segmentation of lesions, U-Net, using ResNet50 as the feature extraction network, achieved an optimal performance, with the IoU, Dice coefficient, and Recall reaching 52.52%, 68.14%, and 73.46% respectively, which are better results than those obtained with the original U-Net. The improvements in the IoU, Dice coefficient, and Recall were 2.87%, 3.14%, and 7.45%, respectively, which were 8.04%, 7.88%, and 14.63% higher than the Transformer-based SETR network. The proposed TRNet model had a slight negative impact on the fine segmentation of lesions because the Transformer branch extracted global features, so the indicators of this model were slightly lower than U-Net (ResNet50).
To further demonstrate the superiority of TRNet and U-Net(ResNet50), we visualized the first-stage and second-stage segmentation results, as shown in
Figure 8. It can be seen that, in the first stage, models based on the CNN could completely segment the target leaf but were inevitably affected by complex backgrounds, resulting in over-segmentation, more or less. The SETR model, which is purely based on the Transformer as the feature extractor, was obviously less affected by overlapping leaves. This is largely because that Transformer mainly focuses on global features. On the other hand, the SETR model was significantly weaker than CNN-based models in extracting local features of the cucumber leaf. TRNet, which combines the advantages of both, could more completely segment the target leaf from complex backgrounds and had less interference from environmental factors.
In the second stage, the image containing disease spots has a simple background without external interference such that the attention to local features becomes more important. Except for the original U-Net and U-Net(ResNet50), all the other models mistakenly segmented the connection between two adjusted disease spots, while U-Net also ignored some minor disease spots. It can be seen that the U-Net model had a significant advantage in fusing multi-scale features for the segmentation of small disease spots. Moreover, ResNet50, as a feature extractor, could provide the precise extraction of the local features. Overall, it was found that TRNet and U-Net(ResNet50) achieved the best performance on the test set compared with the control models. Therefore, the latter part of this paper focuses on the fusion of these two models.
3.3. Comparison of Model Fusion Methods
In this paper, the method we proposed consisted of the segmentation of the complete leaf from the complex backgrounds first, followed by the segmentation of disease spots from a target leaf that is in a simple background to eventually achieve disease severity grading. The intention of two-stage segmentation was not only to remove complex interference factors but also to utilize the complementary advantages of different models to improve the segmentation accuracy. Therefore, the fusion of appropriate models was crucial. In this study, TRNet and U-Net(ResNet50), which delivered the best performance in the first and second stages, respectively, were used for leaf segmentation, and the extracted mask map was further processed to segment the disease spots. To verify the advantage of the TUNet model, we also chose the models delivering the second-best performance in the first stage and second stage, i.e., DeepLabV3+(ResNet50) and TRNet, and fused them with the best performers. In the end, four combination schemes were formed for comparative analysis.
As shown in
Table 6, Scheme 1 used TRNet for the segmentation in both stages. Scheme 2 used TRNet in the first stage and U-Net(ResNet50) in the second stage. Scheme 3 used DeepLabV3+(ResNet50) in the first stage and TRNet in the second stage. Scheme 4 used DeepLabV3+(ResNet50) in the first stage and U-Net(ResNet50) in the second stage.
A comparison of the results is shown in
Table 7. It can be seen that the indicators of the fusion model on these two categories were similar, which is because the lesions of cucumber downy mildew and cucumber anthracnose were similar. Among the two diseases, the performance of Scheme 1 was slightly better than Scheme 3 and Scheme 4. This is because TRNet is a first-stage model and the leaf segmentation was more accurate. Scheme 2 outperformed all other fusion schemes and performed better on all metrics (PA, IoU, Dice coefficient, or recall). It was also noted that all the indicators of Scheme 1, Scheme 2, and Scheme 4 were lower than they were before fusion, and only Scheme 2 yielded higher values for all the indicators after fusion compared to before. Compared with the situation in which the other combinations were declining, the integrated advantages of the two models in scenario 2 can be fully reflected.
The segmentation results of the various fusion schemes are shown in
Figure 9. It can be seen that Scheme 3 and Scheme 4, which used DeepLabV3+ for segmentation in the first stage, mistakenly segmented some leaves with similar colors as the target leaf, resulting in the segmentation of disease spots from non-target leaves in the second stage. Therefore, the final accuracy was reduced. For Scheme 1 and Scheme 2, the TRNet model performed well in the first stage and fully segmented the contour of the target leaf. However, for disease spots of varying sizes, the multi-scale segmentation of U-Net apparently outperformed other schemes. Based on the advantages and disadvantages of the four schemes and the actual production needs, Scheme 2 was ultimately chosen as the cucumber disease segmentation model in this study.
3.5. Disease Severity Grading
At present, there is no unified standard for the severity grading of cucumber downy mildew. According to the relevant literature, commonly used methods for the severity grading of cucumber downy mildew are mainly based on (1) the ratio of the total area of disease spots to the area of the entire leaf and (2) the number of disease spots per unit leaf area. In this study, the first method was adopted. The disease severity was divided into five levels, as detailed in
Section 3.4.
Figure 11 shows the images of cucumber downy mildew and cucumber anthracnose from severity Level 1 to Level 5.
We used TRNet and U-Net to segment the target leaf and disease spots, respectively, and calculated the ratio of the pixel area of disease spots to the pixel area of the leaf. Then, the severity of cucumber downy mildew and cucumber anthracnose was graded according to the specified grading standard. In this study, 90 cucumber downy mildew images and 94 cucumber anthracnose images were selected as test objects, and the predicted disease severity was compared with the manually labelled severity to evaluate the classification accuracy of the model. The experimental results are shown in
Table 8 and
Table 9. It can be seen from
Table 8 that the classification accuracy of cucumber downy mildew from Levels 1, 2, 3, 4, and 5 was 100.00%, 100.00%, 94.44%, 92.31%, and 85.71%, respectively, with an average accuracy of 94.49%. According to
Table 9, the classification accuracy of cucumber anthracnose from Levels 1, 2, 3, 4, and 5 was 100%, 96%, 100.00%, 92.85% and 83.33%, respectively, with an average accuracy of 94.43%. In general, the model had a high prediction accuracy for disease severity for Levels 1 to 3 but performed suboptimally for Levels 4 and 5. This is because the edges of leaves with Level 4–5 cucumber downy mildew or cucumber anthracnose were mostly withered, and the model might recognize such edges as background factors in the first-stage segmentation, resulting in a lower accuracy.
A comparison of the results of the proposed model TUNet and the existing models is shown in
Table 10. Ref. [
34] uses the two-stage method DUNet to segment diseased leaves and lesions, and Ref. [
13] uses an improved U-Net model to simultaneously segment leaves and lesions. As can be seen in
Table 10, TUNet has a higher accuracy in disease severity grading compared to Ref. [
34]. The one-stage model in Ref. [
13] has a speed advantage, but the accuracy is much lower than the two-stage model.
As can be seen in
Figure 12, both Refs. [
13,
34] have problems with over-segmentation, that is, the lesions on the edge of the leaves are classified as background, resulting in an incorrect classification of disease severity. DUNet failed to segment lesions due to the incorrect segmentation of leaves in the first stage, resulting in an incorrect input in the second stage, which illustrates the importance of the first-stage model in the two-stage method. Our method adds global features to the first-stage model for context modeling so that it can correctly determine whether the edge lesion is part of the leaf, thus avoiding the over-segmentation problem. However, TUNet still has shortcomings in the segmentation of small lesions, which needs further improvement.