Next Article in Journal
Satellite-Based Frost Damage Detection in Support of Winter Cover Crops Management: A Case Study on White Mustard
Next Article in Special Issue
Sundry Bacteria Contamination Identification of Lentinula Edodes Logs Based on Deep Learning Model
Previous Article in Journal
Soil Salinity Prediction Using Remotely Piloted Aircraft Systems under Semi-Arid Environments Irrigated with Salty Non-Conventional Water Resources
Previous Article in Special Issue
Estimating Leaf Nitrogen Content in Wheat Using Multimodal Features Extracted from Canopy Spectra
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

DS-DETR: A Model for Tomato Leaf Disease Segmentation and Damage Evaluation

1
College of Information and Technology, Jilin Agricultural University, Changchun 130118, China
2
College of Agriculture, Nanjing Agricultural University, Nanjing 210095, China
3
Institute for the Smart Agriculture, Jilin Agricultural University, Changchun 130118, China
4
College of Food, Agricultural and Natural Resource Sciences, University of Minnesota, St. Paul, MN 55108, USA
*
Authors to whom correspondence should be addressed.
Agronomy 2022, 12(9), 2023; https://doi.org/10.3390/agronomy12092023
Submission received: 20 July 2022 / Revised: 17 August 2022 / Accepted: 24 August 2022 / Published: 26 August 2022

Abstract

:
Early blight and late blight are important factors restricting tomato yield. However, it is still a challenge to accurately and objectively detect and segment crop diseases in order to evaluate disease damage. In this paper, the Disease Segmentation Detection Transformer (DS-DETR) is proposed to segment leaf disease spots efficiently based on several improvements to DETR. Additionally, a damage assessment is carried out by the area ratio of the segmented leaves to the disease spots. First, an unsupervised pre-training method was introduced into DETR with the Plant Disease Classification Dataset (PDCD) to solve the problem of the long training epochs and slow convergence speed of DETR. This method can train the Transformer structures in advance to obtain leaf disease features. Loading the pre-training model weight in DS-DETR can speed up the convergence speed of the model. Then, Spatially Modulated Co-Attention (SMCA) was used to assign Gaussian-like spatial weights to the query box of DS-DETR. The different positions in the image are trained using the query boxes with different weights to improve the accuracy of the model. Finally, an improved relative position code was added to the Transformer structure of DS-DETR. Relative position coding promotes the capture of the sequence order of input tokens by the Transformer. The spatial location feature is strengthened by establishing the location relationship between different instances. Based on these improvements, the DS-DETR model was tested on the Tomato leaf Disease Segmentation Dataset (TDSD) constructed by us. The experimental results show that the DS-DETR proposed by us achieved 0.6823 for APmask, which improved by 12.87%, 8.25%, 3.67%, 1.95%, 10.27%, and 9.52% compared with the state-of-the-art: Mask RCNN, BlendMask, CondInst, SOLOv2, ISTR, and DETR, respectively. In addition, the disease grading accuracy reached 0.9640 according to the segmentation results given by our proposed model.

1. Introduction

Crop diseases are the key constraint to sustainable agricultural development [1]. As one of the world’s most important cash crops, tomato not only is rich in nutrients but also has specific pharmacological effects. China has a huge tomato planting area, and the tomato yield ranks first in the world [2]. However, tomato diseases can seriously affect the yield and cause significant losses to the agricultural economy. Early blight and late blight are the most common diseases of tomato [3]. Therefore, disease area segmentation and damage evaluation are critical to disease estimation, prevention, and control, which are very important in ensuring the quality and yield of tomatoes. It was one of the most important methods of manual evaluation in the earliest days. However, manual evaluation methods require a lot of time and labor [4,5]. Additionally, the number of plant protection or agricultural professional technicians is insufficient. Manual evaluation seriously affects the efficiency of disease prevention and control.
Traditional machine vision technology has been applied to identify crop diseases in the past decade, which can overcome the problems caused by manual detection and save on labor costs to a certain extent. It mainly relies on machine algorithms to perform segmentation, feature extraction, and classification of disease area images [6,7]. Wen et al. [8] proposed an improved Pulse Coupled Neural Network (PCNN) by adding a modified artificial bee colony algorithm, setting the evaluation function of the income degree. The improved network was used for disease spot segmentation of the red–green–blue (RGB) images of corn leaves. The segmentation area chromaticity misclassification was reduced by 7.32% compared with the original algorithm. Anam and Fitriah [9] proposed a K-means algorithm combined with Particle Swarm Optimization (PSO) to segment early blight spots from tomato images converted to hue–saturation–value (HSV) color. The F-score reached 0.90. However, because the traditional machine vision technology is a staged algorithm with human intervention, the effectiveness of this method depends very much on human design experience.
With the rapid development of deep learning, convolution neural network (CNN) has been widely used in agricultural disease identification [10,11,12,13,14]. Mohanty et al. [15] compared the classification performance of AlexNet and GoogleNet for 14 crops and 26 diseases in 38 categories in the PlantVillage dataset. The best accuracy was 99.35%. Liu and Wang [16] used the You Only Look Once Version 3 (YOLOv3) to detect 12 diseases on 15,000 tomato images collected in natural environments. They employed feature pyramids to fuse multi-scale features. The final detection accuracy was 92.39%. Lin et al. [17] applied U-Net to segment disease regions on laboratory-collected cucumber powdery mildew images. The mAP reached 96.08%, the IoU reached 72.11%, and the Dice accuracy reached 83.45%. Ngugi et al. [18] proposed U-Net to perform foreground and background segmentation on 1408 tomato leaf images collected in wild conditions. The mIoU was 0.96, and the F1 was 0.91. An automatic plant disease detection and segmentation framework DPD-DS based on Mask R-CNN was proposed [19]. The DPD-DS modifies the activation function of the backbone network from ReLU to Swish to speed up the model convergence. Furthermore, the anchor ratio in the Region Proposal Network (RPN) was modified to make it more suitable for plant disease detection. Finally, the mAP value on five crops (apple, grape, mango, pomegranate, and water hyacinth) reached 0.7810. The above CNN-based tasks were all carried out for plant leaves, and leaves are regarded as a target. However, the occurrence of diseases is affected by many factors, such as season, environment, collection, etc. It is thus pretty challenging to segment the leaf and leaf disease spots accurately.
Although CNN has developed many highly successful vision models, there are still critical challenges. Because the receptive field of each convolutional filter is limited, the CNN struggles to connect spatially distant concepts. Additionally, not all pixels are equal in the tasks of image classification, object detection, and image segmentation. However, convolutions uniformly process all image patches regardless of importance. This leads to low spatial efficiency in computation and representation [20,21]. To overcome these challenges, many scholars tried to introduce the Transformer into the field of vision, but this requires complex engineering to be implemented efficiently on hardware accelerators [22,23,24,25]. At the end of 2020, the Transformer [26] showed a revolutionary performance improvement in computer vision. It used the self-attention mechanism to focus on extracting global features and context modeling capability. Some visual networks based on the Transformer have been proposed, such as a Visual Transformer [21], DEtection TRansformer (DETR) [27,28,29,30,31,32,33] and Vision Transformer (ViT) [20,34,35,36,37]. Chohan et al. [38] used a Visual Transformer to classify plant leaf disease, and the final accuracy was 97.98%. Reedha et al. [39] used ViT to classify weeds and crop images collected by Unmanned Aerial Vehicles (UAVs). The F1 was 99.28%, which was better than that of the CNN. Therefore, applying the Transformer model to the task of more efficient and accurate segmentation of crop disease parts is very promising [40].
The Disease Segmentation Detection Transformer (DS-DETR) model proposed by us segments tomato leaf diseases based on the DETR network, which is a novel model structure with many excellent performances. With the help of the segmentation results of the model, the early blight and late blight damage grades for tomato were also evaluated. Specifically, we trained the Transformer structure of DETR in advance using unsupervised pre-training [29]. Then, we introduced Spatially Modulated Co-Attention (SMCA) [30] into the DETR model to train query boxes’ importance at different positions. By constructing a suitable Gaussian-like space weight map, the model can pay attention to the appropriate location features faster. Finally, we introduced the improving Relative Positional Encoding (iRPE) [41] into the Transformer to preserve the positional features between different instances better.
The main contributions of our work are as follows:
  • Two datasets were proposed. The Plant Disease Classification Dataset (PDCD) with 86,023 images, including 27 diseases of 14 plants, was collected. The other is the Tomato leaf Disease Segmentation Dataset (TDSD) containing 1700 images, which was annotated by ourselves.
  • A DS-DETR model was proposed to segment leaf disease spots efficiently based on several improvements on DETR. Additionally, DS-DETR is better than several other state-of-the-art segmentation networks.
  • A disease damage evaluation method was proposed for early blight and late blight in tomato by calculating the disease spot area ratio over the leaf area. This method can provide technical support for the precise prevention and control of crop diseases in production.

2. Materials and Methods

2.1. Image Data Acquisition

In the training and learning processes of DS-DETR proposed by us, we constructed two datasets: one for disease segmentation training and testing, and the other for unsupervised pre-training of the model. The disease segmentation dataset used for disease segmentation model training and testing is not publicly available. It is a time-consuming and laborious task to construct a pixel-by-pixel annotation dataset. We need to label and construct the image segmentation dataset by ourselves. The construction process of the datasets was as follows. First, we selected the disease images in datasets such as AI Challenger 2018 and PlantVillage to sort out the Plant Disease Classification Dataset (PDCD), which was used for unsupervised pre-training of the model. The PDCD was identified and collected on websites such as Google, Flickr, Bing and other agriculture and plant science websites by inputting keywords [42]. The constructed PDCD has 86,023 images, including 27 diseases of 14 plants (apple, cherry, grape, citrus, peach, pepper, corn, raspberry, blueberry, soybean, pumpkin, strawberry, tomato, and potato). Second, we constructed a disease segmentation dataset that selected the early blight and late blight images for tomato, with a total of 1700 images segmented and labeled by pixel. LabelImg was chosen as the labeling tool, and a total of three labeling categories were used: leaves, early blight spots, and late blight spots. The labeled images are shown in Figure 1. The Tomato leaf Disease Segmentation Dataset (TDSD) was formed after labeling.

2.2. Dataset Pre-Processing

The PDCD was used for unsupervised pre-training of the large-scale model, DETR, in advance to obtain the pre-training weights of common features of multi-class plants and multi-class diseases. Online augmentation techniques were used to increase the number and diversity of training samples, such as horizontal and vertical flips, rotations and resizing, augmentations, and normalizations. This method does not require the augmented data to be synthesized, which saves on data storage space and provides high flexibility. The 1700 labeled images of early blight and late blight for tomato used for the segmentation tasks were divided into a training set and a test set at a ratio of 8:2. After division, the training set included 1428 images and the test set included 358 images. Furthermore, in order to solve the problem of insufficient training samples for segmentation tasks, a simple copy–paste [43] method was used to augment the training set. The principle of the copy–paste method was to randomly select two images and to apply random proportional dithering and random horizontal flipping to each image. Then, one object subset was randomly selected from one of the images and pasted onto the other image. There were 2856 images after augmentation, and we called the dataset TDSD. The example images after randomly selected augmentation are shown in Figure 2.

2.3. Overall Design of DS-DETR

Figure 3 depicts the specific process of our proposed Disease Segmentation Detection Transformer (DS-DETR). First, an unsupervised pre-training method was introduced into DETR, called UP-DETR, which could train Transformer structure weights in advance to obtain common leaf disease features of crop disease knowledge and to accelerate the convergence speed of the large-scale model. Second, the DS-DETR model was trained and tested on the TDSD. The backbone network of the DS-DETR model, ResNet50, extracts image features and converts them into the features sequence vectors. Then, the relative position encoding and absolute position encoding were introduced into the features sequence vectors of the Transformer in DS-DETR. It can establish the overall location features for the object and the location relationship between instances. The multi-head self-attention in the encoder was then used to capture richer features and to output them to the decoder. The Gaussian-like space weight was assigned to the query boxes in the decoder to improve the model’s attention to local features. Query boxes were then assigned to the feature set to find the target. Finally, the encoder results were output to detect the heads composed of MLP to complete the detection and classification tasks. The result of the detection head was input into the segmentation prediction head, and the data were pixel-classified to obtain the final segmented image.

2.3.1. Unsupervised Pre-Training

Because DETR focuses on spatial location learning and self-attention has no biased assumption on the input sequence, which leads to the need for a large number of training epochs and a large number of data to fit the model in the training process, we used the PDCD dataset above to implement unsupervised pre-training on the UP-DETR to improve the convergence speed and to reduce the number of training epochs [27]. The specific implementation steps of the unsupervised pre-training were to freeze the CNN weights and to add a branch for block feature reconstruction based on DETR. This branch’s task is to crop the image into several small patches randomly. Then, the small patches are used as queries in the decoder, and the original image is used in the encoder. The whole pre-training task is to look at the location of these small patches in the original image. The specific network process is shown in Figure 4.

2.3.2. Spatially Modulated Co-Attention

In DETR, a series of query boxes are responsible for object detection in different spatial positions. During a query, each query box needs to interact with the features at each position of the feature set. The features of each position are used to estimate the detection box and class of the object. This process is called the Coordinated Attention (Co-Attention) mechanism. However, in the DETR decoder, many of the location features of the query box participating in the coordinated attention have little or nothing to do with the prediction result of the query box. Therefore, this interaction of useless features seriously slows down the convergence speed of DETR [30,40]. To solve the problem, we refer to Spatially Modulated Co-Attention (SMCA [30]) to realize the query operation of the decoder. The idea is to use a spatial prior to modulate the co-attention maps between the target query boxes and the self-attention encoded features. The cooperative co-attention through the decoder assigns weights to all query boxes. Give more weight around the targeted query box to focus on the features of this part. This can reduce the number of training epochs and improve the convergence speed of the model. The specific calculation method is provided in Equation (1).
E = Softmax ( K i T Q i / d ) V i c h n o r m , c w n o r m = sigmoid ( MLP ( O q ) ) , s h , s w = FC ( O q ) , G ( i , j ) = exp ( ( i c w ) 2 β s w 2 ( j c h ) 2 β s h 2 ) , C i = Softmax ( K i T Q i / d + log G ) V i ,
where Q i , K i , and V i denote the i group of the Query ( Q ), Key ( K ), and Value ( V ) features; O q N × C is the object query vector and E 1 × C is the self-attention encoded features, where N denotes the number of pre-specified object queries and C is the number of the features channels; FC is a fully connected operation; MLP is multi-layer perceptron; sigmoid is a activate function; c h n o r m , c w n o r m , s h , s w are the center and proportion of each object query box to the target dynamic prediction; the Gaussian-like two-dimensional spatial weight map G ( i , j ) is generated from the center and proportion of each object query box, where ( i , j ) is the spatial indices of the weight map G ; β is used to adjust the bandwidth of the Gaussian-like weight map; and we modulate the co-attention C i between O q and E with spatial prior G .

2.3.3. Improving Relative Position Encoding

The position encoding in DETR establishes a sequential relationship for the input sequence. Absolute position encoding is used in DETR; that is, each position in the sequence has a fixed position vector. The formula is presented in Equation (2).
PE ( pos x , 2 i ) = sin ( pos x / 10,000 2 i / 128 ) , PE ( pos x , 2 i + 1 ) = cos ( pos x / 10,000 2 i / 128 ) , PE ( pos y , 2 i ) = sin ( pos y / 10,000 2 i / 128 ) , PE ( pos y , 2 i + 1 ) = sin ( pos y / 10,000 2 i / 128 ) ,
where PE is the position encoding, pos x is the abscissa position of the token in the sequence, pos y is the ordinate position of the token in the sequence, and i is the dimension of the position encoding.
However, when the model is trained, the distance function of the position encoding is broken as the matrix is learned. Therefore, absolute encoding disadvantages in the relative position features cannot be preserved. This problem affects the detection accuracy of DETR. Relative position encoding can explicitly model the relationship of any two positions. Therefore, we introduce the improving Relative Position Encoding (iRPE) [41] and absolute position encoding in combination. This makes the model pay more attention to local features while preserving the target’s absolute position features. The calculation of the relative position coding is presented in Equation (3).
r i j = p I x ˜ ( i , j ) , I y ˜ ( i , j ) I x ˜ ( i , j ) = g ( x ˜ i x ˜ j ) , I y ˜ ( i , j ) = g ( y ˜ i y ˜ j ) ,
where r i j is the relative position encoding; p is the learnable relative position encoding between the input elements x and y ; I x ˜ ( i , j ) and I y ˜ ( i , j ) are the x and y distances, respectively; their combination is a 2D index for p ; and g ( x ) is defined in Equation (4).
g ( x ) = [ x ] , | x | α sign ( x ) × min ( δ , [ α + ln ( | x | / α ) ln ( γ / α ) ( δ α ) ] ) , | x | > α
where [ ] is a round operation; δ controls the output at [ δ , δ ] ; α is the piecewise point; sign returns 1 for a positive input, −1 for a negative input, and 0 otherwise; and γ adjusts the curvature of the logarithmic part.
The method of adding the relative position encoding is shown in Figure 5. The calculation of the embedding method of relative position coding is shown in Equation (5).
e i j = ( x i W Q ) ( x j W K ) T + b i j d , b i j = ( x i W K ) r i j T
where x i and x j are the two arbitrary positions i and j in the features vector, respectively; W Q , W K , and W V are parameter matrices; and b i j represents the relative position code on k.

3. Training the Tomato Leaf Disease Segmentation and Damage Evaluation Model

3.1. Computational Hardware and Platform

All model training parameters in this paper are set as follows: The hardware parameters of the experiment were the DELL Precision T7920 Tower deep learning workstation, which consisted of a GeForce RTX 3090 PCIe/SSE2 graphics card, an Intel Xeon Gold 4210 CPU, and a 1 TB hard disk. The operating environment was Ubuntu 18.0.4, Python 3.7, Pytorch = 1.7.0. Additionally, learning rate = 1 × 10−4, batch size = 4, and 100–300 epochs were trained. AdamW was used to optimize the training model. The same platforms are also applied to Mask RCNN [44], Blendmask [45], CondInst [46], SOLOv2 [47], and ISTR [48], which are publicly available for comparison.

3.2. Model Training

The DS-DETR model proposed by us was trained. The training process is as follows:
The input was an image to be segmented and detected D .
Output: Vector L was used for sample categories, R was used for boundary coordinates, and M was used for pixel categories.
Step 1: Unsupervised pre-trained models. Feed the PDCD into an unsupervised pre-trained model and freeze the CNN weights. Then, add a branch for block feature reconstruction. The task of this branch is to crop out several small patches in the image randomly. Then, the small patches are used as queries in the decoder and the original image is used into the encoder. After 300 epochs of training, an unsupervised pre-trained model weight with crop disease features is obtained.
Step 2: Feature extraction. Input the image D into the feature extraction network ResNet50, and output a feature set with shape C × H × W ( C is the number of channels, selected as 2048; H , W are the height and width of the features, usually 1/32 of the original image) in the top layer. Then, use 1 × 1 convolutions to reduce the dimension of the feature set, and compress the dimension of the space. The shape of the final output features set is 1 × d × H W ( d is the number of channels after dimensionality reduction, usually 512).
Step 3: Transformer encoder. The sequence output by the feature extraction network is added to the absolute position encoding and then fed into the multi-head self-attention. In the multi-head self-attention, the relative position encoding is added to the QKT (the relative position encoding is calculated as the product of the relative position encoding and K). This relative position encoding can be shared among multi-head self-attention. After the multi-head self-attention calculation, the result is added to the feature set before the input multi-head self-attention (skip connection), and then, the layer normalization operation is performed. After MLP, skip connections and layer normalization are performed. Repeat this step N times. Then, the encoder part outputs K and V with the dimensions (the shape is ( H W , 256 ) ).
Step 4: Transformer decoder. The decoder is divided into two parts. The first is the generation of Y query boxes ( Y = ( 100 , 256 ) ). Add absolute position encoding to Q and K of the query box; V is the input of the decoder (The first decoder has no input to V, so it is a tensor of all 0). Input Q, K, and V into multi-head self-attention and perform skip connections and layer normalization. Then, perform multi-head attention, where multi-head attention uses spatial modulation co-attention. Q predicts the object center and scales through a two-layer MLP followed by a sigmoid activation function. K generates an appropriate attention map for each target query through co-attention. Then, perform a matrix multiplication operation on the two parts to generate a two-dimensional Gaussian-like weight map. The multiplication result of Q and K and the V output by the encoder are used for attention. Skip connections and layer normalization are performed on the results after spatially modulated co-attention and the decoder’s first skip connection. Repeat this step N times. Then, the decoder results go through the fully connected prediction category L and output the frame coordinates R through MLP.
Step 5: Mask head. The decoder results are input into the mask head to achieve pixel-level segmentation of the detected target. The output of the encoder and decoder passes through a set of multi-head attention to obtain a feature set. The result obtained by the feature extraction network is reduced by 1 × 1 convolution and then spliced with the feature set just obtained. The result is then subjected to a series of 3 × 3 convolutions, group normalization, and ReLU activation for three up-samplings. Finally, the pixels of a 1 × 1 convolution output are classified to obtain the final segmentation result M .

3.3. Network Evaluations

To test the model’s effectiveness in this paper, we divided all samples into four conditions according to the Intersection over Union (IoU) between the output box/mask and the label box/mask. True positives (TPs) are correctly detected/segmented targets. False positives (FP) are falsely detected/segmented targets. False negatives (FNs) are failed detections/segmentations of a target. Otherwise, it is a true negative (TN). We choose the precision, recall, and average precision (AP) indices as the main evaluation indices. The precision (P) and recall (R) were computed using Equations (6) and (7). AP is the area under the P and R curve. APbox is the predicted bounding boxes, and APmask is the predicted segmentation. AP was computed using Equation (8).
P = TP TP + FP
  R = TP TP + FN
Since the evaluation index mainly focuses on the positive sample, to weigh the precision index and the recall index, AP is defined in Equation (8) as the area under the precision–recall (P–R) curve obtained for the number of correctly predicted pixels to the number of pixels in the ground truth for single class. APbox is the predicted bounding boxes, and APmask is the predicted segmentation. AP is an important index for measuring the detection and segmentation of the network, which can reflect the network’s sensitivity to the target object. The higher AP value, the more correct targets are predicted and the better the model. The average test time and the scale of model parameters are also calculated to evaluate the performance of the model:
A P = 0 1 P R d R

4. Result

4.1. Ablation Study

The results of our unsupervised pre-training model on the PDCD are shown in Figure 6. We trained 300 epochs, the loss value reached 7.19, and the model showed a convergence gradually.
The model results were evaluated on the test set. As shown in Table 1, we used original data to perform experiments on the DETR model and reached 0.5516 and 0.5361 for APbox and APmask at 150 epochs, and 0.6114 and 0.5871 for APbox and APmask at 350 epochs, respectively. When we used copy–paste to augment the data, APbox and APmask improved by 3.11% and 3.42%, respectively. This shows that the image augmentation can expand the image diversity and improve the performance of the model. When we used the unsupervised pre-training model to train the DETR model, the model converged at 100 epochs, and APbox and APmask again improved by 5.01% and 1.01%. The experimental results show that using an unsupervised pre-trained model can reduce the number of training epochs and achieve higher detection and segmentation accuracy. This is due to the fact that the Transformer has been trained on leaf disease features in advance. Additionally, the unsupervised training model using the PDCD showed good results in tomato diseases, so it can also be used for other diseases, with strong generalization. When the SMCA was added, APbox and APmask again improved by 0.89% and 2.17%. This is because the SMCA module trains a Gaussian-like spatial weight map for the detection boxes at different positions. In this way, weights are assigned to detection boxes at different positions to pay more attention to the useful features. Finally, the addition of relative position encoding again greatly improves APbox and APmask by 3.78% and 2.92%, respectively. The relative position encoding can better locate the position of the disease spot, establish the relative position between the leaf and the disease spot, and make detecting and segmenting the disease spot more accurate. Comparing the run time and parameter scale of the model, it can be seen that DS-DETR increases the parameter scale and run time by very little, allowing the model to achieve better detection and segmentation results.

4.2. Compared with the State-of-the-Art Instance Segmentation

With mainstream instance segmentation, the CNN segmentation used Mask RCNN [40], BlendMask [41], CondInst [42], and SOLOv2 [43], and the Transformer segmentation used ISTR [44]. These comparison models are classic segmentation models or excellent models proposed in 2021. As shown in Table 2, DS-DETR achieved the best results in APbox and APmask: 11.47% and 12.87% better than Mask RCNN, 12.46% and 8.25% better than BlendMask, 13.22% and 3.67% better than CondInst, 4.92% and 1.95% better than SOLOv2, and 1.47% and 10.27% better than ISTR, respectively. For the recall value, our model also achieved the best result: 2–8% higher than the comparison model. The excellent AP and recall values indicate that our model can detect and segment the target better than other comparable models and can play a better role in detecting and segmenting disease spots and leaves.
In Figure 7, we show the segmentation results of all models. In Figure 7B, the disease spots are small and unclear. In Figure 7C,D, some of the disease spots are located in the leaf stem. Additionally, in Figure 7E, the disease spots are almost all over the whole leaf. These leaves and spots are difficult to detect and segment. Therefore, some models cannot identify the disease spots well, and some models are rough in edge-segmenting of leaves and disease spots. However, our model can segment disease spots and leaves more accurately. The parameter scale of our model is second only to BlendMask and CondInst, but APbox and APmask are much higher than them. The run time of the model reached 0.0371, which is only better than ISTR. Compared with other models, each image used about 1/3 more time, which was caused by the common disadvantage of a large number of calculations by the Transformer model. Considering its performance, DS-DETR is more suitable for leaf and disease spot segmentation.

4.3. Disease Damage Evaluation

Finally, the area ratio of the segmented disease spot to the leaf was calculated. The ratio of disease spot area to leaf was used as the index of disease damage evaluation. We classified the index of disease damage evaluation into healthy, mild early blight (the early blight spots were less than 10%), severe early blight (the early blight spots were more than 10%), mild late blight (the late blight spots were less than 10%), and severe late blight (the late blight spots were more than 10%) [3]. The area ratio was calculated based on the segmentation results of the test set. As shown in Table 3, the classification accuracy reached 96.4%.
Figure 8 is the classification results confusion matrix from the disease damage evaluation. It can be found that the probability of confusion between early blight and late blight is very low. Only one early blight image and two late blight images are divided into healthy leaves. It is proved that it is feasible to use DS-DETR to classify the level of disease damage because the disease damage evaluation is calculated through the area ratio, which has a unified evaluation standard and eliminates the error caused by subjective judgment.

5. Conclusions and Discussion

This paper developed a tomato leaf disease segmentation model based on DS-DETR. We accelerated the convergence speed and improved the model’s accuracy through an unsupervised pre-training model. Then, SMCA was used to extract the features in different spatial positions. In the meantime, improving relative position encoding is used to better establish the position relationship between the disease spot and the leaf and to better retain the position features between different instances in model training. Through the above series of improvements, the unsupervised pre-training model is obtained on the PDCD, and the model performance is valid on TDSD. The experimental results show that DS-DETR achieved excellent results: 0.7393% and 0.6823% in APbox and APmask, respectively. APbox and APmask increased by 20.92% and 16.22% compared with the original network. Additionally, the number of training epochs required by the model reduced significantly. Then, we calculated the area ratio of the segmented disease spots to evaluate the level of disease damage. Additionally, the classification accuracy of the disease damage evaluation reached 96.40%. This paper provides technical support for evaluating the level of tomatoes’ early and late blight damage.
However, there are still some problems in this method that need to be solved urgently. Through experiments, it has been found that the segmentation effect of the model for small spots needs to be improved. In future work, we will introduce multi-scale features into the model. Additionally, we will continue to enrich the segmentation dataset of tomato diseases. The dataset will be expanded to include more disease classes and images. In addition, our model is currently only validated for leaf spot images with simple backgrounds. We will continue to validate the segmentation effect of leaf spots in images with complex backgrounds.

Author Contributions

Conceptualization, C.W. and C.Y.; methodology, C.W. and J.W.; software, C.W. and J.W.; validation, C.W. and J.W.; formal analysis, C.W. and J.W.; investigation, H.S. and T.Z.; resources, Z.M. and J.W.; data curation, C.Y. and J.W.; writing—original draft preparation, Z.M. and H.C.; writing—review and editing, C.W. and J.W.; visualization, H.C. and J.W.; supervision, C.W. and C.Y.; project administration, C.W.; funding acquisition, C.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China (Key Program) (No. U19A2061), the Industrial Technology and Development Project of Development and Reform Commission of Jilin Province (No. 2021C044-8), the Natural Science Foundation of Jilin Province of China (No. 20180101041JC), the Social Sciences project of Jilin Provincial Education Department (No. JJKH20220376SK), and Science and technology research project of Education Department of Jilin Province (No. JJKH20190924KJ).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Part of the data is from open-source data, which can be obtained from [42], and part of the data is available upon request, due to privacy reasons.

Conflicts of Interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as potential conflicts of interest.

References

  1. Wang, X.; Wen, H.; Li, X.; Fu, Z.; Lv, X.; Zhang, L. Research Progress Analysis of Mainly Agricultural Diseases Detection and Early Warning Technologies. Trans. Chin. Soc. Agric. Mach. 2016, 47, 266–277. [Google Scholar]
  2. Mi, G.; Tang, Y.; Niu, L.; Ma, K.; Yang, F.; Shi, X.; Zhao, X.; Wang, J. Important virus diseases of tomato in China and their prevention control measures. China Cucurbits Veg. 2021, 34, 8–14. [Google Scholar] [CrossRef]
  3. Blancard, D. Tomato Diseases: A Colour Handbook; Manson Publishing Ltd.: London, UK, 2012. [Google Scholar]
  4. Laterrot, H. Disease resistance in tomato: Practical situation. Acta Physiol. Plant. 2000, 22, 328–331. [Google Scholar] [CrossRef]
  5. Martinelli, F.; Scalenghe, R.; Davino, S.; Panno, S.; Scuderi, G.; Ruisi, P.; Villa, P.; Stroppiana, D.; Boschetti, M.; Goulart, L.R. Advanced methods of plant disease detection. A review. Agron. Sustain. Dev. 2015, 35, 1–25. [Google Scholar] [CrossRef]
  6. Khirade, S.D.; Patil, A. Plant disease detection using image processing. In Proceedings of the 2015 International Conference on Computing Communication Control and Automation, Pune, India, 26–27 February 2015; pp. 768–771. [Google Scholar]
  7. Ud Din, Z.; Adnan, S.M.; Ahmad, R.W.; Aziz, S.; Ismail, W.; Iqbal, J. Classification of Tomato Plants’ Leaf Diseases using Image Segmentation and SVM. Tech. J. 2018, 23, 81–88. [Google Scholar]
  8. Wen, C.; Wang, S.; Yu, H.; Su, H. Image segmentation method for maize diseases based on pulse coupled neural network with modified artificial bee algorithm. Trans. Chin. Soc. Agric. Eng. 2013, 29, 142–149. [Google Scholar]
  9. Anam, S.; Fitriah, Z. Early Blight Disease Segmentation on Tomato Plant Using K-means Algorithm with Swarm Intelligence-based Algorithm. Comput. Sci. 2021, 16, 1217–1228. [Google Scholar]
  10. Chen, Z.; Wu, R.; Lin, Y.; Li, C.; Chen, S.; Yuan, Z.; Chen, S.; Zou, X. Plant disease recognition model based on improved YOLOv5. Agronomy 2022, 12, 365. [Google Scholar] [CrossRef]
  11. Hassan, S.M.; Jasinski, M.; Leonowicz, Z.; Jasinska, E.; Maji, A.K. Plant disease identification using shallow convolutional neural network. Agronomy 2021, 11, 2388. [Google Scholar] [CrossRef]
  12. Peng, Y.; Zhao, S.; Liu, J. Fused-Deep-Features Based Grape Leaf Disease Diagnosis. Agronomy 2021, 11, 2234. [Google Scholar] [CrossRef]
  13. Yang, K.; Zhong, W.; Li, F. Leaf segmentation and classification with a complicated background using deep learning. Agronomy 2020, 10, 1721. [Google Scholar] [CrossRef]
  14. Yin, C.; Zeng, T.; Zhang, H.; Fu, W.; Wang, L.; Yao, S. Maize Small Leaf Spot Classification Based on Improved Deep Convolutional Neural Networks with a Multi-Scale Attention Mechanism. Agronomy 2022, 12, 906. [Google Scholar] [CrossRef]
  15. Mohanty, S.P.; Hughes, D.P.; Salathé, M. Using deep learning for image-based plant disease detection. Front. Plant Sci. 2016, 7, 1419. [Google Scholar] [CrossRef] [PubMed]
  16. Liu, J.; Wang, X. Tomato diseases and pests detection based on improved Yolo V3 convolutional neural network. Front. Plant Sci. 2020, 11, 898. [Google Scholar] [CrossRef]
  17. Lin, K.; Gong, L.; Huang, Y.; Liu, C.; Pan, J. Deep learning-based segmentation and quantification of cucumber powdery mildew using convolutional neural network. Front. Plant Sci. 2019, 10, 155. [Google Scholar] [CrossRef]
  18. Ngugi, L.C.; Abdelwahab, M.; Abo-Zahhad, M. Tomato leaf segmentation algorithms for mobile phone applications using deep learning. Comput. Electron. Agric. 2020, 178, 105788. [Google Scholar] [CrossRef]
  19. Kavitha Lakshmi, R.; Savarimuthu, N. DPD-DS for plant disease detection based on instance segmentation. J. Ambient. Intell. Humaniz. Comput. 2021, 12, 7559–7568. [Google Scholar] [CrossRef]
  20. Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
  21. Wu, B.; Xu, C.; Dai, X.; Wan, A.; Zhang, P.; Yan, Z.; Tomizuka, M.; Gonzalez, J.; Keutzer, K.; Vajda, P. Visual transformers: Token-based image representation and processing for computer vision. arXiv 2020, arXiv:2006.03677. [Google Scholar]
  22. Ramachandran, P.; Parmar, N.; Vaswani, A.; Bello, I.; Levskaya, A.; Shlens, J. Stand-alone self-attention in vision models. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; pp. 68–80. [Google Scholar]
  23. Hu, H.; Zhang, Z.; Xie, Z.; Lin, S. Local relation networks for image recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Korea, 27 October–2 November 2019; pp. 3464–3473. [Google Scholar]
  24. Ho, J.; Kalchbrenner, N.; Weissenborn, D.; Salimans, T. Axial attention in multidimensional transformers. arXiv 2019, arXiv:1912.12180. [Google Scholar]
  25. Zhao, H.; Jia, J.; Koltun, V. Exploring self-attention for image recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 10076–10085. [Google Scholar]
  26. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 6000–6010. [Google Scholar]
  27. Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 213–229. [Google Scholar]
  28. Dai, X.; Chen, Y.; Yang, J.; Zhang, P.; Yuan, L.; Zhang, L. Dynamic detr: End-to-end object detection with dynamic attention. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Virtual, 11–17 October 2021; pp. 2988–2997. [Google Scholar]
  29. Dai, Z.; Cai, B.; Lin, Y.; Chen, J. Up-detr: Unsupervised pre-training for object detection with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual, 19–25 June 2021; pp. 1601–1610. [Google Scholar]
  30. Gao, P.; Zheng, M.; Wang, X.; Dai, J.; Li, H. Fast convergence of detr with spatially modulated co-attention. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Virtual, 11–17 October 2021; pp. 3621–3630. [Google Scholar]
  31. Meng, D.; Chen, X.; Fan, Z.; Zeng, G.; Li, H.; Yuan, Y.; Sun, L.; Wang, J. Conditional DETR for fast training convergence. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Virtual, 11–17 October 2021; pp. 3651–3660. [Google Scholar]
  32. Yao, Z.; Ai, J.; Li, B.; Zhang, C. Efficient detr: Improving end-to-end object detector with dense prior. arXiv 2021, arXiv:2104.01318. [Google Scholar]
  33. Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; Dai, J. Deformable detr: Deformable transformers for end-to-end object detection. arXiv 2020, arXiv:2010.04159. [Google Scholar]
  34. Han, K.; Xiao, A.; Wu, E.; Guo, J.; Xu, C.; Wang, Y. Transformer in transformer. Adv. Neural Inf. Process. Syst. 2021, 34, 15908–15919. [Google Scholar]
  35. Touvron, H.; Cord, M.; Douze, M.; Massa, F.; Sablayrolles, A.; Jégou, H. Training data-efficient image transformers & distillation through attention. In Proceedings of the International Conference on Machine Learning, Virtual, 18–24 July 2021; pp. 10347–10357. [Google Scholar]
  36. Touvron, H.; Cord, M.; Sablayrolles, A.; Synnaeve, G.; Jégou, H. Going deeper with image transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Virtual, 11–17 October 2021; pp. 32–42. [Google Scholar]
  37. Yuan, K.; Guo, S.; Liu, Z.; Zhou, A.; Yu, F.; Wu, W. Incorporating convolution designs into visual transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Virtual, 11–17 October 2021; pp. 579–588. [Google Scholar]
  38. Chohan, M.; Khan, A.; Chohan, R.; Katpar, S.H.; Mahar, M.S.; Engineering. Plant disease detection using deep learning. Int. J. Recent Technol. 2020, 9, 909–914. [Google Scholar] [CrossRef]
  39. Reedha, R.; Dericquebourg, E.; Canals, R.; Hafiane, A. Transformer Neural Network for Weed and Crop Classification of High Resolution UAV Images. Remote Sens. 2022, 14, 592. [Google Scholar] [CrossRef]
  40. Liu, Y.; Zhang, Y.; Wang, Y.; Hou, F.; Yuan, J.; Tian, J.; Zhang, Y.; Shi, Z.; Fan, J.; He, Z. A survey of visual transformers. arXiv 2021, arXiv:2111.06091. [Google Scholar]
  41. Wu, K.; Peng, H.; Chen, M.; Fu, J.; Chao, H. Rethinking and improving relative position encoding for vision transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Virtual, 11–17 October 2021; pp. 10033–10041. [Google Scholar]
  42. Hughes, D.; Salathé, M. An open access repository of images on plant health to enable the development of mobile disease diagnostics. arXiv 2015, arXiv:1511.08060. [Google Scholar]
  43. Ghiasi, G.; Cui, Y.; Srinivas, A.; Qian, R.; Lin, T.-Y.; Cubuk, E.D.; Le, Q.V.; Zoph, B. Simple copy-paste is a strong data augmentation method for instance segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual, 19–25 June 2021; pp. 2918–2928. [Google Scholar]
  44. He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar]
  45. Chen, H.; Sun, K.; Tian, Z.; Shen, C.; Huang, Y.; Yan, Y. Blendmask: Top-down meets bottom-up for instance segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 8573–8581. [Google Scholar]
  46. Tian, Z.; Shen, C.; Chen, H. Conditional convolutions for instance segmentation. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 282–298. [Google Scholar]
  47. Wang, X.; Zhang, R.; Kong, T.; Li, L.; Shen, C. Solov2: Dynamic and fast instance segmentation. Adv. Neural Inf. Process. Syst. 2020, 33, 17721–17732. [Google Scholar]
  48. Hu, J.; Cao, L.; Lu, Y.; Zhang, S.; Wang, Y.; Li, K.; Huang, F.; Shao, L.; Ji, R. Istr: End-to-end instance segmentation with transformers. arXiv 2021, arXiv:2105.00637. [Google Scholar]
Figure 1. TDSD labeling examples, green is the leaf, red is the early blight spot, and blue is the late blight spot.
Figure 1. TDSD labeling examples, green is the leaf, red is the early blight spot, and blue is the late blight spot.
Agronomy 12 02023 g001
Figure 2. Examples of the copy–paste augmented TDSD; green is the leaf, red is the early blight spot, blue is the late blight spot, and black is the background.
Figure 2. Examples of the copy–paste augmented TDSD; green is the leaf, red is the early blight spot, blue is the late blight spot, and black is the background.
Agronomy 12 02023 g002
Figure 3. The schematic layout of the DS-DETR with backbone, unsupervised pre-training weight, encoder (including position encoding, multi-head self-attention), decoder (including multi-self-attention and spatial module), detect prediction head, and mask prediction heads.
Figure 3. The schematic layout of the DS-DETR with backbone, unsupervised pre-training weight, encoder (including position encoding, multi-head self-attention), decoder (including multi-self-attention and spatial module), detect prediction head, and mask prediction heads.
Agronomy 12 02023 g003
Figure 4. Unsupervised pre-training process.
Figure 4. Unsupervised pre-training process.
Agronomy 12 02023 g004
Figure 5. Process of adding the relative position encoding.
Figure 5. Process of adding the relative position encoding.
Agronomy 12 02023 g005
Figure 6. Unsupervised pre-training model training loss.
Figure 6. Unsupervised pre-training model training loss.
Agronomy 12 02023 g006
Figure 7. Segmentation effect visualization results. (AE) are used to represent the grouping index numbers of the five samples we selected and their results respectively.
Figure 7. Segmentation effect visualization results. (AE) are used to represent the grouping index numbers of the five samples we selected and their results respectively.
Agronomy 12 02023 g007
Figure 8. Classification results confusion matrix of disease damage evaluation: the vertical ordinate is the actual result, and the horizontal ordinate is the predicted result; class 0 is healthy, class 1 is mild early blight, class 2 is severe early blight, class 3 is mild late blight, and class 4 is severe late blight.
Figure 8. Classification results confusion matrix of disease damage evaluation: the vertical ordinate is the actual result, and the horizontal ordinate is the predicted result; class 0 is healthy, class 1 is mild early blight, class 2 is severe early blight, class 3 is mild late blight, and class 4 is severe late blight.
Agronomy 12 02023 g008
Table 1. Average precision bounding boxes (APbox), average precision segmentation (APmask), runtime per image, and model parameter of DS-DETR in tomato disease spot and leaf.
Table 1. Average precision bounding boxes (APbox), average precision segmentation (APmask), runtime per image, and model parameter of DS-DETR in tomato disease spot and leaf.
MethodBackboneEpochImprovingBox/%Mask/%ParameterTime (s)
AugUPSMCAiRPEAPAP0.5AP0.75RecallAPAP0.5AP0.75Recall
DETRResNet50150 0.55160.78230.60500.61270.53610.75120.56130.6337163.36 M0.0308
DETRResNet50350 0.61140.81990.63710.68300.58710.77950.58400.6777163.36 M0.0308
DETRResNet50150 0.58910.82110.61960.65860.57500.76310.59700.6670163.36 M0.0308
DETRResNet50350 0.64250.82680.68590.70960.62130.82080.61420.7011163.36 M0.0308
DETRResNet50100 0.69260.82920.71600.75050.63410.81990.66280.7275163.36 M0.0308
DETRResNet50100 0.70150.83760.71600.75320.65310.84590.66890.7172164.06 M0.0336
OurResNet501000.73930.86820.76330.76850.68230.86690.70420.7325164.12 M0.0371
Table 2. Average precision bounding boxes (APbox), average precision segmentation (APmask), runtime per image, and model parameter of instance segmentation in tomato disease spot and leaf.
Table 2. Average precision bounding boxes (APbox), average precision segmentation (APmask), runtime per image, and model parameter of instance segmentation in tomato disease spot and leaf.
MethodBackboneEpochBox/%Mask/%ParameterTime (s)
APAP0.5AP0.75RecallAPAP0.5AP0.75Recall
Mask R-CNN [44]ResNet50-FPN1000.59970.80900.64070.65790.52420.74950.57190.6151167.60 M0.0256
ResNet101-FPN1000.62460.84030.66790.68660.55360.77290.60340.6585239.85 M0.0285
Blendmask [45]ResNet50-FPN1000.61470.80130.65620.67270.59980.78380.61780.6800137.22 M,0.0262
CondInst [46]ResNet50-FPN1000.60710.80160.65490.67190.64560.80130.65400.7024130.12 M0.0270
SOLOv2 [47]ResNet50-FPN1000.69010.83620.69780.73780.66280.82810.68380.7165176.18 M0.0243
ISTR [48]ResNet50-FPN3000.72460.85320.75890.75820.58020.81470.65610.6803413.11 M0.0397
Our MethodResNet501000.73930.86820.76330.76850.68230.86690.70420.7325164.12 M0.0371
Table 3. Disease classification accuracy.
Table 3. Disease classification accuracy.
ClassAccuracy/%
Healthy98.73
Mild early blight95.00
Severe early blight97.50
Mild late blight93.33
Severe late blight97.44
Average96.40
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Wu, J.; Wen, C.; Chen, H.; Ma, Z.; Zhang, T.; Su, H.; Yang, C. DS-DETR: A Model for Tomato Leaf Disease Segmentation and Damage Evaluation. Agronomy 2022, 12, 2023. https://doi.org/10.3390/agronomy12092023

AMA Style

Wu J, Wen C, Chen H, Ma Z, Zhang T, Su H, Yang C. DS-DETR: A Model for Tomato Leaf Disease Segmentation and Damage Evaluation. Agronomy. 2022; 12(9):2023. https://doi.org/10.3390/agronomy12092023

Chicago/Turabian Style

Wu, Jianshuang, Changji Wen, Hongrui Chen, Zhenyu Ma, Tian Zhang, Hengqiang Su, and Ce Yang. 2022. "DS-DETR: A Model for Tomato Leaf Disease Segmentation and Damage Evaluation" Agronomy 12, no. 9: 2023. https://doi.org/10.3390/agronomy12092023

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop