A Study on Tomato Disease and Pest Detection Method

Hu, Wenyi; Hong, Wei; Wang, Hongkun; Liu, Mingzhe; Liu, Shan

doi:10.3390/app131810063

Open AccessArticle

A Study on Tomato Disease and Pest Detection Method

by

Wenyi Hu

¹,

Wei Hong

¹,

Hongkun Wang

¹,

Mingzhe Liu

¹

and

Shan Liu

^2,*

¹

Department of Computer and Network Security, Chengdu University of Technology, Chengdu 610059, China

²

School of Automation, University of Electronic Science and Technology of China, Chengdu 610054, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2023, 13(18), 10063; https://doi.org/10.3390/app131810063

Submission received: 27 July 2023 / Revised: 3 September 2023 / Accepted: 5 September 2023 / Published: 6 September 2023

(This article belongs to the Section Computing and Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

:

In recent years, with the rapid development of artificial intelligence technology, computer vision-based pest detection technology has been widely used in agricultural production. Tomato diseases and pests are serious problems affecting tomato yield and quality, so it is important to detect them quickly and accurately. In this paper, we propose a tomato disease and pest detection model based on an improved YOLOv5n to overcome the problems of low accuracy and large model size in traditional pest detection methods. Firstly, we use the Efficient Vision Transformer as the feature extraction backbone network to reduce model parameters and computational complexity while improving detection accuracy, thus solving the problems of poor real-time performance and model deployment. Second, we replace the original nearest neighbor interpolation upsampling module with the lightweight general-purpose upsampling operator Content-Aware ReAssembly of FEatures to reduce feature information loss during upsampling. Finally, we use Wise-IoU instead of the original CIoU as the regression loss function of the target bounding box to improve the regression prediction accuracy of the predicted bounding box while accelerating the convergence speed of the regression loss function. We perform statistical analysis on the experimental results of tomato diseases and pests under data augmentation conditions. The results show that the improved algorithm improves mAP50 and mAP50:95 by 2.3% and 1.7%, respectively, while reducing the number of model parameters by 0.4 M and the computational complexity by 0.9 GFLOPs. The improved model has a parameter count of only 1.6 M and a computational complexity of only 3.3 GFLOPs, demonstrating a certain advantage over other mainstream object detection algorithms in terms of detection accuracy, model parameter count, and computational complexity. The experimental results show that this method is suitable for the early detection of tomato diseases and pests.

Keywords:

identification of pests and diseases; object detection; YOLOv5; WIoU Loss; CARAFE; EfficientViT

1. Introduction

Tomatoes are an important vegetable crop that is widely grown throughout the world [1]. The tomato crop is important in terms of food and nutrition, economics, ecosystem services, and cultural and historical values. First, as a nutrient-rich vegetable, tomatoes play an important role in human health. It is rich in nutrients such as vitamin C, folate, and potassium [2], and is also a low-calorie food suitable for weight loss and maintaining a healthy diet. In addition, tomatoes can be used in a variety of sauces and condiments to enhance the taste and flavor of food. Second, tomato crops make a significant contribution to the economy and agricultural industry [3]. The cultivation and sale of tomatoes drive many related industries and employment opportunities, including seed production, greenhouse construction, transportation, and sales. Tomato crops also make a positive contribution to the ecosystem. Tomatoes can absorb a large amount of carbon dioxide, reducing greenhouse gases in the atmosphere [4], providing important habitat and food resources, and supporting local biodiversity. Tomatoes are widely used as important research objects and model plants, with significant research value in genetics, cell biology, biotechnology, molecular biology, and genomics [5]. Tomato production is often associated with the occurrence of pests and diseases that can cause significant yield losses [6], particularly the impact of late blight on tomato production in humid areas [7]. Therefore, prevention and control of tomato pests and diseases are key to improving tomato yield and quality, and early detection of diseases is extremely important for selecting the correct control methods and stopping the spread of diseases [8]. Therefore, it is of great significance to design a simple and efficient real-time detection model for tomato pests and diseases to improve tomato yield.

The main contributions of this paper include the following three points:

In this paper, we propose a lightweight model for tomato pest and disease detection called YOLOv5n-VCW. This model improves the YOLOv5n architecture by replacing the original backbone network with Efficient Vision Transformer (EfficientViT) [9], replacing the original upsampling method with the lightweight and general-purpose Content-Aware ReAssembly of FEatures (CARAFE) algorithm [10], and replacing Complete-IoU (CIoU) Loss with Wise-IoU (WIoU) Loss [11]. All these improvements are effective in improving the performance of the model in tomato disease and pest detection tasks.
This paper evaluates and compares the performance of mainstream object detection models, including YOLOv5n, YOLOv5s, YOLOv5m, YOLOv5l, YOLOv5x, SSD, Faster R-CNN, and the proposed YOLOv5n-VCW model, in the task of detecting tomato pests and diseases. The evaluation results show that YOLOv5n-VCW achieves mAP50 and mAP50:95 scores of 98.1% and 84.8%, respectively, which is a 2.3% and 1.7% improvement over YOLOv5n and even outperforms other models such as YOLOv5s, YOLOv5m, YOLOv5l, and YOLOv5x.
Another contribution of this paper is that it reduces the size of the model parameters to 1.9 M, which is a reduction of 0.4 M compared with YOLOv5n. In addition, the computational complexity is reduced by 1.2 GFLOPs, making the YOLOv5-VCW model much smaller than other evaluated models. This makes the YOLOv5-VCW model more suitable for use on devices with limited computational resources.

This paper is organized as follows: Following the introduction, Section 2 briefly reviews previous work related to the detection of tomato pests and diseases using deep learning methods. Section 3 describes the base model used in this paper. Section 4 describes the method that improves on the base model. Section 5 provides experimental results. Section 6 discusses the experimental results and the limitations of this study, and finally, Section 7 summarizes some comments and future work.

2. Related Works

In recent years, with the development of machine learning and deep learning, the application of computer vision in agriculture has achieved remarkable results, especially in the field of plant disease recognition. Traditional object detection algorithms require researchers to manually design features and use machine learning algorithms to classify the extracted features. Representative feature descriptors include Haar features [12], Histograms of Oriented Gradients (HOG) features [13], and the Deformable Parts Model (DPM) [14]. However, these algorithms rely greatly on manually extracted features, resulting in poor generalization, low robustness, and high computational complexity. Compared with traditional methods, deep learning-based object detection algorithms improve detection speed and accuracy and have become a focus of research in pest and disease detection. Mokhtar et al. [15] proposed a method to identify tomato yellow leaf curl virus and tomato spotted wilt virus, achieving an average diagnostic accuracy of 90%. Lin et al. [16] proposed a feature pyramid structure based on Faster R-CNN that fully exploits the features of each layer, enabling the detector to maintain high detection accuracy for small targets. Fuentes et al. [17] proposed incorporating refined filter banks into deep neural networks to address the class imbalance of the tomato dataset. Ale et al. [18] proposed a lightweight deep neural network-based plant disease detection method that reduces the model size and parameter set. Zhao et al. [19] proposed the use of the YOLOv2 algorithm for tomato disease detection. Latif et al. [20] improved the ResNet model for pest and disease detection and increased the detection accuracy. Prabhakar et al. [21] proposed the use of ResNet 101 to determine the severity of early leaf blight in tomatoes. Pattnaik et al. [22] proposed a transfer learning-based Convolutional Neural Network (CNN) framework for tomato pest classification. Jiang et al. [23] used a deep learning method to extract features of tomato leaf diseases such as yellow leaf curl virus, bacterial spot, and late blight. Liu et al. [24] proposed a tomato pest and disease detection algorithm based on the YOLOv3 convolutional neural network. Wang et al. [25] proposed the YOLOv3-tiny-IRB algorithm, which improves the feature extraction network, ameliorates the gradient disappearance phenomenon caused by excessive depth of the network structure, and improves the detection accuracy of tomato pests and diseases under occlusion and overlap conditions in real natural environments. To address the problem of lost feature information for small targets during transmission. Huang et al. [26] proposed an automatic identification and detection method for crop leaf diseases based on a fully convolutional-switchable normalized dual-path network (FC-SNDPN) to reduce the influence of complex backgrounds on the image identification of crop diseases and pests.

Although the above studies have strongly demonstrated the effectiveness of CNN structures in the field of plant pest and disease recognition, these models inevitably have problems due to their large number of parameters and high computational complexity. Therefore, many studies have focused on the design of lightweight CNNs. Kamal et al. [27] used a MobileNet architecture based on depth-separable convolution constructed on public datasets for plant disease recognition. Albahli et al. [28] proposed the use of DenseNet-77 as a backbone network for CornerNet, which reduces the model parameters and improves accuracy. Zhong et al. [29] designed LightMixer, a lightweight tomato leaf disease recognition model, to improve the computational efficiency of the entire network architecture and reduce the loss of disease feature information. Chen et al. [30] optimized the MobileNetV2 model using an augmented loss function approach for recognizing rice diseases and pests in complex contexts. The above studies performed well in the field of plant disease recognition, but the lightweight line and accuracy of the model can be further improved.

3. YOLOv5 Object Detection Algorithm

The YOLOv5 object detection algorithm was released by Ultralytics in June 2020 and has since maintained a fast iteration speed, currently developing to version 7.0. Compared with the updated version of YOLOv8, YOLOv5 is simpler in terms of model architecture and faster in terms of speed because its models are smaller and simpler. In terms of accuracy, YOLOv8 has higher performance because its model is more complex. After considering the factors of detection accuracy, model amount, and detection speed, we chose to carry out related research and improvement work based on YOLOv5n because it is more suitable for real-time target detection tasks. There are five network models in the YOLOv5 algorithm, namely YOLOv5n, YOLOv5s, YOLOv5m, YOLOv5l, and YOLOv5x. Each model has the same structure but differs in depth and width, with YOLOv5n having the smallest depth and width and therefore the smallest number of parameters and computational complexity, resulting in faster inference speed. Considering practical application scenarios, this paper selects YOLOv5n as the improved model, and the whole network structure is shown in Figure 1.

The network structure of YOLOv5n mainly consists of four parts: the input part, the backbone part, the neck part, and the prediction part. The Backbone consists of three modules: CBS, C3_1, and SPPF. The CBS module consists of convolution, batch normalization (BN), and the SiLU activation function. The C3_1 module is a stack of CBS modules with a residual structure that divides the input feature map into two parts. One part is processed by a small convolutional network, and the other part is processed directly by the next layer, and then the two parts are stitched together as the input of the next layer, which can reduce the parameters and computation of the network and at the same time improve the efficiency of feature extraction, thus speeding up model training and inference speed, as shown in Figure 2.

The SPPF module is a spatial pyramid pooling structure that achieves adaptive output sizes. Unlike traditional pooling structures, the output size of SPPF is independent of the input size, and it can achieve a fixed-dimensional output. The SPPF structure first performs pooling operations of different sizes on the input feature map, then uses convolutional operations to fuse the results of different scales, and finally outputs the fused feature map. The neck uses the PAN (Path Aggregation Network) + FPN (Feature Pyramid Networks) [31] network structure, which mainly consists of CBS, C3_2, Concat, and upsampling modules. C3_1 and C3_2 are used in the backbone and neck of the network, respectively, and the only difference between them is the BottleNeck structure inside, as shown in Figure 3.

The purpose of FPN and PAN is to achieve feature fusion. The FPN network generates a feature pyramid with different resolutions by connecting feature maps from different levels of the network. These feature pyramids can capture object information at different scales. The PAN network generates a feature pyramid with different resolutions and semantic information by successively merging higher-level feature maps with lower-level ones. In complex scenarios, using FPN and PAN networks, YOLOv5 can detect objects of different shapes and sizes, improving the robustness and reliability of object detection. The prediction part uses the three different scales of feature maps obtained in the neck to predict large, medium, and small objects. First, these feature maps are divided into lattices, and then predictions are made for each lattice using three anchor boxes. Finally, each detection box outputs a feature vector, including classification probability and object confidence. When an object has multiple prediction boxes, NMS (non-maximum suppression) [32] is used to filter the target boxes.

4. Methods

4.1. Backbone Network Improvements

Although YOLOv5’s CSPDarknet53 backbone network performs well in object detection tasks, it has some drawbacks. For example, it has high computational and storage costs because CSPDarknet53 contains multiple convolutional and pooling layers that require large amounts of computational and storage resources for training and inference. Deploying CSPDarknet53 can be difficult on resource-constrained devices such as edge and mobile devices. The network structure of CSPDarknet53 is relatively fixed, making it difficult to customize and extend. As a result, CSPDarknet53 may not be able to meet the requirements of some specific application scenarios, such as those requiring specific receptive fields or specific feature extraction capabilities. In addition, although YOLOv5 provides several versions of models, including YOLOv5s, YOLOv5m, YOLOv5l, and YOLOv5x, these models may still have high computational and storage costs on resource-constrained devices. To address these issues, the Google Brain team proposed the Vision Transformer (ViT) model [33] in 2020. The ViT model is an image classification model based on the Transformer architecture. It divides the image into a fixed number of image patches and uses each patch as an input to the model. ViT uses the multi-head self-attention mechanism of the Transformer encoder to process image data, so the model does not require convolutional operations to process images and therefore has better scalability and generalization ability. In the ViT model, each image patch is converted to a vector and then input to a multi-layer Transformer encoder for processing. The encoder maps the input vector sequence to another vector sequence, where each vector contains all the information in the sequence. To make the ViT model adaptable to images of different sizes, a deformable attention mechanism is introduced to adapt to different spatial structures when processing image patches. The structure is shown in Figure 4.

The ViT model partially compensates for the shortcomings of traditional CNN backbone networks, such as the ability to process large images, computational and memory costs, generalization, and flexibility. Subsequently, the MIT-IBM Watson AI Lab improved the traditional ViT model to obtain the EfficientViT model. The EfficientViT model can maintain high accuracy while having lower computational and memory costs. It also enhances the multi-scale attention mechanism and has greater scalability, which can be adapted to different tasks and devices. For example, by increasing or decreasing the number of attention heads, accuracy can be increased or decreased, and the depth of the model can be adjusted to balance accuracy and computational cost. The structure of EfficientViT is shown in Figure 5.

MBConv [34] is a module for building efficient neural networks on mobile and embedded devices. It can reduce the number of parameters and computation while improving the accuracy of the model by using Depthwise Separable Convolution (DSC) and Inverted Residual Connection.

Lightweight MSA is a lightweight multi-scale attention mechanism in the EfficientViT model that is used to capture features at different scales. It consists of scale grouping modules, lightweight attention modules, and scale fusion modules, which are responsible for partitioning the input feature tensor into several groups of different scales, with each group containing features of the same scale. Attention weighting is performed on the features of each scale to capture feature information at different scales. This module uses a set of lightweight convolutional layers and attention mechanisms to reduce computational and memory costs. The weighted feature tensors are then fused in the channel dimension to produce the final multi-scale feature representation. This multi-scale attention mechanism can effectively capture feature information at different scales in the input image and fuse them together to generate more expressive and rich feature representations. In addition, this mechanism has lower computational and memory costs due to the use of lightweight attention modules. The structure of the lightweight MSA is shown in Figure 6.

Therefore, in this article, the EfficientViT, a model improved from the traditional ViT model, is used as the backbone network, replacing the original backbone network in YOLOv5.

4.2. Up-Sampling Improvements

In the feature fusion network, YOLOv5 uses nearest-neighbor interpolation for upsampling, which only considers the position of the pixel points and does not fully exploit the information in the feature map. This can reduce the quality of the upsampled feature map. To address this issue, this paper proposes to use a lightweight and general upsampling operator called Content-Aware ReAssembly of Features (CARAFE) to replace nearest-neighbor interpolation and obtain a higher-quality upsampled feature map. The CARAFE operator consists of two blocks: the feature reassembly module and the up-sampling kernel prediction module. The upsample kernel prediction module analyzes and encodes the input feature map to predict the upsample kernels corresponding to different positions of the feature points. The feature reorganization module then performs upsampling using the predicted upsample kernels. Compared with nearest-neighbor interpolation, the CARAFE operator makes better use of the semantic information in the feature map during the upsampling process, thereby improving the quality of the upsampled feature map. The structure of the CARAFE module is shown in Figure 7.

The upsampling kernel prediction module consists of a content encoding sub-module, a kernel normalization sub-module, and a channel compression sub-module. First, the channel compression sub-module reduces the computational cost by using a 1 × 1 convolutional layer to compress the channel dimension of the input feature map. Then, the content encoding sub-module uses a convolutional layer with a size of

k_{e n c o d e r} \times k_{e n c o d e r} \times σ^{2} \times k_{u p}^{2}

to encode the deep semantic information contained in each feature point and its surrounding points in the input feature map, generating an upsample kernel with a shape of

H \times W \times σ^{2} \times k_{u p}^{2}

. The channel dimension of the upsample kernel is then expanded to the width and height dimensions, resulting in an expanded upsample kernel with a shape of

σ H \times σ W \times k_{u p}^{2}

, where

k_{u p}^{2}

is the size of the upsample kernel for a single feature point, and

σ

is the upsampling rate. Finally, the kernel normalization sub-module uses the softmax function to normalize the predicted upsample kernel. Overall, the upsample kernel prediction module uses the semantic information in the input feature map to adaptively generate corresponding upsample kernels for different feature points.

The role of the feature reassembly module is to map each feature point in the output feature map back to the input feature map and extract a region of size

k_u p \times k_u p

centered on that feature point. The feature reassembly module then performs a dot product operation between the extracted region and the upsample kernel predicted by the upsample kernel prediction module for that point, generating the upsampled feature for that point. Since the feature reassembly module pays more attention to the information contained in the relevant feature points in the local region during the reassembly process, the reassembled feature map usually contains richer semantic information and is more expressive than the original feature map.

Compared with the original nearest-neighbor interpolation upsampling, the CARAFE upsampling operator can aggregate contextual semantic information within a larger receptive field and perform adaptive upsampling operations for different feature points using the predicted upsampling kernels. This operation effectively reduces the loss, ensures the integrity of the feature information, and improves the quality of the upsampled feature map.

4.3. Bounded Box Regression Loss Function Improvement

The loss function of YOLOv5 is composed of three sub-loss functions, namely the confidence loss function, the bounding box regression loss function, and the classification loss function, which are calculated as follows:

\begin{matrix} L_{v 5} = \sum_{i}^{N} (λ_{1} L_{c l s} + λ_{2} L_{o b j} + λ_{3} L_{b o x}) \\ = \sum_{i}^{N} (λ_{1} \sum_{j}^{B_{i}} L_{{c l s}_{j}} + λ_{2} \sum_{j}^{B_{i}} L_{{C I o U}_{j}} + λ_{3} \sum_{j}^{S_{i} \times S_{i}} L_{{o b j}_{j}}) \end{matrix}

(1)

L_{b o x}

is the bounding box regression loss,

L_{o b j}

is the object confidence loss,

L_{c l s}

is the object classification loss,

λ_{1}

,

λ_{2}

and

λ_{3}

are the weights of the three losses, where the object classification loss and the object confidence loss are calculated using the Binary Cross Entropy (BCE) loss function, and the calculation formulas are as follows:

L_{B C E} = - \frac{1}{n} \sum_{i}^{n} [y_{i} \times \log (σ (x_{i})) + (1 - y_{i}) \times \log (1 - σ (x_{i}))]

(2)

σ (a) = \frac{1}{1 + \exp (- a)}

(3)

The bounding box regression loss in version 5.0 and later uses the CIoU loss as the bounding box regression loss function. The formula for calculating the CIoU loss is as follows:

L_{C I o U} = 1 - I o U + \frac{ρ^{2} (b, b^{g t})}{c^{2}} + α v

(4)

I o U = \frac{|b \cap b^{g t}|}{|b \cup b^{g t}|}

(5)

v = \frac{4}{π^{2}} {(a r c t a n \frac{w^{g t}}{h^{g t}} - a r c t a n \frac{w}{h})}^{2}

(6)

α = \frac{v}{(1 - I o U) + v}

(7)

Compared with the Generalized-IoU (GIoU) loss [35] used in previous versions of YOLOv5, the CIoU loss takes into account additional penalties for the distance between the centers of the predicted and ground truth boxes, the difference in their areas, and incomplete overlap between the two boxes. Specifically, the distance between the centers of the predicted and ground truth boxes and the difference in their areas are taken into account before the IoU is calculated. This ensures that for the same IoU, the predicted box with a smaller distance and a smaller area will score higher. The CIoU loss provides more significant penalties for incomplete overlap between the boxes, resulting in a more accurate regression of the predicted box to the shape and size of the ground truth box. However, when the aspect ratio of the predicted box and the ground truth box is linearly related, the aspect ratio penalty term becomes ineffective, which affects the regression of the predicted box. Therefore, in this paper, we propose to replace the CIoU loss with the Wise-IoU (WIoU) loss. The WIoU loss focuses more on the aspect ratio, position offset, and scale changes of the bounding box than the traditional IoU loss and other improved versions such as GIoU and CIoU. In addition, the WIoU loss has an important parameter, the outlier weight, which adjusts the degree of penalty for different IoU values in the bounding box matching process. Specifically, the larger the value of the outlier weight, the stronger the penalty for low IoU matching results and the weaker the penalty for high IoU matching results. This allows the bounding box regression to focus on the anchor boxes of normal quality, preventing low-quality examples from producing large harmful gradients and making the model more accurate in matching the shape and size of the ground truth box, thereby improving the accuracy of object detection.

4.4. Improved Network Structure

As a result of the above discussion, an improved model was designed. It is shown in Figure 8. It consists of three parts: the backbone is EfficientViT, and the neck layer is composed of combining and replacing the neck layers of YOLOV5 and CARAFE. The head layer is consistent with YOLOV5. The box loss function is changed to WIoU.

5. Experiment and Result

5.1. Datasets

The experimental dataset in this paper uses the tomato dataset from Kaggle, which contains eight categories of tomato diseases and pests, some of which are shown in Figure 9. In order to increase the training volume of the network model and improve its generalization ability, this paper uses the data augmentation approach to expand the tomato diseases and pests dataset based on the original data. The final dataset contains 20,413 images, including 16,332 images in the training set and 4081 images in the test set. Table 1 shows the statistical table for all data. The specific data enhancement methods used in this paper include (1) flipping the original images vertically or horizontally with a probability of 0.3; (2) translating the images with a scale of 1:1:10; (3) randomly rotating the images with angles ranging from −10° to 10° with a step size of 1°; (4) enhancing the contrast of the images with an enhancement range of 1:1:10; and (5) injecting noise into the images with a mean of 0 and a standard deviation of 1:1:5. Through these data enhancement methods, the dataset can be expanded, the model’s generalization ability can be improved, and overfitting problems can be effectively avoided; (6) partial masking of the original image with a probability of 0.3; and (7) using a random light transformation.

5.2. Experimental Environment

The experimental environment in this paper is Ubuntu 20.04, with a system running memory of 80 GB, a CPU of Intel(R) Xeon(R) Platinum 8358P CPU @ 2.60 GHz, and a GPU of NVIDIA GeForce RTX2080 Super (8 GB). The deep learning framework used in this paper is PyTorch 1.11.0, and the CUDA version is 11.5.

5.3. Model Evaluation Metrics and Training Parameter Settings

The experimental part of this paper uses mean average precision (mAP) of 50 and 50:95 as metrics for model detection accuracy; FLOPs are used as a measure of model computation; the higher the FLOPs, the higher the computational complexity of the model, which requires more computational resources to complete training and inference. Parameters is the number of parameters to be learned in the model, and in general, the higher the number of parameters, the higher the complexity of the model, which requires more data and computational resources for training and inference. mAP50 and mAP50:95 were chosen as metrics to assess detection accuracy because they can simultaneously assess the model’s ability to locate and classify targets for detection, which is more reflective of the model’s detection capability than accuracy and recall. AP50 is the average accuracy of each category at an IoU threshold of 0.5, and similarly, AP95 is the average accuracy of each category at an IoU threshold of 0.95. mAP is the average of the APs of all categories at that IoU threshold, calculated as follows:

m A P = \frac{1}{n} \sum_{j = 1}^{n} A P (j)

(8)

A P 50 = \frac{1}{n} \sum_{i = 1}^{n} P_{i}^{I o U = 0.5} (R_{i}^{I o U = 0.5})

(9)

A P 50 : 95 = \frac{1}{10} (A P 50 + A P 55 + \dots + A P 95)

(10)

In the formula,

n

is the number of categories,

R

is the recall rate, which is the ratio of the number of true positive and predicted positive samples to the total number of true positive samples, and

P

is the precision rate, which is the ratio of the number of samples that are true positive and predicted positive samples to the total number of predicted positive samples.

For the experimental part, the official YOLOv5 recommended hyperparameters were used for the model training parameter settings. These hyperparameters were selected based on experiments conducted on the MS COCO dataset.

5.4. Experimental Results and Analysis

5.4.1. Experimental Analysis of Improved Backbone Networks

To verify the effectiveness of the improved backbone network, the original CSPDarknet53 backbone network was improved to EfficientViT, leaving the rest of the network unchanged. The model with the improved backbone network is renamed YOLOv5n-V. Experimental comparisons were performed between the two models before and after the improvement, and the results are shown in Table 2.

Figure 10 and Figure 11 show the mAP curves of the original model and the model with the EfficientViT.

5.4.2. Experimental Analysis of the Improved Upsampling Operator

To verify the effectiveness of the improved upsampling operator, the nearest neighbor interpolation upsampling was improved to the CARAFE upsampling, leaving the rest of the network unchanged. The model with the improved upsampling operator is renamed YOLOv5n-C. Experimental comparisons were made between the two models before and after the improvement, and the results are shown in Table 3.

Figure 12 and Figure 13 show the mAP curves for the original model and the model using CARAFE upsampling.

5.4.3. Experimental Analysis of the Improved Bounding Box Regression Loss Function

In order to verify the effectiveness of improving the box loss function to WIoU Loss, the box loss function of the original YOLOv5n was improved from CIoU Loss to WIoU Loss while keeping the rest of the network unchanged. The model with the improved loss function is called YOLOv5n-W. Experimental comparisons were made between the two models before and after the improvement, and the results are shown in Table 4.

Figure 14 and Figure 15 show the mAP curves of the original model and the model with the WIoU Loss replacement.

5.4.4. Ablation Experiments

In this paper, three improvement methods are proposed, namely V (EfficientViT), C (CARAFE upsampling operator), and W (WIoU Loss).In order to test the effectiveness of the three improvement methods, the ablation experiments in this paper are designed in the following two directions:

(1): Using the original YOLOv5n as a base, only one of the above improvements was added to each group of experiments separately to verify the effectiveness of each improvement method on the original algorithm.
(2): Based on the finally obtained improved algorithm, YOLOv5n-VCW, each experimental group eliminated only one of the above improvement methods separately to verify the effect of each improvement method on the final improved algorithm.

“✓” indicates the introduction of the method; the design of the ablation experiment is shown in Table 5, and the average accuracy mean curve index at training is shown in Figure 16 and Figure 17. The PR curve is shown in Figure 18.

5.4.5. Comparison Experiments

To further verify that the improvement algorithm proposed in this paper has certain advantages over other mainstream object detection algorithms in terms of detection accuracy, model size, and detection speed, we compared the proposed improvement algorithm YOLOv5n-VCW with Faster R-CNN, SSD, YOLOv5n, YOLOv5s, YOLOv5m, YOLOv5l, and YOLOv5x algorithms on this dataset, and the experimental results are shown in Table 6.

Finally, the validation set images are detected using the base model and the improved model. Figure 19 shows the validation set with real labels. Figure 20 and Figure 21 are images of the detection validation set. From the figure, it can be seen that the improved model detects images with generally higher confidence, and more objects are detected.

6. Discussion

Based on the experimental results of this study, the following conclusions can be drawn:

First, by replacing the backbone network with EfficientViT, we found that the number of parameters in the model was significantly reduced while the accuracy was slightly improved. Specifically, [email protected] and [email protected]:0.95 increased by 1.1% and 0.6%, respectively, while the number of floating-point operations per second performed by the model decreased by 1.2 GFLOPs, a 28% decrease, and the number of parameters decreased by 0.4 M, a 21% decrease. This indicates that the EfficientViT backbone network proposed in this paper has significant effectiveness in tomato disease and pest detection tasks.

Secondly, we found that improving the original nearest neighbor interpolation upsampling to CARAFE upsampling can improve the model’s detection accuracy with a slight increase in model size and computational complexity. Specifically, [email protected] and [email protected]:0.95 increased by 1.3% and 1.0%, respectively, while the increase in model floating-point operations and number of parameters was within an acceptable range. This indicates that the CARAFE upsampling operator proposed in this paper is effective in improving model accuracy in tomato disease and pest detection tasks.

Thirdly, we found that improving the original CIoU Loss to a WIoU Loss can improve model detection accuracy without affecting model volume or detection speed. Specifically, [email protected] and [email protected]:0.95 increased by 0.4% and 0.6%, respectively. This indicates that the WIoU Loss proposed in this paper has significant effectiveness in tomato disease and pest detection tasks.

Additionally, we found that the proposed improved algorithm, YOLOv5n-VCW, has the highest detection accuracy while ensuring model lightness. Compared with other mainstream object detection algorithms, the proposed algorithm in this paper can maintain high detection accuracy even with a decrease in the number of model parameters and computational complexity, making it highly applicable.

Finally, we found that the proposed YOLOv5n-VCW algorithm has significant advantages in resource-limited situations. Compared with other algorithms, this algorithm has the highest detection accuracy with a smaller number of model parameters and computational complexity. This indicates that the YOLOv5n-VCW algorithm proposed in this paper is significantly effective and superior for tomato disease and pest detection tasks.

However, our study has some limitations. First, this study conducted experiments only for tomato pest detection tasks and did not consider other types of object detection tasks. Second, this study was only based on a single dataset, and we did not consider the differences and influences between different datasets. Finally, the lighting and occlusion effects in the experiment were realized based on data enhancement, which may differ from the real scene, so this is a theoretical approach. It is also a direction for future research.

In conclusion, the results of this study show that the proposed improved algorithm theoretically has significant effectiveness and superiority in tomato disease and pest detection tasks. These results provide useful clues for future research and feasible solutions for practical applications. Future research can go a step further to explore how to apply the proposed improved algorithm to other types of object detection tasks and conduct experimental verification, as well as how to improve the real-time performance of the model and better deal with the differences and influences between different datasets.

7. Conclusions

To address the problems of large model size and low detection performance of existing models, this paper proposes an improved tomato pest and disease detection algorithm called YOLOv5n-VCW, based on YOLOv5n. First, EfficientViT is used to replace the feature extraction module of the original YOLOv5, which significantly reduces the computational and parameter costs of the model. Second, the nearest neighbor interpolation upsampling module is replaced with the CARAFE upsampling module to reduce the loss of feature information during upsampling. Finally, the WIoU Loss is used to replace the CIoU Loss as the new target box loss function to optimize the calculation of the loss function. The hyperparameters for training were set as follows: 0.01 for lr0, 0.01 for lrf, 0.937 for momentum, 0.0005 for weight_decay, 3.0 for warmup_epochs, 0.8 for warmup_momentum, 0.1 for warmup_bias_lr, and 0.05 for box; cls is 0.5.

The experimental results on the tomato pest and disease detection dataset show that the YOLOv5n-VCW proposed in this paper has significantly better detection accuracy, model parameters, and computational cost than the original YOLOv5n. With only 1.6 M model parameters and 3.3 G computational costs, mAP50 and mAP50:95 reach 98.1% and 84.8%, respectively. Compared with other mainstream object detection algorithms in terms of detection accuracy, model size, and computational cost, YOLOv5n-VCW has certain advantages and is more practical and feasible in real-world applications. Next, deployment of the proposed model in mobile or embedded device environments for accurate identification and detection of tomato pests and diseases is our main goal, and further exploration of the proposed model for detection and identification of various other plant pests and diseases will be a part of future plans.

Author Contributions

Conceptualization, S.L., W.H. (Wei Hong) and W.H. (Wenyi Hu); methodology, W.H. (Wei Hong); software, W.H. (Wei Hong); validation, M.L. and W.H. (Wenyi Hu); formal analysis, S.L. and W.H. (Wenyi Hu); investigation, W.H. (Wei Hong); resources, H.W.; data curation, H.W.; writing—original draft preparation, W.H. (Wei Hong) and M.L.; writing—review and editing, W.H. (Wei Hong) and S.L.; visualization, W.H. (Wei Hong) and S.L.; supervision, W.H. (Wenyi Hu); project administration, M.L.; funding acquisition, W.H. (Wenyi Hu). All authors have read and agreed to the published version of the manuscript.

Funding

Supported by Sichuan Science and Technology Program (2023YFSY0026, 2023YFH0004).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

In this paper, we use publicly available datasets, including Kaustubh B’s tomato leaf disease detection dataset (https://www.kaggle.com/datasets/kaustubhb999/tomatoleaf (accessed on 27 July 2023)) and Nouaman Lamrhi’s open-source tomato dataset (https://www.kaggle.com/datasets/noulam/tomato (accessed on 27 July 2023)).

Conflicts of Interest

The authors declare no conflict of interest.

References

Li, J. Research on tomato bacterial pith necrosis. Plant Dis. Pests 2012, 3, 9. [Google Scholar]
Takayama, M.; Ezura, H. How and why does tomato accumulate a large amount of GABA in the fruit? Front. Plant Sci. 2015, 6, 612. [Google Scholar] [CrossRef] [PubMed]
Manríquez-Altamirano, A.; Sierra-Pérez, J.; Muñoz, P.; Gabarrell, X. Analysis of urban agriculture solid waste in the frame of circular economy: Case study of tomato crop in integrated rooftop greenhouse. Sci. Total Environ. 2020, 734, 139375. [Google Scholar] [CrossRef] [PubMed]
Rehman, A.; Ulucak, R.; Murshed, M.; Ma, H.; Işık, C. Carbonization and atmospheric pollution in China: The asymmetric impacts of forests, livestock production, and economic progress on CO₂ emissions. J. Environ. Manag. 2021, 294, 113059. [Google Scholar] [CrossRef]
Li, N.; Yu, Q. Tomato super-pangenome highlights the potential use of wild relatives in tomato breeding. Nat. Genet. 2023, 55, 744–745. [Google Scholar]
Wang, X.Y.; Feng, J.; Zang, L.Y.; Yan, Y.L.; Yang, Y.Y.; Zhu, X.P. Natural occurrence of Tomato chlorosis virus in cowpea (Vigna unguiculata) in China. Plant Dis. 2018, 102, 254. [Google Scholar] [CrossRef]
Arafa, R.A.; Kamel, S.M.; Taher, D.I.; Solberg, S.; Rakha, M.T. Leaf Extracts from Resistant Wild Tomato Can Be Used to Control Late Blight (Phytophthora infestans) in the Cultivated Tomato. Plants 2022, 11, 1824. [Google Scholar] [CrossRef]
Ferrero, V.; Baeten, L.; Blanco-Sánchez, L.; Planelló, R.; Díaz-Pendón, J.A.; Rodríguez-Echeverría, S.; Haegeman, A.; Peña, E. Complex patterns in tolerance and resistance to pests and diseases underpin the domestication of tomato. New Phytol. 2020, 226, 254–266. [Google Scholar] [CrossRef]
Han, C.; Gan, C.; Han, S. Efficientvit: Enhanced linear attention for high-resolution low-computation visual recognition. arXiv 2022, arXiv:2205.14756. [Google Scholar]
Wang, J.; Chen, K.; Xu, R.; Liu, Z.; Loy, C.C.; Lin, D. Carafe: Content-aware reassembly of features. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar]
Tong, Z.; Chen, Y.; Xu, Z.; Yu, R. Wise-IoU: Bounding Box Regression Loss with Dynamic Focusing Mechanism. arXiv 2023, arXiv:2301.10051. [Google Scholar]
Viola, P.; Jones, M.J. Robust real-time face detection. Int. J. Comput. Vis. 2004, 57, 137–154. [Google Scholar] [CrossRef]
Tan, P.S.; Lim, K.M.; Lee, C.P. Human action recognition with sparse autoencoder and histogram of oriented gradients. In Proceedings of the 2020 IEEE 2nd International Conference on Artificial Intelligence in Engineering and Technology (IICAIET), Kota Kinabalu, Malaysia, 26–27 September 2020. [Google Scholar]
Felzenszwalb, P.F.; Girshick, R.B.; McAllester, D.; Ramanan, D. Object Detection with Discriminatively Trained Part-Based Models. IEEE Trans. Pattern Anal. Mach. Intell. 2010, 32, 1627–1645. [Google Scholar] [CrossRef] [PubMed]
Mokhtar, U.; Ali, M.A.; Hassanien, A.E.; Hefny, H. Identifying two of tomatoes leaf viruses using support vector machine. In Information Systems Design and Intelligent Applications, Proceedings of the Second International Conference INDIA 2015, Kalyani, India, 8–9 January 2015; Springer: Berlin/Heidelberg, Germany, 2015; Volume 1. [Google Scholar]
Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
Fuentes, A.F.; Yoon, S.; Lee, J.; Park, D.S. High-performance deep neural network-based tomato plant diseases and pests diagnosis system with refinement filter bank. Front. Plant Sci. 2018, 9, 1162. [Google Scholar] [CrossRef] [PubMed]
Ale, L.; Sheta, A.; Li, L.; Wang, Y.; Zhang, N. Deep learning based plant disease detection for smart agriculture. In Proceedings of the 2019 IEEE Globecom Workshops (GC Wkshps), Waikoloa, HI, USA, 9–13 December 2019; IEEE: Piscataway, NJ, USA, 2019. [Google Scholar]
Zhao, J.; Qu, J. Healthy and diseased tomatoes detection based on YOLOv2. In Proceedings of the Human Centered Computing: 4th International Conference, HCC 2018, Mérida, Mexico, 5–7 December 2018; Revised Selected Papers 4. Springer International Publishing: New York City, NY, USA, 2019. [Google Scholar]
Latif, G.; Alghazo, J.; Maheswar, R.; Vijayakumar, V.; Butt, M. Deep learning based intelligence cognitive vision drone for automatic plant diseases identification and spraying. J. Intell. Fuzzy Syst. 2020, 39, 8103–8114. [Google Scholar] [CrossRef]
Prabhakar, M.; Purushothaman, R.; Awasthi, D.P. Deep learning based assessment of disease severity for early blight in tomato crop. Multimed. Tools Appl. 2020, 79, 28773–28784. [Google Scholar] [CrossRef]
Pattnaik, G.; Shrivastava, V.K.; Parvathi, K. Transfer learning-based framework for classification of pest in tomato plants. Appl. Artif. Intell. 2020, 34, 981–993. [Google Scholar] [CrossRef]
Jiang, D.; Li, F.; Yang, Y.; Yu, S. A tomato leaf diseases classification method based on deep learning. In Proceedings of the 2020 Chinese Control and Decision Conference (CCDC), Hefei, China, 22–24 August 2020. [Google Scholar]
Liu, J.; Wang, X. Tomato diseases and pests detection based on improved Yolo V3 convolutional neural network. Front. Plant Sci. 2020, 11, 898. [Google Scholar] [CrossRef]
Wang, X.; Liu, J.; Liu, G. Diseases detection of occlusion and overlapping tomato leaves based on deep learning. Front. Plant Sci. 2021, 12, 792244. [Google Scholar] [CrossRef]
Huang, X.; Chen, A.; Zhou, G.; Zhang, X.; Wang, J.; Peng, N.; Yan, N.; Jiang, C. Tomato leaf disease detection system based on FC-SNDPN. Multimed. Tools Appl. 2023, 82, 2121–2144. [Google Scholar] [CrossRef]
Kc, K.; Yin, Z.; Wu, M.; Wu, Z. Depthwise separable convolution architectures for plant disease classification. Comput. Electron. Agric. 2019, 165, 104948. [Google Scholar] [CrossRef]
Albahli, S.; Nawaz, M. DCNet: DenseNet-77-based CornerNet model for the tomato plant leaf disease detection and classification. Front. Plant Sci. 2022, 13, 957961. [Google Scholar] [CrossRef] [PubMed]
Zhong, Y.; Teng, Z.; Tong, M. LightMixer: A novel lightweight convolutional neural network for tomato disease detection. Front. Plant Sci. 2023, 14, 1166296. [Google Scholar] [CrossRef]
Chen, J.; Zhang, D.; Zeb, A.; Nanehkaran, Y.A. Identification of rice plant diseases using lightweight attention networks. Expert Syst. Appl. 2021, 169, 114514. [Google Scholar] [CrossRef]
Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path aggregation network for instance segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
He, Y.; Zhu, C.; Wang, J.; Savvides, M.; Zhang, X. Bounding box regression with uncertainty for accurate object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.C. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
Rezatofighi, H.; Tsoi, N.; Gwak, J.; Sadeghian, A.; Reid, I.; Savarese, S. Generalized intersection over union: A metric and a loss for bounding box regression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]

Figure 1. Structure of the YOLOv5n.

Figure 2. C3_1 structure diagram.

Figure 3. C3_2 structure diagram.

Figure 4. ViT structure.

Figure 5. EfficientViT structure.

Figure 6. Lightweight MSA structure.

Figure 7. CARAFE module structure.

Figure 8. Improved network structure.

Figure 9. Tomato pest and disease example diagram.

Figure 10. [email protected] curve.

Figure 11. [email protected]:0.95 curve.

Figure 12. [email protected] curve.

Figure 13. [email protected]:0.95 curve.

Figure 14. [email protected] curve.

Figure 15. [email protected]:0.95 curve.

Figure 16. [email protected] curve.

Figure 17. [email protected]:0.95 curve.

Figure 18. PR curve.

Figure 19. Labeled images.

Figure 20. YOLOv5n predictive images.

Figure 21. YOLOv5n-VCW predictive images.

Table 1. Statistical table.

Class	Training Set (Sheets)	Test Set (Sheets)
healthy	1700	425
bacterial spot	1900	475
early blight	1900	475
late blight	1850	462
leaf mold	1880	470
powdery mildew	1827	456
septoria leaf spot	1740	435
spider mites	1740	435
mosaic virus	1790	447
yellow leaf curl virus	1965	491

Table 2. Improved backbone network validation experiment.

Models	[email protected]/%	[email protected]:0.95/%	Params/M	FLOPs
YOLOv5n	95.8	83.1	1.9	4.2 G
YOLOv5n-V	96.9	83.7	2.0	4.4 G

Table 3. Improved upsampling operator validation experiment.

Models	[email protected]/%	[email protected]:0.95/%	Params/M	FLOPs
YOLOv5n	95.8	83.1	1.9	4.2 G
YOLOv5n-C	97.1	84.1	1.5	3.0 G

Table 4. Improved WIoU Loss validation experiment.

Models	[email protected]/%	[email protected]:0.95/%	Params/M	FLOPs
YOLOv5n	95.8	83.1	1.9	4.2 G
YOLOv5n-W	96.2	83.7	1.9	4.2 G

Table 5. Ablation experiment results.

Models	V	C	W	[email protected]%	[email protected]:0.95/%	Params/M	FLOPs
YOLOv5n				95.8	83.1	1.9	4.2 G
YOLOv5n-V	✓			96.9	83.7	1.5	3.0 G
YOLOv5n-C		✓		97.1	84.1	2.0	4.4 G
YOLOv5n-W			✓	96.2	83.7	1.9	4.2 G
YOLOv5n-VC	✓	✓		97.8	84.5	1.6	3.3 G
YOLOv5n-VW	✓		✓	97.3	84.1	1.5	3.0 G
YOLOv5n-CW		✓	✓	97.7	84.6	2.0	4.4 G
YOLOv5n-VCW	✓	✓	✓	98.1	84.8	1.6	3.3 G

Table 6. Comparison of experimental results with mainstream object detection algorithms.

Models	[email protected]/%	[email protected]:0.95/%	Params/M	GFLOPs
YOLOv5n	95.8	83.1	1.9	4.2 G
YOLOv5s	96.8	83.7	7.2	16.5 G
YOLOv5m	97.1	84.1	21.2	49.0 G
YOLOv5l	97.4	84.3	46.5	109.1 G
YOLOv5x	97.5	84.7	86.7	205.7 G
YOLOv3	92.3	75.9	61.5	155.4 G
SSD	78.5	59.7	23.6	273.1 G
Faster R-CNN	81.7	64.5	136.6	369.7 G
YOLOv5n-VCW(Ours)	98.1	84.8	1.6	3.3 G

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Hu, W.; Hong, W.; Wang, H.; Liu, M.; Liu, S. A Study on Tomato Disease and Pest Detection Method. Appl. Sci. 2023, 13, 10063. https://doi.org/10.3390/app131810063

AMA Style

Hu W, Hong W, Wang H, Liu M, Liu S. A Study on Tomato Disease and Pest Detection Method. Applied Sciences. 2023; 13(18):10063. https://doi.org/10.3390/app131810063

Chicago/Turabian Style

Hu, Wenyi, Wei Hong, Hongkun Wang, Mingzhe Liu, and Shan Liu. 2023. "A Study on Tomato Disease and Pest Detection Method" Applied Sciences 13, no. 18: 10063. https://doi.org/10.3390/app131810063

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Study on Tomato Disease and Pest Detection Method

Abstract

1. Introduction

2. Related Works

3. YOLOv5 Object Detection Algorithm

4. Methods

4.1. Backbone Network Improvements

4.2. Up-Sampling Improvements

4.3. Bounded Box Regression Loss Function Improvement

4.4. Improved Network Structure

5. Experiment and Result

5.1. Datasets

5.2. Experimental Environment

5.3. Model Evaluation Metrics and Training Parameter Settings

5.4. Experimental Results and Analysis

5.4.1. Experimental Analysis of Improved Backbone Networks

5.4.2. Experimental Analysis of the Improved Upsampling Operator

5.4.3. Experimental Analysis of the Improved Bounding Box Regression Loss Function

5.4.4. Ablation Experiments

5.4.5. Comparison Experiments

6. Discussion

7. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI