Improved Tomato Leaf Disease Recognition Based on the YOLOv5m with Various Soft Attention Module Combinations

Lee, Yong-Suk; Patil, Maheshkumar Prakash; Kim, Jeong Gyu; Choi, Seong Seok; Seo, Yong Bae; Kim, Gun-Do

doi:10.3390/agriculture14091472

Open AccessArticle

Improved Tomato Leaf Disease Recognition Based on the YOLOv5m with Various Soft Attention Module Combinations

by

Yong-Suk Lee

¹

,

Maheshkumar Prakash Patil

²

,

Jeong Gyu Kim

¹,

Seong Seok Choi

^1,2,

Yong Bae Seo

¹ and

Gun-Do Kim

^1,*

¹

Department of Microbiology, Pukyong National University, Busan 48513, Republic of Korea

²

Industry University Cooperation Foundation, Pukyong National University, Busan 48513, Republic of Korea

^*

Author to whom correspondence should be addressed.

Agriculture 2024, 14(9), 1472; https://doi.org/10.3390/agriculture14091472

Submission received: 17 July 2024 / Revised: 26 August 2024 / Accepted: 27 August 2024 / Published: 29 August 2024

(This article belongs to the Special Issue Machine Vision Solutions and AI-Driven Systems in Agriculture)

Download

Browse Figures

Versions Notes

Abstract

:

To reduce production costs, environmental effects, and crop losses, tomato leaf disease recognition must be accurate and fast. Early diagnosis and treatment are necessary to cure and control illnesses and ensure tomato output and quality. The YOLOv5m was improved by using C3NN modules and Bidirectional Feature Pyramid Network (BiFPN) architecture. The C3NN modules were designed by integrating several soft attention modules into the C3 module: the Convolutional Block Attention Module (CBAM), Squeeze and Excitation Network (SE), Efficient Channel Attention (ECA), and Coordinate Attention (CA). The C3 modules in the Backbone and Head of YOLOv5 model were replaced with the C3NN to improve feature representation and object detection accuracy. The BiFPN architecture was implemented in the Neck of the YOLOv5 model to effectively merge multi-scale features and improve the accuracy of object detection. Among the various combinations for the improved YOLOv5m model, the C3ECA-BiFPN-C3ECA-YOLOv5m achieved a precision (P) of 87.764%, a recall (R) of 87.201%, an F1 of 87.482, an mAP.5 of 90.401%, and an mAP.5:.95 of 68.803%. In comparison with the YOLOv5m and Faster-RCNN models, the improved models showed improvement in P by 1.36% and 7.80%, R by 4.99% and 5.51%, F1 by 3.18% and 6.86%, mAP.5 by 1.74% and 2.90%, and mAP.5:.95 by 3.26% and 4.84%, respectively. These results demonstrate that the improved models have effective tomato leaf disease recognition capabilities and are expected to contribute significantly to the development of plant disease detection technology.

Keywords:

tomato; Solanum lycopersicum L.; plant disease recognition; YOLOv5m; soft attention module; precision agriculture; deep learning

1. Introduction

The tomato (Solanum lycopersicum L.) is one of the most widely cultivated and consumed fruits and vegetables worldwide, with an estimated annual market value exceeding $60 billion [1]. Tomato disease control is a required process that restricts the increase in crop quality and development and comprises a significant percentage of the total production costs [2]. In order to lower processing costs, lessen the environmental impact of chemicals, and decrease the danger of crop loss, precise and rapid disease recognition is crucial [3]. Appropriate selection requires early disease detection and diagnosis. It is essential for both curing diseases and stopping their spread [4]. The leaves are the plant portion that initially display disease symptoms. It is possible to accurately determine the type of sickness affecting a plant by looking for lesions on the leaves caused by various diseases [4,5].

Currently, farmers can readily identify and categorize plant diseases due to the progress made in engineering disciplines such as artificial intelligence (AI) and image processing [6]. Additionally, the availability of databases containing various images has also contributed to this breakthrough. Machine learning, a subfield of AI, evaluates images from databases and classifies plant diseases more quickly and accurately than farmers are able to do [7].

Object detection algorithms based on deep learning are classified into two main categories according to whether they involve a candidate region extraction step: two-stage algorithms, which are illustrated by Faster R-CNN, and single-stage algorithms, such as You Only Look Once (YOLO). Although two-stage algorithms such as Faster R-CNN generally display slower detection speeds as a result of their two-step structure, they frequently attain superior detection accuracy. Conversely, YOLO exhibits superior performance in terms of detection accuracy and speed, due to its ongoing network architecture enhancements [8]. The concept of YOLOv1 was first introduced by Redmon et al. [9]. During prediction, this system directly applies region generation and target classification, dividing the feature map into a grid. This approach leads to a significant improvement in detection speed. YOLOv2 [10] represents an advancement compared with YOLOv1, by incorporating a Batch Normalization layer after each convolutional layer and eliminating the use of dropout. YOLOv3 [11] is an enhanced iteration that introduces the residual module Darknet53 and FPN architecture. Its key feature is the ability to forecast objects of varying scales and achieve multiscale fusion. YOLOv4 [12], which is an improvement on YOLOv3, features the Path Aggregation Network (PANet), mosaic data augmentation, and the Complete Intersection over Union (CIoU) loss function. Following that, YOLOv5 [13] has gained significant popularity due to its ability to deliver high performance while being lightweight and efficient. Optimization of the network topology leads to enhanced operating speed and memory consumption. It performs exceptionally well in real-time applications and scenarios with restricted resources. YOLOv6 [14] integrates the segmentation and detection tasks with a hybrid matching strategy, which effectively reduces the probability of false detections by incorporating both global and local information. However, this approach may lead to increased model complexity and processing demands. YOLOv7 [15] possesses straightforward yet potent characteristics, featuring a relatively uncomplicated structure that is easily comprehensible and executable while providing benefits in terms of accuracy and operational speed. YOLOv8 [16] incorporates a hybrid matching method to enhance accuracy, which bears resemblance to YOLOv7 but varies on several crucial points. The architecture of YOLOv8 is possibly more intricate and demands greater computing resources.

The attention module is a complicated cognitive process that is essential to humans [17,18]. An essential characteristic of perception is that people often do not process whole pieces of information at once. Moreover, humans have a tendency to focus on a specific portion of information in situations and locations when it is necessary, while simultaneously disregarding other observable information [17]. This is a way for people to efficiently use their limited processing resources to choose high-value information from huge amounts of data. The attention module makes processing visual information far more efficient and accurate [17,19]. Inspired by this finding, the attention module is included in computer vision in order to replicate this particular characteristic of the human visual system. The attention module can be divided into hard and soft attention mechanisms based on their differentiability [20]. The hard attention module is non-differentiable, necessitates reinforcement learning or sampling techniques for training, and focuses on specific input components [21]. The hard attention module is computationally efficient and can concentrate on specific components of the input; however, its training is intricate and may be unstable [22]. In contrast, the soft attention module is differentiable and simple to train, as it allocates weights to all components of the input simultaneously and considers the entire input [23]. The computational costs of the soft attention module are higher, but it offers stable performance and a simple training procedure [24]. Various computer vision fields have developed and applied various soft attention modules, including the CBAM [25], SE [26], ECA [27], and CA [28].

These soft attention modules were applied to YOLOv5 to improve image recognition of various fruits, vegetables, and insects in the agricultural field. For example, Lv and Su [29] utilized a modified version of the YOLOv5 framework, incorporating both the CBAM attention module and the transformer encoder, referred to as YOLOv5-CBAM-C3TR, to improve the recognition of apple leaf diseases. The YOLOv5-CBAM-C3TR achieved an archival mean average precision (mAP.5) of 73.4% and precision and recall rates of 70.9% and 69.5%, respectively, for the recognition of apple leaf diseases. Appe et al. [30] introduced CAM-YOLO, a combination of YOLOv5 and CBAM. Non-Maximum Suppression (NMS) and Distance Intersection over Union (DIoU) were utilized to improve the accuracy of tomato recognition. The CAM-YOLO demonstrates high efficiency in accurately identifying tomatoes that are both overlapped and tiny, with an average precision rate of 88.1%. Jing et al. introduced BC-YOLOv5, a novel approach that incorporates the weighted Bi-directional Feature Pyramid Network (BiFPN) and CBAM to improve tomato disease recognition. Additionally, they modify the input channel of the detection layer to further improve its performance. The BC-YOLOv5 obtained high precision, recall, mAP.5, and mAP.5:.95. The values were 96.4%, 98.8%, 99.1%, and 88.3%, respectively. Qi et al. [31] introduced SE-YOLO5, which combines YOLOv5m with SE, as a means to improve the recognition of tomato virus disease. The SE-YOLO5 demonstrated a notable accuracy of 91.07% and scored a mAP.5 of 94.10%. Chen et al. [32] improved the plant disease recognition of YOLOv5 by using an Involution Bottleneck module, SE, and Efficient Intersection over Union (EIoU). The approved YOLOv5 achieved 70% of mAP, exhibiting a notable improvement of 5.4% compared with the original YOLOv5. Touko Mbouembe et al. [33] introduced SBCS-YOLOv5s, which integrated SE, BiFPN, Content Aware Reassembly of Feature (CARAFE), and Soft-NMS into YOLOv5s to improve the recognition of tomatoes in natural environments. Dong et al. [34] introduced PestLite, which combined YOLOv5 with Multi-Level Spatial Pyramid Pooling (MTSPPF), ECA, and CARAFE for crop pest detection. The mAP of PestLite increased from 87.9% to 90.7%. Chen et al. [35] introduced YOLO-COF to improve the precision of Camellia oleifera fruit detection in their natural environment. The YOLO-COF utilized the K-means++ clustering algorithm and integrated CA into YOLOv5s. The mAP value was 94.10%. Wang et al. [36] introduced YOLO-BLBE, which integrated an innovative I-MSRCR (Improved MSRCR (Multi-Scale Retinex with Color Restoration)), CA, BiFPN, and Alpha-EIoU for identifying blueberry fruit with different maturities. The YOLO-BLBE achieved an average identification accuracy of 99.58% for mature, 96.77% for semi-mature, and 98.07% for immature fruit. Li et al. [37] introduced MTC-YOLOv5n, which integrated CA, a transformer, a multi-scale training strategy, and a feature fusion network to improve multi-scale cucumber disease detection. The mAP value was 84.9%.

Despite extensive research into tomato leaf disease recognition, there is still a critical need to improve detection accuracy and resilience to meet practical needs. To address these persisting issues, this study proposes an enhanced tomato disease detection approach based on the YOLOv5 model. This model aims to significantly improve the number of detectable tomato leaf diseases by enhancing recognition accuracy and balancing the effect of identification across different disease types. The significant alterations of this study are summarized as follows:

The original C3 module in the Backbone and Neck is merged with various soft attention modules (CBAM, SE, ECA, and CA) to form C3NN modules (C3CBAM, C3SE, C3ECA, and C3CA). This strategy enhances the ability of the YOLOv5m model to identify critical patterns and improves overall recognition accuracy.
The PANet architecture in the YOLOv5m model has been substituted with the BiFPN architecture to improve the performance of the model. The BiFPN architecture enhances tiny item recognition accuracy by combining shallow feature maps with detailed spatial information, while maintaining a fast inference speed. Moreover, the bidirectional feature fusion of the BiFPN architecture efficiently captures intricate interactions among features at various scales, resulting in enhanced overall detection performance.
Various factorial combinations are produced by combining C3NN modules and BiFPN architecture in the Backbone and Neck to increase the performance of models.

2. Materials and Methods

2.1. Tomato Leaf Disease Dataset

The data used to make predictions about the tomato leaf disease were obtained from the Kaggle open repository (https://www.kaggle.com/datasets/cookiefinder/tomato-disease-multiple-sources, accessed on 30 December 2023) [38]. The data were augmented with rotations at various angles, mirroring, the reduction of image brightness, etc. There were more than 20,000 images of tomato leaves, 10 diseases and 1 healthy class. Images were uniformized in 640 × 640 format from a variety of sizes and rotated ±15° to replace the original image, in order to maintain the balance of various tomato disease images in the dataset and to more effectively analyze the identification effect of the model on various diseases.

The LabelImg tool (v1.8.1, Label Studio) was utilized to accurately label the images in the dataset. Healthy was assigned the label “HN”, Tomato_mosaic_virus was assigned the label “TM”, Spider_mites Two-spotted_spider_mite was assigned the label “SM”, Target_Spot was assigned the label “TS”, Tomato_Yellow_Leaf_Curl_Virus was assigned the label “TY”, Leaf_Mold was assigned the label “LM”, Septoria_leaf_spot was assigned the label “SL”, Early_blight was assigned the label “EB”, powdery Mildew was assigned the label “PM”, Late_blight was assigned the label “LB”, and Bacterial_spot was assigned the label “BS”. Finally, a total of 5500 images were collected and divided into training sets, validation sets, and test sets in an 8:1:1 ratio. The specific division is shown in Table 1.

2.2. Improvement of the YOLOv5 Model

2.2.1. YOLOv5-6.1 Model Highlights

The YOLO model significantly improves the execution speed of the model while maintaining detection accuracy at a level comparable to the target detection network employed in the R-CNN series. Although the various state-of-the-art YOLO models have been developed to improve object recognition accuracy and speed, researchers have used the YOLOv5 model in a variety of fields [39,40,41]. There are many papers on model customization, such as changing structures, which are relevant to this study. Furthermore, it is a lightweight model that is appropriate for real-time applications with limited hardware environments. As a result, we used the YOLOv5 model as the baseline model. The YOLOv5 model is an upgraded version of the YOLOv3 that incorporates a multi-scale prediction methodology. This network enables the concurrent detection of image elements with different dimensions [13,31]. The YOLOv5 model encompasses many model versions that have been specifically designed to achieve a balance between speed, accuracy, and computational efficiency. The YOLOv5s model is very efficient and lightweight, making it ideal for real-time applications. The YOLOv5m model provides a harmonious combination of speed and accuracy. The YOLOv5l model improves accuracy by incorporating additional layers, making it highly suitable for complex image analysis. The YOLOv5x model provides exceptional precision, making it well-suited for applications that need outstanding accuracy. The YOLOv5n model is a very efficient and lightweight version specifically optimized for mobile and edge devices. The primary emphasis is on optimizing inference speed and efficiency rather than accuracy. These variations enable YOLOv5 to exhibit versatility and adaptability to fulfill a wide range of application needs. The YOLOv5m model was chosen for this experiment because it has low resource demands and achieves high detection accuracy. The YOLOv5 model (Figure 1) consists of four components: the Input, Backbone, Neck, and Head.

The Input of the YOLOv5 model is designed to preprocess the raw images before their input into the neural network. To optimize processing efficiency and preserve uniformity, the method involves altering the scale of the images to a predetermined dimension. Subsequently, normalization is implemented to rescale pixel values, which facilitates the training process and enhances convergence [12]. To enhance the capacity of the model to manage fluctuations in the input data, random flipping, cropping, and color jittering are implemented as data augmentation techniques [42]. A recently developed approach in the YOLOv5 model, Mosaic augmentation, allows for the merging of four images into a single image during training, resulting in a significant improvement in the identification of small objects [13]. This preprocessing pipeline ensures that the input data is appropriately prepared for feature extraction by the Backbone.

The Backbone of the YOLOv5 model is responsible for extracting crucial information from the input images. It achieves this by employing a sequence of convolutional layers with varied kernel sizes, which enables it to capture various levels of abstraction and detail. The Backbone mostly consists of the Conv, C3, and SPPF. The C3 module incorporates multiple bottleneck layers, which aid in the acquisition of complex patterns while maintaining computing efficiency. Furthermore, it employs a cross-stage hierarchy to combine incomplete characteristics, thus enhancing the flow of gradients and improving the propagation of features [43]. The model includes CSPDarknet53, which improves the gradient flow and reduces computing complexity while still achieving high accuracy. CSPDarknet53 is an upgraded version of the original Darknet53 model that incorporates Cross-Stage Partial Networks (CSPNet) [44]. This integration aims to enhance the model’s learning capabilities by separating the feature map into two sections and integrating them through a cross-stage hierarchy. In addition, the SPPF module achieves multi-scale feature fusion by employing max-pooling procedures with different kernel sizes, and then combining the results by concatenation [45]. This technique increases the area that the system can see and improve the way it represents features, all while maintaining computational efficiency. The architecture guarantees that the Neck and Head components extract abundant and hierarchical features for next processing.

The Neck of the YOLOv5 model plays a crucial role in collecting and improving the features obtained by the Backbone. The Neck incorporates Feature Pyramid Network (FPN) and Path Aggregation Network (PAN) architectures to enhance the representation of features of different dimensions [46]. FPN architecture enables the model to utilize both high-level semantic information and low-level detailed information, which is essential for accurately detecting objects of different dimensions [47]. PAN architecture improves information transmission by creating connections across several levels using both hierarchical and non-hierarchical channels, leading to enhanced integration of features [48]. The Neck is fitted with additional convolutional layers to enrich the features before transmitting them to the Head for optimal detection. Utilizing multi-scale feature aggregation enhances the model’s capacity to precisely detect objects of different dimensions.

The Head of the YOLOv5 model is accountable for forecasting the bounding boxes, objectness scores, and class probabilities for every identified object. The system comprises a sequence of convolutional layers that analyze the improved features of the Neck region to generate the most accurate predictions [49]. To efficiently manage objects of varying proportions, the Head utilizes anchor boxes with a variety of sizes and ratios. Non-Maximum Suppression (NMS) is employed to eliminate duplicate detections, thereby ensuring that only the most confident predictions are retained [50]. The CIoU loss function is also included in the Head of the YOLOv5 model, which provides a more precise evaluation of the overlap between predicted and ground truth boxes. This leads to improved localization performance [51]. The YOLOv5 model achieves exceptional precision in object detection assignments by combining these approaches in the Head.

2.2.2. Attention Modules

Attention modules have lately been crucial in the development of deep learning models, particularly when dealing with complex image issues [52]. By giving distinct weights to the input components of the network, the attention module allows the model to ignore unimportant information and concentrate on what matters. This can significantly enhance the capacity of the model to extract features from complicated backgrounds [53].

The CBAM [25] (Figure 2a) is a versatile and efficient module that can easily be integrated into any convolutional neural network (CNN) without significant additional computational costs. It can be trained together with the underlying CNNs, ensuring a smooth end-to-end training process. It is made up of two sub-modules: a channel attention module (CAM) and a spatial attention module (SAM) [54]. The CAM enhances the network’s focus on the prominent features and relevant regions of the image. The SAM enhances the network’s focus on areas that have abundant contextual information from the entire image. The combination of these two sub-modules produces more comprehensive and trustworthy attention information, which in turn provides better guidance for computer resource allocation [55].

The SE [26] (Figure 2b) incorporates a channel-wise attention mechanism to improve the network’s ability to represent information by explicitly capturing the relationships between different channels. The process consists of a squeeze operation that combines global spatial information using global average pooling, in addition to an excitation operation that captures channel-wise dependencies across two completely linked layers [56]. This process adjusts the feature maps by assigning new weights to the channels, allowing the network to prioritize more informative features. The SE greatly enhances the performance of diverse convolutional neural networks across a range of tasks [31].

The ECA [27] (Figure 2c) module is intended to enhance the channel attention process by explicitly modeling local cross-channel interactions and avoiding dimensionality reduction. In contrast to other attention methods that utilize fully connected layers, ECA utilizes a 1D convolution with a meticulously selected kernel size to effectively capture channel dependencies. This method maintains accuracy while reducing computational load, thereby achieving a balance between performance and complexity [57]. ECA improves the representational capability of CNNs, resulting in improved performance across a variety of visual recognition tasks [54].

The CA [28] module (Figure 2d) improves feature representation by encoding exact positional information and detecting long-range correlations across spatial dimensions. Unlike other attention methods, CA divides attention into two parallel processes along the height and width axes. This enables the network to aggregate characteristics more effectively while still retaining spatial information. CA enhances the performance of CNN in a variety of tasks by including coordinated information in the channel attention process. This approach is especially useful for mobile networks, where computing performance is critical.

2.2.3. BiFPN Architecture

The BiFPN was used in the YOLOv5 model, rather than the PANet. These two advanced feature pyramid designs (Figure 3) aim to improve multi-scale feature fusion and, thus, object recognition [58]. The PANet adds another bottom-up route augmentation to improve information flow [12]. By offering a supplementary channel that combines data from lower to higher levels, this structure enhances the top-down pathway included in the conventional FPN and improves localization accuracy for tiny objects. Furthermore, adaptive feature pooling is used by the PANet to further improve feature maps before detection, boosting the model’s accuracy and resilience. On the other hand, the BiFPN is renowned for its effectiveness and simplicity [47]. It uses a bidirectional technique to learnable weights and fuses features at various scales, improving gradient flow and feature refinement. It presents the idea of weighted feature fusion, in which the relative value of every input characteristic is discovered and modified during training, enabling more adaptable and efficient feature integration from different backbone network tiers.

2.2.4. The Proposed Model

We improved the YOLOv5m to enhance the detection accuracy of tomato leaf disease. This study introduces a novel approach for addressing the problem by implementing the C3NN-BiFPN-C3NN-YOLOv5m models, which incorporate the C3NN and BiFPN components into the YOLOv5m model. The C3NN modules merged various soft attention modules, including the CBAM, SE, ECA, and CA, with the C3 module of the YOLOv5m model. The C3 modules in the Neck and Backbone of the YOLOv5m model were replaced by C3NN modules, which were based on factorial analysis. The feature extraction characterizations of various C3NN modules were compared and are shown in Figure 4. Among the various channels for the feature extraction, only channel 1 was shown. Then, BiFPN was adopted as a replacement for the initial PANet architecture in the YOLOv5m model because of its notable benefits, such as its efficient design, minimal processing requirements, and excellent gradient flow [46]. The C3NN-BiFPN-C3NN-YOLOv5m models improved the model’s capability to acquire and integrate data at several scales, resulting in enhanced accuracy and resilience in detecting small and complex objects. The architecture of the C3NN-BiFPN-C3NN-YOLOv5m model is shown in Figure 1.

2.3. Experimental Environment and Setting

The experiment in this study ran on Windows 10 with the following specifications: 64 GB of RAM (SK Hynix, Icheon, Republic of Korea), an Nvidia Geforce RTX 4090 GPU (MSI, Taipei, Taiwan), and an Intel Core i7-14700KF CPU (Intel, Santa Clara, CA, USA) running at a base clock speed of 3.4 GHz. The chosen model framework was PyTorch, with Compute Unified Device Architecture (CUDA) 12.1 and Python 3.8.10 for implementation. The Adam optimizer was utilized. The input image size was configured to 640 × 640 pixels. The learning rate, number of epochs, and batch size were assigned values of 0.001, 400, and 16, respectively. During the training process, the early stopping was engaged when there was no improvement after 100 epochs.

This study used several evaluation metrics—precision (P) [59], recall (R) [60], average precision (AP) [61], mean average precision (mAP) [62], F1 score [63], detection speed, GFlops, and model weight—to obtain a full understanding of how well the enhanced model performed.

3. Results and Discussion

To improve tomato leaf disease recognition, the YOLOv5m model was enhanced using C3NN, which included different soft attention modules (CBAM, SE, ECA, and CA) with the C3 module. The research involved integrating C3NN into both the Backbone and Head components of the model individually and then comparing the outcomes with those of the original YOLOv5 and proposed models. The results of the testing process are presented in Table 2.

3.1. Insertion of C3NN into the Backbone or Head of the YOLOv5m Model

The integration of C3NN into the Backbone showed the following outcomes: The C3CBAM-YOLOv5m model achieved P of 82.442%, R of 87.025%, F1 of 84.672, mAP.5 of 88.823%, and mAP.5:.95 of 65.530%. The C3SE-YOLOv5m model achieved P of 84.057%, R of 87.787%, F1 of 85.882, mAP.5 of 89.891%, and mAP.5:.95 of 66.876%. The C3ECA-YOLOv5m model achieved P of 86.397%, R of 85.554%, F1 of 85.973, mAP.5 of 89.670%, and mAP.5:.95 of 66.703%. The C3CA-YOLOv5m model achieved P of 84.014%, R of 86.743%, F1 of 85.357, mAP.5 of 89.724%, and mAP.5:.95 of 67.518%. These results indicated that C3NN modules effectively enhance feature representation in the Backbone, leading to improved object detection capability.

On the other hand, the integration of C3NN into the Head also showed the improved outcomes, but these improvements were not as significant as those observed with Backbone integration. The H-C3CBAM-YOLOv5m model achieved P of 86.727% (a 0.16% improvement), R of 86.321% (3.93%), F1 of 86.524 (2.05%), mAP.5 of 89.763% (1.02%), and mAP.5:.95 of 66.475%.

When comparing the two methods, it was observed that the YOLOv5 model exhibited a significant performance improvement when C3NN modules were integrated into the Backbone. This improvement may be attributed to the critical role performed by the C3NN mod-ules in extracting early feature extraction.

3.2. Insertion of BiFPN into the YOLOv5m Model

To further enhance the performance of the YOLOv5m model for tomato leaf disease recognition, we experimented with integrating the BiFPN architecture in combination with various C3NN modules (C3CBAM, C3SE, C3ECA, and C3CA).

When C3NN modules were integrated into the Backbone with BiFPN, C3CA-BiFPN-YOLOv5m showed notable improvements, achieving P of 85.973%, R of 84.363%, F1 of 85.160, mAP.5 of 90.691%, and mAP.5:.95 of 68.552%. When compared with the original YOLOv5m, the C3NN modules did better on the main evaluation metrics, such as R (1.57–5.92%), F1 (0.27–1.01%), mAP.5 (0.17–2.07%), and mAP.5:.95 (0.51–2.89%). This suggested that BiFPN’s ability to efficiently merge features at multiple scales significantly enhances feature representation, leading to better performance in object detection tasks.

When the C3NN modules were integrated into the Head with BiFPN: BiFPN-C3SE-YOLOv5m achieved P of 83.791%, R of 86.977%, F1 of 85.354, mAP.5 of 89.308%, and mAP.5:.95 of 67.160%. Among BiFPN-C3NN-YOLOv5m models, only BiFPN-C3SE-YOLOv5m demonstrated advances in the main evaluation metrics, compared to the original YOLOv5m. These improvements included an increase in R by 4.72%, F1 by 0.67%, mAP.5 by 0.51%, and mAP.5:.95 by 0.80%.

The results showed that adding BiFPN to the YOLOv5m model improves feature fusion and multi-scale feature representation, which leads to better recognition.

3.3. Insertion of C3NN into Backbone and Head with BiFPN of the YOLOv5m Model

To improve the performance of YOLOv5m for tomato leaf disease recognition, various combinations of the C3NN module were additionally analyzed in the Backbone and Head of YOLOv5m with BiFPN architecture. We compare the evaluation metrics of the C3NN-BiFPN-C3NN-YOLOv5m models with the original YOLOv5m.

Among the C3CBAM-BiFPN-C3NN-YOLOv5m models, the C3CBAM-BiFPN-C3SE-YOLOv5m and C3CBAM-BiFPN-C3CA-YOLOv5m models demonstrated significant improvement. The C3CBAM-BiFPN-C3SE-YOLOv5m achieved F1 of 85.705 (improvement of 1.08%) and mAP.5 of 90.254% (1.58%). Additionally, the C3CBAM-BiFPN-C3CA-YOLOv5m achieved R of 87.535% (5.39%), F1 of 86.527 (2.05%), and mAP.5 of 90.519% (1.87%).

Among the C3SE-BiFPN-C3NN-YOLOv5m models, the C3SE-BiFPN-C3CBAM-YOLOv5m and C3SE-BiFPN-C3SE-YOLOv5m models demonstrated significant improvement. The C3SE-BiFPN-C3CBAM-YOLOv5m model achieved R of 88.833% (improvement of 6.95%), mAP.5 of 90.279% (1.61%), and mAP.5:.95 of 67.861% (1.85%). The R of 88.333% for the C3SE-BiFPN-C3CBAM-YOLOv5m model was the best among all models examined in this study. Additionally, the C3SE-BiFPN-C3SE-YOLOv5m model achieved F1 of 86.546 (improvement of 2.07%), mAP.5 of 90.780% (2.17%), and mAP.5:.95 of 67.840% (1.82%). The mAP.5 of 90.780% for the C3SE-BiFPN-C3SE-YOLOv5m model was the best among all models examined in this study.

Among the C3ECA-BiFPN-C3NN-YOLOv5m models, the C3ECA-BiFPN-C3ECA-YOLOv5m and C3ECA-BiFPN-C3CA-YOLOv5m models demonstrated significant improvement. The C3ECA-BiFPN-C3ECA-YOLOv5m model achieved P of 87.764% (improvement of 1.36%), F1 of 87.482 (3.18%), mAP.5 of 90.401% (1.74%), and mAP.5:.95 of 68.803% (3.26%). The C3ECA-BiFPN-C3ECA-YOLOv5m model had the best P, F1, and mAP.5:.95 values among all models examined in this study. In particular, the P for the C3ECA-BiFPN-C3ECA-YOLOv5m model was the only one among all models examined in this study that exceeded that of the original YOLOv5m (86.591%).

Among the C3CA-BiFPN-C3NN-YOLOv5m models, the C3CA-BiFPN-C3SE-YOLOv5m and C3CA-BiFPN-C3CA-YOLOv5m models demonstrated significant improvement. The C3CA-BiFPN-C3SE-YOLOv5m model achieved R of 88.642% (improvement of 6.72%), F1 of 87.074 (2.70%), and mAP.5 of 90.718% (2.10%). Additionally, the C3CA-BiFPN-C3CA-YOLOv5m model achieved F1 of 85.455 (improvement of 0.79%).

Based on these results, the C3ECA-BiFPN-C3ECA-YOLOv5 model was recommended as the most improved model due to its superior performance across multiple metrics. Specifically, it demonstrated the highest P (87.764%), R (87.201%), and F1 (87.482%) among all the models evaluated. Additionally, it achieved a high mAP.5 of 90.401% and an mAP.5:.95 of 68.803%, indicating excellent detection accuracy. Among the evaluation metrics, F1 curves of the proposed models are shown in Figure 5. This result showed that F1 curves gradually increased as the structure of the model became more complex.

These improvements highlighted the model’s effectiveness in reducing both false positives and false negatives, making it highly suitable for tomato leaf disease recognition tasks. This significant enhancement in performance can be attributed to several key factors. Firstly, the model leverages the ECA module, which improves feature representation by capturing channel-wise dependencies without dimensionality reduction. This allows the model to emphasize critical features while suppressing less important ones, leading to better object detection accuracy.

By integrating ECA in both the Backbone and Head, the model effectively utilizes the benefits of channel attention at various phases of feature extraction and classification. This dual application serves to retain and improve crucial features across the network, resulting in increased precision, recall, and overall detection performance. Furthermore, the BiFPN architecture’s ability to robustly fuse multiscale features greatly enhances the model’s performance. BiFPN is specifically engineered to efficiently merge features at many sizes by utilizing bidirectional cross-scale connections and trainable weighted fusion. This enhances the model’s capability to efficiently handle objects of different sizes, thus enhancing its proficiency in detecting and categorizing in intricate scenarios. The use of ECA and BiFPN ensures that the model leverages the advantages of both precise recalibration of channel-wise features and robust fusing of features across several scales. This synergy enables the model to preserve high-quality feature maps, resulting in improved object localization and categorization.

Figure 6 and Figure 7 present comparative visualizations of the evaluation results of the proposed models for single images and a multi-class image, respectively. The 16 single images were evaluated with the proposed models; the results of predictions and bounding boxes are shown in a 4 × 4 format. The single-image predictions show that the prediction accuracy improves when the structure of the model becomes more complex. The 16 single images were combined to create a multi-class image with increasing levels of confusion. The resulting multi-class image was used to evaluate the proposed models. The multi-class image predictions show that the prediction accuracy and bounding box accuracy also improved. These results show that the proposed model works well in more complex environments.

Previous researches using the attention modules and feature pyramid networks support these findings, demonstrating their ability to enhance the performance of object detection recognition models [64,65].

3.4. Comparison of Detection Performance of Proposed Models

The proposed C3ECA-BiFPN-C3ECA-YOLOv5 model was compared with the other models and is shown in Table 2. Compared with the Faster-RCNN, it achieved an improvement in P by 7.80%, R by 5.51%, F1 by 6.86%, mAP.5 by 2.90%, and mAP.5:.95 by 4.84%. Compared with the YOLOv6m, it achieved an improvement in P by 2.48%, R by 8.93%, F1 by 5.85%, mAP.5 by 4.44%, and mAP.5:.95 by 4.77%. Compared with the YOLOv8m, it achieved an improvement in R by 3.63%, F1 by 1.84%, mAP.5 by 1.32%, and mAP.5:.95 by 1.63%. The results revealed that the proposed model demonstrates excellent performance oftomato leaf disease recognition.

3.5. Comparison of Detection Speed and Performance of Proposed Models

Table 3 presents the results of evaluating the detection speed and performance of the C3ECA-BiFPN-C3ECA-YOLOv5m model against the original YOLOv5m, C3ECA-YOLOv5m, H-C3ECA-YOLOv5m, C3ECA-BiFPN-YOLOv5m, BiFPN-C3ECA-YOLOv5m, and C3ECA-BiFPN-C3ECA-YOLOv5m models.

The C3ECA-BiFPN-C3ECA-YOLOv5m model achieves FLOPs value of 35.5 G, the lowest among the compared models, which translates to efficient computational resource usage. The inference time, which implies rapid processing capability, is 2.910 ms, the fastest of the compared models except for the original YOLOv5m. The NMS time is 0.901 ms, making it the second-longest time among the compared models. The C3ECA-BiFPN-C3ECA-YOLOv5m model operates at a respectable 248.26 FPS with a parameter count of 15.946 million bytes, highlighting its efficiency and speed. This makes the C3ECA-BiFPN-C3ECA-YOLOv5m model a highly efficient and speedy choice for applications requiring robust object detection without incurring high computational demands.

4. Conclusions

In this study, the various models were designed by combining YOLOv5m with the C3NN modules to the Backbone and Head and the BiFPN architecture to the Neck to improve the recognition of tomato leaf diseases. The C3NN modules were created by integrating the soft attention modules (CBAM, SE, ECA, and CA) into the C3 module of the YOLOv5m model. The C3ECA-BiFPN-C3ECA-YOLOv5m model demonstrated the highest performance improvement among the various C3NN-BiFPN-C3NN-YOLOv5m models. The significant performance improvement of the C3ECA-BiFPN-C3ECA-YOLOv5m model can be attributed to the effective use of ECA for channel attention, robust multiscale feature fusion through BiFPN, and the synergistic effects of combining these techniques. These components work together to enhance feature representation and aggregation, leading to superior object detection performance across various evaluation metrics. In the future, the proposed model will undergo more optimization to improve the performance of the tomato leaf disease recognition model.

Author Contributions

Y.-S.L.: conceptualization, resources, methodology, software, writing—original draft preparation. M.P.P.: formal analysis, validation, writing—review and editing. J.G.K.: data curation, methodology, software. S.S.C.: formal analysis, investigation. Y.B.S.: writing—review and editing, supervision. G.-D.K.: project administration. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (2021R1I1A1A01051968).

Data Availability Statement

The data presented in this study are openly available in Tomato Disease Multiple Sources at https://www.kaggle.com/datasets/cookiefinder/tomato-disease-multiple-sources/data, accessed on 30 December 2023, reference number [38].

Conflicts of Interest

The authors declare no conflicts of interest.

References

Wang, X.; Liu, J. Tomato anomalies detection in greenhouse scenarios based on YOLO-Dense. Front. Plant Sci. 2021, 12, 634103. [Google Scholar] [CrossRef]
Wspanialy, P.; Moussa, M. A detection and severity estimation system for generic diseases of tomato greenhouse plants. Comput. Electron. Agric. 2020, 178, 105701. [Google Scholar] [CrossRef]
Saeed, A.; Abdel-Aziz, A.; Mossad, A.; Abdelhamid, M.A.; Alkhaled, A.Y.; Mayhoub, M. Smart Detection of Tomato Leaf Diseases Using Transfer Learning-Based Convolutional Neural Networks. Agriculture 2023, 13, 139. [Google Scholar] [CrossRef]
Ebrahimi, M.; Khoshtaghaza, M.H.; Minaei, S.; Jamshidi, B. Vision-based pest detection based on SVM classification method. Comput. Electron. Agric. 2017, 137, 52–58. [Google Scholar] [CrossRef]
Zhang, Y.; Huang, S.; Zhou, G.; Hu, Y.; Li, L. Identification of tomato leaf diseases based on multi-channel automatic orientation recurrent attention network. Comput. Electron. Agric. 2023, 205, 107605. [Google Scholar] [CrossRef]
Astani, M.; Hasheminejad, M.; Vaghefi, M. A diverse ensemble classifier for tomato disease recognition. Comput. Electron. Agric. 2022, 198, 107054. [Google Scholar] [CrossRef]
Ferentinos, K.P. Deep learning models for plant disease detection and diagnosis. Comput. Electron. Agric. 2018, 145, 311–318. [Google Scholar] [CrossRef]
Sirisha, U.; Praveen, S.P.; Srinivasu, P.N.; Barsocchi, P.; Bhoi, A.K. Statistical analysis of design aspects of various YOLO-based deep learning models for object detection. Int. J. Comput. Intell. Sys. 2023, 16, 126. [Google Scholar] [CrossRef]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Redmon, J.; Farhadi, A. YOLO9000: Better, faster, stronger. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 7263–7271. [Google Scholar]
Redmon, J.; Farhadi, A. Yolov3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
Bochkovskiy, A.; Wang, C.-Y.; Liao, H.-Y.M. Yolov4: Optimal speed and accuracy of object detection. arXiv 2020, arXiv:2004.10934. [Google Scholar]
Jocher, G.; Chaurasia, A.; Stoken, A.; Borovec, J.; Kwon, Y.; Michael, K.; Fang, J.; Yifu, Z.; Wong, C.; Montes, D. ultralytics/YOLOv5: v7.0-YOLOv5 Sota Realtime Instance Segmentation; Zenodo: Geneve, Switzerland, 2022. [Google Scholar]
Li, C.; Li, L.; Jiang, H.; Weng, K.; Geng, Y.; Li, L.; Ke, Z.; Li, Q.; Cheng, M.; Nie, W. YOLOv6: A single-stage object detection framework for industrial applications. arXiv 2022, arXiv:2209.02976. [Google Scholar]
Wang, C.-Y.; Bochkovskiy, A.; Liao, H.-Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 7464–7475. [Google Scholar]
Ultralytics. YOLOv8 Docs. 2023. Available online: https://docs.ultralytics.com/ (accessed on 3 July 2024).
Niu, Z.; Zhong, G.; Yu, H. A review on the attention mechanism of deep learning. Neurocomputing 2021, 452, 48–62. [Google Scholar] [CrossRef]
Chikkerur, S.; Serre, T.; Tan, C.; Poggio, T. What and where: A Bayesian inference theory of attention. Vis. Res. 2010, 50, 2233–2247. [Google Scholar] [CrossRef]
Guo, M.H.; Xu, T.X.; Liu, J.J.; Liu, Z.N.; Jiang, P.T.; Mu, T.J.; Zhang, S.H.; Martin, R.R.; Cheng, M.M.; Hu, S.M. Attention mechanisms in computer vision: A survey. Comput. Vis. Media 2022, 8, 331–368. [Google Scholar] [CrossRef]
Xu, X.; Chen, X.; Liu, C.; Rohrbach, A.; Darrell, T.; Song, D. Fooling vision and language models despite localization and attention mechanism. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 4951–4961. [Google Scholar]
Xu, K.; Ba, J.; Kiros, R.; Cho, K.; Courville, A.; Salakhudinov, R.; Zemel, R.; Bengio, Y. Show, attend and tell: Neural image caption generation with visual attention. In Proceedings of the International Conference on Machine Learning, Lille, France, 6–11 July 2015; pp. 2048–2057. [Google Scholar]
Mnih, V.; Heess, N.; Graves, A. Recurrent models of visual attention. In Proceedings of the Annual Conference on Advances in Neural Information Processing Systems, Montreal, QC, Canada, 8–13 December 2014; Volume 27. [Google Scholar]
Bahdanau, D.; Cho, K.; Bengio, Y. Neural machine translation by jointly learning to align and translate. arXiv 2014, arXiv:1409.0473. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the Annual Conference on Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Volume 30. [Google Scholar]
Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient channel attention for deep convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 11534–11542. [Google Scholar]
Hou, Q.; Zhou, D.; Feng, J. Coordinate attention for efficient mobile network design. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 13713–13722. [Google Scholar]
Lv, M.; Su, W.H. YOLOV5-CBAM-C3TR: An optimized model based on transformer module and attention mechanism for apple leaf disease detection. Front. Plant Sci. 2024, 14, 1323301. [Google Scholar] [CrossRef]
Appe, S.N.; Arulselvi, G.; Balaji, G. CAM-YOLO: Tomato detection and classification based on improved YOLOv5 using combining attention mechanism. PeerJ Comput. Sci. 2023, 9, e1463. [Google Scholar] [CrossRef]
Qi, J.; Liu, X.; Liu, K.; Xu, F.; Guo, H.; Tian, X.; Li, M.; Bao, Z.; Li, Y. An improved YOLOv5 model based on visual attention mechanism: Application to recognition of tomato virus disease. Comput. Electron. Agric. 2022, 194, 106780. [Google Scholar] [CrossRef]
Chen, Z.; Wu, R.; Lin, Y.; Li, C.; Chen, S.; Yuan, Z.; Chen, S.; Zou, X. Plant disease recognition model based on improved YOLOv5. Agronomy 2022, 12, 365. [Google Scholar] [CrossRef]
Touko Mbouembe, P.L.; Liu, G.; Park, S.; Kim, J.H. Accurate and fast detection of tomatoes based on improved YOLOv5s in natural environments. Front. Plant Sci. 2024, 14, 1292766. [Google Scholar] [CrossRef]
Dong, Q.; Sun, L.; Han, T.; Cai, M.; Gao, C. PestLite: A Novel YOLO-Based Deep Learning Technique for Crop Pest Detection. Agriculture 2024, 14, 228. [Google Scholar] [CrossRef]
Chen, S.; Zou, X.; Zhou, X.; Xiang, Y.; Wu, M. Study on fusion clustering and improved YOLOv5 algorithm based on multiple occlusion of Camellia oleifera fruit. Comput. Electron. Agric. 2023, 206, 107706. [Google Scholar] [CrossRef]
Wang, C.; Han, Q.; Li, J.; Li, C.; Zou, X. YOLO-BLBE: A Novel Model for Identifying Blueberry Fruits with Different Maturities Using the I-MSRCR Method. Agronomy 2024, 14, 658. [Google Scholar] [CrossRef]
Li, S.; Li, K.; Qiao, Y.; Zhang, L. A multi-scale cucumber disease detection method in natural scenes based on YOLOv5. Comput. Electron. Agric. 2022, 202, 107363. [Google Scholar] [CrossRef]
Khan, Q. Tomato Disease Multiple Sources; Kaggle: San Francisco, CA, USA, 2022. [Google Scholar]
Tang, S.; Zhang, S.; Fang, Y. HIC-YOLOv5: Improved YOLOv5 for small object detection. In Proceedings of the 2024 IEEE International Conference on Robotics and Automation (ICRA), Yokohama, Japan, 13–17 May 2024; pp. 6614–6619. [Google Scholar]
Hui, Y.; You, S.; Hu, X.; Yang, P.; Zhao, J. SEB-YOLO: An Improved YOLOv5 Model for Remote Sensing Small Target Detection. Sensors 2024, 24, 2193. [Google Scholar] [CrossRef] [PubMed]
Sun, L.; Yao, J.; Cao, H.; Chen, H.; Teng, G. Improved YOLOv5 Network for Detection of Peach Blossom Quantity. Agriculture 2024, 14, 126. [Google Scholar] [CrossRef]
Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv 2017, arXiv:1704.04861. [Google Scholar]
Jing, J.; Li, S.; Qiao, C.; Li, K.; Zhu, X.; Zhang, L. A tomato disease identification method based on leaf image automatic labeling algorithm and improved YOLOv5 model. J. Sci. Food Agric. 2023, 103, 7070–7082. [Google Scholar] [CrossRef] [PubMed]
Wang, C.Y.; Liao, H.Y.M.; Wu, Y.H.; Chen, P.Y.; Hsieh, J.W.; Yeh, I.H. CSPNet: A new backbone that can enhance learning capability of CNN. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Seattle, WA, USA, 14–19 June 2020; pp. 390–391. [Google Scholar]
Arifando, R.; Eto, S.; Wada, C. Improved YOLOv5-based lightweight object detection algorithm for people with visual impairment to detect buses. Appl. Sci. 2023, 13, 5802. [Google Scholar] [CrossRef]
Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path aggregation network for instance segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8759–8768. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. In Proceedings of the Annual Conference on Advances in Neural Information Processing Systems, Montreal, QC, Canada, 7–12 December 2015; Volume 28. [Google Scholar]
Neubeck, A.; Van Gool, L. Efficient non-maximum suppression. In Proceedings of the 18th International Conference on Pattern Recognition (ICPR’06), Hong Kong, China, 20–24 August 2006; pp. 850–855. [Google Scholar]
Rezatofighi, H.; Tsoi, N.; Gwak, J.; Sadeghian, A.; Reid, I.; Savarese, S. Generalized intersection over union: A metric and a loss for bounding box regression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 658–666. [Google Scholar]
Zhang, H.; Goodfellow, I.; Metaxas, D.; Odena, A. Self-attention generative adversarial networks. In Proceedings of the International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019; pp. 7354–7363. [Google Scholar]
Wen, C.; Guo, H.; Li, J.; Hou, B.; Huang, Y.; Li, K.; Nong, H.; Long, X.; Lu, Y. Application of improved YOLOv7-based sugarcane stem node recognition algorithm in complex environments. Front. Plant Sci. 2023, 14, 1230517. [Google Scholar] [CrossRef]
Jiang, K.; Xie, T.; Yan, R.; Wen, X.; Li, D.; Jiang, H.; Jiang, N.; Feng, L.; Duan, X.; Wang, J. An attention mechanism-improved YOLOv7 object detection algorithm for hemp duck count estimation. Agriculture 2022, 12, 1659. [Google Scholar] [CrossRef]
Yao, H.; Fan, Y.; Wei, X.; Liu, Y.; Cao, D.; You, Z. Research and optimization of YOLO-based method for automatic pavement defect detection. Electron. Res. Arch. 2024, 32, 1708–1730. [Google Scholar] [CrossRef]
Wang, F.; Jiang, J.; Chen, Y.; Sun, Z.; Tang, Y.; Lai, Q.; Zhu, H. Rapid detection of Yunnan Xiaomila based on lightweight YOLOv7 algorithm. Front. Plant Sci. 2023, 14, 1200144. [Google Scholar] [CrossRef]
Xu, L.; Shi, X.; Tang, Z.; He, Y.; Yang, N.; Ma, W.; Zheng, C.; Chen, H.; Zhou, T.; Huang, P. Asfl-yolox: An adaptive spatial feature fusion and lightweight detection method for insect pests of the papilionidae family. Front. Plant Sci. 2023, 14, 1176300. [Google Scholar] [CrossRef]
Tan, M.; Pang, R.; Le, Q.V. Efficientdet: Scalable and efficient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 10781–10790. [Google Scholar]
Streiner, D.L.; Norman, G.R. “Precision” and “accuracy”: Two terms that are neither. J. Clin. Epidemiol. 2006, 59, 327–330. [Google Scholar] [CrossRef]
Gillund, G.; Shiffrin, R.M. A retrieval model for both recognition and recall. Psychol. Rev. 1984, 91, 1–67. [Google Scholar] [CrossRef]
He, K.; Lu, Y.; Sclaroff, S. Local descriptors optimized for average precision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 596–605. [Google Scholar]
Henderson, P.; Ferrari, V. End-to-end training of object class detectors for mean average precision. In Proceedings of the Computer Vision—ACCV 2016: 13th Asian Conference on Computer Vision, Taipei, Taiwan, 20–24 November 2016; Revised Selected Papers, Part V 13. pp. 198–213. [Google Scholar]
Yacouby, R.; Axman, D. Probabilistic extension of precision, recall, and F1 score for more thorough evaluation of classification models. In Proceedings of the First Workshop on Evaluation and Comparison of NLP Systems, Online, 20 November 2020; pp. 79–91. [Google Scholar]
He, L.; Wei, H.; Wang, Q. A new target detection method of ferrography wear particle images based on ECAM-YOLOv5-BiFPN network. Sensors 2023, 23, 6477. [Google Scholar] [CrossRef]
Pengcheng, Y.; Xuyue, K.; Heng, Z.; Fengxiang, C.; Wenchang, W.; Ruokai, W. Recognition and location of coal gangue based on BiFPN and ECA attention mechanism. Int. J. Coal Prep. Util. 2023, 44, 1–13. [Google Scholar] [CrossRef]

Figure 1. The overall structure of the improved YOLOv5 model. The flow added by using BiFPN is shown by the red line.

Figure 2. Structure of the various soft attention modules. (a), convolutional block attention module; (b), squeeze and excitation network; (c), efficient channel attention; (d), coordinate attention.

Figure 3. Structure of the PANet and BiFPN architectures. (a), PANet; (b), BiFPN.

Figure 4. Comparison of the feature extractions from the input images of C3NN modules.

Figure 5. F1 curves of the proposed models after the test process. The curves are color-coded as follows: HN, brown; TV, orange; SM, green; TS, red; TY, purple; LM, light brown; SL, pink; EB, dark gray; PM, chartreuse; LB, light blue; BS, cyan; all classes, thick blue.

Figure 6. Single-image predictions of the proposed models after the test process.

Figure 7. Multi-class image predictions of the proposed models after the test process.

Table 1. Details of the utilized tomato leaf disease dataset.

Disease Name	Abbreviation	Kaggle		Input data
Disease Name	Abbreviation	Train	Valid	Train	Valid	Test
healthy	HN	3051	806	400	50	50
Tomato_mosaic_virus	TM	2153	584	400	50	50
Spider_mites Two-spotted Spider_mite	SM	1747	435	400	50	50
Target_Spot	TS	1827	457	400	50	50
Tomato_Yellow_Leaf_Furl_Virus	TY	2039	498	400	50	50
Leaf_Mold	LM	2754	739	400	50	50
Septoria_leaf_spot	SL	2882	746	400	50	50
Early_blight	EB	2455	643	400	50	50
powdery Mildew	PM	1004	252	400	50	50
Late_blight	LB	3113	792	400	50	50
Bacteria_spot	BS	2826	732	400	50	50
Total	-	25,851	6684	4400	550	550

Table 2. Comparison of the detection results of the various proposed models.

Model	Backbone	Neck	Head	P (%)	R (%)	F1	mAP.5 (%)	mAP.5:.95 (%)
Faster-RCNN	-	-	-	81.418	82.654	81.863	87.854	65.627
YOLOv5m	-	-	-	86.591	83.059	84.788	88.854	66.628
YOLOv6m	-	-	-	85.283	78.272	81.635	85.966	64.032
YOLOv8m	-	-	-	87.805	83.576	85.640	89.083	67.175
YOLOv5m	C3CBAM	-	-	82.442	87.025	84.672	88.823	65.530
YOLOv5m	C3SE	-	-	84.057	87.787	85.882	89.891	66.876
YOLOv5m	C3ECA	-	-	86.397	85.554	85.973	89.670	66.703
YOLOv5m	C3CA	-	-	84.015	86.743	85.357	89.724	67.518
YOLOv5m	-	-	C3CBAM	86.727	86.321	86.524	89.763	66.475
YOLOv5m	-	-	C3SE	84.035	84.943	84.487	87.551	64.638
YOLOv5m	-	-	C3ECA	84.786	85.874	85.327	88.854	66.676
YOLOv5m	-	-	C3CA	81.232	87.500	84.250	88.487	65.698
YOLOv5m	C3CBAM	BiFPN	-	85.036	86.264	85.646	89.930	68.168
YOLOv5m	C3SE	BiFPN	-	85.741	85.250	85.495	90.025	67.331
YOLOv5m	C3ECA	BiFPN	-	82.246	87.978	85.015	89.002	66.969
YOLOv5m	C3CA	BiFPN	-	85.973	84.363	85.160	90.691	68.552
YOLOv5m	-	BiFPN	C3CBAM	86.117	82.084	84.052	87.623	64.339
YOLOv5m	-	BiFPN	C3SE	83.791	86.977	85.354	89.308	67.160
YOLOv5m	-	BiFPN	C3ECA	83.403	82.988	83.195	86.967	64.428
YOLOv5m	-	BiFPN	C3CA	82.373	86.887	84.570	88.549	66.464
YOLOv5m	C3CBAM	BiFPN	C3CBAM	83.559	87.352	85.413	89.522	67.488
YOLOv5m	C3CBAM	BiFPN	C3SE	84.662	86.775	85.705	90.254	68.026
YOLOv5m	C3CBAM	BiFPN	C3ECA	82.034	87.247	84.560	89.096	67.285
YOLOv5m	C3CBAM	BiFPN	C3CA	85.542	87.535	86.527	90.519	67.613
YOLOv5m	C3SE	BiFPN	C3CBAM	83.392	88.833	86.027	90.279	67.861
YOLOv5m	C3SE	BiFPN	C3SE	85.940	87.160	86.546	90.780	67.840
YOLOv5m	C3SE	BiFPN	C3ECA	83.840	85.427	84.626	88.582	66.905
YOLOv5m	C3SE	BiFPN	C3CA	86.588	85.322	85.950	89.892	67.617
YOLOv5m	C3ECA	BiFPN	C3CBAM	80.176	87.045	83.469	88.112	64.912
YOLOv5m	C3ECA	BiFPN	C3SE	86.415	84.864	85.63.2	88.966	67.400
YOLOv5m	C3ECA	BiFPN	C3ECA	87.764	87.201	87.482	90.401	68.803
YOLOv5m	C3ECA	BiFPN	C3CA	84.699	87.749	86.197	90.420	68.529
YOLOv5m	C3CA	BiFPN	C3CBAM	84.781	86.358	85.562	90.202	66.595
YOLOv5m	C3CA	BiFPN	C3SE	85.560	88.642	87.074	90.718	68.543
YOLOv5m	C3CA	BiFPN	C3ECA	82.628	87.404	86.949	89.170	66.789
YOLOv5m	C3CA	BiFPN	C3CA	84.596	86.332	85.455	90.505	68.425

Values in bold indicate the highest performance for each metric across all models. The symbol ‘-’ indicates the use of the original model without the addition of any module.

Table 3. Detection speed and performance comparison of proposed models.

Model	Backbone	Neck	Head	FLOPs (G)	Inference Time (ms)	NMS	FPS	Parameters (M)
YOLOv5m	-	-	-	48.0	2.876	0.861	257.40	20.893
YOLOv5m	C3ECA	-	-	39.8	2.926	0.866	252.14	18.104
YOLOv5m	-	-	C3ECA	43.3	3.177	0.874	230.57	18.588
YOLOv5m	C3ECA	BiFPN	-	40.2	3.054	0.788	238.15	18.251
YOLOv5m	-	BiFPN	C3ECA	43.8	3.125	1.133	227.27	18.735
YOLOv5m	C3ECA	BiFPN	C3ECA	35.5	2.910	0.901	248.26	15.946

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Lee, Y.-S.; Patil, M.P.; Kim, J.G.; Choi, S.S.; Seo, Y.B.; Kim, G.-D. Improved Tomato Leaf Disease Recognition Based on the YOLOv5m with Various Soft Attention Module Combinations. Agriculture 2024, 14, 1472. https://doi.org/10.3390/agriculture14091472

AMA Style

Lee Y-S, Patil MP, Kim JG, Choi SS, Seo YB, Kim G-D. Improved Tomato Leaf Disease Recognition Based on the YOLOv5m with Various Soft Attention Module Combinations. Agriculture. 2024; 14(9):1472. https://doi.org/10.3390/agriculture14091472

Chicago/Turabian Style

Lee, Yong-Suk, Maheshkumar Prakash Patil, Jeong Gyu Kim, Seong Seok Choi, Yong Bae Seo, and Gun-Do Kim. 2024. "Improved Tomato Leaf Disease Recognition Based on the YOLOv5m with Various Soft Attention Module Combinations" Agriculture 14, no. 9: 1472. https://doi.org/10.3390/agriculture14091472

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Improved Tomato Leaf Disease Recognition Based on the YOLOv5m with Various Soft Attention Module Combinations

Abstract

1. Introduction

2. Materials and Methods

2.1. Tomato Leaf Disease Dataset

2.2. Improvement of the YOLOv5 Model

2.2.1. YOLOv5-6.1 Model Highlights

2.2.2. Attention Modules

2.2.3. BiFPN Architecture

2.2.4. The Proposed Model

2.3. Experimental Environment and Setting

3. Results and Discussion

3.1. Insertion of C3NN into the Backbone or Head of the YOLOv5m Model

3.2. Insertion of BiFPN into the YOLOv5m Model

3.3. Insertion of C3NN into Backbone and Head with BiFPN of the YOLOv5m Model

3.4. Comparison of Detection Performance of Proposed Models

3.5. Comparison of Detection Speed and Performance of Proposed Models

4. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI