MSPB-YOLO: High-Precision Detection Algorithm of Multi-Site Pepper Blight Disease Based on Improved YOLOv8

Zheng, Xiaodong; Shao, Zichun; Chen, Yile; Zeng, Hui; Chen, Junming

doi:10.3390/agronomy15040839

Open AccessArticle

MSPB-YOLO: High-Precision Detection Algorithm of Multi-Site Pepper Blight Disease Based on Improved YOLOv8

by

Xiaodong Zheng

¹,

Zichun Shao

¹

,

Yile Chen

¹

,

Hui Zeng

²

and

Junming Chen

^1,*

¹

Faculty of Humanities and Arts, Macau University of Science and Technology, Macao 999078, China

²

School of Design, Jiangnan University, Wuxi 214122, China

^*

Author to whom correspondence should be addressed.

Agronomy 2025, 15(4), 839; https://doi.org/10.3390/agronomy15040839

Submission received: 14 February 2025 / Revised: 19 March 2025 / Accepted: 27 March 2025 / Published: 28 March 2025

(This article belongs to the Special Issue Intelligent Information System for Agriculture Based on Vision Technology)

Download

Browse Figures

Versions Notes

Abstract

:

In response to the challenges of low accuracy in traditional pepper blight identification under natural complex conditions, particularly in detecting subtle infections on early-stage leaves, stems, and fruits. This study proposes a multi-site pepper blight disease image recognition algorithm based on YOLOv8, named MSPB-YOLO. This algorithm effectively locates different infection sites on peppers. By incorporating the RVB-EMA module into the model, we can significantly reduce interference from shallow noise in high-resolution depth layers. Additionally, the introduction of the RepGFPN network structure enhances the model’s capability for multi-scale feature fusion, resulting in a marked improvement in multi-target detection accuracy. Furthermore, we optimized CIOU to DIOU by integrating the center distance of bounding boxes into the loss function; as a result, the model achieved an impressive mAP@0.5 score of 96.4%. This represents an enhancement of 2.2% over the original algorithm’s mAP@0.5. Overall, this model provides effective technical support for promoting intelligent management and disease prevention strategies for peppers.

Keywords:

pepper blight; YOLO; multi-site detection; RepGFPN; disease prevention strategies

MSC:

68T07; 92F99; 68T45

1. Introduction

Chili peppers, as one of the most important economic crops worldwide, are widely used in the production of food, pharmaceuticals, and seasonings. The market demand and economic value of chili peppers have been steadily increasing [1]. Chili peppers serve as an important vegetable and cash crop with extensive cultivation areas and a solid consumer base. However, the growth process of chili peppers is susceptible to various diseases, among which Phytophthora blight poses one of the most severe threats to the production of chili peppers. Pepper blight disease is a disease caused by the strain of Phytophthora capsici that can rapidly multiply to infect the roots, stems, leaves, and fruits of the pepper plant [2]. This causes the affected plant to wilt, rot, and even die. The routes of transmission of the disease are complex and disseminated by including soil, water, and air [3]; a transmission mechanism makes the prevention and control a challenge. With the advent of modern agriculture and large-scale cultivation, managing Phytophthora disease has become increasingly difficult. This is evident with regard to early intervention strategies and precision management techniques, where traditional approaches have begun to show their limitations. Currently, disease management relies mainly on agricultural practices and the application of chemical pesticides. However, due to the highly covert and diverse nature of the disease, especially since infection usually begins at the root without obvious early symptoms, traditional prevention methods often fail to provide timely interventions [4]. Consequently, this adversely affects both yield and quality. Therefore, identifying a technology capable of real-time monitoring along with efficient precision identification to manage pepper blight has emerged as a critical issue facing modern agriculture.

In recent years, the convergence of machine vision and image processing technologies has introduced a new approach to these challenges. Chen et al. [5] proposed a chili pepper pest recognition approach using the HSV color space and convolutional neural networks (CNNs), demonstrating improved precision, recall, and generalization compared with RGB-based models. Bhagat et al. [6] proposed a hybrid feature fusion approach using local binary pattern (LBP) and VGG-16 features combined with a random forest classifier for bell pepper disease detection. However, despite the progress made using this method, the final results did not achieve the required accuracy due to the cumbersome and time-consuming feature extraction process. However, with the rapid advancement of deep learning in the field of computer vision, particularly with respect to convolutional neural networks [7] and improvements in object detection models, new opportunities for the detection of agricultural diseases have been created [8,9,10,11]. Biplob Dey et al. [12] have demonstrated the effectiveness of deep learning models in identifying multiple stress factors in rice, including pest infestations, fungal diseases, and nutrient deficiencies, highlighting the potential of CNN-based approaches for precision agriculture. The YOLO (You Only Look Once) series of object detection algorithms [13,14,15,16,17], recognized as one of the representative models in deep learning, has been widely adopted for image recognition and object detection tasks due to its efficiency, real-time processing capabilities, and accuracy. The core advantage of the YOLO algorithm lies in its ability to process an entire image through a single forward pass, enabling swift identification of multiple objects while providing real-time localization information, particularly suitable for multi-object detection tasks within agricultural disease monitoring. With the release of YOLOv8, further optimizations have been made in both detection accuracy and inference speed, offering more robust technical support for intelligent agricultural disease surveillance. Nevertheless, two-stage object detectors, such as Mask R-CNN [18], Libra R-CNN [19], and Cascade R-CNN [20], among others, have also been explored for certain agricultural scenarios. By first generating region proposals and then refining them through classification and bounding-box regression, these methods can achieve high accuracy but typically incur increased computational complexity and longer inference times. For example, Sharma et al. [21] proposed a Faster R-CNN-based model for the detection and multi-classification of Pepper Leaf Blight Disease (PLBD), achieving detection and classification accuracies of 99.39% and 98.38%, respectively, with an average inference time of 0.23 s per image. Zhang et al. [22] developed a multi-feature fusion Faster R-CNN (MF3 R-CNN) for soybean leaf disease detection in complex scenes, achieving an optimal mean average precision of 83.34% on a real test dataset and demonstrating the effectiveness of synthetic training data. Consequently, although two-stage approaches remain valuable in some specialized contexts, the demand for rapid and large-scale monitoring often positions the YOLO series, especially its YOLOv8 version, as a more practical choice for real-world agricultural disease detection.

The deep learning-based object detection models have demonstrated significant success in the automated identification of plant diseases, providing reliable and scalable solutions for precision agriculture. Various studies have focused on optimizing feature extraction, enhancing small object detection, and improving computational efficiency to achieve high accuracy in disease recognition under complex environmental conditions. For instance, Alhwaiti et al. [23] explored the use of YOLO deep learning models for plant disease identification, demonstrating that YOLOv4 outperforms YOLOv3 with a detection accuracy of 98% and a mean average precision (mAP) of 98%, significantly improving real-time plant disease diagnosis. Bezabh et al. [24] proposed a concatenated CNN model combining VGG16 and AlexNet for pepper disease classification, achieving a testing accuracy of 95.82% and demonstrating its effectiveness in identifying pepper leaf and fruit diseases. Nidhi Kundu et al. [25] proposed a deep learning-based framework for automatic maize disease detection, severity prediction, and crop loss estimation, achieving an accuracy of 98.50% with their customized MaizeNet model. Maurmo et al. [26] leveraged transfer learning and explainable AI to enhance cassava disease detection, achieving 92% accuracy in identifying four major cassava diseases and demonstrating 98% accuracy in detecting Cassava Mosaic Disease. Ankita Gangwar et al. [27] proposed a space- and time-efficient convolution vision transformer model for tomato disease detection, achieving 93.51% accuracy across 13 categories while significantly reducing the model size and training time, making it suitable for deployment on IoT-enabled devices and mobile platforms. Li et al. proposed TPSAO-AMWNet for pepper leaf disease detection, incorporating Adaptive Residual Pyramid Convolution and a Minor Triplet Disease Focus Attention mechanism to enhance micro-feature extraction. The model also introduced a novel optimization strategy to determine the optimal learning rate, achieving a classification accuracy of 93.52% [28]. Similarly, Liu et al. developed a MobileNetv2-YOLOv3 framework for early detection of tomato gray leaf spot disease, enabling real-time monitoring via mobile applications and improving disease prevention strategies [29]. Furthermore, Yang et al. proposed an improved YOLOv8 model for corn leaf spot disease detection, integrating Slim-neck and GAM attention modules to enhance feature representation and localization precision. Their approach significantly outperformed conventional YOLO models, with a precision of 95.18% and a recall of 89.11% [30]. Recent advancements in object detection models, such as YOLOv9 to YOLOv11, have demonstrated significant improvements in various applications, including agriculture inspection [31,32,33,34].

Building on these advancements, recent studies have sought to refine YOLO-based architectures for specific plant species and disease types. Sun et al. [35] employed the YOLOv8 model to evaluate the germination status of maize seeds, achieving a peak accuracy of 0.93 and underscoring the significant potential of the model for the detection of plant growth. To confront the difficulties of recognizing tomatoes in varying stages of ripeness, Gao et al. proposed the YOLOv8n-CA model, which embeds the Coordinate Attention mechanism [36] in the backbone network to improve the extraction and representation of characteristics of tomato fruits. Furthermore, by integrating the structure of C2f-FN in both the backbone and the neck networks, an AP of 97 3% was achieved [37]. In another study aimed at improving the precise detection and monitoring of flax diseases and pest infestations, Zhong et al. [38] extended the YOLOv8 framework with a SimAM attention mechanism, an EIOU loss function, and a BIFPN structure. This configuration not only increased the detection accuracy of flax diseases and pests but also preserved high computational efficiency, offering a promising strategy for rapid, accurate assessments. Likewise, Guo et al. refined Tiny-YOLOv7 by introducing the YOLOv7-TMRTM model for detecting multi-scale rice leaf diseases. Using MobileNetV3 as the backbone to reduce computational complexity and replace the original ELAN-1 module with RCS-OSA, the approach significantly improved both detection and localization [39]. Duan et al. [40] developed the YOLOv8-GDCI model for the detection of anthracnose in peppers, particularly under complex field conditions. Using RepGFPN for feature fusion, Dysample for upsampling, CA attention for feature extraction, and Inner-MPDIoU loss to identify small targets, the model achieved a mean accuracy of 88.9%.

Although YOLOv8 exhibits excellent performance in general object detection tasks, it faces significant limitations when identifying the subtle features of chili pepper diseases in complex field environments, particularly early-stage, pale-brown wet rot patches at the roots and faint pest traces. Currently, variations in illumination, cluttered backgrounds, and the diverse morphologies of various pests and pathogens hinder accurate capture of these fine-scale symptoms, leading to elevated rates of missed detections and false positives. The recall rate for early or inconspicuous infections is especially low. Furthermore, because chili pepper diseases often affect multiple regions, slight variations among different disease types impose higher demands on the model’s generalization and feature discrimination capabilities.

To address these limitations, we propose MSPB-YOLO, a novel approach that enhances multi-scale feature extraction and pyramid-based contextual aggregation for superior pepper blight detection. Unlike previous YOLOv8 modifications, MSPB-YOLO enables robust detection of small, dispersed, and early-stage infections. By effectively capturing cross-scale feature dependencies, MSPB-YOLO surpasses existing models in detecting early-stage and multi-focal disease manifestations, offering a more holistic and precise pepper disease detection framework. The proposed model not only outperforms traditional YOLO-based architectures in detecting Phytophthora infestans infections but also addresses neglected aspects of disease impact across different plant structures.

The key contributions of MSPB-YOLO include (1) the C2f-RVB-EMA block that focuses on critical disease-related features, improving the model’s ability to distinguish patterns in pepper blight; (2) the RepGFPN, which effectively captures multi-scale features for robust detection of objects at various sizes; and (3) the DIoU loss function, which optimizes the localization accuracy of the predicted bounding boxes. These innovations result in higher detection precision, particularly in complex agricultural environments, where challenges such as varying object sizes, cluttered backgrounds, and fine-grained disease patterns are common.

2. Materials and Methods

2.1. Construction of Pepper Blight Dataset

2.1.1. Data Collection

A specialized dataset containing a total of 1053 images has been developed to train and evaluate symptom detection models for pepper blight in different sites. The images were mainly derived from two sources, including the iFLYtek 2022 Pepper Pest Image Recognition Competition https://aistudio.baidu.com/datasetdetail/153190 (accessed on 20 December 2024) and on-site photography at farms within Changji Hui Autonomous Prefecture, Xinjiang Uygur Autonomous Region. The second part of the dataset was captured through photography on-site at Changji Farm, where chili plants were photographed in changing cultivation conditions to represent different lighting and growth stages. The field data collection for this study was conducted from June to September 2023. This period was selected based on the growth cycle of chili peppers, covering the seedling, flowering, and fruiting stages, which are critical phases for the high incidence of Phytophthora blight. The image acquisition device used in this study was Huawei P20 Pro, which is equipped with a Leica triple-camera system that supports optical image stabilization and a multispectral sensor. This setup enables the capture of high-resolution disease features under complex lighting conditions, ensuring data quality. The experimental samples consisted of line pepper, specifically the two main cultivated varieties, ‘Xiangla 17’ and ‘Su Jiao 5’, which were grown in open fields and greenhouses, respectively, to account for different cultivation environments. The distribution of pepper target data is shown in Table 1.

Following the collection of images, a comprehensive annotation procedure was performed utilizing the LabelImg tool. The annotation specifically focused on identifying pepper blight lesions and delineating the infected regions on leaves, fruits, and stems through bounding boxes to accurately pinpoint their locations. All annotated labels were categorized into three classes, namely leaf, fruit, and stem, to facilitate a precise detection of lesions at these distinct anatomical sites.

2.1.2. Image Augmentation

The specific effect of data augmentation is depicted in Figure 1. To improve the model’s robustness and generalization, an online data augmentation strategy was employed during the training phase. Initial transformations, including brightness variation, contrast alteration, and flipping, were applied to the original images, thereby introducing variations in orientation and size that help mitigate overfitting. Subsequently, a mosaic augmentation technique was utilized by randomly cropping four source images and recombining them into a single composite, which further enriched the morphological diversity and contextual complexity of the dataset. These integrated data augmentation methods not only expanded the effective training sample size but also allowed the model to learn from a broader array of lesion representations, ultimately enhancing detection performance across different chili plant parts.

After data enhancement, the dataset contains 4212 images. The dataset was divided into three subsets: a training set, a validation set, and a testing set, following a ratio of 7:1.5:1.5. Specifically, the training set was 70% of the data, which was used to train the model. The test set constituted 15% of the total data, serving to evaluate the model’s performance and accuracy. Similarly, the validation set also accounted for 15%, which was utilized for hyperparameter tuning and assessing the model’s generalization capabilities.

2.2. Methods

2.2.1. YOLOv8 Model

YOLOv8 was selected as the foundational detection framework due to its balanced trade-off between accuracy and computational efficiency, which makes it especially well suited for real-world scenarios. The structure of the YOLOv8 network is shown in Figure 2. The structure of YOLOv8 comprises an input module, a backbone, a neck, and a head, and each component focuses on a distinct phase of the detection pipeline. The input module pre-processes images and converts them into a format compatible with subsequent processing stages. The backbone, built around feature extraction, replaces the conventional C3 modules with more advanced C2f modules. This substitution results in richer gradient flows, ultimately boosting the capacity to extract discriminative features from input images. The neck component then refines and merges the extracted feature maps, ensuring robust detection across objects of varying scales. Finally, the head transforms these integrated features into detection results by separately handling category and positional information, thus enhancing both accuracy and robustness. Additionally, the anchor-free design adopted by YOLOv8 eliminates the reliance on predefined anchor boxes, allowing faster inference and reduced computational overhead. Among the multiple variants of the model, namely YOLOv8n, YOLOv8s, YOLOv8m, YOLOv8l, and YOLOv8x, the present work uses the lightweight YOLOv8n to achieve real-time detection of pepper blight in practical agricultural settings.

2.2.2. Backbone Network Optimization RVB-EMA Module

The introduction of the RVB-EMA module [41] combines advanced feature extraction techniques with an optimized attention mechanism that improves the model’s sensitivity to critical features, significantly improving the capabilities of traditional convolutional networks. This approach demonstrates remarkable advantages in the detection of agricultural diseases, such as pepper blight, due to its ability to ensure high precision and enhance the accuracy of multi-site pepper detection.

The RVB-EMA module achieves significant performance improvements by integrating the RepViT Block, a lightweight yet powerful feature extraction module, with the EMA attention mechanism [42]. The RVB and RVB-EMA structure is shown in Figure 3. The RepViT Block combines depthwise convolution and channel interaction to optimize feature extraction unlike the original RepViT Block, which uses the SE module for attention [43]. This architecture integrates a multi-pronged topology during the training phase by relocating the 3 × 3 depth-wise (DW) convolution from the base MobileNetV3 module to an earlier stage. Subsequently, it consolidates into a unified-branch configuration during the inference phase. Furthermore, 1 × 1 extended convolutional layers (EConv) and 1 × 1 projection layers (Player) are employed for inter-channel interaction, which significantly enhances feature diversity and expressiveness. The MSPB-YOLO model incorporates a more efficient attention mechanism within its modules. This enhances the ability to capture spatial and contextual information at different scales, making the model more adept at learning features, especially in scenarios commonly encountered in agricultural image analysis, such as variations in object size and partial occlusion.

The EMA attention mechanism immediately follows the RepViT block and plays a crucial role in the RVB-EMA module. The EMA is depicted in Figure 4. The EMA attention mechanism is designed to enhance feature representations by employing a multi-branch attention strategy. Initially, fused features derived from image modalities are divided into g groups of sub-features, which are processed by parallel sub-networks. The mechanism incorporates three attention pathways: two

1 \times 1

convolutional branches and one

3 \times 3

convolutional branch. The

1 \times 1

branches apply one-dimensional global average pooling to encode channel information, followed by a shared

1 \times 1

convolutional layer and Sigmoid activation to generate channel attention maps. These maps are multiplied with the original sub-features to emphasize critical features. The

3 \times 3

branch captures multi-scale feature representations using

3 \times 3

convolutions. Subsequently, the outputs of all branches undergo two-dimensional global average pooling and SoftMax activation to derive spatial attention weights. Finally, spatial attention weights are combined with the original input features via matrix dot product, element-wise addition, and Sigmoid activation to produce the final processed feature map. The mathematical formulation proposed by Liu et al. [44] is adopted in Equations (1)–(3).

To capture global information in channel c, 1-dimensional global average pooling is performed along the height H, as shown in Equation (1):

z_{c}^{H} (H) = \frac{1}{W} \sum_{0 \leq i \leq W} x_{c} (H, i)

(1)

Similarly, Equation (2) represents the 1-dimensional global average pooling along the vertical direction at width W:

z_{c}^{W} (W) = \frac{1}{H} \sum_{0 \leq j \leq H} x_{c} (j, W)

(2)

The two-dimensional global average pooling is given in Equation (3):

z_{c} = \frac{1}{H \times W} \sum_{j}^{H} \sum_{i}^{W} x_{c} (i, j)

(3)

Here, c indicates the total number of input channels, while H and W denote the spatial dimensions of the features. i represents the spatial position index of the feature map in the width direction, and j represents the spatial position index of the feature map in the height direction.

This hierarchical process effectively captures both channel-wise and spatial information, enhancing the representation power of the model. Its feature grouping and parallel sub-networks allow the EMA mechanism to efficiently and effectively focus the model on salient regions, reducing interference from noise or irrelevant features. This capability is essential for crop disease detection, in which image quality variations and lighting conditions can often lead to the introduction of noise or misleading features. The RVB-EMA module not only improves accuracy in object detection but also enhances the robustness and generalization of the model in varying imaging conditions and application domains.

The C2f-RVB-EMA module builds on the foundational C2f-RVB by introducing the EMA attention mechanism, which dynamically aggregates temporal features across layers. The C2f-RVB and C2f-RVB-EMA architectures are depicted in Figure 5. The design ensures smoother feature transitions and improves robustness to noisy inputs. Specifically, the C2f-RVB-EMA module starts with a convolutional module for initial feature extraction, followed by a split operation to parallelize processing across multiple RepViT blocks. These blocks operate at various receptive fields, enabling the model to effectively capture both global and local contexts. By integrating EMA blocks within this architecture, the model further refines its attention to temporally correlated features, ensuring improved spatial and temporal feature consistency. The final concatenation operation aggregates these refined features before passing them through another convolutional module for downstream tasks. This design not only strengthens feature fusion but also ensures computational efficiency, making it ideal for real-world applications.

2.2.3. Multi-Scale Feature Fusion Network RepGFPN

The RepGFPN module [45] introduced in this paper aims to enhance multi-scale feature fusion within the network neck. The structure of RepGFPN is illustrated in Figure 6. By improving the interaction between features from various scales and minimizing computational overhead, RepGFPN proves particularly effective for complex agricultural detection scenarios. The RepGFPN module builds upon the traditional feature pyramid network (FPN) architecture [46] and integrates several key innovations to address the limitations associated with the standard FPN framework. A significant challenge with conventional FPNs is that features across different scales share identical channel dimensions, resulting in inefficient feature fusion. In contrast, RepGFPN utilizes a non-shared channel configuration that allows each scale’s feature map to preserve its original spatial information. This design eliminates the need for scaling and resizing operations common in traditional methods. Moreover, RepGFPN employs reparameterization techniques to optimize the network’s multi-branch structure. During training, multiple branches are utilized by the network to capture diverse feature representations from various scales. However, during inference, this multi-branch structure is reparameterized into a single-branch format, significantly reducing computation and memory consumption.

In addition to optimizing the feature fusion process, RepGFPN also introduces an enhanced cross-scale feature interaction mechanism. The structures of the CSPStage module and the Rep module are shown in Figure 7. The RepGFPN module accepts additional node inputs through five CSPStage modules, facilitating the fusion of multi-scale features from both adjacent layers and within the same layer. This mechanism improves accuracy without adding extra computational cost, thus enhancing feature reuse and expressive power.

2.2.4. DIOU

In the task of multi-region disease boundary box regression for chili pepper disease detection using MSPB-YOLO, the DIoU loss function is employed [47]. When non-maximum suppression (NMS) is employed to eliminate redundant detection boxes, the criterion for judgment is based on the IoU between a given detection box and the box with the highest predicted score. If the IoU exceeds a predetermined threshold, the corresponding predicted detection box will be discarded. However, in scenarios involving densely distributed targets, occlusion frequently results in significant overlap among detection boxes, leading to an increased incidence of false negatives due to erroneous removals by NMS.

The original loss function in YOLOv8 integrates a classification loss function and a regression loss function. The classification loss employs the VFL loss Equation (4), while the regression loss utilizes the CIoU loss combined with the DFL Equation (5).

V F L (p, q) = \{\begin{matrix} - q (q log (p) + (1 - q) log (1 - p)), & q > 0 \\ - α p^{γ} log (1 - p), & q = 0 \end{matrix}

(4)

The DFL function is expressed as follows:

D F L (S_{i}, S_{i + 1}) = - ((y_{i + 1} - y) log (S_{i}) + (y - y_{i}) log (S_{i + 1}))

(5)

Here, p and q are probabilities,

S_{i}

and

S_{i + 1}

denote predicted scores, and

α

and

γ

are balancing factors.

To improve the accuracy of boundary box predictions, we introduce the DIoU loss function during the model training process. DIoU is an extension of the traditional IoU, which incorporates the geometric distance between the centers of the predicted and ground truth boxes. This method provides effective gradient signals even in the case of non-overlapping frames, thereby accelerating model convergence. DIoU is computationally simple, requiring no complex aspect ratio adjustments, which improves training efficiency. In the DIoU calculation, b and

b^{g t}

represent the coordinates of the centroids of the predicted and ground truth bounding boxes, respectively. The term

ρ^{2} (b, b^{g t})

denotes the squared Euclidean distance between these centroids, and

c^{2}

represents the diagonal length of the smallest enclosing rectangle that encompasses both bounding boxes. The formula is presented as follows:

D I O U = I O U - \frac{ρ^{2} (b, b^{g t})}{c^{2}}

(6)

I O U = \frac{(A \cap B)}{(A \cup B)}

(7)

2.2.5. MSPB-YOLO

The MSPB-YOLO model demonstrates significant performance advantages in complex object detection tasks. The structure of the MSPB-YOLO network is shown in Figure 8. We introduce MSPB-YOLO, which integrates the C2f-RVB-EMA module to significantly enhance feature representation and improve detection accuracy. The C2f-RVB-EMA module builds on the C2f-RVB design by integrating an EMA mechanism to improve feature fusion and robustness.

The MSPB-YOLO architecture consists of three main sections: the backbone network, the neck, and the detection head. The backbone network employs alternating C2f-RVB and C2f-RVB-EMA modules to effectively capture global semantic and local detail features through multi-scale feature fusion and lightweight design. The introduction of the SPPF module maximizes the multiresolution information aggregation while keeping the computation cost low. Neck networks are developed with CSPStage and feature concatenation operations to strengthen the interaction of shallow features with deep features for better adaptability to multi-scale object detection. The detection head, optimized with the components producing bounding boxes and class confidence scores, gives efficient bounding box prediction at a balance between real-time inference and accuracy.

3. Results

3.1. The Indicators of Evaluation

To rigorously assess the performance of the proposed pepper blight disease detection model, four key evaluation metrics were utilized: precision (P), recall (R), average precision (AP), and mean average precision (mAP). Precision measures the proportion of correctly identified disease instances out of all instances predicted by the model, indicating the accuracy of positive detections. A high precision value means that the model generates few false positives, which is essential to reduce unnecessary interventions in agricultural management. Recall, on the other hand, quantifies the model’s ability to identify all relevant disease instances within the dataset. A high recall rate ensures that most actual disease cases are detected, reducing the risk of missing infected plants. Average precision provides a comprehensive measure by averaging precision between different recall levels for each class, leaf, stem, and fruit, thus offering a balanced evaluation of the model’s detection capabilities. The mean Average Precision aggregates these AP scores across all classes, providing an overall performance indicator that reflects the effectiveness of the model in multiclass disease detection. These metrics collectively ensure a thorough evaluation of the model’s accuracy, reliability, and generalizability in real-world agricultural settings. The formula is presented as follows:

P = \frac{T_{P}}{T_{P} + F_{P}}

(8)

R = \frac{T_{P}}{T_{P} + F_{N}}

(9)

A P = \int_{0}^{1} p (r) d r

(10)

m A P = \frac{1}{n} \sum_{i = 1}^{n} A P_{i}

(11)

In this scenario,

T_{P}

represents the count of positive samples that are accurately recognized. False positive (

F_{P}

) indicates the number of negative samples wrongly classified as positive, while

F_{N}

stands for the number of positive samples misidentified as negative. AP is calculated as the area beneath the precision–recall curve. Here, n represents the number of categories. mAP is obtained by averaging the

A P

values across various categories.

The number of parameters (Params) and the number of floating-point operations (GFLOPs) are used as metrics to evaluate the complexity of the model. Params refer to the total number of trainable parameters in the model, which is typically used to measure storage requirements. GFLOPs quantify the computational complexity of the model, representing the number of billions of floating-point operations required for a single forward pass.

3.2. Experimental Parameter Configuration

All experimental environments are summarized in Table 2. This table includes details such as processor, GPU type, memory capacity, operating system, and relevant software versions.

In this unified setup, the MSPB-YOLO model was trained for 400 epochs with a learning rate of 0.01, a momentum coefficient of 0.937, and a weight decay coefficient of 0.0005. A batch size of 24 was employed, and eight worker threads were utilized for data loading. The AdamW optimizer was selected to facilitate stable convergence and to handle large-scale image data efficiently.

The experimental results in Table 3 compare the performance of several object detection models, including Faster-RCNN, Cascade-RCNN, YOLOv5n, YOLOv7n-Tiny, and the proposed model, based on parameters, GFLOPs, and mAP@0.5.

Faster-RCNN and Cascade-RCNN achieved competitive precision, with rates of 80.7% and 82.8%, respectively. However, these models required significant computational resources, with 41.39 million and 69.29 million parameters. In contrast, the YOLOv5n, YOLOv7n-Tiny, and YOLOv10n models achieved high accuracy, reaching 87.7%, 86.6%, and 93.8%, respectively, while maintaining a parameter count of 2.5 million, 6.0 million, and 2.7 million. This makes them more suitable for environments with limited computational resources. The MSPB-YOLO outperformed all others with a mAP@0.5 of 96.4%, maintaining a compact architecture with only 2.9 M parameters and 7.3 GFLOPs. This combination of high accuracy and low computational cost makes it ideal for real-world detection tasks, particularly in agricultural disease detection.

The ablation experiments presented in Table 4 highlight the contributions of various components, including RVB-EMA, GFPN, and DIoU, to the model performance. The results show that each component enhances the accuracy of the model progressively.

In Experiment 1, where no attention mechanism is applied, the model achieves a mAP@0.5 of 94.2%. This performance level reflects the basic capacity of the network without any enhancement in feature fusion or spatial attention. The model’s ability to localize and detect objects is constrained by its inability to adaptively focus on key regions of interest.

Experiment 2 indicates that incorporating the RVB module reduces the number of parameters from 3.0 M to 2.3 M and GFLOPs from 8.1 to 6.4. At the same time, P, R, and mAP@0.5% all improve, with P increasing to 92.7%, R to 90%, and mAP@0.5% to 94.4%. This demonstrates that the RVB module effectively reduces computational complexity while enhancing detection accuracy, contributing to a more efficient model.

In Experiment 3, the RVB-EMA module is introduced. This module, which combines the RepViT Block with the EMA attention mechanism, allows for more effective feature selection and smoother attention dynamics. The introduction of RVB-EMA results in a slight improvement in mAP@0.5, increasing the score to 94.9%. The RVB-EMA mechanism helps the model better capture important spatial features while suppressing irrelevant information, leading to improved accuracy in detection tasks, especially in complex environments.

In Experiment 4, the GFPN module is added in conjunction with RVB-EMA. The GFPN further enhances the feature fusion process by addressing multi-scale information, which is particularly beneficial for detecting objects of varying sizes. This combination increases mAP@0.5 to 95.9%, demonstrating that both RVB-EMA and GFPN complement each other in improving model robustness and feature utilization across different scales.

Finally, in Experiment 5, the DIoU loss function is incorporated alongside the previous components. DIoU improves the geometric accuracy of a model by highlighting the distance between the centers of the predicted and ground-truth bounding boxes, which is critical for accurate localization. The result is a mAP@0.5 of 96.4%, the highest among all experiments. This improvement highlights the importance of combining advanced attention mechanisms with loss functions that optimize spatial and geometric alignment for more precise object detection.

The contrastive experimental results of different attention mechanisms in Table 5 focus on the combination of YOLOv8 and REPGFPN networks. In Experiment 1, without any additional attention mechanisms, the YOLOv8 model achieves a mAP@0.5 of 95.8%. This demonstrates the strong baseline performance of YOLOv8 while also highlighting the potential for further enhancement through the integration of specialized attention modules, which could help improve the model’s performance in more complex scenarios.

In Experiment 2, when the MLCA attention mechanism [48] is introduced, the model achieves 93.5% precision but shows a notable decrease in recall to 89.9% and mAP@0.5 to 95.2%. This indicates that while MLCA improves focus on critical image areas, it may also lead to missed detections due to a narrower attention focus. In Experiment 3, the introduction of the SimAM attention mechanism [49] leads to improvements in precision at 94.4% but a decrease in recall at 90.5%, with a slight decrease in mAP@0.5 to 95.3%. Experiment 4 incorporates the SE attention mechanism, which improves channel-wise feature recalibration but causes a slight decrease in mAP@0.5 to 94.5%, suggesting that channel-wise recalibration alone may not always enhance overall performance. In Experiment 5, the introduction of the ELA attention mechanism [50] maintains precision at 92.7%, but recall drops to 89.3%, resulting in a mAP@0.5 of 94.6%. While ELA focuses on local spatial information, it reduces recall by limiting the model’s ability to capture global context. In Experiment 6, we evaluate the performance of the EMA attention mechanism, which enhances precision to 94.1% but reduces recall to 91.0%, achieving the highest mAP@0.5 of 95.9%. The EMA mechanism improves the stability and robustness of attention, allowing the model to focus on significant features while disregarding irrelevant ones. This results in better detection across various image conditions, establishing the EMA attention mechanism as the most effective approach within this set of experiments.

The experimental results of contrastive loss functions are shown in Table 6.

The CIoU loss function achieved a precision of 94.1%, a recall of 91%, and mAP@0.5 of 95.9%. Although it incorporates center distance, aspect ratio, and overlap, CIoU faces certain limitations in handling complex spatial relationships. For instance, it struggles to address challenges like rotation and relative positioning between bounding boxes, which can hinder its performance in complex scenarios. The GIoU loss function [51] demonstrated superior performance, with a precision of 93.3%, a recall of 91.3%, and a mAP@0.5 of 95.8%. By introducing the concept of the minimum enclosing box, GIoU provides a more accurate measure of geometric overlap. This improvement effectively reduces the impact of misalignment in the bounding boxes, making it particularly advantageous for tasks that require precise spatial localization. The EIoU loss [52], with a precision of 86.5%, a recall of 83.3%, and a mAP@0.5 of 89.7%, underperforming compared with other losses. Although EIoU improves on IoU-based metrics by factoring in aspect ratio, it lacks the ability to model complex spatial relationships, especially when dealing with rotated or occluded objects, which reduces its effectiveness in detection tasks. The loss of SIoU [53], which incorporates spatial and geometric considerations, achieved a precision of 94.4%, a recall of 91.1%, and mAP@0.5 of 95.8%. The inclusion of rotational angle handling gives SIoU an edge in scenarios where object orientation is important, such as in detecting agricultural objects where rotation and scale variation are frequent. The DIoU loss outperformed all other loss functions, with precision at 93.1%, recall at 92.6%, and mAP@0.5 at 96.4%. DIoU refines the IoU loss by introducing a distance-based penalty that focuses on reducing the center distance between predicted and ground-truth boxes. Unlike CIoU or GIoU, DIoU directly optimizes the distance between the box centers, leading to more accurate localization. The additional benefit of DIoU is its robustness in handling various bounding-box distortions and occlusions, which is crucial for object detection tasks with complex geometric variations, such as those in agricultural imagery.

3.3. Training Curve

The training and validation loss curves exhibit a smooth and consistent downward trend, indicating stable model convergence without signs of divergence, as shown in Figure 9. Notably, after 400 epochs, both the classification loss and regression loss reach a plateau, suggesting that the model has sufficiently learned the feature representations. The precision, recall, and mAP metrics continue to improve and stabilize, demonstrating the model’s strong generalization capability. Moreover, no evident overfitting is observed, as the validation performance aligns closely with the training performance.

3.4. Visualization

To demonstrate the superiority of the MSPB-YOLO model, a comparative visualization of the detection results is presented in Figure 10. The results are juxtaposed with the ground truth and baseline model predictions to provide a comprehensive analysis.

From the visualization, it is evident that MSPB-YOLO consistently achieves superior detection accuracy and robustness across a variety of environmental conditions. For example, in scenario (a), where disease spots are distributed over complex foliage backgrounds, MSPB-YOLO demonstrates exceptional performance in accurately localizing and identifying infections. In contrast to the baseline model, which exhibits a higher incidence of missed detections and false positives, MSPB-YOLO reliably identifies and delineates the infection regions. In scenario (b), involving the disease spot on fruiting bodies, MSPB-YOLO excels by generating more precise bounding boxes with minimal overlap, thus reducing ambiguity in high-density areas. The baseline model tends to produce imprecise bounding boxes, often merging nearby objects, which impairs the clarity of detection. Conversely, MSPB-YOLO’s multi-scale prediction and DIoU loss function effectively separate individual objects, leading to more reliable and accurate outcomes. Finally, in scenario (c), MSPB-YOLO showcases its remarkable capability in addressing occlusion challenges, a critical factor for agricultural applications where overlapping leaves and fruits frequently obstruct clear views. By employing multi-scale feature fusion and an optimized backbone structure, MSPB-YOLO adeptly handles such complexities, delivering accurate detections where the baseline model falters. MSPB-YOLO maintains robustness by capturing finer details and mitigating errors in overlapping regions, highlighting its practicality in real-world scenarios. Overall, the visual comparison underscores MSPB-YOLO’s superiority in terms of precision, robustness, and adaptability, making it an exceptionally effective tool for pepper blight detection in complex agricultural environments.

4. Discussion

This study presents MSPB-YOLO, an optimized deep-learning model designed for the accurate detection of multi-site pepper blight disease in complex agricultural environments. By integrating the RVB-EMA module, RepGFPN multi-scale feature fusion, and DIoU loss function, the model achieves superior feature representation and spatial attention, enhancing its ability to detect disease symptoms under varying conditions. The experimental results demonstrate the effectiveness of MSPB-YOLO in real-world agricultural scenarios. Compared with baseline models, MSPB-YOLO achieves significant improvements in precision and recall, indicating enhanced robustness in detecting disease-affected regions despite challenges such as occlusions and illumination. The advanced multi-scale feature fusion mechanism enables the model to capture fine-grained details of disease symptoms, ensuring high detection accuracy across different growth stages of pepper plants. In the meantime, the high accuracy of MSPB-YOLO suggests strong potential for practical deployment in precision agriculture. The model can be integrated into automated disease monitoring systems to provide early warnings and support intelligent decision-making for crop management.

However, despite the promising performance of MSPB-YOLO in detecting diseases in pepper plants, the model has several limitations that need to be addressed for broader applicability. First, while it improves accuracy, its computational complexity could limit real-time deployment on low-power edge devices such as drones or smartphones. Second, the model is trained on static images, which may not fully capture the temporal dynamics of disease progression. Addressing these challenges requires further optimization in both algorithm efficiency and dataset expansion. The practical implications of this study are significant. The ability to detect early-stage symptoms across different parts of the pepper plant (leaves, stems, and fruits) enhances early intervention strategies in precision agriculture. By reducing the dependency on manual inspection and chemical pesticide use, MSPB-YOLO contributes to sustainable disease management practices. And one key limitation is its ability to generalize across different crop species. Trained on a specific set of pepper varieties, the model’s performance on other crops with differing morphological traits and disease patterns remains untested, necessitating further fine-tuning or dataset expansion. Another challenge is detecting rare or other pepper diseases, as the model is primarily trained on pepper blight diseases and lacks sufficient labeled data for rare conditions, hindering early-stage recognition. Additionally, the model’s robustness is influenced by environmental factors such as humidity and temperature. However, the current dataset does not capture a wide range of environmental variations, limiting the model’s ability to perform consistently under diverse field conditions.

More broadly, the application of AI in agriculture faces data accessibility constraints, computational resource demands, and integration challenges with existing farming practices [54]. Future research should focus on expanding datasets to include diverse crop species and environmental conditions, optimizing the model for real-time deployment on edge devices, and enhancing its interpretability for practical agricultural applications. Addressing these challenges will be crucial for improving the reliability and scalability of AI-driven disease detection systems in precision agriculture.

5. Conclusions

In this study, we proposed MSPB-YOLO, an advanced detection model that integrates the RVB-EMA attention mechanism, RepGFPN multi-scale feature fusion, and DIoU loss function to enhance feature integration, spatial attention, and geometric localization. Experimental results indicate that MSPB-YOLO achieves a mAP@0.5 of 96.4%, outperforming previous models by 2.2% and demonstrating its robustness in detecting multi-site pepper blight under diverse agricultural conditions. The key contributions of this work include the following: (1) Enhanced detection accuracy, particularly in challenging environments with occlusions and variable lighting; (2) Improved feature fusion and spatial attention, allowing precise localization of disease sites; (3) Effective early detection of blight symptoms on leaves, stems, and fruits, which is critical for real-time disease management. Future research will focus on extending MSPB-YOLO’s applicability to other agricultural diseases and pests. Further improvements may include incorporating temporal information through video-based analysis, optimizing the model for deployment on edge devices, and enhancing robustness under different environmental conditions.

Author Contributions

Conceptualization, X.Z. and J.C.; Methodology, J.C.; Software, J.C.; Validation, X.Z. and J.C.; Formal analysis, J.C.; Investigation, J.C.; Resources, X.Z.; Data curation, X.Z. and J.C.; Writing—original draft, X.Z., Z.S., Y.C., H.Z. and J.C.; Writing—review and editing, X.Z., Z.S., Y.C., H.Z. and J.C.; Visualization, J.C.; Supervision, X.Z. and J.C.; Project administration, X.Z.; Funding acquisition, X.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by grants from the Second Batch of the Industry-University Cooperation and Collaborative Education Project of the Ministry of Education in 2022 (Grant No. 220900275195015) and the Young and Middle-aged Backbone Teacher Training Project of Nantong Institute of Technology (Grant No. ZQNGGJS202120).

Data Availability Statement

The data presented in this study are openly available in Iflytek Co., Ltd. at https://aistudio.baidu.com/datasetdetail/153190 (accessed on 20 December 2024), reference number GPL 2.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Idrees, S.; Hanif, M.A.; Ayub, M.A.; Hanif, A.; Ansari, T.M. Chapter 9—Chili Pepper. In Medicinal Plants of South Asia; Hanif, M.A., Nawaz, H., Khan, M.M., Byrne, H.J., Eds.; Elsevier: Amsterdam, The Netherlands, 2020; pp. 113–124. [Google Scholar] [CrossRef]
Moreira-Morrillo, A.A.; Monteros-Altamirano, Á.; Reis, A.; Garcés-Fiallos, F.R. Phytophthora capsici on Capsicum Plants: A Destructive Pathogen in Chili and Pepper Crops. In Capsicum; Yllano, O.B., Ed.; IntechOpen: Rijeka, Croatia, 2022; Chapter 4. [Google Scholar] [CrossRef]
Gevens, A.; Donahoo, R.; Lamour, K.; Hausbeck, M. Characterization of Phytophthora capsici from Michigan surface irrigation water. Phytopathology 2007, 97, 421–428. [Google Scholar] [CrossRef] [PubMed]
Granke, L.L.; Quesada-Ocampo, L.; Lamour, K.; Hausbeck, M.K. Advances in research on Phytophthora capsici on vegetable crops in the United States. Plant Dis. 2012, 96, 1588–1600. [Google Scholar] [PubMed]
Chen, W.; Gao, H.; Ding, D.; Dong, X.; Luo, X. Chili pepper pests recognition based on HSV color space and convolutional neural networks. In Proceedings of the 2023 IEEE 3rd International Conference on Electronic Technology, Communication and Information (ICETCI), Changchun, China, 26–28 May 2023; pp. 241–245. [Google Scholar]
Bhagat, M.; Kumar, D.; Kumar, S. Bell pepper leaf disease classification with LBP and VGG-16 based fused features and RF classifier. Int. J. Inf. Technol. 2023, 15, 465–475. [Google Scholar] [CrossRef]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 2012, 25, 1097–1105. [Google Scholar]
Dai, G.; Hu, L.; Fan, J. DA-ActNN-YOLOV5: Hybrid YOLO v5 Model with Data Augmentation and Activation of Compression Mechanism for Potato Disease Identification. Comput. Intell. Neurosci. 2022, 2022, 6114061. [Google Scholar]
Srivastava, A.; Rawat, B.S.; Bajpai, P.; Dhondiyal, S.A. Potato Leaf Disease Detection Method Based on the YOLO Model. In Proceedings of the 2024 4th International Conference on Data Engineering and Communication Systems (ICDECS), Bangalore, India, 15–16 March 2024; pp. 1–5. [Google Scholar]
Huang, Y.; Zhong, Y.; Zhong, D.; Yang, C.; Wei, L.; Zou, Z.; Chen, R. Pepper-YOLO: An lightweight model for green pepper detection and picking point localization in complex environments. Front. Plant Sci. 2024, 15, 1508258. [Google Scholar]
Mathew, M.P.; Mahesh, T.Y. Leaf-based disease detection in bell pepper plant using YOLO v5. Signal Image Video Process. 2022, 16, 123–130. [Google Scholar]
Dey, B.; Haque, M.M.U.; Khatun, R.; Ahmed, R. Comparative performance of four CNN-based deep learning variants in detecting Hispa pest, two fungal diseases, and NPK deficiency symptoms of rice (Oryza sativa). Comput. Electron. Agric. 2022, 202, 107340. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 779–788. [Google Scholar]
Redmon, J.; Farhadi, A. YOLO9000: Better, faster, stronger. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 7263–7271. [Google Scholar]
Redmon, J.; Farhadi, A. Yolov3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
Bochkovskiy, A.; Wang, C.Y.; Liao, H.Y.M. Yolov4: Optimal speed and accuracy of object detection. arXiv 2020, arXiv:2004.10934. [Google Scholar]
Jocher, G.; Chaurasia, A.; Stoken, A.; Borovec, J.; Kwon, Y.; Michael, K.; Fang, J.; Wong, C.; Yifu, Z.; Montes, D.; et al. ultralytics/yolov5: V6. 2-yolov5 classification models, apple m1, reproducibility, clearml and deci. ai integrations. Zenodo 2022. [Google Scholar] [CrossRef]
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar]
Pang, J.; Chen, K.; Shi, J.; Feng, H.; Ouyang, W.; Lin, D. Libra r-cnn: Towards balanced learning for object detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 821–830. [Google Scholar]
Cai, Z.; Vasconcelos, N. Cascade r-cnn: Delving into high quality object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 6154–6162. [Google Scholar]
Sharma, R.; Kukreja, V.; Bordoloi, D. Deep learning meets agriculture: A faster RCNN based approach to pepper leaf blight disease detection and multi-classification. In Proceedings of the 2023 4th International Conference for Emerging Technology (INCET), Belgaum, India, 26–28 May 2023; pp. 1–5. [Google Scholar]
Zhang, K.; Wu, Q.; Chen, Y. Detecting soybean leaf disease from synthetic image using multi-feature fusion faster R-CNN. Comput. Electron. Agric. 2021, 183, 106064. [Google Scholar] [CrossRef]
Alhwaiti, Y.; Khan, M.; Asim, M.; Siddiqi, M.H.; Ishaq, M.; Alruwaili, M. Leveraging YOLO deep learning models to enhance plant disease identification. Sci. Rep. 2025, 15, 7969. [Google Scholar] [CrossRef] [PubMed]
Bezabh, Y.A.; Salau, A.O.; Abuhayi, B.M.; Mussa, A.A.; Ayalew, A.M. CPD-CCNN: Classification of pepper disease using a concatenation of convolutional neural network models. Sci. Rep. 2023, 13, 15581. [Google Scholar] [CrossRef] [PubMed]
Kundu, N.; Rani, G.; Dhaka, V.S.; Gupta, K.; Nayaka, S.C.; Vocaturo, E.; Zumpano, E. Disease detection, severity prediction, and crop loss estimation in MaizeCrop using deep learning. Artif. Intell. Agric. 2022, 6, 276–291. [Google Scholar] [CrossRef]
Maurmo, D.; Gagliardi, M.; Ruga, T.; Zumpano, E.; Vocaturo, E. Boosting Agricultural Diagnostics: Cassava Disease Detection with Transfer Learning and Explainable AI. In Proceedings of the 2024 IEEE International Conference on Big Data (BigData), Los Alamitos, CA, USA, 15–18 December 2024; pp. 4702–4710. [Google Scholar] [CrossRef]
Gangwar, A.; Dhaka, V.S.; Rani, G.; Khandelwal, S.; Zumpano, E.; Vocaturo, E. Time and Space Efficient Multi-Model Convolution Vision Transformer for Tomato Disease Detection from Leaf Images with Varied Backgrounds. Comput. Mater. Contin. 2024, 79, 117–142. [Google Scholar] [CrossRef]
Wan, L.; Zhu, W.; Dai, Y.; Zhou, G.; Chen, G.; Jiang, Y.; Zhu, M.; He, M. Identification of Pepper Leaf Diseases Based on TPSAO-AMWNet. Plants 2024, 13, 1581. [Google Scholar] [CrossRef]
Liu, J.; Wang, X. Early recognition of tomato gray leaf spot disease based on MobileNetv2-YOLOv3 model. Plant Methods 2020, 16, 123–130. [Google Scholar] [CrossRef]
Yang, S.; Yao, J.; Teng, G. Corn leaf spot disease recognition based on improved YOLOv8. Agriculture 2024, 14, 666. [Google Scholar] [CrossRef]
Huang, Y.; Huang, H.; Qin, F.; Chen, Y.; Zou, J.; Liu, B.; Li, Z.; Liu, C.; Wan, F.; Qian, W.; et al. YOLO-IAPs: A Rapid Detection Method for Invasive Alien Plants in the Wild Based on Improved YOLOv9. Agriculture 2024, 14, 2201. [Google Scholar] [CrossRef]
Huang, Y.; Liu, Z.; Zhao, H.; Tang, C.; Liu, B.; Li, Z.; Wan, F.; Qian, W.; Qiao, X. YOLO-YSTs: An Improved YOLOv10n-Based Method for Real-Time Field Pest Detection. Agronomy 2025, 15, 575. [Google Scholar] [CrossRef]
Liao, Y.; Li, L.; Xiao, H.; Xu, F.; Shan, B.; Yin, H. YOLO-MECD: Citrus Detection Algorithm Based on YOLOv11. Agronomy 2025, 15, 687. [Google Scholar] [CrossRef]
Zhang, C.; Yun, L.; Yang, C.; Chen, Z.; Cheng, F. LRNTRM-YOLO: Research on Real-Time Recognition of Non-Tobacco-Related Materials. Agronomy 2025, 15, 489. [Google Scholar] [CrossRef]
Sun, W.; Xu, M.; Xu, K.; Chen, D.; Wang, J.; Yang, R.; Chen, Q.; Yang, S. CSGD-YOLO: A Corn Seed Germination Status Detection Model Based on YOLOv8n. Agronomy 2025, 15, 128. [Google Scholar] [CrossRef]
Hou, Q.; Zhou, D.; Feng, J. Coordinate attention for efficient mobile network design. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual, 19–25 June 2021; pp. 13713–13722. [Google Scholar]
Gao, X.; Ding, J.; Zhang, R.; Xi, X. YOLOv8n-CA: Improved YOLOv8n Model for Tomato Fruit Recognition at Different Stages of Ripeness. Agronomy 2025, 15, 188. [Google Scholar] [CrossRef]
Zhong, M.; Li, Y.; Gao, Y. Research on Small-Target Detection of Flax Pests and Diseases in Natural Environment by Integrating Similarity-Aware Activation Module and Bidirectional Feature Pyramid Network Module Features. Agronomy 2025, 15, 187. [Google Scholar] [CrossRef]
Guo, F.; Li, J.; Liu, X.; Chen, S.; Zhang, H.; Cao, Y.; Wei, S. Improved YOLOv7-Tiny for the Detection of Common Rice Leaf Diseases in Smart Agriculture. Agronomy 2024, 14, 2796. [Google Scholar] [CrossRef]
Duan, Y.; Han, W.; Guo, P.; Wei, X. YOLOv8-GDCI: Research on the Phytophthora Blight Detection Method of Different Parts of Chili Based on Improved YOLOv8 Model. Agronomy 2024, 14, 2734. [Google Scholar] [CrossRef]
Wang, A.; Chen, H.; Lin, Z.; Han, J.; Ding, G. Repvit: Revisiting mobile cnn from vit perspective. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–21 June 2024; pp. 15909–15920. [Google Scholar]
Ouyang, D.; He, S.; Zhang, G.; Luo, M.; Guo, H.; Zhan, J.; Huang, Z. Efficient multi-scale attention module with cross-spatial learning. In Proceedings of the ICASSP 2023–2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; pp. 1–5. [Google Scholar]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 7132–7141. [Google Scholar]
Liu, Y.; Wang, Q.; Zheng, Q.; Liu, Y. YOLO-Wheat: A More Accurate Real-Time Detection Algorithm for Wheat Pests. Agriculture 2024, 14, 2244. [Google Scholar] [CrossRef]
Xu, X.; Jiang, Y.; Chen, W.; Huang, Y.; Zhang, Y.; Sun, X. Damo-yolo: A report on real-time object detection design. arXiv 2022, arXiv:2211.15444. [Google Scholar]
Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
Zheng, Z.; Wang, P.; Liu, W.; Li, J.; Ye, R.; Ren, D. Distance-IoU loss: Faster and better learning for bounding box regression. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 12993–13000. [Google Scholar]
Wan, D.; Lu, R.; Shen, S.; Xu, T.; Lang, X.; Ren, Z. Mixed local channel attention for object detection. Eng. Appl. Artif. Intell. 2023, 123, 106442. [Google Scholar]
Yang, L.; Zhang, R.Y.; Li, L.; Xie, X. Simam: A simple, parameter-free attention module for convolutional neural networks. In Proceedings of the International Conference on Machine Learning. PMLR, Virtual, 18–24 July 2021; pp. 11863–11874. [Google Scholar]
Xu, W.; Wan, Y. ELA: Efficient Local Attention for Deep Convolutional Neural Networks. arXiv 2024, arXiv:2403.01123. [Google Scholar]
Union, G.I.O. A metric and a loss for bounding box regression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 658–666. [Google Scholar]
Zhang, Y.F.; Ren, W.; Zhang, Z.; Jia, Z.; Wang, L.; Tan, T. Focal and efficient IOU loss for accurate bounding box regression. Neurocomputing 2022, 506, 146–157. [Google Scholar]
Gevorgyan, Z. SIoU loss: More powerful learning for bounding box regression. arXiv 2022, arXiv:2205.12740. [Google Scholar]
Vocaturo, E.; Rani, G.; Dhaka, V.S.; Zumpano, E. AI-driven agriculture: Opportunities and challenges. In Proceedings of the 2023 IEEE International Conference on Big Data (BigData), Sorrento, Italy, 15–18 December 2023; pp. 3530–3537. [Google Scholar]

Figure 1. Data augmentation examples on pepper images: (a) original image, (b) brightness variation, (c) image flipping, and (d) contrast alteration.

Figure 2. The structures of Yolov8.

Figure 3. The structures of RVB and RVB-EMA.

Figure 4. The structures of EMA.

Figure 5. The structures of C2f-RVB and C2f-RVB-EMA.

Figure 6. The structures of RepGFPN.

Figure 7. The structures of CSPStage module and Rep module.

Figure 8. The structures of MSPB-YOLO.

Figure 9. Training curve result.

Figure 10. Visual comparison of pest detection results between Ground Truth, the baseline model, and MSPB-YOLO under different scenarios: (a) early-stage pest infestations on foliage, (b) mature pests on fruiting bodies, and (c) complex occlusion conditions. The blue box represents leaf disease; green frame represents fruit disease. The white frame represents stem disease.

Table 1. Distribution of pepper target data.

Target	Leaf	Fruit	Stem
Totals	1184	812	824

Table 2. Experimental software and hardware configuration.

Item	Specification
Deep Learning Framework	PyTorch 1.13.1
Data Processing Environment	Python 3.8.19
Operating System	Windows 11
GPU Parallel Computing Library	CUDA 11.1
GPU	NVIDIA RTX 3070 Laptop GPU
CPU	AMD Ryzen 7 5800H

Table 3. Contrasting experiment results.

Model	Params(M)	GFLOPs	P(%)	R(%)	mAP@0.5(%)
Faster-RCNN	41.39	208	78.2	83.5	80.7
Cascade-RCNN	69.29	236	81.6	85.1	82.8
Yolov5n	2.5	7.1	85.9	89.3	87.7
Yolov7n-Tiny	6.0	13.2	84.2	88.7	86.6
Yolov10n	2.7	8.4	91.2	87.2	93.8
Yolov11n	2.8	6.3	91	82.9	92.9
Ours	2.9	7.3	93.1	92.6	96.4

Table 4. Ablation experiments.

RVB	RVB-EMA	GFPN	DIoU	Params(M)	GFLOPs	P(%)	R(%)	mAP@0.5(%)
-	-	-	-	3.0	8.1	90.9	88.9	94.2
✓	-	-	-	2.3	6.4	92.7	90	94.4
-	✓	-	-	2.6	7.1	93.1	89	94.9
-	✓	✓	-	2.9	7.3	94.1	91	95.9
-	✓	✓	✓	2.9	7.3	93.1	92.6	96.4

Table 5. The contrastive experimental results of different attention mechanisms.

Yolov8	MLCA	SimAM	SE	ELA	EMA	P(%)	R(%)	mAP@0.5(%)
✓	-	-	-	-	-	92.7	93.1	95.8
✓	✓	-	-	-	-	93.5	89.9	95.2
✓	-	✓	-	-	-	94.4	90.5	95.3
✓	-	-	✓	-	-	90.5	91.7	94.5
✓	-	-	-	✓	-	92.7	89.3	94.6
✓	-	-	-	-	✓	94.1	91.0	95.9

Table 6. The contrastive experimental results of different loss function.

CIoU	GIoU	EIoU	SIoU	DIoU	P(%)	R(%)	mAP@0.5(%)
✓	-	-	-	-	94.1	91	95.9
-	✓	-	-	-	93.3	91.3	95.8
-	-	✓	-	-	86.5	83.3	89.7
-	-	-	✓	-	94.4	91.1	95.8
-	-	-	-	✓	93.1	92.6	96.4

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zheng, X.; Shao, Z.; Chen, Y.; Zeng, H.; Chen, J. MSPB-YOLO: High-Precision Detection Algorithm of Multi-Site Pepper Blight Disease Based on Improved YOLOv8. Agronomy 2025, 15, 839. https://doi.org/10.3390/agronomy15040839

AMA Style

Zheng X, Shao Z, Chen Y, Zeng H, Chen J. MSPB-YOLO: High-Precision Detection Algorithm of Multi-Site Pepper Blight Disease Based on Improved YOLOv8. Agronomy. 2025; 15(4):839. https://doi.org/10.3390/agronomy15040839

Chicago/Turabian Style

Zheng, Xiaodong, Zichun Shao, Yile Chen, Hui Zeng, and Junming Chen. 2025. "MSPB-YOLO: High-Precision Detection Algorithm of Multi-Site Pepper Blight Disease Based on Improved YOLOv8" Agronomy 15, no. 4: 839. https://doi.org/10.3390/agronomy15040839

APA Style

Zheng, X., Shao, Z., Chen, Y., Zeng, H., & Chen, J. (2025). MSPB-YOLO: High-Precision Detection Algorithm of Multi-Site Pepper Blight Disease Based on Improved YOLOv8. Agronomy, 15(4), 839. https://doi.org/10.3390/agronomy15040839

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

MSPB-YOLO: High-Precision Detection Algorithm of Multi-Site Pepper Blight Disease Based on Improved YOLOv8

Abstract

1. Introduction

2. Materials and Methods

2.1. Construction of Pepper Blight Dataset

2.1.1. Data Collection

2.1.2. Image Augmentation

2.2. Methods

2.2.1. YOLOv8 Model

2.2.2. Backbone Network Optimization RVB-EMA Module

2.2.3. Multi-Scale Feature Fusion Network RepGFPN

2.2.4. DIOU

2.2.5. MSPB-YOLO

3. Results

3.1. The Indicators of Evaluation

3.2. Experimental Parameter Configuration

3.3. Training Curve

3.4. Visualization

4. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI