A Robust YOLOv5 Model with SE Attention and BIFPN for Jishan Jujube Detection in Complex Agricultural Environments

Chen, Hao; Su, Lijun; Tian, Yiren; Chai, Yixin; Hu, Gang; Mu, Weiyi

doi:10.3390/agriculture15060665

Open AccessArticle

A Robust YOLOv5 Model with SE Attention and BIFPN for Jishan Jujube Detection in Complex Agricultural Environments

by

Hao Chen

^1,2

,

Lijun Su

^3,*,

Yiren Tian

¹,

Yixin Chai

³,

Gang Hu

³

and

Weiyi Mu

³

¹

School of Computer Science and Engineering, Xi’an University of Technology, Xi’an 710048, China

²

Shaanxi Key Laboratory for Network Computing and Security Technology, Xi’an 710048, China

³

School of Science, Xi’an University of Technology, Xi’an 710048, China

^*

Author to whom correspondence should be addressed.

Agriculture 2025, 15(6), 665; https://doi.org/10.3390/agriculture15060665

Submission received: 13 February 2025 / Revised: 15 March 2025 / Accepted: 17 March 2025 / Published: 20 March 2025

(This article belongs to the Section Digital Agriculture)

Download

Browse Figures

Versions Notes

Abstract

:

This study presents an improved detection model based on the YOLOv5 (You Only Look Once version 5) framework to enhance the accuracy of Jishan jujube detection in complex natural environments, particularly with varying degrees of occlusion and dense foliage. To improve detection performance, we integrate an SE (squeeze-and-excitation) attention module into the backbone network to enhance the model’s ability to focus on target objects while suppressing background noise. Additionally, the original neck network is replaced with a BIFPN (bi-directional feature pyramid network) structure, enabling efficient multiscale feature fusion and improving the extraction of critical features, especially for small and occluded fruits. The experimental results demonstrate that the improved YOLOv5 model achieves a mean average precision (mAP) of 96.5%, outperforming the YOLOv3, YOLOv4, YOLOv5, and SSD (Single-Shot Multibox Detector) models by 7.4%, 9.9%, 2.5%, and 0.8%, respectively. Furthermore, the proposed model improves precision (95.8%) and F1 score (92.4%), reducing false positives and achieving a better balance between precision and recall. These results highlight the model’s effectiveness in addressing missed detections of small and occluded fruits while maintaining higher confidence in predictions.

Keywords:

YOLOv5; SE attention mechanism; BIFPN; Jishan jujube detection; precision agriculture

1. Introduction

Jishan jujube (Ziziphus jujuba Mill. cv. Jishan), a high-value fruit crop in China, is renowned for its thin skin, thick flesh, and nutritional richness, particularly in vitamins C and P [1]. However, its short harvesting window (typically 2–3 weeks) and susceptibility to physical damage during manual picking pose significant economic challenges. Current manual harvesting practices not only incur high labor costs but also lead to post-harvest losses due to bruising and rapid decay [2]. These limitations underscore the urgent need for automated harvesting systems, where robust fruit detection serves as the cornerstone for efficient robotic operations.

The rapid development of deep learning has profoundly impacted agricultural object detection, which is commonly categorized into two-stage methods, including the Region-based Convolutional Neural Network (R-CNN) [3], Fast R-CNN [4], and Faster R-CNN [5] and one-stage approaches (e.g., SSD (Single-Shot Multibox Detector) [6], YOLO (You Only Look Once) [7,8,9,10]). While two-stage methods excel in accuracy, their computational complexity and slow inference speed hinder real-time deployment in dynamic orchard environments [11]. Conversely, one-stage detectors achieve a balance between speed and accuracy, making them highly efficient. For instance, YOLOv5, with its lightweight architecture and adaptive training strategies, has been widely adopted for fruit detection tasks, such as the recognition of mangos [12], apples [13], and others [14,15]. Nevertheless, detecting small, densely clustered fruits such as Jishan jujubes under natural conditions remains challenging due to significant occlusions, varying scales, and complex backgrounds.

Prior studies have attempted to address these challenges through algorithmic enhancements. For example, Liu et al. [16] integrated an SE attention module into YOLOv3 to improve winter jujube detection, achieving an mAP of 92.1%. Similarly, Liang et al. [17] optimized YOLOv4 with an L1-norm NMS for tomato defect detection, reducing false positives in occluded scenarios. Peng enhanced SSD for lychee detection by removing deep convolutional layers and incorporating receptive field modules to improve small target detection [18]. Despite these efforts, existing models struggle with Jishan jujubes’ unique growth patterns; the fruits grow in irregular clusters, with upper branches bearing heavier loads than lower ones and foliage frequently obscuring targets [19].

To bridge these gaps, this study proposes an improved YOLOv5 detection model specifically for Jishan jujube. Our contributions are threefold:

Architecture Optimization: We integrate the SE attention module into the backbone network to improve feature discriminability, enabling the model to suppress background noise while emphasizing occluded or small targets.
Efficient Multiscale Fusion: The original neck network is substituted with a BIFPN structure, which dynamically weights features across scales to improve detection accuracy for densely packed fruits.
Comprehensive Validation: Experimental results show that our model achieves a state-of-the-art mAP of 96.5%, outperforming existing YOLO variants and SSD under diverse occlusion levels and lighting conditions.

This work not only advances precision agriculture but also provides a scalable framework for the precise detection of other clustered fruits, such as grapes or cherries, in unstructured environments. By addressing the challenges of occlusion, scale variation, and lighting changes, our model enhances the reliability and accuracy of fruit detection systems in diverse agricultural settings.

2. Materials and Methods

2.1. Dataset Preparation

This study mainly focuses on the detection of Jishan jujubes, with image data collected from the Jishan orchard base in Shanxi Province between late July and mid-September 2022. Images were captured using Huawei Nova 6 (48 MP, f/1.8 aperture; manufactured by Huawei Technologies Co., Ltd., Shenzhen, China) and Vivo S16 (50 MP, f/1.9 aperture; manufactured by Vivo Communication Technology Co., Ltd., Dongguan, China) devices, targeting jujube trees with intact and densely packed fruits. The dataset encompasses various angles (0°–360°), distances (0.5–3 m), lighting conditions (sunny, cloudy, and shaded), and fruit densities (low, medium, and high) to ensure diversity and complexity. A total of 200 raw images were initially collected, with an average resolution of 4000 × 3000 pixels.

To enhance the generalization capability of the detection model, we applied several data augmentation techniques, such as flipping, contrast enhancement, brightness adjustment, and more. These techniques were designed to introduce variability and strengthen the model’s robustness to various image conditions. This process expanded the dataset to 1400 images, ensuring robustness to variations in lighting, orientation, and scale.

The images were annotated using LabelImg, an open-source annotation tool, generating XML files that include the coordinates of the bounding boxes (top-left and bottom-right corners) for each target. As YOLO requires labels in TXT format, we converted the XML files into the desired format using a Python script. The generated TXT files included the normalized coordinates of the bounding box center, as well as the normalized width and height of the bounding boxes.

The final annotated dataset was split into a training set and a test set in a ratio of 8:2, resulting in 1120 images for training and 280 images for testing. To prevent data leakage, images generated from the same raw image through augmentation were assigned to the same set (training or test). The bounding box normalization was performed using Equations (1)–(3).

x_{c} = \frac{x_{\max} + x_{\min}}{2}, y_{c} = \frac{y_{\max} + y_{\min}}{2}

(1)

x = \frac{x_{c}}{W_{img}}, y = \frac{y_{c}}{H_{img}}

(2)

w = \frac{x_{\max} - x_{\min}}{W_{img}}, h = \frac{y_{\max} - y_{\min}}{H_{img}}

(3)

here,

x_{c}

and

y_{c}

represent the coordinates of the bounding box center, x and y are the normalized coordinates of the center, and w and h denote the normalized

W_{img}

and

H_{img}

of the bounding box, respectively.

2.2. Traditional YOLOv5 Model

The YOLO (You Only Look Once) series represents a class of one-stage, regression-based object detection models known for their simplicity, compact model size, and real-time inference capabilities. YOLOv5, the fifth iteration of the YOLO network, further optimizes the architecture to enhance both detection speed and accuracy [7,8,9,10].

Compared with its predecessors (YOLOv3 and YOLOv4), YOLOv5 introduces several improvements, including adaptive anchor box computation, advanced data augmentation techniques, and a more efficient backbone design, enhancing its capability for detecting jujube fruits in complex environments [12,18].

As shown in Figure 1, YOLOv5 is composed of four main components: the input layer, the backbone network, the neck network, and the output layer.

2.2.1. Input Layer

The input layer preprocesses images using various algorithms, such as data augmentation, adaptive anchor box computation, and adaptive image scaling. Unlike earlier YOLO versions that rely on external programs for anchor box computation, YOLOv5 embeds this process directly within its code. By iteratively comparing predicted anchor boxes with ground-truth boxes, the model identifies optimal anchor sizes tailored to each dataset [7]. Additionally, adaptive image scaling adjusts input images to a fixed size with minimal padding, reducing computational overhead and improving inference speed.

2.2.2. Backbone Network

The backbone network focuses on feature extraction from input images. As shown in Figure 1, the C3 module functions through two branches: one branch processes features using multiple bottleneck stacks combined with three standard convolutional layers, while the other branch consists of a basic convolutional module. The outputs from these two branches are concatenated to create enriched feature representations [7,9].

2.2.3. Neck Network

The neck network integrates FPN (Feature Pyramid Network) and PAN (Path Aggregation Network) structures to enhance feature fusion. The FPN passes high-level semantic features from top to bottom, allowing the model to capture contextual information [20]. Conversely, the PAN propagates strong localization features from bottom to top, mitigating the loss of low-level information caused by the FPN [21]. By combining information from different detection layers across the backbone, this structure effectively aggregates parameters for improved localization accuracy, particularly for small and occluded objects.

2.2.4. Output Layer

In the output layer, the class and probability of each detected object are predicted. YOLOv5 uses CIOU_LOSS (Complete Intersection-over-Union Loss) as the loss function for bounding boxes. This loss function calculates the distance between the centers of the predicted and ground-truth boxes, as well as their aspect ratios, to ensure precise alignment between the predictions and the actual objects. The CIOU_LOSS is defined as follows:

CIOU_LOSS = 1 - IOU + \frac{ρ^{2} (b, b_{gt})}{c^{2}} + α v,

(4)

here, IOU refers to the intersection over union,

ρ

denotes the Euclidean distance between the centers of the predicted box b and the ground-truth box

b_{gt}

, c represents the diagonal length of the smallest enclosing box,

α

is a balancing parameter, and v evaluates the consistency of aspect ratios.

2.3. Improved YOLOv5 Model

2.3.1. SE Attention Mechanism

The SE (squeeze-and-excitation) attention mechanism [22], as shown in Figure 2, is widely adopted due to its simplicity, computational efficiency, and ability to enhance the representational capacity of convolutional neural networks. It has demonstrated exceptional performance in object detection tasks, particularly in scenarios where the network needs to prioritize informative channels while suppressing less relevant ones.

In the Squeeze stage, global spatial information is embedded by compressing an input feature map of dimensions c × h × w into a c × 1 × 1 feature map through global average pooling. This operation generates channel-wise statistics, capturing global contextual information. In the Excitation stage, two fully connected layers are used to adaptively recalibrate the feature map. The first layer reduces the c-dimensional vector to c/r dimensions via ReLU, and the second layer restores it to c dimensions via Sigmoid, generating a channel-wise weight matrix. The original feature map is rescaled using this matrix, highlighting crucial channels while diminishing the significance of less relevant ones. From a mathematical perspective, the SE mechanism can be formulated as:

z = F_{sq} (U) = \frac{1}{H \times W} \sum_{i = 1}^{H} \sum_{j = 1}^{W} U (i, j),

(5)

s = F_{ex} (z, W) = σ (W_{2} δ (W_{1} z)),

(6)

here, U represents the input feature map, z is the squeezed feature vector, and s denotes the excitation output. The weights of the fully connected layers are

W_{1}

and

W_{2}

, while

δ

and

σ

correspond to the ReLU and Sigmoid activation functions, respectively.

2.3.2. BIFPN

The FPN (feature pyramid network) and PAN (path aggregation network) are widely used for multiscale feature extraction and localization in object detection tasks. However, these networks have limitations in efficiently fusing features across different scales. The FPN, as shown in Figure 3a, combines features through simple summation or concatenation, which does not consider the varying contributions of different features. On the other hand, the PAN, depicted in Figure 3b, improves upon the FPN by introducing a top-down and bottom-up structure for feature propagation. Despite this, it still struggles with effectively weighting features during the fusion process.

To overcome these limitations, a weighted BIFPN (bidirectional feature pyramid network) [23] was proposed. As shown in Figure 3c, the BIFPN introduces a weighted fusion mechanism, where each input feature map is assigned a unique weight during the fusion process. This allows the network to prioritize more relevant feature maps and suppress less useful ones.

The BIFPN performs weighted fusion across multiple scales by combining feature maps from different network layers. It achieves this through bidirectional fusion (both top-down and bottom-up), transferring information across scales more efficiently.

Given the input feature maps

F_{1}, F_{2}, \dots, F_{n}

(each corresponding to a different scale), the weighted fusion process in the BIFPN can be expressed as follows:

F_{fused} = \sum_{i = 1}^{n} w_{i} \cdot F_{i}

(7)

where

F_{i}

is the i-th feature map (each feature map is a 2D matrix representing features at a specific scale),

w_{i}

is the learned weight representing the importance of each feature map in the fusion process, and

F_{fused}

is the final fused feature map after weighted summation.

The BIFPN also computes the weights for the feature maps to adjust their contributions to the fused result. The weight calculation can be represented as follows:

w_{i} = \frac{1}{1 + e^{- α_{i} \cdot (F_{i} - F_{avg})}}

(8)

where

α_{i}

is a learnable parameter to adjust the weight for each feature map,

F_{i}

is the i-th feature map, and

F_{avg}

is the average feature map across all scales.

Additionally, the BIFPN introduces extra connections between the input and output nodes at the same level, facilitating the aggregation of features across different scales. This design significantly enhances the efficiency and effectiveness of multiscale feature fusion, particularly for detecting small and occluded objects.

2.3.3. YOLOv5 Model Based on the SE and BIFPN Mechanisms

Figure 4 shows the proposed model’s architecture, integrating the SE attention mechanism and BIFPN into YOLOv5. The SE mechanism is added to the backbone network to enhance the focus on important channels in the global feature space. The SE mechanism adaptively weights channels, highlighting informative ones and reducing noise, thereby enhancing fruit detection in cluttered backgrounds.

The YOLOv5 neck structure is upgraded to BIFPN for improved feature fusion. BIFPN adds cross-scale connections, eliminates less impactful nodes, and links input and output nodes at the same levels, enhancing multiscale feature coupling and enabling better detection of small targets like occluded or varying-sized jujube fruits.

The integration of these mechanisms significantly reduces the missed detection rate for smaller jujube fruits, particularly in complex orchard scenarios. Experimental results show the proposed model attains a 96.5% mAP, exceeding the baseline YOLOv5 by 2.5%.

3. Results

3.1. Experimental Setup

To ensure the reproducibility and reliability of the results, all the experiments were conducted in a consistent and controlled computing environment. The detailed hardware and software configurations are summarized in Table 1.

All the experiments were conducted under identical software and hardware conditions to ensure fair comparisons. The computational resources provided sufficient capacity for model training and evaluation.

3.2. Evaluation Metrics

The proposed model’s performance was assessed using precision (P), recall (R), mean average precision (mAP), and F1 score. These metrics were selected for their common application in object detection, offering a thorough evaluation of the model’s accuracy in detecting and classifying objects.

Precision (P) indicates the accuracy of positive predictions, recall (R) the detection rate of actual positives, and the F1 score balances both through their harmonic mean, defined as:

P = \frac{T P}{T P + F P}, R = \frac{T P}{T P + F N}, F 1 = \frac{2 \times P \times R}{P + R}

(9)

Here, true positives (TP) denote correctly identified positive samples, false positives (FP) represent incorrectly identified positive samples, and false negatives (FN) are actual positive samples that were not identified by the model.

The average precision (AP) is calculated as the area under the precision–recall (PR) curve, which reflects the model’s detection performance at various thresholds. The mean average precision (mAP) is then obtained by averaging the AP values across all detection categories. These are computed as follows:

A P = \int_{a}^{b} P (R) d R, m A P = \frac{1}{n} \sum_{i = 1}^{n} A P_{i}

(10)

In the formula above,

P (R)

is the precision as a function of recall, a and b are the integration limits (usually 0 to 1),

A P_{i}

is the AP for the i-th class, and n is the total number of classes.

3.3. Hyperparameter Settings

The hyperparameters used in our experiments are summarized in Table 2. We employed the SGD optimizer with a momentum of 0.937 and a weight decay of 0.0005. The learning rate followed the one-cycle scheduling strategy, starting from an initial learning rate of 0.01. The model was trained for 500 epochs with a batch size of 8 and an image input size of 640 × 640 pixels.

This setup ensures a balance between model generalization and computational efficiency. The chosen values for these hyperparameters are based on empirical validation and commonly used settings in object detection tasks.

3.4. Performance Evaluation Using Different Models

Comparative experiments under consistent conditions evaluated the detection performance of mainstream object detection networks on the Jishan jujube dataset. Metrics including precision (P), recall (R), mAP, and F1 score are listed in Table 3, while Figure 5 visually compares their qualitative performance.

Table 3 reveals that YOLOv3, YOLOv4, and SSD consistently underperform relative to YOLOv5 across all metrics. Notably, the proposed improved YOLOv5 model demonstrates superior performance, with the mean average precision (mAP) reaching 96.5%, an increase of 0.8% compared with the standard YOLOv5. This improvement highlights the effectiveness of the SE and BIFPN mechanisms incorporated into the proposed model.

P and R of the proposed model also improved significantly, reaching 95.8% and 89.2%, respectively. Furthermore, the F1 score of the proposed model reached 92.4%, indicating a balanced performance in precision and recall. The results highlight the proposed model’s robustness and accuracy in detecting Jishan jujubes under challenging natural conditions.

The detection results in Figure 5 across different scenarios demonstrate significant differences in performance among the evaluated models. For images with unobstructed jujube fruits (Figure 5a), all models successfully detected the fruits, with the proposed model achieving the highest average confidence score of 0.94.

For images with slight occlusions (Figure 5b), SSD model showed the poorest detection performance, slightly trailing behind the YOLO series models. Among these, the YOLOv5 model achieved a relatively high detection rate, and the proposed model further improved the average confidence score based on YOLOv5.

In scenarios with medium-sized fruits and severe occlusion (Figure 5c), none of the models were able to detect fruits occluded by leaves at the edges of the image. However, the YOLOv5 model achieved an average confidence score of 0.89, which was higher than those of YOLOv3 (0.85), YOLOv4 (0.86), and SSD (0.82). The proposed model was further improved, increasing the average confidence score to 0.93.

For densely packed fruits with some occlusions (Figure 5d), all models exhibited varying degrees of missed detections. YOLOv4 performed moderately well but with a low average confidence score. YOLOv5 detected most fruits but struggled with blurred and smaller targets, achieving an average confidence score of 0.86. The proposed model outperformed all other models by detecting more fruits with higher confidence, including those that were blurred or smaller. While not all fruits were detected, the detected fruits showed a higher average confidence score of 0.91, indicating greater reliability.

3.5. Performance Evaluation of Improved YOLOv5 Model Variants

The training set underwent unified offline augmentation, and four YOLOv5 variants—the original YOLOv5, YOLOv5-SE, YOLOv5-BIFPN, and the proposed enhanced model—were trained and evaluated on the same test set. Results are shown in Table 4.

Compared to the original YOLOv5, the YOLOv5-SE network showed a 1.0% decrease in precision but a 3.4% gain in recall, as indicated in Table 4. In contrast, the YOLOv5-BIFPN network achieved a 1.4% increase in precision, while the recall decreased by 1.5%. The proposed enhanced model outperformed all other variants, with a precision gain of 3.5% and a slight improvement of 0.4% in recall. The proposed model achieved the highest mAP, rising by 0.8%, and enhanced the F1 score by 1.9%. These results demonstrate the effectiveness of the proposed model in balancing precision and recall, ultimately achieving superior detection performance across all metrics.

Figure 6 provides a comparative visualization of the detection results across different models under various conditions: Figure 6a depicts unobstructed images, Figure 6b depicts slight occlusion, Figure 6c depicts severe occlusion, and Figure 6d depicts densely packed Jishan jujubes.

For unobstructed images of Jishan jujubes (Figure 6a), all models, including YOLOv5, YOLOv5-SE, YOLOv5-BIFPN, and the proposed model, successfully detected the targets with high confidence. While the differences in confidence scores are marginal, the proposed model exhibited the highest precision (95.8%) among all models (Table 4), reducing false positives.

In the case of images with slight occlusion (Figure 6b), all models experienced a decrease in recall due to occlusion. Notably, YOLOv5-BIFPN misclassified some leaves as Jishan jujubes, while the proposed model demonstrated superior robustness, achieving the highest F1 score (92.4%), balancing precision and recall effectively.

For images with severe occlusion (Figure 6c), the YOLOv5 model had a recall of 88.8%, whereas the proposed model improved recall to 89.2%, outperforming YOLOv5-SE and YOLOv5-BIFPN. This indicates that the proposed model retains detection capability even in challenging scenarios.

In densely packed images where Jishan jujubes were partially obscured (Figure 6d), YOLOv5 suffered from significant under-detection, leading to a lower recall. While YOLOv5-SE and YOLOv5-BIFPN showed moderate improvements, the proposed model successfully detected more obscured and small fruits, achieving the highest mAP (96.5%) and improving overall detection performance.

These results demonstrate that the proposed model consistently outperforms other YOLOv5 variants, particularly in challenging scenarios, such as those with severe occlusion and dense object distributions. Its robustness highlights its potential for real-world applications in complex environments.

3.6. Evaluation of Models with Different Attention Mechanisms

To investigate the impact of attention mechanisms on Jishan jujube detection, the CA (coordinate attention) [22], ECA (efficient channel attention) [24], and CBAM (convolutional block attention module) [25] modules were, respectively, embedded into the backbone network of YOLOv5. These attention mechanisms were chosen due to their ability to enhance feature representation by focusing on critical spatial and channel information, which is particularly important for detecting small and occluded fruits in complex orchard environments.

The models were evaluated under identical experimental conditions, with results shown in Table 5, and detection effects illustrated in Figure 7.

As shown in Table 5, compared with the original YOLOv5 network, the addition of the CA, ECA, and CBAM modules increased the recall rates by 2.4%, 1.3%, and 2.3%, respectively. However, the precision rates decreased by 4.6%, 1%, and 2.9%, respectively, and the mean average precision (mAP) also showed a decrease. This trade-off between precision and recall suggests that while these attention mechanisms improve the model’s ability to detect more fruits, they may also introduce false positives, particularly in complex scenes with dense foliage and occlusions.

In contrast, the proposed model, which integrates the SE (squeeze-and-excitation) attention mechanism, demonstrated improvements in both precision and recall rates compared with the original network, along with increases in the overall mAP and F1 score. The SE mechanism’s ability to adaptively recalibrate channel-wise feature responses allows the model to focus on informative features while suppressing background noise, leading to more accurate and robust detection.

Figure 7 shows the detection results on unobstructed images of Jishan jujubes (Figure 7a), images with slight occlusion (Figure 7b), and images with severe occlusion (Figure 7c).

All the models with embedded attention mechanisms successfully detected Jishan jujubes in these images. In terms of average confidence scores, the SE attention mechanism embedded in the proposed model yielded the best detection results. For images with densely packed Jishan jujubes and some occlusion (Figure 7d), YOLOv5-CA, YOLOv5-ECA, YOLOv5-CBAM, and the proposed model all exhibited improvements in reducing missed detections, successfully identifying Jishan jujubes that YOLOv5 failed to detect. Additionally, the proposed model not only enhanced detection accuracy but also detected the highest number of Jishan jujubes, particularly in challenging scenarios with severe occlusion and dense object distributions.

These experimental comparisons demonstrate that the improved model proposed in this study enhances detection accuracy for Jishan jujubes, particularly for small and occluded Jishan jujubes. The enhancements significantly contribute to precise detection, making the proposed model a promising solution for automated jujube harvesting in complex orchard environments.

3.7. Error Analysis

To further evaluate the detection performance of different models, we conducted an error analysis based on 200 experimental trials for each model. The false positive (FP) rate and false negative (FN) rate were computed to assess the model’s ability to correctly detect Jishan jujubes while minimizing misclassifications. The results are presented in Table 6.

Table 6 shows that the proposed model achieved the lowest false positive rate (3.1%), significantly reducing the misclassification of background elements as Jishan jujubes. In contrast, YOLOv4 exhibited the highest false positive rate (6.1%), suggesting a tendency to incorrectly classify non-target areas as Jishan jujubes. YOLOv3, SSD, and YOLOv5 showed moderate false positive rates of 4.2%, 5.2%, and 4.5%, respectively, but they were still higher than that of the proposed model.

Similarly, the proposed model demonstrated the lowest false negative rate (1.9%), indicating that it effectively identified the majority of Jishan jujubes while minimizing missed detections. On the other hand, YOLOv4 had the highest false negative rate (4.5%), implying that it failed to detect a considerable number of Jishan jujubes, which negatively impacted recall. YOLOv3, SSD, and YOLOv5 recorded false negative rates of 2.7%, 3.3%, and 2.1%, respectively, performing reasonably well but not as effectively as the proposed model.

Each model was evaluated across 200 trials, and the standard deviation values reflect the consistency of the detection performance. The proposed model not only achieved the best false positive and false negative rates but also exhibited the lowest standard deviation, suggesting high robustness across multiple test scenarios.

4. Discussion

The findings of this study demonstrate the effectiveness of the proposed improved YOLOv5-based algorithm in detecting Jishan jujubes under complex natural conditions. By integrating the SE (squeeze-and-excitation) attention mechanism and the BIFPN (bidirectional feature pyramid network) structure, the model enhances feature extraction and fusion, significantly improving detection accuracy for occluded and densely packed fruits. These results highlight the potential of advanced attention mechanisms and multiscale feature fusion techniques in agricultural object detection.

However, certain limitations should be noted. First, the model was trained and evaluated on a dataset specifically curated for Jishan jujube detection. Although the dataset includes diverse lighting conditions, occlusion levels, and fruit densities, its generalizability to other fruit detection tasks (e.g., apples and citrus fruits) requires further investigation. Second, the inclusion of the BIFPN structure increases computational complexity, leading to longer inference times.

5. Conclusions

In this study, an improved YOLOv5-based detection algorithm was developed to enhance the accuracy and robustness of Jishan jujube detection in complex natural environments. By incorporating advanced techniques such as data augmentation, the SE attention mechanism, and the BIFPN structure, the proposed model effectively addressed challenges related to dense occlusions, scale variations, and background interference. The experimental results showed that the model achieved a mean average precision (mAP) of 96.5%, outperforming other YOLOv5-based models incorporating attention mechanisms such as CA, ECA, and CBAM. Additionally, the proposed model demonstrated improved precision (95.8%) and F1 score (92.4%), effectively reducing false positives and achieving a better balance between precision and recall, particularly in detecting small and occluded fruits. The SE attention mechanism enhanced the model’s ability to capture contextual information by adaptively recalibrating channel-wise feature responses, while the BIFPN structure improved multiscale feature fusion, enabling more accurate detection of small and occluded fruits. Additionally, the proposed model demonstrated strong generalization performance and robustness across diverse image datasets with varying complexities.

These findings contribute to the advancement of agricultural object detection, highlighting the potential of attention mechanisms and multiscale feature fusion in improving fruit detection accuracy.

Author Contributions

Conceptualization, H.C. and L.S.; data curation, Y.T.; formal analysis, H.C. and G.H.; funding acquisition, L.S.; investigation, Y.T. and Y.C.; methodology, H.C. and L.S.; project administration, L.S.; resources, G.H. and W.M.; software, Y.T.; supervision, L.S.; validation, H.C. and G.H.; visualization, H.C.; writing—original draft, H.C. and Y.T. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Major Science and Technology Project of Xinjiang Uyghur Autonomous Region (2023A02002-4).

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Wang, L.; Guo, L.; Li, Q. Development and Conservation of Banzao Production System in Jishan. Agric. Technol. Equip. 2021, 47, 44–45. [Google Scholar]
Hu, Y.g. Design of Automatic Strawberry Picking Machinery. For. Mach. Woodwork. Equip. 2021, 49, 26–30. [Google Scholar]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
Girshick, R. Fast R-CNN. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), ICCV ’15, Boston, MA, USA, 7–12 June 2015; pp. 1440–1448. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
Liu, W.; Erhan, D.D.; Christian, S.; Reed, S.E.; Fu, C.Y.; Berg, A.C. SSD: Single Shot MultiBox Detector. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Redmon, J.; Farhadi, A. YOLO9000: Better, Faster, Stronger. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 6517–6525. [Google Scholar]
Joseph, R.; Ali, F. YOLOv3: An Incremental Improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
Alexey, B.; Chien, Y.W.; Hong, Y.M.L. YOLOv4: Optimal Speed and Accuracy of Object Detection. arXiv 2020, arXiv:2004.10934. [Google Scholar]
Zhang, Z.; Luo, M.; Guo, S.; Liu, G.; Li, S.; Zhang, Y. Cherry Fruit Detection Method in Natural Scene Based on Improved YOLO v5. Trans. Chin. Soc. Agric. Mach. 2022, 53, 232–240. [Google Scholar]
Xue, Y.; Huang, N.; Tu, S.; Mao, L.; Yang, A.X.; Yang, X.; Chen, P. Immature mango detection based on improved YOLOv2. Trans. Chin. Soc. Agric. Eng. 2018, 34, 173–179. [Google Scholar]
Zhang, S.f. Research on Apple Target Recognitionand Location Algorithm Based on DeepLearning. Master’s Thesis, Zhejiang University of Technology, Hangzhou, China, 2020. [Google Scholar]
Wang, Y.; Xue, J. Lightweight object detection method for Lingwu long jujube images based on improved SSD. Trans. Chin. Soc. Agric. Eng. 2021, 37, 173–182. [Google Scholar]
Hao, J.; Bing, Z.; Yang, S.; Yang, J.; Sun, L. Detection of green walnut by improved YOLOv3. Trans. Chin. Soc. Agric. Eng. 2022, 38, 183–190. [Google Scholar]
Zhao, H.; Qiao, Y.; Wang, H.; Yue, Y. Apple fruit recognition in complex orchard environment based on improved YOLOv3. Trans. Chin. Soc. Agric. Eng. 2021, 37, 127–135. [Google Scholar]
Liang, X.; Pang, Q.; Yang, Y.; Wen, C.; Li, Y.; Huang, W.; Zhang, C.; Zhao, C. Online detection of tomato defects based on YOLOv4 model pruning. Trans. Chin. Soc. Agric. Eng. 2022, 38, 283–292. [Google Scholar]
Peng, H.; Li, J.; Xu, H.; Chen, H.; Xing, Z.; He, H.; Xiong, J. Litchi detection based on multiple feature enhancement and feature fusion SSD. Trans. Chin. Soc. Agric. Eng. 2022, 38, 169–177. [Google Scholar]
Gao, F.; Fu, L.; Zhang, X.; Majeed, Y.; Li, R.; Karkee, M.; Zhang, Q. Multi-class fruit-on-plant detection for apple in SNAP system using Faster R-CNN. Comput. Electron. Agric. 2020, 176, 105634. [Google Scholar] [CrossRef]
Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature Pyramid Networks for Object Detection. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 936–944. [Google Scholar]
Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path Aggregation Network for Instance Segmentation. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8759–8768. [Google Scholar]
Hou, Q.; Zhou, D.; Feng, J. Coordinate Attention for Efficient Mobile Network Design. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 13708–13717. [Google Scholar]
Tan, M.; Pang, R.; Le, Q.V. EfficientDet: Scalable and Efficient Object Detection. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 10778–10787. [Google Scholar]
Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient Channel Attention for Deep Convolutional Neural Networks. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 11531–11539. [Google Scholar]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. CBAM: Convolutional Block Attention Module. In Proceedings of the 15th European Conference, Munich, Germany, 8–14 September 2018; Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y., Eds.; Springer: Cham, Switzerland, 2018; pp. 3–19. [Google Scholar]

Figure 1. Architecture of the YOLOv5 model: The input layer preprocesses images. The backbone network extracts features, while the neck network combines an FPN and PAN for multiscale feature fusion. The output layer predicts bounding boxes and class probabilities.

Figure 2. SE attention mechanism: The Squeeze stage compresses spatial dimensions, while the Excitation stage adaptively recalibrates channel weights.

Figure 3. Structures of different feature fusion networks. (a) FPN: A feature pyramid network focusing on multiscale feature extraction. (b) PAN: A path aggregation network designed for better feature localization. (c) BIFPN: A combination of the FPN and PAN structures to enhance both multiscale and localization features.

Figure 4. Architecture of the improved YOLOv5 with SE and the BIFPN. The SE mechanism is integrated into the backbone network, while the BIFPN replaces the original neck structure for efficient multiscale feature fusion.

Figure 5. Detection results of different mainstream models under varying conditions: (a) unobstructed fruits, (b) slightly occluded fruits, (c) moderately occluded and medium-sized fruits, and (d) densely packed and occluded fruits.

Figure 6. Comparison of the detection results under varying conditions using YOLOv5, YOLOv5-SE, YOLOv5-BIFPN, and the proposed model. (a) Unobstructed fruits, (b) slightly occluded fruits, (c) moderately occluded and medium-sized fruits, and (d) densely packed and occluded fruits.

Figure 7. Detection effects of different attention mechanism models under varying conditions: (a) unobstructed fruits, (b) slightly occluded fruits, (c) moderately occluded and medium-sized fruits, and (d) densely packed and occluded fruits.

Table 1. Experimental setup.

Category	Specification
Operating System	Windows 10 (64-bit)
Programming Language	Python 3.6
Deep Learning Framework	PyTorch 1.7.0
Processor	Intel Core i7-9700K (3.60 GHz)
GPU	NVIDIA GeForce RTX 2080 (8 GB)
Memory	32 (GB)

Table 2. Hyperparameter settings used in the experiments.

Hyperparameter	Value	Description
Batch Size	8	Number of images per batch
Initial Learning Rate ( $l r_{0}$ )	0.01	Starting learning rate
Learning Rate Scheduler	one_cycle	Strategy for adjusting learning rate
Optimizer	SGD	Stochastic Gradient Descent (SGD)
Momentum	0.937	Momentum parameter for SGD
Weight Decay	0.0005	L2 regularization strength
Number of Epochs	1000	Total training iterations
Warmup Epochs	3	Number of epochs for learning rate warm-up
Warmup Momentum	0.8	Initial momentum during warm-up
Box Loss Gain	0.05	Weight for bounding box loss
Class Loss Gain	0.5	Weight for classification loss
Object Loss Gain	1.0	Weight for object detection loss
Anchor Threshold ( $T_{a n c h o r}$ )	4.0	Threshold for anchor matching
Image Size	640	Input image resolution (pixels)
Label Smoothing	0.0	Degree of label smoothing
Freeze Layers	0	Number of frozen layers during training
Evolve Generations	300	Number of generations for hyperparameter evolution

Table 3. Comparison of the performance of detection models on the Jishan jujube dataset.

Model	P (%)	R (%)	mAP (%)	F1 (%)
YOLOv3	89.0	85.8	89.1	87.4
YOLOv4	93.2	72.6	86.6	81.6
SSD	91.9	84.3	94.0	87.9
YOLOv5	92.3	88.8	95.7	90.5
Proposed Model	95.8	89.2	96.5	92.4

Table 4. Quantitative comparison of YOLOv5 variants in terms of detection metrics.

Model	P (%)	R (%)	mAP (%)	F1 (%)
YOLOv5	92.3	88.8	95.7	90.5
YOLOv5-SE	91.3	92.2	94.7	91.7
YOLOv5-BIFPN	93.7	87.3	95.8	90.4
Proposed Model	95.8	89.2	96.5	92.4

Table 5. Comparison of performance with different attention mechanisms.

Model	P (%)	R (%)	mAP (%)	F1 (%)
YOLOv5	92.3	88.8	95.7	90.5
YOLOv5-CA	87.7	91.2	94.5	89.4
YOLOv5-ECA	91.3	90.1	94.9	90.7
YOLOv5-CBAM	89.4	91.1	94.7	90.2
Proposed Model	95.8	89.2	96.5	92.4

Table 6. Comparison of performance with different attention mechanisms.

Model	FP Rate (%)	FP Count	FN Rate (%)	FN Count	Std Dev (FP)	Std Dev (FN)
YOLOv3	4.2	168	2.7	108	0.35	0.28
YOLOv4	6.1	244	4.5	180	0.52	0.47
SSD	5.2	208	3.3	132	0.41	0.33
YOLOv5	4.5	180	2.1	84	0.38	0.25
Proposed Model	3.1	124	1.9	76	0.27	0.19

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Chen, H.; Su, L.; Tian, Y.; Chai, Y.; Hu, G.; Mu, W. A Robust YOLOv5 Model with SE Attention and BIFPN for Jishan Jujube Detection in Complex Agricultural Environments. Agriculture 2025, 15, 665. https://doi.org/10.3390/agriculture15060665

AMA Style

Chen H, Su L, Tian Y, Chai Y, Hu G, Mu W. A Robust YOLOv5 Model with SE Attention and BIFPN for Jishan Jujube Detection in Complex Agricultural Environments. Agriculture. 2025; 15(6):665. https://doi.org/10.3390/agriculture15060665

Chicago/Turabian Style

Chen, Hao, Lijun Su, Yiren Tian, Yixin Chai, Gang Hu, and Weiyi Mu. 2025. "A Robust YOLOv5 Model with SE Attention and BIFPN for Jishan Jujube Detection in Complex Agricultural Environments" Agriculture 15, no. 6: 665. https://doi.org/10.3390/agriculture15060665

APA Style

Chen, H., Su, L., Tian, Y., Chai, Y., Hu, G., & Mu, W. (2025). A Robust YOLOv5 Model with SE Attention and BIFPN for Jishan Jujube Detection in Complex Agricultural Environments. Agriculture, 15(6), 665. https://doi.org/10.3390/agriculture15060665

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Robust YOLOv5 Model with SE Attention and BIFPN for Jishan Jujube Detection in Complex Agricultural Environments

Abstract

1. Introduction

2. Materials and Methods

2.1. Dataset Preparation

2.2. Traditional YOLOv5 Model

2.2.1. Input Layer

2.2.2. Backbone Network

2.2.3. Neck Network

2.2.4. Output Layer

2.3. Improved YOLOv5 Model

2.3.1. SE Attention Mechanism

2.3.2. BIFPN

2.3.3. YOLOv5 Model Based on the SE and BIFPN Mechanisms

3. Results

3.1. Experimental Setup

3.2. Evaluation Metrics

3.3. Hyperparameter Settings

3.4. Performance Evaluation Using Different Models

3.5. Performance Evaluation of Improved YOLOv5 Model Variants

3.6. Evaluation of Models with Different Attention Mechanisms

3.7. Error Analysis

4. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI