Early Sweet Potato Plant Detection Method Based on YOLOv8s (ESPPD-YOLO): A Model for Early Sweet Potato Plant Detection in a Complex Field Environment

Xu, Kang; Sun, Wenbin; Chen, Dongquan; Qing, Yiren; Xing, Jiejie; Yang, Ranbing

doi:10.3390/agronomy14112650

Open AccessArticle

Early Sweet Potato Plant Detection Method Based on YOLOv8s (ESPPD-YOLO): A Model for Early Sweet Potato Plant Detection in a Complex Field Environment

by

Kang Xu

^1,2,

Wenbin Sun

^1,2

,

Dongquan Chen

^1,2,

Yiren Qing

^2,3,

Jiejie Xing

^2,3 and

Ranbing Yang

^2,3,*

¹

College of Information and Communication Engineering, Hainan University, Haikou 570228, China

²

Key Laboratory of Tropical Intelligent Agricultural Equipment, Ministry of Agriculture and Rural Affairs, Hainan University, Danzhou 571737, China

³

College of Mechanical and Electrical Engineering, Hainan University, Haikou 570228, China

^*

Author to whom correspondence should be addressed.

Agronomy 2024, 14(11), 2650; https://doi.org/10.3390/agronomy14112650

Submission received: 19 October 2024 / Revised: 3 November 2024 / Accepted: 8 November 2024 / Published: 11 November 2024

(This article belongs to the Section Precision and Digital Agriculture)

Download

Browse Figures

Versions Notes

Abstract

Traditional methods of pest control for sweet potatoes cause the waste of pesticides and land pollution, but the target detection algorithm based on deep learning can control the precise spraying of pesticides on sweet potato plants and prevent most pesticides from entering the land. Aiming at the problems of low detection accuracy of sweet potato plants and the complex of target detection models in natural environments, an improved algorithm based on YOLOv8s is proposed, which can accurately identify early sweet potato plants. First, this method uses an efficient network model to enhance the information flow in the channel, obtain more effective global features in the high-level semantic structure, and reduce model parameters and computational complexity. Then, cross-scale feature fusion and the general efficient aggregation architecture are used to further enhance the network feature extraction capability. Finally, the loss function is replaced with InnerFocaler-IoU (IFIoU) to improve the convergence speed and robustness of the model. Experimental results showed that the mAP0.5 and model size of the improved network reached 96.3% and 7.6 MB. Compared with the YOLOv8s baseline network, the number of parameters was reduced by 67.8%, the amount of computation was reduced by 53.1%, and the mAP0.5:0.95 increased by 3.5%. The improved algorithm has higher detection accuracy and a lower parameter and calculation amount. This method realizes the accurate detection of sweet potato plants in the natural environment and provides technical support and guidance for reducing pesticide waste and pesticide pollution.

Keywords:

sweet potato; deep learning; target detection; efficient network; low complexity

1. Introduction

Sweet potato has the characteristics of high and stable yield, strong applicability and rich nutrition. With the large-scale development of sweet potato planting and the cross-regional transportation of seedlings, the prevalence of diseases and pests has intensified, posing a serious threat to the sweet potato industry. With traditional pest control methods, only 30% of pesticides affect crops, and most of the rest enter the soil, air and water [1], causing sweet potatoes to be contaminated by pesticides. Currently, sweet potato plants can be detected through deep learning models, and then the sprayer can be adjusted to accurately spray the sweet potato plants to prevent most of the pesticides from entering the soil. However, due to the irregular growth of sweet potatoes in the field, there is a problem of low accuracy in sweet potato plant detection, which brings certain troubles to the precise pesticide spraying of sweet potatoes.

Traditional target detection algorithms are mainly based on manually extracted features. However, traditional algorithms have slow detection speed, weak robustness and limited versatility in complex natural environments [2]. Their recognition efficiency fails to meet the requirements of spray machinery in agricultural production [3]. The target detection algorithm based on the deep convolutional network has a good detection effect and can be applied to natural field environments. At present, there are two development trends of deep convolutional networks. One category is two-stage networks, such as R-CNN [4] and Faster R-CNN [5,6]. The other is the one-stage networks, such as the YOLO algorithm [7,8,9] and the SSD algorithm [10]. Since the one-stage network has a faster detection speed, it has a better applicability to field crops and has real-time detection capabilities. We focus on optimizing the single-stage detection algorithm to enhance the recognition capability of sweet potato plants.

In the research on one-stage networks, there are currently two main improvement methods: using the attention mechanism, residual structure and dedicated modules to enhance the network’s learning ability for key features and designing a more efficient feature fusion network. Some researchers have integrated feature extraction networks with other plug-and-play deep learning modules to enable the model to capture more complex and diverse information, thereby achieving more accurate and robust detection performance in various tasks. Tian et al. [11] combined DenseNet to recognize crops under overlapping and occluded conditions. Lu et al. [12] fused the Swin-transformer model, achieving an average improvement of 4% over YOLOv5. Tian et al. [13] combined DenseNet blocks with adaptive attention modules (AAMs), achieving an mAP0.5 value of 86.2% and an accuracy improvement of 1.6%. Gai et al. [14] integrated the DenseNet layer into the YOLOv4 backbone network, changed the shape of the prior box, and improved the accuracy of the network by 0.15. Li et al. [15] used ConvNext and SwinTransformer to improve YOLOv7. Some researchers also integrated the attention mechanism with the existing modules to focus on areas of interest. Chen et al. [16] incorporated the SE attention mechanism into YOLOv4 to mitigate the conflicts arising from multi-scale feature fusion. Du et al. [17] introduced the Shuffle Attention mechanism and Deformable Convolution v3 into the YOLOv7 model. Ye et al. [18] introduced the second-order channel attention (SOCA) mechanism and simplified the spatial pyramid pooling (SPP) structure. The improved YOLOv5 improved by 2.5% in mAP0.5. Some researchers have developed innovative feature fusion modules that enable the model to capture finer details and contextual information. Res2Net was combined with a dual feature pyramid network [19], and the ELAN module was combined with a bidirectional feature fusion pyramid [20] to make the features as integrated as possible. Although the current one-stage network has achieved certain advantages, the structure of the one-stage network is still relatively complex and is not suitable for the recognition of sweet potatoes in natural environments.

Therefore, there is an urgent need to design a network structure with low parameters and computational complexity. MobileNet [21], ShuffleNet [22] and GhostNet [23] are representative low-complexity neural networks, and other researchers have also performed low-complexity operations on mainstream algorithms such as RCNN, YOLO and SSD, mainly in terms of three aspects. (1) Some researchers have introduced technologies such as Ghost Conv and depthwise separable convolution (DSC) based on deep models. Ghost convolution reduces the parameter scale by generating fewer redundant features, while DSC greatly reduces the computational cost by reducing the computational process. Dong et al. [24] introduced C3Ghost and the Ghost module in YOLOv5. The FLOPs were reduced by 15.24%, and parameters were reduced by 19.37%. Yang et al. [25] used DSC to replace the ordinary convolution of YOLOv8D, and the parameters were reduced by 27.7%. Zhang et al. [26] introduced Ghost and Transformer blocks, and the parameters were reduced by 10%, and the detection accuracy was improved by 1.3%. Tian et al. [27] introduced the C3Faster module based on YOLOv5s, replaced the conventional convolution with DSC and reduced the parameters and calculation amount. (2) Some researchers have compressed the model size by adjusting the number of object detection layers. Li et al. [28] adjusted the number of target detection layers, introduced deep separable convolution and added ghost modules to reduce interference parameters. Cao et al. [29] replaced the three detection heads with two detection heads. The improved network parameters and computational complexity were reduced. (3) In addition, some researchers have replaced the backbone network with a lightweight network or pruned the network, which greatly reduced the model parameters but also resulted in the loss of performance of some other indicators. Deng et al. [30] combined the ideas of VoVNet and ShuffleNetV2 to make the network lighter, with mAP0.5 increased by 8.3% and parameters reduced by 10%, but the computational complexity increased by 82%. Li et al. [31] combined GhostNet and depthwise separable convolution, improving the model’s accuracy by 1.08%. The parameters were reduced by 82.36%, and the FLOPs were reduced by 89.11%. Lyu and Li [32] introduced a channel attention module to prune the network and optimize model hyperparameters. The parameters and FLOPs were reduced by 14.25% and 2.52%, respectively. Tang et al. [33] combined the pruned MobileNetV3 with YOLOv4, reducing the parameters by 34.37%, but the accuracy was reduced by 0.82%.

Although deep convolutional networks can better capture the key features of crops are more robust to external interference in field environments and better meet the field detection requirements of sweet potatoes, existing methods have low sweet potato plant detection accuracy and efficiency. This is because the sweet potato planting pattern is with a single ridge and double rows, with high density and severe occlusion, and the growth conditions of each sweet potato vary greatly. In addition, only Silvia et al. [34] optimized the Faster R-CNN model to identify sweet potato leaves, and no scholars have identified and detected sweet potato plants. Therefore, for early sweet potato plants in complex natural environments, we proposed an early sweet potato plant detection method based on YOLOv8s (ESPPD-YOLO). The main contributions are as follows: (1) In order to reduce the computational complexity and model parameters, an efficient network based on the coordinate attention (CA) mechanism is proposed to optimize the backbone network of yolov8. (2) While reducing the model size and further improving the model accuracy, the neck network in the original YOLOv8s is replaced with a feature fusion framework based on an efficient aggregation network. (3) Based on Inner-IoU and Focaler-IoU, a new detection box loss function InnerFocaler-IoU (IFIoU) is proposed to improve the convergence speed. (4) By introducing efficient feature extraction and fusion modules, the model shows higher accuracy and greater robustness in sweet potato plants under complex backgrounds. (5) The proposed method has significantly reduced the network structure parameters and computational complexity while still maintaining high accuracy and robustness.

The rest of this paper is as follows. Section 2 introduces the sweet potato plant dataset and model details. Section 3 provides evaluation results. Section 4 discusses future research directions. The last section summarizes the effects of model improvements.

2. Materials and Methods

2.1. Sweet Potato Plant Dataset

We took early sweet potato plants as the research object, and the origin was Haitou Town, Danzhou City, Hainan Province. The data collection period was from 4 March 2024 to 9 May 2024. The image acquisition device used was a Redmi K70 mobile phone (Xiaomi Company, Beijing, China), with 50 million pixel. The original image format of the sweet potato image collected in this experiment was JPG with a resolution of 3598 × 2022 pixels. Sweet potato images were collected from different shooting angles, and a total of 1980 sweet potato photos were collected, as shown in Figure 1.

The collected sweet potato plant images were manually annotated, and the sweet potato plants that were blocked by more than two-thirds were not annotated. The annotated dataset was divided into the training set, validation set and test set at a ratio of 80%, 10% and 10%, as shown in Table 1. Then the images were expanded and enhanced according to cropping, rotation, translation, brightness, noise, HSV, cutout and up and down flipping. Each image was randomly combined with the above 8 data enhancement methods, and each image was expanded to 3 images. The numbers of images in the expanded training set, validation set and test set were 4749, 594 and 597, respectively.

2.2. The Network Structure of YOLOv8

This paper chooses to make improvements based on YOLOv8s, as shown in Figure 2. The YOLOv8 network structure mainly consists of a backbone network, SPPF, path-aggregation-network–feature-pyramid-network and decoupling head. The YOLOv8 adopts a variety of optimization modules and structural designs, including C2f module, SPPF (spatial pyramid pooling fast version), residual connection and bottleneck structure in the backbone part. These improvements not only effectively reduce model parameters but also significantly enhance feature extraction capabilities. By combining upsampling and downsampling operations in the PAN-FPN part, the semantic features are effectively and deeply integrated with the spatial information, which improves the expressiveness of the feature map. The head generates the bounding box, category, and confidence score of the object. However, the YOLOv8 algorithm has problems such as missed detection, false detection and the repeated detection of sweet potato plants. Improving YOLOv8 can improve the detection accuracy of sweet potato plants so that each plant can be accurately sprayed with pesticides, avoiding missed spraying and over-spraying of pesticides.

2.3. Improved Network Structure

The structure of the method based on deep neural network is relatively complex, so it is necessary to improve the accuracy while reducing the parameters. Therefore, this study proposes an early sweet potato detection method based on YOLOv8s, which is mainly composed of an efficient model based on coordinate attention (EMCA), an efficient feature fusion framework (EFFF) and Head, as shown in Figure 3. Among them, EMCA consists of an inverted residual mobile block based on CA (RMBCA) to enhance the extraction of interesting features while utilizing the high efficiency of the structure to reduce unnecessary calculations and the complexity of the network structure. The transform and CA are applied in RMBCA to obtain more effective global features in high-level semantic structures. In the EFFF module, convolution is used for the three feature layers O3, O4 and O5, and then the fused features are extracted using the generalized efficient layer aggregation network (GELAN). The efficiency of the network architecture is also used to enhance the network feature extraction capability while reducing unnecessary parameters and calculations. Finally, combining the characteristics of Inner-IoU and Focaler-IoU, a new detection box loss function IFIoU is proposed. The IFIoU loss function improves the robustness of the model by more accurately measuring the overlap between boxes. The ESPPD-YOLO reduces model parameters and calculations and improves the recognition accuracy of occluded plants.

2.3.1. Feature Extraction Network Based on Inverted Residual Mobile Block

The backbone of the YOLOv8 model does not have the ability to dynamically interact with long-range features, and the connections between regions of interest are limited. The efficient model (EMO) utilizes the characteristics of convolutional networks and transformer structures to build a low-complexity module iRMB [35], which has both the local feature capture characteristics of convolutional and the information interaction ability between long-range features of Transformer. Convolutional neural networks usually tend to focus on the detailed features of objects, which makes them perform well in feature extraction in local areas. Models based on Transformers focus on the global information of objects, and they can effectively capture long-range dependencies of input information. EMOs can focus on objects’ features more accurately while maintaining the ability to perceive global areas.

The EMO model is constructed only through the inverse residual block iRMB, which consists only of a depthwise separable convolution and attention mechanism, which are used to model short-distance and long-distance dependencies, respectively. Two 1 × 1 convolutions are only used to expand and reduce the number of channels, as shown in Figure 4a. To improve the attention to local details of objects, we introduce CA in iRMB [36]. As shown in Figure 4b, the improved iRMB is called RMBCA.

CA enhances feature representation by weighting feature channels, emphasizing important channels and suppressing unimportant channels, and paying more attention to key information when processing data. As shown in Figure 5, CA uses two one-dimensional global pooling operations to aggregate the information of the input features in the vertical and horizontal directions, respectively. Then, the aggregated information in different directions is generated into two feature maps through two-dimensional convolution, effectively capturing the feature distribution in different directions. Next, these two feature maps are encoded into two attention maps, and the important channels are weighted, which can enhance the selectivity of key features and reduce redundant information. The core idea of CA is to introduce position information and capture the spatial distribution features, thereby more comprehensively capturing the dependency between features.

As shown in Figure 4c, we combine five groups of RMBCA modules into an EMCA module to replace the backbone in YOLOv8. EMCA pays more attention to the region of interest, provides strong semantic information for the network, and further achieves better accuracy with fewer parameters.

2.3.2. Efficient Feature Fusion Framework

We propose a new efficient feature fusion framework (EFFF) based on generalized efficient layer aggregation network (GELAN) [37]. As shown in Figure 6, we add 1 × 1 convolution modules to the output ends of O3, O4 and O5 in the ESPPD-YOLO to enable the network to learn more advanced semantic feature information. Then, GELAN is used to replace C2f to further reduce the complexity of the network structure.

GELAN combines the designs of Cross-Stage Partial Network (CSPNet) [38] and ELAN [6], adopts the concepts of segmentation and reorganization of CSPNet, and introduces the hierarchical convolution processing method of ELAN in each part. CSPNet passes part of the input feature map through a more complex convolution path, while the other part passes through a relatively simple bypass. By separating the paths, different parts can propagate gradients in different ways, which makes the gradient information richer. After the features are propagated through different paths, they are re-fused together through the cross-layer hierarchical structure to ensure effective feature combination and gradient flow integration. The ELAN is mainly composed of VoVNet [39] combined with CSPNet, and uses a stack structure in the computing block to optimize the gradient length, which effectively solves the problem that the convergence of deep models will gradually deteriorate when the model is scaled. GELAN optimizes the flow of information in the network through a carefully designed gradient path and reduces the loss of information during network transmission. By using the Cross-Stage Partial Network structure in CSPNet, GELAN can effectively aggregate features at different stages of the network and enhance the expressiveness of features. This feature aggregation mechanism not only improves the model’s ability to detect targets but also improves the model’s generalization ability for targets of different scales and shapes.

The GELAN structure is shown in Figure 7. First, the channel of the input data is widened through a 1 × 1 convolution, and then the output result passes through two different structures. One part passes through the residual structure composed of RepConv, and the other part is directly concatenated with the part convolved by RepConv, and finally, the channels are compressed through convolution. The GELAN combines the CSPNet and ELAN, in which RepConv is used to obtain features while using a dedicated single-branch structure during inference without affecting the inference speed. RepConv has three inputs. The left input undergoes a 3 × 3 convolution to capture local spatial information. The middle input does not undergo a convolution and is moved directly. The last input undergoes a 1 × 1 convolution to compress and reduce the dimension. Finally, the results of the three branches are added together. The GELAN has higher parameter utilization and shows great advantages of being light, fast and accurate.

2.3.3. Loss Function Improvement

Since sweet potatoes are affected by the surrounding environment, the growth conditions of each sweet potato are different. In some areas, sweet potatoes grow densely and there is mutual occlusion between sweet potatoes, so the collected sweet potato data have different distribution of difficult and easy samples. The CIoU [40] loss function used by YOLOv8 is not suitable for data with mutual occlusion and different distributions of difficult and easy samples. We merge the Inner-IoU [41] and Focaler-IoU [42] loss functions into IFIoU to enhance the recognition performance of sweet potatoes.

Inner-IoU controls the generation of auxiliary bounding boxes through the

r a t i o

of the scaling factor. As shown in Figure 8, the blue solid line box represents the real target box, and the blue dotted line box represents the original prediction box. The yellow solid line box represents the generated auxiliary target box InnerTarget, and the yellow dotted line box represents the generated auxiliary prediction box InnerAnchor. As shown in Equations (1)–(4), InnerTarget box and InnerAnchor box are generated by the

r a t i o

.

x_{l}^{g t} = x_{c}^{g t} - \frac{w^{g t} * r a t i o}{2}, x_{r}^{g t} = x_{c}^{g t} + \frac{w^{g t} * r a t i o}{2}

(1)

y_{u}^{g t} = y_{c}^{g t} - \frac{h^{g t} * r a t i o}{2}, y_{d}^{g t} = y_{c}^{g t} + \frac{h^{g t} * r a t i o}{2}

(2)

x_{l} = x_{c} - \frac{w * r a t i o}{2}, x_{r} = x_{c} + \frac{w * r a t i o}{2}

(3)

y_{u} = y_{c} - \frac{h * r a t i o}{2}, y_{d} = y_{c} + \frac{h * r a t i o}{2}

(4)

where (

x_{c}^{g t}

,

y_{c}^{g t}

) and (

x_{c}

,

y_{c}

) are the center coordinates of the target box and the anchor box, respectively, (

x_{l}^{g t}, y_{c}^{g t}

), (

x_{r}^{g t}, y_{c}^{g t}

), (

x_{c}^{g t}, y_{u}^{g t}

), and (

x_{c}^{g t}, y_{d}^{g t}

) are the coordinates of the center points of the four sides of InnerTarget box, (

x_{l}, y_{c}

), (

x_{r}, y_{c}

), (

x_{c}, y_{u}

), and (

x_{c}, y_{d}

) are the coordinates of the center points of the four sides of InnerAnchor box,

w^{g t}

and

h^{g t}

represent the width and height of the target box, and

w

and

h

represent the width and height of the anchor box.

The bounding box regression loss function is calculated by generating auxiliary boxes InnerTarget and InnerAnchor, as shown in Equations (5)–(7).

i n t e r = (\min (x_{r}^{g t}, x_{r}) - m a x (x_{l}^{g t}, x_{l})) * (\min (y_{d}^{g t}, y_{d}) - m a x (y_{u}^{g t}, y_{u}))

(5)

u n i o n = (w^{g t} * h^{g t}) * r a t i o^{2} + (w * h) * r a t i o^{2} - i n t e r

(6)

I o U^{i n n e r} = \frac{i n t e r}{u n i o n}

(7)

where

I o U^{i n n e r}

represents the Inner-IoU loss function and

i n t e r

and

u n i o n

represent the intersection and union of the InnerTarget box and the InnerAnchor box, respectively.

The distribution of difficult and easy samples has a great impact on target detection. When difficult samples dominate, we need to focus on difficult samples to improve detection performance. When the proportion of simple samples is large, the opposite is true. We use the Focaler-IoU method to reconstruct the inner through linear interval mapping to achieve the purpose of focusing on difficult and easy samples. The method for calculating IFIoU according to Equation (8) is as follows:

I o U^{I F} = \{\begin{matrix} 0, & I o U^{i n n e r} < d \\ \frac{I o U^{i n n e r} - d}{u - d}, & d \leq I o U^{i n n e r} \leq u \\ 1, & I o U^{i n n e r} > u \end{matrix}

(8)

where

I o U^{I F}

is the fused IFIoU loss function,

[d, u] \in [0, 1]

.

Combining Equations (7) and (8), the loss function of IFIoU is

L_{IFIoU} = 1 - I o U^{I F}

(9)

3. Analysis of the Experiment and Results

3.1. The Indicators of Evaluation

The model is evaluated from two dimensions: performance and complexity. Performance indicators include precision (P), which measures the proportion of positive samples detected by the model that are actually positive samples; average precision (AP), which is a combination of precision and recall, obtained by plotting the precision–recall curve and calculating its area; recall (R), which measures the proportion of positive samples detected by the model to all true positive samples; and mean average precision (mAP) in multiple categories, in which the AP of each category is calculated and the average is taken. The mAP is calculated under different IoU thresholds. The equations are shown in (10)–(13):

R = \frac{T P}{T P + F N} \times 100 %

(10)

P = \frac{T P}{T P + F P} \times 100 %

(11)

A P = \int_{0}^{1} P (R) d R

(12)

m A P = \frac{\sum_{i = 1}^{n} A P_{i}}{n}

(13)

where

T P

is the correctly detected positive sample,

F N

is the positive sample that was not detected,

F P

is the incorrectly detected negative sample, and

n

is the number of categories.

In complexity evaluation, three main indicators are considered: the number of model parameters, floating-point operations [28] and the model size.

3.2. Experimental Parameter Configuration

All experiments were conducted in the same environment to ensure the consistency and comparability of the results. The detailed information is listed in Table 2, including parameters such as processor, graphics card model, memory size, operating system and related software versions.

The parameter settings of ESPPD-YOLO were as follows: 300 training epochs, a learning rate of 0.01, a momentum coefficient of 0.936, a weight decay coefficient of 0.0006, a batch size of 32, and the use of 16 workers. SGD was selected as the optimizer.

Figure 9 shows the ESPPD-YOLO training results. The precision of the ESPPD-YOLO model is 92.0%, the recall rate is 91.4%, the mAP0.5 is 96.3%, and the mAP0.5:0.95 is 80.6%.

3.3. Comparative Experiments of Different Attention

We introduced a backbone network EMO in YOLOv8s to reduce parameters and computation. At the same time, to improve the model effect, the CA module is integrated into EMO. To analyze the improvement effect of different attention mechanisms on the model, the original EMO model is compared with the EMO model that integrated the CA, CBAM, SE, and EMA attention mechanisms. The models after the introduction of the module are called YOLO-EMCA, YOLO-EMCBAM, YOLO-EMSE, and YOLO-EMEMA, respectively.

The improved EMO model is beneficial to improving the model detection effect, but it will increase the number of parameters and the amount of calculation, as shown in Table 3. The number of parameters of the models integrating the CBAM, SE and EMA attention mechanisms increased by 38.8%, 81.9% and 22.6%, respectively. In addition, the amount of calculation of the model integrating the EMA attention mechanism increased by 63.1%. As shown in Table 3, the model incorporating the CA mechanism improves mAP0.5 by 0.8% and mAP0.5:0.95 by 2.5% while keeping the model complexity unchanged. Therefore, the EMO model with the CA attention mechanism has the best improvement effect.

3.4. Ablation Experiment

We analyze the impact of removing or adding specific modules on the overall performance of the model. To fully verify the improvement of the improved model, this study selected five different network architectures and conducted ablation experiments compared with the original network. This process helps us to more clearly understand the specific contribution of each module in improving model performance and ensure that the improved network has better performance in practical applications. Improvement 1 is to replace the backbone network with the EMCA, Improvement 2 is to replace the framework of the neck part with the EFFF framework, Improvement 3 is to replace the loss function with IFIoU, Improvement 4 is to add EMCA and EFFF modules, and Improvement 5 is to add EMCA, EFFF and IFIoU modules. The ablation test results are shown in Table 4, where “×” indicates that the improved model does not use this module, and “√” indicates that the improved model uses this module.

As shown in Table 4, after adding EMCA to YOLOv8s, the complexity of the model is significantly optimized, and Flops and Params are reduced by 29.3% and 30.8%, respectively. In addition, the model detection accuracy is improved, with mAP0.5 and mAP0.5:0.95 increased by 0.8% and 2.5%, respectively. This is because EMCA can extract key information from the input image with extremely low computational overhead by integrating the efficient aggregation module RMBBCA. At the same time, the CA module can effectively capture the spatial relationship between remote pixels and organically integrate local and global information. While reducing the computational complexity, the model is still able to deeply input useful information in features, thereby improving the network’s perception ability and overall performance. After replacing the neck in YOLOv8s with the EFFF framework, mAP0.5 and mAP0.5:0.95 increased by 0.5% and 0.4%, respectively, and the number of parameters and the amount of calculation decreased by 37.2% and 24.4%, respectively. The EFFF framework uses the efficient aggregation module GELAN, which has the ability to fuse multi-scale features while enhancing the feature extraction ability of the mode. After replacing the detection box loss function in YOLOv8s with IFIoU, there is no additional increase in model parameters and calculation amount, and the model accuracy is slightly improved.

Combining the EMCA module with the EFFF framework allows the improved YOLO model to significantly reduce the complexity of the model while maintaining detection performance. mAP0.5 and mAP0.5:0.95 are improved by 0.9% and 3.0%, respectively, and the number of parameters, the amount of calculation, and the model size are reduced by 67.8%, 53.1%, and 66.2%, respectively. Finally, the original YOLOv8s model is combined with EMCA, EFFF, and IFIoU, and mAP0.5:0.95 is further improved by 0.5% without affecting the complexity and calculation of the model.

By replacing the EMCA backbone network and EFFF framework and improving the loss function, ESPPD-YOLO outperforms YOLOv8s in terms of accuracy, number of parameters, FLOPs, and model size. Compared with YOLOv8s, the parameters, FLOPs, and model size of ESPPD-YOLO are reduced by 67.8%, 53.1%, and 66.2%, respectively. At the same time, the comprehensive evaluation accuracy indicators mAP0.5 and mAP0.5:0.95 are increased by 0.9% and 3.5%, respectively. The impact of adding different modules on the accuracy of the model is shown in Figure 10. In the process of continuous optimization and adding various modules, the average precision of the model showed a trend of gradual improvement. The introduction of each module had a positive effect on the performance of the model and significantly enhanced the ability to extract and process features.

3.5. Comparison Experiment of the Mainstream One-Stage Algorithm

In order to verify the effectiveness of the improved network, six models, including Faster R-CNN, SSD, YOLOv5, YOLOv7, YOLOv8s, YOLOv9s, YOLOv10s and ESPPD-YOLO, are selected for comparison. As shown in Table 5, considering the accuracy, number of parameters, amount of computation and model size, the ESPPD-YOLO model outperforms other series networks in terms of accuracy and complexity. Specifically, the mAP0.5 of the ESPPD-YOLO model is 7.0%, 4.4%, 0.5%, 0.1%, 0.9%, 0.5% and 1.0% higher than that of Faster R-CNN, SSD, YOLOv5s, YOLOv7, YOLOv8s, YOLOv9s and YOLOv10s; the mAP0.5:0.95 of the ESPPD-YOLO model is 24.5%, 19.9%, 5.9%, 6.4%, 3.5%, 1.9% and 3.7% higher than them, respectively; the number of parameters of the ESPPD-YOLO model is 87.3%, 86.3%, 49.0%, 90.3%, 67.8%, 50.1%, and 55.4% less than them, respectively; and the computational complexity of the ESPPD-YOLO model is 98.5%, 78.6%, 15.1%, 87.2%, 53.1%, 51.4, and 46.6% lower than them, respectively. These results demonstrate the effectiveness of ESPPD-YOLO in effectively detecting sweet potato plants. ESPPD-YOLO performs very well in terms of model complexity and accuracyd and has good practical application potential in production scenarios.

3.6. Performance Analysis of Algorithms in Complex Environments

In order to verify the applicability of the proposed ESPPD-YOLO model for sweet potato detection, the detection results of different models in typical complex scenes are compared, as shown in Figure 11 and Figure 12. It can be seen that Faster R-CNN, SSD, YOLOv5s, YOLOv7, YOLOv8s, YOLOv9s, YOLOv10s and ESPPD-YOLO can accurately detect most sweet potatoes. However, when the sweet potato plants are dense or the target is small, except for ESPPD-YOLO, there are problems such as missed detection, wrong detection, repeated recognition and incomplete target recognition.

As shown in the black ellipse in Figure 11, Faster R-CNN, YOLOv5s, YOLOv8s, and YOLOv9s identify one sweet potato as multiple sweet potatoes when the sweet potatoes are lush or the growth area spans a large span or when they are connected with other sweet potatoes, resulting in duplicate recognition problems. The yellow ellipse mark shows that Faster R-CNN, SSD, YOLOv7, YOLOv8s, YOLOv9s and YOLOv10s have missed sweet potato detection. Faster R-CNN, SSD, YOLOv7, YOLOv8s, YOLOv9s and YOLOv10s tend to miss small targets when they are close to normal targets. SSD can hardly detect small target sweet potatoes, and YOLOv7 misses some sweet potato objects at the edge of the image. As shown in the blue ellipse, Faster R-CNN, SSD, YOLOv7, YOLOv8s, YOLOv9s and YOLOv10s identify two sweet potatoes that are close to each other as one. In the purple ellipse marks in Figure 11b,e,g, SSD, YOLOv8s and YOLOv10s cannot accurately identify a single sweet potato. There are problems such as missing a part of the recognized sweet potato or identifying the image background as part of the sweet potato. ESPPD-YOLO can accurately recognize occluded sweet potatoes, sweet potatoes at the edge of the image and small target sweet potatoes, and the model detection performance and robustness are higher than the other five models.

In Figure 12, Faster R-CNN, SSD, YOLOv5, YOLOv8s, and YOLOv9s repeatedly identify one sweet potato as multiple sweet potatoes, and there is a situation where the sweet potato is repeatedly identified, as shown by the black ellipse mark. In the first image on the left of Figure 12a,b,d–g, there is a situation where the sweet potato is missed, and two sweet potato plants are identified as one, and the other sweet potato is not identified, as shown by the yellow ellipse mark. Faster R-CNN does not identify the small target sweet potato at the edge of the image, and SSD cannot accurately identify the small target sweet potato. In the rightmost images of Figure 12b,g, the sweet potato plant in the lower right corner of SSD and YOLOv10s is ignored and not detected. YOLOv7s and YOLOv8s have the same missed detection situation. As shown by the purple ellipse mark, although Faster R-CNN, SSD, YOLOv5, YOLOv8s, YOLOv9s and YOLOv10s detect the sweet potato, they also default the background as part of the sweet potato, which has the problem of inaccurate recognition.

As can be seen from Figure 11 and Figure 12, ESPPD-YOLO achieves a higher detection rate, lower missed detection rate, and lower false detection rate compared to the original YOLOv8s network. When objects are densely distributed, ESPPD-YOLO’s prediction boxes overlap less. When objects are severely occluded, ESPPD-YOLO’s detection results are relatively more accurate. In general, ESPPD-YOLO achieves better detection results with lower parameters and computational complexity.

3.7. Model Visualization Analysis

Heatmaps [43] represent the distribution information of objects through different shades of colors, where hot tones (such as red and yellow) represent high-activity or high-importance areas, and cold tones (such as blue and green) represent low-activity or low-importance areas. The darker the hot tones in the heatmap, the greater the impact of the location on the detection performance. This section analyzes the feature extraction capabilities of the YOLOv8s and ESPPD-YOLO algorithms by comparing their heatmaps.

The heatmap comparison between the two algorithms is shown in Figure 13, where Figure 13a,c, respectively, show the heatmaps of ESPPD-YOLO for sweet potatoes at different shooting angles, and Figure 13b,d, respectively, show the heatmaps of YOLOv8s for sweet potatoes at different shooting angles. As can be seen from Figure 13, both ESPPD-YOLO and YOLOv9 focus on the target object, but the hot tones in Figure 13a,c are brighter than those in Figure 13b,d, indicating that our improved ESPPD-YOLO has better feature extraction capabilities for target objects. By comparing Figure 13a and Figure 13b, it can be seen that Figure 13b only focuses on part of the target features, which easily leads to misdetection and repeated detection of the target, while Figure 13a focuses more on the overall features of the target and improves the detection accuracy of the whole sweet potato plant. In addition, in Figure 13d, the heat source is not concentrated on the object, and the background environment interferes with the model, while in Figure 13c, the heat source is all concentrated on the target object.

As shown in Figure 13, at different shooting angles, the heat map generated by ESPPD-YOLO almost completely covers all sweet potato targets, and the hot spots in the heat map are widely distributed throughout the sweet potato plant. This shows that ESPPD-YOLO has a stronger ability to extract sweet potato features and can accurately capture and identify the details of sweet potatoes in the environment. It further proves its superiority and adaptability in sweet potato target detection in natural environments.

4. Discussion

In order to deal with the sweet potato and land pollution caused by pesticide waste, an early sweet potato plant detection method based on YOLOv8s is proposed. This method is mainly aimed at the field recognition of sweet potato plants and has broad application prospects for crop recognition in complex natural environments.

During the development of this work, no other datasets similar to the sweet potato plant classes evaluated here were found in the literature. However, to evaluate the performance of the ESPPD-YOLO algorithm, we compared our results with several recently published related studies that used methods similar to ours. Peng et al. [44] used an improved YOLOv5s model to detect tea buds in outdoor environments. The model achieved 83.6% and 54.8% in mAP0.5 and mAP0.5:0.95, respectively, with a parameter count of 8.39 M and a computational cost of 17.9 G. Wu et al. [45] used YOLO-RGBDtea to detect tea seedlings in a complex outdoor scene, achieving an mAP0.5:0.95 of 91.12%. The number of parameters and computational complexity were 49.19 M and 170.82 G, respectively. Sun [46] proposed a lightweight greenhouse tomato object detection model S-YOLO based on YOLOv8s, with an mAP0.5 of 92.46%, a parameter size of 9.11 M, and a computational cost of 15.22 G. Du et al. [17] proposed the DSW-YOLO network based on the YOLOv7 network model. The mAP0.5 and mAP0.5:0.95 of this improved model were 86.7% and 57.2%, respectively, with a parameter and computational complexity of 32.4 M and 99.5 G. Qi et al. [47] proposed a method for detecting tomato buds and stems based on YOLOv8s, with a parameter count and computational complexity of 11.86 M and 42.90 G, respectively, and mAP0.5 and mAP0.5:0.95 of 66.4% and 66.5%, respectively. Based on the above research results, the improved method proposed in this study has excellent performance, with mAP0.5 and mAP0.5:0.95 reaching 96.3% and 80.6%, respectively, which are significantly higher than the above research. In addition, in terms of model complexity, the number of parameters and the amount of calculation of the model used in this study are 3.6 M and 13.4 G, respectively, which are 67.8% and 53.1% lower than the original YOLOv8s network and also lower than the model of the above researchers. This shows that our algorithm has lower computational complexity while maintaining high accuracy and has good application prospects.

However, when the sweet potatoes grow very lushly or multiple sweet potatoes are severely occluded, there are still cases of false detection and missed detection, and a richer data set is needed to improve the detection effect of the model. In the data collection stage, some of the collected pictures have problems such as background blur, front and back displacement, and overexposure. Therefore, before data processing, first consider preprocessing operations such as defogging and noise reduction on the image to make the image clearer and with richer details. In addition, image collection is carried out in different time periods, such as soft light in the early morning, long shadows in the evening, and low-light environments at night. It can effectively enhance the diversity of the data set, not only increasing the diverse performance of sweet potato targets under different conditions such as lighting and shadows but also helping the model learn to adapt under complex lighting conditions. By combining defogging, noise reduction and other technologies with multi-time period image acquisition strategies, it can not only provide the model with higher-quality and richer training data but also further improve the robustness of the model. This enables it to maintain high detection performance in various complex natural environments.

Therefore, in future research, we will further optimize the data set and data processing methods to enhance the model’s adaptability in various complex natural environments. Then, we will try to use technologies such as knowledge distillation and network pruning to further optimize the model structure, thereby significantly reducing the model complexity while improving the model detection effect.

5. Conclusions

To address the problems of pesticide waste and pollution in the process of sweet potato disease and pest control, a sweet potato plant detection method, ESPPD-YOLO based on the YOLOv8s network, is proposed. It can detect early sweet potato plants under natural conditions, with mAP0.5 and mAP0.5:0.95 reaching 96.3% and 80.6%, respectively. The number of parameters and calculation amount are 3.6 MB and 13.4 G, respectively, and the model size is only 7.6 MB. The experimental results confirm the feasibility and effectiveness of this method in detecting sweet potato plants. This study is of great significance for reducing pesticide waste, improving pesticide utilization, and preventing crops and land from being polluted by pesticides. In future research, our research will consider data fusion from different time periods and different sensors to obtain richer environmental information, while improving the performance of the model in complex and changing natural environments.

Author Contributions

Methodology, K.X., W.S. and D.C.; resources, K.X.; writing—original draft preparation, K.X., W.S., D.C. and J.X.; writing—review and editing, K.X., W.S., Y.Q. and R.Y.; supervision, J.X. and R.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the “Key R&D Projects in Hainan Province, grant number ZDYF2023XDNY039” and “National Talent Foundation Project of China, grant number T2019136”.

Data Availability Statement

Dataset available on request from the authors.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Zhang, W.; Jiang, F.; Ou, J. Global pesticide consumption and pollution: With China as a focus. Proc. Int. Acad. Ecol. Environ. Sci. 2011, 1, 125. [Google Scholar]
Wei, J.; Gong, H.; Li, S.; You, M.; Zhu, H.; Ni, L.; Luo, L.; Chen, M.; Chao, H.; Hu, J.; et al. Improving the Accuracy of Agricultural Pest Identification: Application of AEC-YOLOv8n to Large-Scale Pest Datasets. Agronomy 2024, 14, 1640. [Google Scholar] [CrossRef]
Darbyshire, M.; Salazar-Gomez, A.; Gao, J.; Sklar, E.I.; Parsons, S. Towards practical object detection for weed spraying in precision agriculture. Front. Plant Sci. 2023, 14, 1183277. [Google Scholar] [CrossRef] [PubMed]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar] [CrossRef]
Mai, X.; Zhang, H.; Jia, X.; Meng, M.Q.H. Faster R-CNN with classifier fusion for automatic detection of small fruits. IEEE Trans. Autom. Sci. Eng. 2020, 17, 1555–1569. [Google Scholar] [CrossRef]
Wan, S.; Goudos, S. Faster R-CNN for multi-class fruit detection using a robotic vision system. Comput. Netw. 2020, 168, 107036. [Google Scholar] [CrossRef]
Redmon, J.; Farhadi, A. Yolov3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar] [CrossRef]
Ge, Z.; Liu, S.; Wang, F.; Li, Z.; Sun, J. Yolox: Exceeding yolo series in 2021. arXiv 2021, arXiv:2107.08430. [Google Scholar] [CrossRef]
Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 7464–7475. [Google Scholar] [CrossRef]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part I 14. Springer International Publishing: Berlin/Heidelberg, Germany, 2017; pp. 21–37. [Google Scholar] [CrossRef]
Tian, Y.; Yang, G.; Wang, Z.; Wang, H.; Li, E.; Liang, Z. Apple detection during different growth stages in orchards using the improved YOLO-V3 model. Comput. Electron. Agric. 2019, 157, 417–426. [Google Scholar] [CrossRef]
Lu, S.; Liu, X.; He, Z.; Zhang, X.; Liu, W.; Karkee, M. Swin-Transformer-YOLOv5 for real-time wine grape bunch detection. Remote Sens. 2022, 14, 5853. [Google Scholar] [CrossRef]
Tian, Y.; Wang, S.; Li, E.; Yang, G.; Liang, Z.; Tan, M. MD-YOLO: Multi-scale Dense YOLO for small target pest detection. Comput. Electron. Agric. 2023, 213, 108233. [Google Scholar] [CrossRef]
Gai, R.; Chen, N.; Yuan, H. A detection algorithm for cherry fruits based on the improved YOLO-v4 model. Neural Comput. Appl. 2023, 35, 13895–13906. [Google Scholar] [CrossRef]
Li, Q.; Ma, W.; Li, H.; Zhang, X.; Zhang, R.; Zhou, W. Cotton-YOLO: Improved YOLOV7 for rapid detection of foreign fibers in seed cotton. Comput. Electron. Agric. 2024, 219, 108752. [Google Scholar] [CrossRef]
Chen, J.; Wang, H.; Zhang, H.; Luo, T.; Wei, D.; Long, T.; Wang, Z. Weed detection in sesame fields using a YOLO model with an enhanced attention mechanism and feature fusion. Comput. Electron. Agric. 2022, 202, 107412. [Google Scholar] [CrossRef]
Du, X.; Cheng, H.; Ma, Z.; Lu, W.; Wang, M.; Meng, Z.; Jiang, C.; Hong, F. DSW-YOLO: A detection method for ground-planted strawberry fruits under different occlusion levels. Comput. Electron. Agric. 2023, 214, 108304. [Google Scholar] [CrossRef]
Ye, X.; Pan, J.; Shao, F.; Liu, G.; Lin, J.; Xu, D.; Liu, J. Exploring the potential of visual tracking and counting for trees infected with pine wilt disease based on improved YOLOv5 and StrongSORT algorithm. Comput. Electron. Agric. 2024, 218, 108671. [Google Scholar] [CrossRef]
Zhao, C.; Shu, X.; Yan, X.; Zuo, X.; Zhu, F. RDD-YOLO: A modified YOLO for detection of steel surface defects. Measurement 2023, 214, 112776. [Google Scholar] [CrossRef]
Wang, Y.; Zou, H.; Yin, M.; Zhang, X. SMFF-YOLO: A Scale-Adaptive YOLO Algorithm with Multi-Level Feature Fusion for Object Detection in UAV Scenes. Remote Sens. 2023, 15, 4580. [Google Scholar] [CrossRef]
Howard, A.G. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv 2017, arXiv:1704.04861. [Google Scholar] [CrossRef]
Zhang, X.; Zhou, X.; Lin, M.; Sun, J. Shufflenet: An extremely efficient convolutional neural network for mobile devices. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6848–6856. [Google Scholar] [CrossRef]
Han, K.; Wang, Y.; Tian, Q.; Guo, J.; Xu, C.; Xu, C. Ghostnet: More features from cheap operations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 1580–1589. [Google Scholar] [CrossRef]
Dong, X.; Yan, S.; Duan, C. A lightweight vehicles detection network model based on YOLOv5. Eng. Appl. Artif. Intell. 2022, 113, 104914. [Google Scholar] [CrossRef]
Yang, G.; Wang, J.; Nie, Z.; Yang, H.; Yu, S. A lightweight YOLOv8 tomato detection algorithm combining feature enhancement and attention. Agronomy 2023, 13, 1824. [Google Scholar] [CrossRef]
Zhang, J.; Jin, J.; Ma, Y.; Ren, P. Lightweight object detection algorithm based on YOLOv5 for unmanned surface vehicles. Front. Mar. Sci. 2023, 9, 1058401. [Google Scholar] [CrossRef]
Tian, S.; Fang, C.; Zheng, X.; Liu, J. Lightweight Detection Method for Real-Time Monitoring Tomato Growth Based on Improved YOLOv5s. IEEE Access 2024, 12, 29891–29899. [Google Scholar] [CrossRef]
Li, S.; Zhang, S.; Xue, J.; Sun, H. Lightweight target detection for the field flat jujube based on improved YOLOv5. Comput. Electron. Agric. 2022, 202, 107391. [Google Scholar] [CrossRef]
Cao, J.; Bao, W.; Shang, H.; Yuan, M.; Cheng, Q. GCL-YOLO: A GhostConv-based lightweight yolo network for UAV small object detection. Remote Sens. 2023, 15, 4932. [Google Scholar] [CrossRef]
Deng, L.; Bi, L.; Li, H.; Chen, H.; Duan, X.; Lou, H.; Zhang, H.; Bi, J.; Liu, H. Lightweight aerial image object detection algorithm based on improved YOLOv5s. Sci. Rep. 2023, 13, 7817. [Google Scholar] [CrossRef]
Li, J.; Li, J.; Zhao, X.; Su, X.; Wu, W. Lightweight detection networks for tea bud on complex agricultural environment via improved YOLO v4. Comput. Electron. Agric. 2023, 211, 107955. [Google Scholar] [CrossRef]
Lyu, S.; Li, Z. YOLO-SCL: A lightweight detection model for citrus psyllid based on spatial channel interaction. Front. Plant Sci. 2023, 14, 1276833. [Google Scholar] [CrossRef]
Tang, J.; Wang, Z.; Zhang, H.; Li, H.; Wu, P.; Zeng, N. A lightweight surface defect detection framework combined with dual-domain attention mechanism. Expert Syst. Appl. 2024, 238, 121726. [Google Scholar] [CrossRef]
Silvia, R.; Rahman, A.Y.; Priyandoko, G. Quality Detection of Sweet Potato Leaves Using YOLOv4-Tiny. In Proceedings of the 2023 International Seminar on Application for Technology of Information and Communication, Semarang, Indonesia, 16–17 September 2023; pp. 446–451. [Google Scholar] [CrossRef]
Zhang, J.; Li, X.; Li, J.; Liu, L.; Xue, Z.; Zhang, B.; Jiang, Z.; Huang, T.; Wang, Y.; Wang, C. Rethinking mobile block for efficient attention-based models. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 2–3 October 2023; pp. 1389–1400. Available online: https://doi.ieeecomputersociety.org/10.1109/ICCV51070.2023.00134 (accessed on 18 October 2024).
Hou, Q.B.; Zhou, D.Q.; Feng, J.S. Coordinate attention for efficient mobile network design. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 19–25 June 2021; pp. 13713–13722. [Google Scholar] [CrossRef]
Wang, C.Y.; Yeh, I.H.; Liao, H.Y.M. YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information. arXiv 2024, arXiv:2402.13616. [Google Scholar] [CrossRef]
Wang, C.Y.; Liao, H.Y.M.; Wu, Y.H.; Chen, P.Y.; Hsieh, J.W.; Yeh, I.H. CSPNet: A new backbone that can enhance learning capability of CNN. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Seattle, WA, USA, 14–19 June 2020; pp. 390–391. [Google Scholar] [CrossRef]
Lee, Y.; Hwang, J.W.; Lee, S.; Bae, Y.; Park, J. An energy and GPU-computation efficient backbone network for real-time object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Long Beach, CA, USA, 16–17 June 2019. [Google Scholar] [CrossRef]
Zheng, Z.; Wang, P.; Ren, D.; Liu, W.; Ye, R.; Hu, Q.; Zuo, W. Enhancing geometric factors in model learning and inference for object detection and instance segmentation. IEEE Trans. Cybern. 2021, 52, 8574–8586. [Google Scholar] [CrossRef]
Zhang, H.; Xu, C.; Zhang, S. Inner-iou: More effective intersection over union loss with auxiliary bounding box. arXiv 2023, arXiv:2311.02877. [Google Scholar] [CrossRef]
Zhang, H.; Zhang, S. Focaler-IoU: More Focused Intersection over Union Loss. arXiv 2024, arXiv:2401.10525. [Google Scholar] [CrossRef]
Zuo, Z.; Gao, S.; Peng, H.; Xue, Y.; Han, L.; Ma, G.; Mao, H. Lightweight Detection of Broccoli Heads in Complex Field Environments Based on LBDC-YOLO. Agronomy 2024, 14, 2359. [Google Scholar] [CrossRef]
Peng, J.; Zhang, Y.; Xian, J.; Wang, X.; Shi, Y. YOLO Recognition Method for Tea Shoots Based on Polariser Filtering and LFAnet. Agronomy 2024, 14, 1800. [Google Scholar] [CrossRef]
Wu, Y.; Chen, J.; Wu, S.; Li, H.; He, L.; Zhao, R.; Wu, C. An improved YOLOv7 network using RGB-D multi-modal feature fusion for tea shoots detection. Comput. Electron. Agric. 2024, 216, 108541. [Google Scholar] [CrossRef]
Sun, X. Enhanced tomato detection in greenhouse environments: A lightweight model based on S-YOLO with high accuracy. Front. Plant Sci. 2024, 15, 1451018. [Google Scholar] [CrossRef]
Qi, Z.; Hua, W.; Zhang, Z.; Deng, X.; Yuan, T.; Zhang, W. A novel method for tomato stem diameter measurement based on improved YOLOv8-seg and RGB-D data. Comput. Electron. Agric. 2024, 226, 109387. [Google Scholar] [CrossRef]

Figure 1. Sample images of sweet potato.

Figure 2. YOLOv8 network structure.

Figure 3. ESPPD-YOLO structure.

Figure 4. Efficient model based on coordinate attention (EMCA). (a) Inverted residual mobile block (iRMB). (b) The iRMB based on the coordinate attention mechanism (RMBCA). (c) EMCA structure.

Figure 5. Coordinate attention mechanism.

Figure 6. EFFF structure.

Figure 7. GELAN structure.

Figure 8. Description of Inner-IoU.

Figure 9. ESPPD-YOLO model training results.

Figure 10. The mAP curves for different improvement stages.

Figure 11. Detection results of different models at horizontal angles. (a) Faster R-CNN. (b) SSD. (c) YOLOv5s. (d) YOLOv7. (e) YOLOv8s. (f) YOLOv9s. (g) YOLOv10s. (h) ESPPD-YOLO. The red rectangle indicates the detection frame, the yellow ellipse indicates missed detection, the black ellipse indicates that one sweet potato plant is identified as two or more, the purple ellipse indicates incomplete detection, and the blue ellipse indicates that two sweet potato plants are identified as one.

Figure 12. Detection results of different models at tilt angles. (a) Faster R-CNN. (b) SSD. (c) YOLOv5s. (d) YOLOv7. (f) YOLOv9s. (g) YOLOv10s. (h) ESPPD-YOLO. The red rectangle indicates the detection frame, the yellow ellipse indicates missed detection, the black ellipse indicates that one sweet potato plant is identified as two or more, the purple ellipse indicates incomplete detection, and the blue ellipse indicates that two sweet potato plants are identified as one.

Figure 13. Heat map. (a,c) are heat maps of ESPPD-YOLO. (b,d) are heat maps of YOLOv8s. Hot tones (such as red and yellow) represent high-importance areas, and cold tones (such as blue and green) represent low-importance areas.

Table 1. Sweet potato plant dataset composition.

Dataset	Target Box	Number of Images
Dataset	Sweet Potato	Number of Images
Train	61,212	4749
Validation	7704	594
Test	7230	597
Total	76,146	5940

Table 2. Hardware and software environment.

System	CPU	GPU	Ram	Python	PyTorch	CUDA
Ubuntu 20	i5	Rtx3060	16g	3.8	1.11.0	11.3

Table 3. The comparative results of the different attention mechanisms.

Model	P (%)	R (%)	mAP0.5 (%)	mAP0.5:0.95 (%)	Parameters	FLOPs (G)	Size (MB)
YOLO-EMO	92.4	88.9	95.4	77.1	7,702,801	20.1	15.8
YOLO-EMCBAM	91.8	90.7	95.7	78.4	9,093,551	20.2	18.5
YOLO-EMSE	92.7	90.8	95.4	78.3	10,638,507	20.2	21.6
YOLO-EMEMA	91.8	90.7	95.7	77.8	8,518,135	28.6	17.4
YOLO-EMCA	92.1	91.2	96.2	79.6	7,702,801	20.2	15.8

Table 4. Results of ablation experiments.

Model	EMCA	EFFF	IFIoU	mAP0.5 (%)	mAP0.5:0.95 (%)	Parameters	FLOPs (G)	Size (MB)
YOLOv8s	×	×	×	95.4	77.1	11,125,971	28.6	22.5
Improvement 1	√	×	×	96.2	79.6	7,702,801	20.2	15.8
Improvement 2	×	√	×	95.9	77.5	6,976,979	21.6	14.2
Improvement 3	×	×	√	95.4	77.3	11,125,971	28.4	22.5
Improvement 4	√	√	×	96.3	80.1	3,576,337	13.4	7.6
Improvement 5	√	√	√	96.3	80.6	3,576,337	13.4	7.6

Table 5. Comparison experiment of the ESPPD-YOLO with different detection algorithms.

Model	mAP0.5 (%)	mAP0.5:0.95 (%)	Parameters	FLOPs (G)	Size (MB)
Faster R-CNN	89.3	56.1	28,275,328	940.9	113.4
SSD	91.9	60.7	26,285,486	62.7	95
YOLOv5s	95.8	74.7	7,012,822	15.8	14.4
YOLOv7	96.2	74.2	37,196,556	105.1	149.2
YOLOv8s	95.4	77.1	11,135,971	28.6	22.5
YOLOv9s	95.8	78.7	7,167,475	27.6	15.2
YOLOv10s	95.3	76.9	8,035,734	25.1	16.5
ESPPD-YOLO	96.3	80.6	3,576,337	13.4	7.6

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Xu, K.; Sun, W.; Chen, D.; Qing, Y.; Xing, J.; Yang, R. Early Sweet Potato Plant Detection Method Based on YOLOv8s (ESPPD-YOLO): A Model for Early Sweet Potato Plant Detection in a Complex Field Environment. Agronomy 2024, 14, 2650. https://doi.org/10.3390/agronomy14112650

AMA Style

Xu K, Sun W, Chen D, Qing Y, Xing J, Yang R. Early Sweet Potato Plant Detection Method Based on YOLOv8s (ESPPD-YOLO): A Model for Early Sweet Potato Plant Detection in a Complex Field Environment. Agronomy. 2024; 14(11):2650. https://doi.org/10.3390/agronomy14112650

Chicago/Turabian Style

Xu, Kang, Wenbin Sun, Dongquan Chen, Yiren Qing, Jiejie Xing, and Ranbing Yang. 2024. "Early Sweet Potato Plant Detection Method Based on YOLOv8s (ESPPD-YOLO): A Model for Early Sweet Potato Plant Detection in a Complex Field Environment" Agronomy 14, no. 11: 2650. https://doi.org/10.3390/agronomy14112650

APA Style

Xu, K., Sun, W., Chen, D., Qing, Y., Xing, J., & Yang, R. (2024). Early Sweet Potato Plant Detection Method Based on YOLOv8s (ESPPD-YOLO): A Model for Early Sweet Potato Plant Detection in a Complex Field Environment. Agronomy, 14(11), 2650. https://doi.org/10.3390/agronomy14112650

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Early Sweet Potato Plant Detection Method Based on YOLOv8s (ESPPD-YOLO): A Model for Early Sweet Potato Plant Detection in a Complex Field Environment

Abstract

1. Introduction

2. Materials and Methods

2.1. Sweet Potato Plant Dataset

2.2. The Network Structure of YOLOv8

2.3. Improved Network Structure

2.3.1. Feature Extraction Network Based on Inverted Residual Mobile Block

2.3.2. Efficient Feature Fusion Framework

2.3.3. Loss Function Improvement

3. Analysis of the Experiment and Results

3.1. The Indicators of Evaluation

3.2. Experimental Parameter Configuration

3.3. Comparative Experiments of Different Attention

3.4. Ablation Experiment

3.5. Comparison Experiment of the Mainstream One-Stage Algorithm

3.6. Performance Analysis of Algorithms in Complex Environments

3.7. Model Visualization Analysis

4. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI