Research on Detection Algorithm of Green Walnut in Complex Environment

Yang, Chenggui; Cai, Zhengda; Wu, Mingjie; Yun, Lijun; Chen, Zaiqing; Xia, Yuelong

doi:10.3390/agriculture14091441

Open AccessArticle

Research on Detection Algorithm of Green Walnut in Complex Environment

by

Chenggui Yang

^1,2

,

Zhengda Cai

³,

Mingjie Wu

^1,2

,

Lijun Yun

^1,2,*,

Zaiqing Chen

^1,2 and

Yuelong Xia

^1,2

¹

School of Information, Yunnan Normal University, Kunming 650500, China

²

Computer Vision and Intelligent Control Technology Engineering Research Center, Yunnan Provincial Department of Education, Kunming 650500, China

³

Yunnan Woody Oilseed (Walnut) Full Industry Chain Innovation Research Institute, Kunming 650500, China

^*

Author to whom correspondence should be addressed.

Agriculture 2024, 14(9), 1441; https://doi.org/10.3390/agriculture14091441

Submission received: 16 July 2024 / Revised: 6 August 2024 / Accepted: 22 August 2024 / Published: 24 August 2024

(This article belongs to the Section Digital Agriculture)

Download

Browse Figures

Versions Notes

Abstract

The growth environment of green walnuts is complex. In the actual picking and identification process, interference from near-background colors, occlusion by branches and leaves, and excessive model complexity pose higher demands on the performance of walnut detection algorithms. Therefore, a lightweight walnut detection algorithm suitable for complex environments is proposed based on YOLOv5s. First, the backbone network is reconstructed using the lightweight GhostNet network, laying the foundation for a lightweight model architecture. Next, the C3 structure in the feature fusion layer is optimized by proposing a lightweight C3 structure to enhance the model’s focus on important walnut features. Finally, the loss function is improved to address the problems of target loss and gradient adaptability during training. To further reduce model complexity, the improved algorithm undergoes pruning and knowledge distillation operations, and is then deployed and tested on small edge devices. Experimental results show that compared to the original YOLOv5s model, the improved algorithm reduces the number of parameters by 72.9% and the amount of computation by 84.1%. The mAP0.5 increased by 1.1%, the precision increased by 0.7%, the recall increased by 0.3%, and the FPS is 179.6% of the original model, meeting the real-time detection needs for walnut recognition and providing a reference for walnut harvesting identification.

Keywords:

object detection; model; YOLOv5s; GhostNet; lightweight; walnut picking

1. Introduction

As one of the world’s four major dried fruits, walnuts have high comprehensive development and utilization value and are widely used in health foods, industrial raw materials, medicinal treatments, and furniture manufacturing. China, as the main producer of walnuts, accounts for 34.6% of the world’s total walnut supply [1]. In China, walnut cultivation is mainly distributed in North China, Northwest China, Southwest China, Central China, South China, and East China, with altitudes ranging from 400 to 1800 m, mainly growing on slopes and hilly areas. However, most walnut harvesting in China still relies on manual harvesting by farmers or mechanical shaking methods, which not only have high labor intensity, low efficiency, and high costs but also easily damage the trees using mechanical shaking. With the development of modern agriculture, fruit harvesting is increasingly moving towards intelligence, and fruit-picking robots are gradually being put into production and use. Therefore, intelligent fruit picking that can improve the harvesting efficiency of picking equipment and reduce picking costs has become an important research direction [2].

Due to the complexity and uncertainty of the fruit-picking environment, fruit-picking robots often struggle to meet the demands for precision, speed, and stability in practical applications. Therefore, achieving high-efficiency fruit picking relies on a precise and stable recognition and positioning system [3]. With the development of artificial intelligence technology, recognition and positioning technologies based on image recognition and machine learning are gradually being applied to the agricultural picking field. Chaivivatrakul et al. [4] proposed a technique based on texture analysis for detecting green fruits on plants, which showed good results for pineapples and bitter melons. Payne et al. [5,6] used color segmentation in the RGB and YCbCr color ranges and texture segmentation based on adjacent pixel variability to segment pixels into fruit pixels and background pixels, achieving a mango detection accuracy of 78.83%. Bai et al. [7] used Hough circle detection to fit the contour lines of tomato fruits, and applied spatial symmetric spline interpolation and set analysis for peduncle estimation, contour fitting, and picking point positioning, achieving precise tomato picking. Although traditional machine vision methods can achieve basic fruit recognition and positioning, the unstructured orchard environment increases the difficulty of fruit recognition. In practical working environments, visual obstructions caused by branches, leaves, and overlapping fruits, as well as changes in lighting that alter fruit colors, make traditional object recognition technologies limited in complex natural environments, leading to low detection efficiency and poor robustness, which cannot meet the needs of practical work. With the maturity of hardware devices such as GPUs, deep learning technologies based on convolutional neural networks have made new progress in the field of fruit picking [8,9,10]. Zhang et al. [11] addressed the problem of high hardware requirements and poor convenience of chili-picking robots by pruning the improved YOLOv5s algorithm, making it easier to deploy. Nan et al. [12] proposed a dragon fruit detection algorithm called WGB-YOLO, which performed well in densely planted dragon fruit orchards for multi-category fruit detection. Xu et al. [13] proposed an improved YOLOv5s algorithm for pepper detection to improve the work efficiency and adaptability of pepper picking in natural environments. Chen et al. [14] proposed the GA-YOLO grape detection algorithm to address the serious problem of missed detections in dense grape clusters with existing detection algorithms.

With the rise of research in the field of fruit picking, walnut picking recognition has gradually attracted more attention from scholars. In the field of walnut picking, an accurate and stable recognition and positioning system directly affects the operational efficiency of walnut-picking robots [15]. However, in actual harvesting operations, the accuracy of target positioning and recognition is affected by various factors, such as interference from complex backgrounds, changes in natural lighting, occlusion by fruits and leaves, and issues with real-time detection. Wu et al. [16] used drone remote sensing images to study walnut recognition under complex lighting conditions, improving the detection accuracy of small walnut targets under these conditions. Hao et al. [17] addressed the problem of green-husk walnuts being difficult to detect due to their color similarity to leaves and their small size in natural environments by proposing a YOLOv3-based visual detection method for green-husk walnuts. The improved algorithm achieved an average detection accuracy of 94.5% and a model size of 88.6 MB. Zhong et al. [18] proposed a walnut recognition algorithm based on Swin Transformer multi-layer feature fusion improved YOLOX-S to solve the issues of missed and false detections in walnut recognition by target detection algorithms in natural environments. The improved algorithm achieved a mAP0.5 of 96.72% and a model parameter size of 20.55 MB. Fan et al. [19] studied a convolutional neural network detection method based on Faster R-CNN for the precise recognition of green-husk walnuts in natural environments. The results showed that this method achieved a detection accuracy of 97.71%, but the detection time for a single image was relatively long at 227 ms. Fu et al. [20] addressed the issue of walnuts and leaves having very similar shapes, colors, and textures by using multispectral synthetic images for walnut detection, significantly improving detection performance.

These studies have filled the research gap in the field of walnut picking recognition, but most have focused on improving detection accuracy while neglecting model complexity. This paper addresses the detection algorithm for green-husk walnuts in unstructured orchard environments, where issues such as color similarity between fruits and leaves, overlapping fruits, and leaf occlusion lead to missed and false detections of walnuts, and overly large models are not conducive to deployment on edge devices. The main contributions of this study are as follows:

A drone walnut dataset is constructed to collect walnut data under complex environments from different angles. The dataset includes various complex situations such as fruits with similar colors to the background under different lighting conditions, and occlusion by leaves and overlapping fruits.
We propose a walnut detection algorithm, GDAD-YOLOv5s, designed for complex environments. It includes a lightweight feature extraction module, a more efficient feature fusion module DE_C3, and an improved loss function Alpha CIoU. These enhancements aim to improve the model’s ability to focus on critical information about walnuts, enhance feature extraction performance, accelerate model convergence, and increase the accuracy of walnut recognition.
The model compression techniques of pruning and knowledge distillation were introduced to obtain a more lightweight and efficient walnut detection model, which was then deployed on a small edge device for inference testing. The results indicate that the model processes each image in an average of 201.95 ms on this device, showing a significant improvement in detection performance compared to the baseline model YOLOv5s.

2. Materials and Methods

2.1. Dataset

2.1.1. Study Area

The walnut dataset was collected from Baizhang Village, Yangbi County, Dali Bai Autonomous Prefecture, Yunnan Province (99°51′ E, 25°42′ N). This area belongs to the subtropical plateau monsoon climate of North Yunnan. The annual average temperature is 16.6 °C, and the annual rainfall is 827.6 mm. Located in a low-latitude and high-altitude zone, with an average elevation of 2166 m, the terrain is mainly characterized by mountains and hills. It is part of the western Yunnan high mountain and canyon region of the Hengduan Mountains, with significant terrain undulations. The mountainous area accounts for 98.4% of the total area. The topographic map is shown in Figure 1. The geographical environment of this region is suitable for walnut growth, with walnuts known for their large size, thin shells, white kernels, and fragrant taste, enjoying a good reputation both domestically and internationally.

2.1.2. Drone Data Collection

Drone technology, as a multidisciplinary and versatile technology, has been widely used in agricultural research [21]. In this study, walnut data collection was mainly carried out using a DJI Matrice-300-RTK (DJI, Shenzhen, China) drone equipped with a Zenmuse P1 camera (DJI, Shenzhen, China) sensor, with a resolution of 5472 × 3268. The data was captured in July 2022, between 09:00 and 19:00. During the capture, the drone was approximately 100 m above the ground, with a vertical angle of 90 degrees downward, a side overlap rate of 70%, a forward overlap rate of 80%, and GDS is 1.26 cm/pixel. The captured images were manually processed to remove those that are indiscernible to the naked eye due to shooting issues and eliminate duplicate images, resulting in a total of 180 walnut images. The walnut collection process is illustrated in Figure 2.

2.1.3. Dataset Creation

The captured walnut images were cropped to a size of 640 × 640 pixels. After screening, a total of 2490 walnut dataset images were obtained, including various complex scenarios such as bright light, dark light, occlusion, close-ups, and distant views. The number of walnut images in different scenarios is shown in Table 1. Figure 3 shows some of the walnut datasets. The walnuts were labeled using the LabelImg 1.8.6 annotation software in YOLO format. In total, 12,147 walnut targets were annotated. After annotation, the dataset was divided into a training set and a validation set in an 8:2 ratio, resulting in 1992 images for training and 498 images for validation.

2.2. YOLOv5 Object Detection Algorithm

Currently, mainstream object detection algorithms are mainly divided into two categories: one-stage [22] and two-stage [23]. One-stage algorithms, such as SSD [24], RetinaNet [25], and the YOLO series [26,27,28,29], are known for their fast detection speed and low complexity. Two-stage algorithms, such as R-CNN [30], SPP-Net [31], Fast R-CNN [32], Faster R-CNN [33], and R-FCN [34], offer higher detection accuracy but come with larger model complexity and slower detection speed. The YOLO series, known for its fast detection speed and high accuracy, has been widely applied in various fields. Although YOLOv8 [35] and YOLOv9 [36] have been developed to improve accuracy, their complexity is higher compared to YOLOv5. Considering the real-time nature and ease of deployment of model detection, this study chooses the lower-complexity YOLOv5s as the base model for improvement.

The YOLOv5 algorithm consists of four parts: input, backbone, neck, and head. The input end preprocesses input images using Mosaic data augmentation techniques. YOLOv5 utilizes CSPDarkNet53 as its backbone network for feature extraction. The neck network incorporates Feature Pyramid Network (FPN) [37] and Path Aggregation Network (PAN) [38] for feature fusion. The head network predicts features from three dimensions to obtain class and position information.

2.3. Algorithm Improvement

Although the YOLOv5 algorithm performs well in walnut detection, there are also challenges. First, walnuts grow in unstructured orchard environments where varying lighting conditions and similarities between walnut color and foliage can lead to missed detections and false positives. Therefore, it is necessary to improve the detection accuracy of the walnut detection algorithm. Moreover, overly large models are impractical for deployment, highlighting the need for lightweight improvements to the algorithm. We propose a lightweight algorithm, GDA-YOLOv5, based on the YOLOv5s algorithm. Its model structure is shown in Figure 4. Firstly, we reconstruct the backbone network by introducing the lightweight GhostNet network to lay the foundation for model lightweighting. At the feature fusion layer, we optimize the C3 structure to address the problem of small feature differences caused by the similarity between walnut skin and leaves. We propose a lightweight C3 structure, DE_C3, which integrates the ECA attention mechanism into C3 to enhance focus on walnut information. We also replace the original CIoU loss function with Alpha CIoU to improve the loss and gradient adaptive weighted bounding box regression accuracy for high-IoU targets. Finally, we compress GDA-YOLOv5 using pruning followed by distillation to achieve lightweight model goals. The distilled model is named GDAD-YOLOv5.

2.3.1. Reconstruction of Backbone Network

To reduce model complexity and achieve lightweighting, this study restructures the original backbone network of YOLOv5s using GhostNet [39]. After feature extraction, a single image generates many similar feature maps, leading to redundancy in feature maps. GhostNet performs simple linear operations on one feature map to generate more similar features, thereby using fewer parameters to generate more feature maps. The Ghost module is illustrated in Figure 5. GhostNet employs Ghost modules to replace traditional convolutions. Initially, a regular 1 × 1 convolution compresses the number of channels and generates a portion of actual feature layers. Then, depthwise separable convolution is applied to obtain more feature layers. Finally, the different feature layers from the two parts are concatenated for output.

Assuming the output of n feature maps, after the

1 \times 1

regular convolution, the number of channels is reduced to

n / s

. After depthwise separable convolution, the number of channels obtained is

m (s - 1)

. Finally, through identity mapping, the number of channels obtained by adding the feature maps from the previous two parts is

s \times m

. The acceleration ratio

r_{s}

and parameter compression ratio

r_{c}

obtained by using Ghost modules and standard convolutions can be derived.

r_{s} = \frac{n \cdot h^{'} \cdot w^{'} \cdot c \cdot k \cdot k}{\frac{n}{s} \cdot h^{'} \cdot w^{'} \cdot c \cdot k \cdot k + (s - 1) \cdot \frac{n}{s} \cdot h^{'} \cdot w^{'} \cdot d \cdot d} \approx s

(1)

r_{c} = \frac{n \cdot c \cdot k \cdot k}{\frac{n}{s} \cdot c \cdot k \cdot k + (s - 1) \cdot \frac{n}{s} \cdot d \cdot d} \approx s

(2)

where c is the number of input feature maps, h and w represent the height and width of the input feature maps,

h^{'}

and

w^{'}

denote the height and width of the output feature maps, s is used to adjust the number of traditional convolutional methods, d is the size of the linear transformation convolution kernel, and k is the size of the regular convolution kernel.

The Ghost bottleneck is a bottleneck structure composed of Ghost modules, as illustrated in Figure 6. The left side of the figure shows the bottleneck with stride = 1, which deepens the network without compressing the height and width of the input feature maps. As depicted in the figure, it consists of two concatenated Ghost modules. The first Ghost module expands the number of channels, while the second Ghost module reduces the number of channels to match the input channel count. On the right side of the figure is the bottleneck with stride = 2, which alters the shape of the input feature maps. Compared to the bottleneck on the left, it includes an additional depthwise convolution to compress the height and width of the feature maps, reducing their size to half of the input.

2.3.2. Depthwise Separable Convolution

Depthwise separable convolution consists of depthwise convolution and pointwise convolution. Firstly, it performs depthwise convolution on each input channel, and then merges all channels into output feature maps through pointwise convolution, thereby reducing computation complexity and improving computational efficiency. In depthwise convolution, the convolution kernel is in single-channel mode, applying an independent convolution kernel to each input channel, resulting in output feature maps with the same number of channels as the input feature maps. Pointwise convolution operates on the output feature maps of depthwise convolution using

1 \times 1

convolution kernels to integrate and interact the information between channels, ultimately outputting feature information. Figure 7 illustrates the concept of standard convolution and depthwise separable convolution. As shown in the figure, suppose the convolution kernel size of standard convolution is

D_{K} * D_{K}

, the input channels are M, the output channels are N, and the size of output feature maps is

D_{F} * D_{F}

. Therefore, the parameter and computation complexity of standard convolution are calculated as

P a r a m_{s} = D_{K} * D_{K} * M * N

(3)

G F L O P_{s_{s}} = D_{K} * D_{K} * M * N * D_{F} * D_{F}

(4)

The depthwise convolution has a kernel size of

D_{K} * D_{K} * 1

, with M kernels. The pointwise convolution has a kernel size of

1 * 1 * M

and M kernels. Each convolution operation, including both depthwise and pointwise convolutions, requires

D_{F} * D_{F}

multiplication–addition operations. Therefore, the parameter and computation complexity of depthwise separable convolution are calculated as

P a r a m_{d} = D_{K} * D_{K} * M + M * N

(5)

G F L O P_{s_{d}} = D_{K} * D_{K} * M * D_{F} * D_{F} + M * N * D_{F} * D_{F}

(6)

The ratio of parameter quantity to computation complexity can be calculated from the above formulas.

P a r a m_{d / s} = \frac{D_{K} * D_{K} * M + M * N}{D_{K} * D_{K} * M * N} = \frac{1}{N} + \frac{1}{D_{K}^{2}}

(7)

G F L O P_{s_{d / s}} = \frac{D_{K} * D_{K} * M * D_{F} * D_{F} + M * N * D_{F} * D_{F}}{D_{K} * D_{K} * M * N * D_{F} * D_{F}} = \frac{1}{N} + \frac{1}{D_{K}^{2}}

(8)

With a large value for N,

1 / N

can be neglected, where

D_{K}

represents the kernel size. Therefore, when using a

3 \times 3

convolution, the parameter and computation complexity of depthwise separable convolution is

1 / 9 t h

of the standard convolution.

2.3.3. ECA Attention Mechanism

During feature extraction in convolutional neural networks, each part of the features is often assigned equal importance, which may lead to inaccurate extraction of important features and an increase in the extraction of irrelevant features, thus introducing feature redundancy in the model. By introducing attention mechanisms, the model can focus more on locally important information to extract crucial features. The ECA attention mechanism [40] is a type of channel attention mechanism, an improvement upon the SE attention mechanism [41]. Its principle is illustrated in Figure 8. After global adaptive pooling operations, a one-dimensional convolution is utilized to achieve non-dimension reduction local cross-channel interactions. An adaptive method is employed to determine the coverage rate of local cross-channel interactions by selecting the size of the one-dimensional convolutional kernel adaptively. The non-dimension reduction channel interaction strategy is crucial for learning channel attention, while appropriate cross-channel interaction can effectively reduce model complexity while maintaining model performance.

2.3.4. Lightweight C3

The C3 module plays a crucial role in both feature fusion and downsampling. Regarding feature fusion, the C3 module utilizes convolutional layers at different scales to extract features from various semantic levels. These features are then integrated to better capture semantic information across different scales, thereby enhancing the accuracy and robustness of object detection. Additionally, the C3 module can perform downsampling operations on features, reducing the size of feature maps and enlarging the receptive field, which helps improve the detection capability for small objects. Due to the similarity in color between walnuts and leaves, resulting in minimal differences in feature information, the extraction of important walnut features is compromised. Therefore, this study proposes an enhancement to the feature fusion layer’s C3, introducing the DE_C3 structure. The DE_C3 employs depthwise separable convolution to reduce the parameters required by convolutional layers, thereby achieving lightweight C3. To prioritize the extraction of important features, the channel attention mechanism ECA is introduced. As shown in Figure 9c, in the bottleneck layer, after two rounds of depthwise separable convolution, the ECA attention mechanism is incorporated to enhance the model’s ability to capture feature information and focus more on critical walnut details. The improved C3 not only reduces the number of parameters and computations but also enhances the feature extraction performance at the feature fusion layer.

2.3.5. Loss Function Improvement

The loss function is primarily used to evaluate the degree of deviation between predicted values and ground truth. The choice of loss function is crucial for model performance during training. The IoU algorithm is widely used in object detection, where the Intersection over Union (IoU) measures the overlap between predicted and ground truth bounding boxes. CIoU is an improvement over DIoU, which considers the Euclidean distance, taking into account the overlap between two boxes. DIoU measures whether the predicted box is centered within the ground truth box by dividing the square of the distance between their centers by the square of the diagonal distance of the smallest enclosing region. It considers overlap, distance, and scale between the two boxes, stabilizing the object regression box. However, it does not incorporate aspect ratio into the calculation. Therefore, CIoU loss was proposed, which introduces an aspect ratio penalty term on top of DIoU. The formula for CIoU loss is as follows:

I o U = \frac{| A \cap B |}{| A \cup B |}

(9)

C I o U = I o U - \frac{ρ^{2} (b, b^{g t})}{c^{2}} - β v

(10)

Here, A represents the area of the ground truth box, B represents the area of the predicted box,

(b, b^{g t})

denotes the Euclidean distance between the centers of the predicted and ground truth boxes, depicted as d in Figure 10, and c represents the diagonal distance of the minimum enclosing region that can simultaneously contain the predicted and ground truth boxes, as shown as c in Figure 10. The formulas for

β

and v are as follows:

β = \frac{v}{1 - I o U + v}

(11)

v = \frac{4}{π^{2}} {(\arctan \frac{w^{g t}}{h^{g t}} - \arctan \frac{w}{h})}^{2}

(12)

where

w^{g t}

and

h^{g t}

represent the width and height of the ground truth box, and w and h represent the width and height of the predicted box; the final CIoU loss function is obtained as

C I o U_{L o s s} = 1 - I o U + \frac{ρ^{2} (b, b^{g t})}{c^{2}} + β v

(13)

When using the CIoU loss function, each detected target is assigned the same weight, making it difficult for the loss and gradients of the targets to adapt during training. Alpha IoU [42], as a new weighted loss function based on IoU, is a unified power transformation of existing losses. The mAP0.5 comparison curve is shown in Figure 11, from which it can be observed that the curve using Alpha CIoU converges better than when using CIoU alone. Alpha IoU enhances the loss of high-IoU targets and the adaptive weighted gradient for bounding box regression precision by appropriately selecting the Alpha coefficient. Applying the Box-Cox transformation, the CIoU loss is generalized to the Alpha IoU loss.

L_{α - I o U} = \frac{1 - I o U^{α}}{α}, α > 0

(14)

Adding the penalty term into the existing IoU based on the above formula yields the improved CIoU loss function.

L_{α - C I o U} = 1 - I o U^{α} + \frac{ρ^{2 α} (b, b^{g t})}{c^{2 α}} + {(β v)}^{α}

(15)

2.4. Model Compression

To further enhance the performance of the improved model, we adopt a model compression scheme of pruning followed by distillation for GDA-YOLOv5. This approach involves trimming and optimizing redundant network parameters and structures to accelerate inference speed, reduce floating-point operations, and decrease memory usage.

2.4.1. Pruning

The current models typically contain millions to hundreds of millions of parameters. Therefore, not every weight is crucial in practice. Pruning is a technique used to satisfy practical constraints, alleviate overfitting, enhance interpretability, or deepen understanding of neural network training by removing “unimportant weights” from the model. Pruning methods can be categorized into structured pruning and unstructured pruning. Structured pruning involves reducing the parameters of the network by removing entire layers, channels, or filters according to certain rules. Unstructured pruning, on the other hand, involves removing any weight during the pruning process without being constrained by the network structure, resulting in irregular sparse models. This paper introduces a non-structured pruning method called LAMP (Layer-Adaptive Magnitude-based Pruning) [43], which optimizes pruning efficiency by selecting hierarchical sparsity using LAMP scores.

2.4.2. Knowledge Distillation

After pruning GDA-YOLOv5, the model complexity was effectively reduced, but this reduction also resulted in a loss of accuracy, which is not conducive for walnut detection in unstructured orchard environments. Therefore, we employed knowledge distillation on the pruned model. Knowledge distillation is a lightweight method based on transfer learning. It trains a structurally simpler student model based on the output of a teacher model, transferring knowledge from the teacher model to the student model. Knowledge distillation mainly consists of two categories: logit distillation and feature distillation. Logit distillation typically uses soft labels instead of hard labels to train smaller models, focusing mainly on the probability distribution of the output layer. Feature distillation, on the other hand, usually occurs in the middle layers of the model and focuses more on the intermediate features within the model. In this study, we primarily employed logit distillation, selecting GDA-YOLOv5 as the teacher model and the pruned GDA-YOLOv5 as the student model. The distilled model is named GDAD-YOLOv5.

2.5. Experimental Environment and Parameter Setting

The experiments were conducted on a Windows 10 operating system, utilizing an NVIDIA GeForce RTX 4060 Ti GPU (Shenzhen Yingjiaxun Industry Co., Ltd., Shenzhen, China) with 16 GB of memory, an Intel(R) Core(TM) i5-13490F CPU (Intel, Santa Clara, CA, USA), and 32 GB of RAM (Shenzhen Jinbaida Technology Co., Ltd., Shenzhen, China). PyTorch deep learning framework version 1.13.1 was used, along with Python version 3.7 and CUDA version 12.2. The training initialization parameters are listed in Table 2.

2.6. Metrics

Object detection algorithms are primarily evaluated based on precision (P), recall (R), and average precision (

m A P

) to assess detection accuracy. Model complexity is typically assessed by parameters, computation, and model weight size. Real-time detection performance is evaluated based on the computation of detection frame rate (

F P S

).

Precision is used to measure the accuracy of the model in predicting positive samples. Its calculation formula is

P = \frac{T P}{T P + F P} \times 100 %

(16)

where

T P

represents the probability of predicting a positive sample as positive, and

F P

represents the probability of predicting a negative sample as positive.

Recall is used to measure the proportion of true positive samples detected by the model among all samples. Its calculation formula is

R = \frac{T P}{T P + F N} \times 100 %

(17)

where

F N

represents the probability of predicting a negative sample that is actually a positive sample.

The

A P

value refers to the area under the curve when plotting the P and R curves, and

m A P

is the average of

A P

values for all classes. Its formula is

m A P = \frac{\sum_{i = 1}^{N} \int_{0}^{1} P (R) d R}{N} \times 100 %

(18)

The parameter count (

P a r a m

) refers to the number of learnable parameters in the model, including weights and bias terms in the network. Its calculation formula is

P a r a m = \sum K \times K \times C_{i n} \times C_{o u t}

(19)

where K represents the kernel size,

C_{i n}

represents the number of input channels, and

C_{o u t}

represents the number of output channels.

The computational complexity of a model is typically represented using

F L O P s

, which stands for Floating-Point Operations per Second. It refers to the number of floating-point operations required for model inference and is calculated using the following formula:

F L O P s = \sum K \times K \times C_{i n} \times C_{o u t} \times H \times W

(20)

3. Experiment and Analysis

3.1. Experiments

3.1.1. Comparison Experiment of Lightweight Backbone Network

To validate the effectiveness of the proposed lightweight model, a series of lightweight backbone networks are compared, and the experimental results are shown in Table 3. From the table, it can be observed that although the MobileNetv3 and ShuffleNetv2 networks exhibit good lightweight performance, they suffer significant accuracy loss, which is not conducive to walnut detection research. On the other hand, while the GhostNet and EfficientNetv2 networks have relatively high model complexity, they incur smaller detection accuracy loss. Among these lightweight backbone networks, GhostNet achieves the highest detection accuracy. Therefore, GhostNet is selected as the backbone network for lightweight research in this paper.

3.1.2. Optimization Experiment of Feature Fusion Layer

In the feature fusion layer, this study mainly improved the C3 module. By embedding attention mechanisms into the C3 module to enhance the model’s perception of important features, the model focuses more on extracting locally important features. As shown in Figure 12, the detection accuracy of different attention mechanisms is compared, and it can be seen from the graph that mAP0.5 fluctuates around 94.3%. Among them, SE, ECA, SimAM, and Triplet attention mechanisms achieve the best mAP0.5, reaching 94.5%. The specific analysis of these four attention mechanisms is presented in Table 4. It can be observed from the table that the mAP0.5 of the four attention mechanisms is 0.2% higher than the original value. However, comparing comprehensively, the parameter volume of the ECA attention mechanism is the smallest, and its P value is also the highest. Therefore, we select the ECA attention mechanism to be integrated into the improved C3 module to optimize the performance of C3.

3.1.3. Ablation Experiment

To evaluate the effectiveness of the improved algorithm in walnut detection, we conducted the following ablation experiments, and the results are shown in Table 5. From the table, it can be seen that after introducing the GhostNet network, the model complexity is effectively reduced, but there is also a certain loss in mAP value. However, with the optimized DE_C3, not only did mAP0.5 increase by 0.6%, but the parameter count and computation also saw significant reductions. Upon replacing the loss function with Alpha CIoU, the model’s precision, recall, and mAP values all increased while maintaining the original complexity. After pruning, the model’s parameter count decreased by 2.2 M, computation reduced by 3.6GFLOPs, and the model weight file size decreased by 4.3 MB, determining the final complexity of the model. Knowledge distillation improved the model’s detection accuracy while maintaining its small complexity, even surpassing the teacher model. Ultimately, our proposed walnut detection algorithm GDAD-YOLOv5 outperformed the baseline model YOLOv5s, with mAP0.5 increased by 1.1%, precision by 0.7%, recall by 0.3%, parameter count reduced by 72.9%, computation reduced by 84.1%, and model weight file size reduced by 70.1%. Not only did it enhance detection accuracy, but it also effectively reduced model complexity, thereby improving the walnut detection performance of the model. The performance comparison of the algorithm before and after the improvement is shown in the radar chart in Figure 13. The smaller the area of the radar region, the better the performance of the algorithm. It is also intuitively demonstrated from the chart that the performance of our improved algorithm has been significantly enhanced.

3.1.4. Comparison Experiments of Different Models

To further validate the performance of the improved algorithm, we conducted several comparative experiments with the currently mainstream lightweight YOLO series algorithms, and the experimental results are shown in Table 6. From the table, it can be observed that the improved algorithm is superior both in terms of model complexity and detection accuracy. Although the earlier model YOLOv3-tiny performs well in real-time detection, it has shortcomings in terms of detection accuracy and model complexity. Comparing with YOLOv7-tiny, YOLOv8s, and YOLOv9t, our model GDAD-YOLOv5 achieved higher mAP0.5 by 1.0%, 0.7%, 0.6%, respectively. Moreover, compared to the relatively lightweight YOLOv9t, GDAD-YOLOv5 reduced computation by 8.2GFLOPs, parameter count by 0.8 M, and weight file size by 1.8 MB, with the best FPS value. Our improved GDA-YOLOv5 and pruned algorithms also reached the level of mainstream algorithms, significantly reducing model complexity while effectively improving accuracy.

3.1.5. Pruning Optimization Experiment

To validate the pruning effects under different parameters, we conducted several experiments as follows. Firstly, we compared the effects of different pruning methods on our model, and the experimental results are shown in Table 7. From the table, it can be observed that the model achieves the best mAP0.5 when using the LAMP pruning method. Therefore, we employed the LAMP pruning method for pruning in this study. The “Speed up” parameter specifies the acceleration ratio of the pruned model’s inference speed compared to the original model. Different “Speed up” values correspond to different pruning rates of the model. Figure 14 reflects the influence of different “Speed up” values on model performance. From Figure 14a,b, it can be seen that as the “Speed up” increases, the complexity of the model shows a decreasing trend. Figure 14c,d show that precision, recall, and mAP0.5 exhibit oscillating changes with different “Speed up” values, but relatively better performance is achieved when “Speed up” is set to 2.0.

3.1.6. Knowledge Distillation Optimization Experiment

Logit distillation and feature distillation are two common knowledge distillation methods, each with different emphases. To compare which distillation method is more suitable for our model compression, this study conducted a comparison between the two distillation methods. The experimental results are shown in Figure 15. The horizontal axis in the figure represents the distillation loss rate, indicating the variation of the values of feature distillation and logit distillation at different loss rates. As observed from the figure, whether in terms of precision and recall or mAP0.5, the performance of logit distillation is significantly better than that of feature distillation across different loss rates. Therefore, this study adopts the logit distillation method, and the best value is achieved when the distillation loss rate is set to 0.008.

3.2. Analysis of Detection Effect in Complex Environment

3.2.1. Comparison of Occlusion Detection

To evaluate the improved algorithm’s performance in detecting occluded walnuts in complex environments, we selected 100 images containing occluded walnuts for testing. The results are shown in Table 8. From the table, it can be observed that the improved algorithm demonstrates better occlusion detection performance in different complex environments. Some selected visualized results are shown in Figure 16. It can be observed from the figure that, whether in well-lit or slightly dim environments, the improved algorithm outperforms its predecessor in detecting occluded walnuts of varying degrees.

3.2.2. Comparison of Near Background Detection

To evaluate the improved algorithm’s performance in detecting walnuts against similar background foliage, we selected 100 images where the walnut skin color is similar to that of the leaves for testing. The detection results are presented in Table 9. From the table, it can be observed that the improved algorithm shows some improvement in detection performance in complex environments where the walnut color is similar to that of the leaves. Selected visualized results are shown in Figure 17. From the figure, it can be seen that due to variations in lighting conditions, the walnut color and background foliage can be similar, leading to confusion between the leaves and the walnuts to some extent. The improved algorithm demonstrates a reduction in both missed detections and false positives, indicating an enhancement in walnut detection performance.

4. Model Deployment

To validate the effectiveness of the proposed algorithm, we deployed the algorithm on Raspberry Pi 5, a small edge device, for inference verification. The device specifications are shown in Figure 18, featuring a 64-bit quad-core Arm Cortex-A76 processor running at a frequency of 2.4 GHz and utilizing the VideoCore VII GPU, which supports OpenGL-ES 3.1 and Vulkan 1.2. We converted the trained model into ONNX format and wrote Python code to conduct inference testing on the Raspberry Pi. We selected 110 walnut images and recorded the detection results. When using YOLOv5s for inference, the average processing time per image was 319.92 ms, whereas when using GDAD-YOLOv5 for inference, the average processing time per image was 201.95 ms. The visualization of the inference results is shown in Figure 19, which compares the detection results between YOLOv5s and GDAD-YOLOv5. The improved model not only reduces the rates of missed detections and false positives but also effectively enhances the model’s walnut detection performance by improving the inference speed.

5. Conclusions

In the research process of automated walnut harvesting, an accurate and stable recognition and localization system is crucial. Given the advantages of deep learning models and drone technology in agricultural production, we used drones to capture walnut data in complex environments, creating a walnut dataset that includes various field conditions. This dataset contains a total of 2490 walnut images, covering scenarios such as bright light, low light, occlusion, close-up, distant views, and multiple fruits. Addressing the issue of unsatisfactory detection results caused by the complex growth environment of walnuts, we proposed a green walnut detection algorithm, GDAD-YOLOv5, based on YOLOv5s. Our main improvements focus on the backbone network, feature fusion, and loss function, including the lightweight backbone network GhostNet, the efficient DE_C3, and the superior performance Alpha CIoU loss. Additionally, we compressed the model using pruning and knowledge distillation techniques. The final model, compared to YOLOv5s, reduced the number of parameters by 72.9%, computation by 84.1%, and model weight size by 70.1%, while increasing mAP0.5 by 1.1%, p-value by 0.7%, R-value by 0.3%, and detection speed FPS to 179.6% of the original model. Compared to the current mainstream algorithms YOLOv7-tiny, YOLOv8s, and YOLOv9t, our GDAD-YOLOv5 algorithm outperforms them in mAP0.5 by 1.0%, 0.7%, and 0.6%, respectively, reaching the level of current mainstream algorithms. In complex environmental recognition tasks, GDAD-YOLOv5 achieved a detection accuracy of 95.2%, a model size of only 4.3 MB, and an FPS of up to 135.1. It not only offers higher detection accuracy and faster detection speed but also has lower model complexity. Notably, in scenarios where the fruit’s background color is similar to the leaf color and in occlusion situations, GDAD-YOLOv5’s recognition performance has significantly improved, greatly reducing the number of missed and false detections.To meet practical deployment needs, we tested its deployment on small edge devices. Its detection performance remained stable and effective, meeting the requirements of practical deployment.

Although our algorithm has significantly improved walnut detection performance in complex environments, there are still some shortcomings. When deployed on small edge devices, due to limited hardware performance, the average detection time per image reached 201.95 ms, which presents some delay in real-time detection scenarios. In the next stage of our research, we will focus on model acceleration to enhance real-time detection on small edge devices. Additionally, we will enrich our walnut dataset, extending our research to include multispectral and LiDAR data to provide more valuable information for walnut agricultural research.

Author Contributions

Conceptualization, C.Y. and L.Y.; methodology, M.W.; software, C.Y.; validation, Y.X., Z.C. (Zaiqing Chen) and C.Y.; formal analysis, C.Y.; investigation, L.Y.; resources, Z.C. (Zhengda Cai); data curation, C.Y.; writing—original draft preparation, C.Y.; writing—review and editing, L.Y.; visualization, Z.C. (Zhengda Cai); supervision, M.W.; project administration, L.Y.; funding acquisition, Z.C. (Zhengda Cai). All authors have read and agreed to the published version of the manuscript.

Funding

Yunnan Province Applied Basic Research Program Key Project (202401AS070034); Yunnan Province Forest and Grassland Science and Technology Innovation Joint Project (202404CB090002).

Institutional Review Board Statement

Not applicable.

Data Availability Statement

The data used in this study are available by contacting the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Manterola-Barroso, C.; Padilla Contreras, D.; Ondrasek, G.; Horvatinec, J.; Gavilán CuiCui, G.; Meriño-Gergichevich, C. Hazelnut and Walnut Nutshell Features as Emerging Added-Value Byproducts of the Nut Industry: A Review. Plants 2024, 13, 1034. [Google Scholar] [CrossRef] [PubMed]
Hua, X.; Li, H.; Zeng, J.; Han, C.; Chen, T.; Tang, L.; Luo, Y. A review of target recognition technology for fruit picking robots: From digital image processing to deep learning. Appl. Sci. 2023, 13, 4160. [Google Scholar] [CrossRef]
Sa, I.; Zong, G.; Feras, D.; Ben, U.; Tristan, P.; Chris, M.C. Deepfruits: A Fruit Detection System Using Deep Neural Networks. Sensors 2016, 16, 1222. [Google Scholar] [CrossRef] [PubMed]
Supawadee, C.; Dailey, M.N. Texture-based fruit detection. Precis. Agric. 2014, 15, 662–683. [Google Scholar]
Payne, A.B.; Walsh, K.B.; Subedi, P.P.; Jarvis, D. Estimation of mango crop yield using image analysis–segmentation method. Comput. Electron. Agric. 2013, 91, 57–64. [Google Scholar] [CrossRef]
Payne, A.; Walsh, K.; Subedi, P.; Jarvis, D. Estimating mango crop yield using image analysis using fruit at ‘stone hardening’stage and night time imaging. Comput. Electron. Agric. 2014, 100, 160–167. [Google Scholar] [CrossRef]
Bai, Y.; Mao, S.; Zhou, J.; Zhang, B. Clustered tomato detection and picking point location using machine learning-aided image analysis for automatic robotic harvesting. Precis. Agric. 2023, 24, 727–743. [Google Scholar] [CrossRef]
Montoya-Cavero, L.E.; de León Torres, R.D.; Gómez-Espinosa, A.; Cabello, J.A.E. Vision systems for harvesting robots: Produce detection and localization. Comput. Electron. Agric. 2022, 192, 106562. [Google Scholar] [CrossRef]
Xiao, F.; Wang, H.; Xu, Y.; Zhang, R. Fruit detection and recognition based on deep learning for automatic harvesting: An overview and review. Agronomy 2023, 13, 1625. [Google Scholar] [CrossRef]
Wang, Z.; Xun, Y.; Wang, Y.; Yang, Q. Review of smart robots for fruit and vegetable picking in agriculture. Int. J. Agric. Biol. Eng. 2022, 15, 33–54. [Google Scholar]
Zhang, S.; Mingshan, X. Real-time recognition and localization based on improved YOLOv5s for robot’s picking clustered fruits of chilies. Sensors 2023, 23, 3408. [Google Scholar] [CrossRef] [PubMed]
Nan, Y.; Zhang, H.; Zeng, Y.; Zheng, J.; Ge, Y. Intelligent detection of Multi-Class pitaya fruits in target picking row based on WGB-YOLO network. Comput. Electron. Agric. 2023, 208, 107780. [Google Scholar] [CrossRef]
Xu, Z.; Huang, X.; Huang, Y.; Sun, H.; Wan, F. A real-time zanthoxylum target detection method for an intelligent picking robot under a complex background, based on an improved YOLOv5s architecture. Sensors 2022, 22, 682. [Google Scholar] [CrossRef] [PubMed]
Chen, J.; Ma, A.; Huang, L.; Su, Y.; Li, W.; Zhang, H.; Wang, Z. GA-YOLO: A Lightweight YOLO Model for Dense and Occluded Grape Target Detection. Horticulturae 2023, 9, 443. [Google Scholar] [CrossRef]
Hou, G.; Chen, H.; Jiang, M.; Niu, R. An Overview of the Application of Machine Vision in Recognition and Localization of Fruit and Vegetable Harvesting Robots. Agriculture 2023, 13, 1814. [Google Scholar] [CrossRef]
Wu, M.; Yun, L.; Xue, C.; Chen, Z.; Xia, Y. Walnut Recognition Method for UAV Remote Sensing Images. Agriculture 2024, 14, 646. [Google Scholar] [CrossRef]
Hao, J.; Bing, Z.; Yang, S.; Yang, J.; Sun, L. Detection of green walnut by improved YOLOv3. Trans. Chin. Soc. Agric. Eng. 2022, 38, 183–190. (In Chinese) [Google Scholar]
Zhong, Z.; Yun, L.; Yang, X.; Chen, Z. Research on Walnut Recognition Algorithm in Natural Environment Based on Improved YOLOX. J. Henan Agric. Sci. 2024, 53, 152–161. (In Chinese) [Google Scholar]
Fan, X.; Xu, Y.; Zhou, J.; Liu, X.; Tang, J.; Wei, Y. Green Walnut Detection Method Based on Improved Convolutional Neural Network. Trans. Chin. Soc. Agric. Mach. 2021, 52, 149–155. (In Chinese) [Google Scholar]
Fu, K.; Lei, T.; Halubok, M.; Bailey, B.N. Walnut Detection Through Deep Learning Enhanced by Multispectral Synthetic Images. arXiv 2023, arXiv:2401.03331. [Google Scholar]
Rejeb, A.; Abdollahi, A.; Rejeb, K.; Treiblmaier, H. Drones in agriculture: A review and bibliometric analysis. Comput. Electron. Agric. 2022, 198, 107017. [Google Scholar] [CrossRef]
Kecen, L.; Xiaoqiang, W.; Hao, L.; Leixiao, L.; Yanyan, Y.; Chuang, M.; Jing, G. Survey of one-stage small object detection methods in deep learning. J. Front. Comput. Sci. Technol. 2022, 16, 41. [Google Scholar]
Staff, A.C. The two-stage placental model of preeclampsia: An update. J. Reprod. Immunol. 2019, 134, 1–10. [Google Scholar] [CrossRef] [PubMed]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. Ssd: Single shot multibox detector. In Computer Vision–ECCV 2016: Proceedings of the 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016, Proceedings, Part I 14; Springer International Publishing: Berlin/Heidelberg, Germany, 2016. [Google Scholar]
Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal Loss for Dense Object Detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017. [Google Scholar]
Bochkovskiy, A.; Wang, C.-Y.; Liao, H.-Y.M. Yolov4: Optimal speed and accuracy of object detection. arXiv 2020, arXiv:2004.10934. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
Joseph, R.; Farhadi, A. YOLO9000: Better, faster, stronger. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
Joseph, R.; Farhadi, A. Yolov3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 37, 1904–1916. [Google Scholar] [CrossRef] [PubMed]
Girshick, R. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 1137–1149. [Google Scholar] [CrossRef]
Dai, J.; Li, Y.; He, K.; Sun, J. R-fcn: Object detection via region-based fully convolutional networks. In Proceedings of the 30th International Conference on Neural Information Processing Systems, Barcelona, Spain, 5–10 December 2016. [Google Scholar]
Reis, D.; Kupec, J.; Hong, J.; Daoudi, A. Real-time flying object detection with YOLOv8. arXiv 2023, arXiv:2305.09972. [Google Scholar]
Wang, C.-Y.; Yeh, I.-H.; Liao, H.-Y.M. YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information. arXiv 2024, arXiv:2402.13616. [Google Scholar]
Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path aggregation network for instance segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
Han, K.; Wang, Y.; Tian, Q.; Guo, J.; Xu, C.; Xu, C. Ghostnet: More features from cheap operations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient channel attention for deep convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
Hu, J.; Li, S.; Gang, S. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
He, J.; Erfani, S.; Ma, X.; Bailey, J.; Chi, Y.; Hua, X.S. Alpha-IoU: A family of power intersection over union losses for bounding box regression. arXiv 2021, arXiv:2110.13675. [Google Scholar]
Lee, J.; Park, S.; Mo, S.; Ahn, S.; Shin, J. Layer-adaptive sparsity for the magnitude-based pruning. arXiv 2020, arXiv:2010.07611. [Google Scholar]
Fang, G.; Ma, X.; Song, M.; Mi, M.B.; Wang, X. Depgraph: Towards any structural pruning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023. [Google Scholar]
Molchanov, P.; Mallya, A.; Tyree, S.; Frosio, I.; Kautz, J. Importance estimation for neural network pruning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
Li, H.; Kadav, A.; Durdanovic, I.; Samet, H.; Graf, H.P. Pruning filters for efficient convnets. arXiv 2016, arXiv:1608.08710. [Google Scholar]

Figure 1. Geographic location of the study area.

Figure 2. Flowchart of walnut collection process.

Figure 3. Partial walnut dataset. (a) Bright light. (b) Dim light. (c) Occlusion. (d) Close-range. (e) Far-range. (f) Multiple fruits.

Figure 4. Improved model structure diagram. CBS represents Conv+BN+SiLU, UpSample denotes the upsampling operation, Concat represents the feature fusion operation, and C3_DE refers to the C3 structure proposed in this paper.

Figure 5. Ghost module.

Figure 6. Ghost bottleneck. (a) shows the bottleneck structure with stride = 1, and (b) shows the bottleneck structure with stride = 2.

Figure 7. Standard convolution and depthwise separable convolution. (a) Standard convolution. (b) Depthwise separable convolution.

Figure 8. ECA attention mechanism.

Figure 9. Improvement structure diagram of C3. (a) is the C3 bottleneck structure, (b) is the C3 structure, (c) is the DE_C3 bottleneck structure, and (d) is the DE_C3 structure.

Figure 10. CIoU bounding box loss.

Figure 11. CIoU and Alpha CIoU mAP0.5 curves.

Figure 12. mAP0.5 for different attention mechanisms. A represents SE, B represents CBAM, C represents CA, D represents ECA, E represents SimAM, F represents SKAttention, G represents DoubleAttention, H represents Triplet, I represents SpaticalGroupEnhance, J represents NAM, K represents ParNet, L represents GAM, M represents ParallelPolarized, N represents ParallelPolarized, O represents NO Attention.

Figure 13. Comparison of performance before and after improvement.

Figure 14. Pruning effect under different model speed-up ratios. (a) Parameter number variation. (b) Variation of computation. (c) Variation of P/R. (d) Variation of mAP0.5.

Figure 15. Distillation comparison results. (a) Variation of P/R. (b) Variation of mAP0.5.

Figure 16. Occlusion detection results. (a,c,e,g) represent the detection results of YOLOv5s, while (b,d,f,h) denote the detection results of GDAD-YOLOv5. Purple circles represent missed detections, and purple squares denote false detections.

Figure 17. Results of near-background detection. (a,c,e,g) represent the detection results of YOLOv5s, while (b,d,f,h) denote the detection results of GDAD-YOLOv5. Purple circles represent missed detections, and purple squares denote false detections.

Figure 18. Raspberry Pi 5.

Figure 19. Visualization of the inference results. (a,c,e,g) represent the detection results of YOLOv5s, while (b,d,f,h) denote the detection results of GDAD-YOLOv5. Purple circles represent missed detections.

Table 1. The number of walnut images in different scenarios.

Bright Light	Dim Light	Occlusion	Close-Range	Far-Range	Multiple Fruits
316	420	861	641	159	93

Table 2. Initialization parameter table.

Parameter	Value
epoch	300
lr	0.01
momentum	0.937
weight_decay	0.0005
batch_size	16
optimizer	SGD

Table 3. Comparison of lightweight backbone networks.

Model	mAP0.5 (%)	GFLOPs ( $10^{9}$ )	Param ( $10^{6}$ )
YOLOv5s	94.1	15.8	7.0
MobileNetv3	92.9	2.5	1.3
GhostNet	93.9	7.7	4.9
ShuffleNetv2	92.1	6.9	3.3
EfficientNetv2	93.4	5.6	5.9

Table 4. Comparison results of different attention mechanisms.

Attention	mAP0.5 (%)	P (%)	R (%)	Parameters	GFLOPs ( $10^{9}$ )
no attention	94.3	91.7	86.7	4,045,114	6.1
ECA	94.5	91.0	87.8	4,045,126	6.1
SE	94.5	90.4	88.8	4,057,914	6.1
Triplet	94.5	90.9	87.7	4,046,314	6.1
SimAM	94.5	89.9	88.5	4,045,144	6.1

Table 5. Results of ablation experiments.

Components	1	2	3	4	5	6
GhostNet		√	√	√	√	√
DE_C3			√	√	√	√
Alhpa CIoU				√	√	√
Prune					√	√
KD						√
mAP0.5 (%)	94.1	93.9	94.5	94.8	94.3	95.2 (+1.1%)
P (%)	90.9	89.8	91.0	91.4	89.9	91.6 (+0.7%)
R (%)	88.4	88.8	86.7	88.2	88.2	88.7 (+0.3%)
Param ( $10^{6}$ )	7.0	4.8	4.0	4.0	1.8	1.8 (−72.9%)
GFLOPs ( $10^{9}$ )	15.8	7.7	6.1	6.1	2.5	2.5 (−84.1%)
Model Size/MB	14.4	10.3	8.6	8.6	4.3	4.3 (−70.1%)

The check mark “√” indicates that this module has been included.

Table 6. Comparison of the experimental results.

Model	Param ( $10^{6}$ )	GFLOPs ( $10^{9}$ )	mAP0.5 (%)	Model Size/MB	FPS
YOLOv3-tiny	8.7	12.9	92.3	17.4	105.3
YOLOv4-tiny	5.8	16.2	90.8	23.6	-
YOLOv5s	7.0	15.8	94.1	14.4	75.2
YOLOv6s	18.5	45.17	93.1	38.7	91.5
YOLOv7-tiny	6.0	13.0	94.2	12.3	61.7
YOLOv8s	11.1	28.4	94.5	22.5	90.9
YOLOv9t	2.6	10.7	94.6	6.1	52.9
GDA-YOLOv5	4.0	6.1	94.8	8.6	126.6
GDA-YOLOv5 (prune)	1.8	2.5	94.3	4.3	128.2
GDAD-YOLOv5	1.8	2.5	95.2	4.3	135.1

Table 7. Comparison of different pruning methods.

Prune_method	Param ( $10^{6}$ )	GFLOPs ( $10^{9}$ )	mAP0.5 (%)	Model Size/MB
LAMP	1,894,721	2.5	94.3	4.3
DepGraph [44]	1,894,721	2.5	93.9	4.3
Taylor Pruning [45]	1,894,721	2.5	93.7	4.3
L1-norm Pruning [46]	1,894,721	2.5	93.4	4.3

Table 8. Occlusion detection results.

Model	Target Numbers	Missed Numbers	Wrong Numbers
YOLOv5s	889	128	16
GDAD-YOLOv5s	899	62	6

Table 9. Results of near-background detection.

Model	Target Numbers	Missed Numbers	Wrong Numbers
YOLOv5s	918	119	18
GDAD-YOLOv5s	918	63	6

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yang, C.; Cai, Z.; Wu, M.; Yun, L.; Chen, Z.; Xia, Y. Research on Detection Algorithm of Green Walnut in Complex Environment. Agriculture 2024, 14, 1441. https://doi.org/10.3390/agriculture14091441

AMA Style

Yang C, Cai Z, Wu M, Yun L, Chen Z, Xia Y. Research on Detection Algorithm of Green Walnut in Complex Environment. Agriculture. 2024; 14(9):1441. https://doi.org/10.3390/agriculture14091441

Chicago/Turabian Style

Yang, Chenggui, Zhengda Cai, Mingjie Wu, Lijun Yun, Zaiqing Chen, and Yuelong Xia. 2024. "Research on Detection Algorithm of Green Walnut in Complex Environment" Agriculture 14, no. 9: 1441. https://doi.org/10.3390/agriculture14091441

APA Style

Yang, C., Cai, Z., Wu, M., Yun, L., Chen, Z., & Xia, Y. (2024). Research on Detection Algorithm of Green Walnut in Complex Environment. Agriculture, 14(9), 1441. https://doi.org/10.3390/agriculture14091441

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Research on Detection Algorithm of Green Walnut in Complex Environment

Abstract

1. Introduction

2. Materials and Methods

2.1. Dataset

2.1.1. Study Area

2.1.2. Drone Data Collection

2.1.3. Dataset Creation

2.2. YOLOv5 Object Detection Algorithm

2.3. Algorithm Improvement

2.3.1. Reconstruction of Backbone Network

2.3.2. Depthwise Separable Convolution

2.3.3. ECA Attention Mechanism

2.3.4. Lightweight C3

2.3.5. Loss Function Improvement

2.4. Model Compression

2.4.1. Pruning

2.4.2. Knowledge Distillation

2.5. Experimental Environment and Parameter Setting

2.6. Metrics

3. Experiment and Analysis

3.1. Experiments

3.1.1. Comparison Experiment of Lightweight Backbone Network

3.1.2. Optimization Experiment of Feature Fusion Layer

3.1.3. Ablation Experiment

3.1.4. Comparison Experiments of Different Models

3.1.5. Pruning Optimization Experiment

3.1.6. Knowledge Distillation Optimization Experiment

3.2. Analysis of Detection Effect in Complex Environment

3.2.1. Comparison of Occlusion Detection

3.2.2. Comparison of Near Background Detection

4. Model Deployment

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI