*2.4. Methodologies*

The Yolo series are effective in single-stage object detection, and their miniature detection models guarantee higher accuracy, taking into account faster speed and fewer parameters. Therefore, the lightweight detection models of the Yolo series are more suitable to be applied to embedded devices to develop mobile agricultural equipment. However, due to the complexity of the agricultural production environment and the harsh working environment, it is difficult to meet the agricultural production for the simple detection algorithm. Based on YOLOv5s, the original backbone network was replaced by the ShuffleNet V2 backbone network in this research, which significantly reduced the number of parameters of the network. The Foucs were replaced by the Stem to resist partial information missing from the feature map. PANet was replaced by BiFPN to enhance the model feature fusion capability and improve the model accuracy. Finally, the improved YOLOv5s detection network was used to identify the image and count red jujubes.

#### 2.4.1. Yolov5s Network

YOLOv5 is improved by adding some new ideas on the basis of YOLOv4, and its detection accuracy and speed have been greatly improved. The YOLOv5 can be divided into four types according to the size of the model: YOLOv5s, YOLOv5m, YOLOv5l, and YOLOv5x, among which the YOLOv5s model is the smallest. YOLOv5s mainly consists of four parts: Input, Backbone, Neck, and Prediction.

In order to improve the speed and accuracy of the network, the Mosaic data augmentation is used in the YOLOv5 to stitch images by random cropping, scaling, and lining up. YOLOv5s uses adaptive anchor box calculation to set the initial anchor boxes for different datasets and calculates the difference between the bounding boxes and the ground truth. YOLOv5s updates the anchor boxes in the reverse iteration to adaptively calculate the best anchor box for different training sets. To adapt different sizes of images in the dataset, YOLOv5 uses adaptive image scaling to fill the scaled image with the least amount of black edges, which reduces the computation and improves the speed. Backbone will perform information extraction on the feature maps. It mainly includes Focus, CBS, and C3. The input image is sliced by the Focus and convolved by one convolution with 32 kernels, as

shown in Figure 4. CBS consists of a convolution, a batch normalization, and the SiLU. The SiLU is defined as follows:

$$SILI(\mathbf{x}) = \frac{\mathbf{x}}{1 + e\mathbf{x}\,p(-\mathbf{x})} \tag{1}$$

where, *x* represents the feature map.

**Figure 4.** Foucs structrue.

As a new structure of BottlenneckCSP, C3 contains 3 CBS modules and several Bottlenecks. The C3 is used repeatedly in YOLOv5s to extract more information. As shown in Figure 5, the SPP (spatial pyramid pooling) introduces three different pooling kernels of 5 × 5, 9 × 9, and 13 × 13, and it connects different feature maps to expand the respective field, which effectively separates the most important features and improves the accuracy of the model.

**Figure 5.** SPP structure.

To utilize most of the backbone information, the Neck of YOLOv5 uses the FPN + PAN. Feature Pyramid Network (FPN) solves the problem of different input feature map sizes by constructing an image pyramid on the feature map. PAN, as the innovative point of path aggregation network (PANet) [37], downsamples the image from FPN and then performs concat on the image. To improve the ability of image recognition and localization, FPN acquires the semantic features of the image from the top, while PAN gets the localization features of the image from the bottom.

There are some regression loss functions used in object detection tasks, such as the Smooth Loss function [16], IOU Loss function [38], GIOU Loss function [39], DIOU Loss function [40], and CIOU\_Loss function [41]. In the Prediction, YOLOv5 uses CIOU\_Loss as the loss function of the Bounding box. The CIOU\_Loss function is defined as follows:

$$L\_{CIOU} = 1 - IOL + \frac{\rho^2(b, b^{\otimes t})}{c^2} + \alpha v \tag{2}$$

where, *IOU* represents the intersection ratio of the prediction box to the object box. *b* represents the center point of the prediction box. *bgt* represents the center point of the object box. *ρ*2(*b*, *bgt*) represents Euclidean distance squared between the center point of the prediction box and the center point of the object box. *c* represents the diagonal length of the two closed boxes. *α* represents a positive trade-off parameter. *υ* represents the consistency of the aspect ratio.

#### 2.4.2. ShuffleNet V2 Backbone

YOLOv5s reduces the parameters of the model by C3 and improves the speed of the model, but the C3 is very complicated, with a large amount of calculation and still needs a lot of memory. The YOLOv5 lightweight model based on ShuffleNet V2 was designed, which greatly reduced the model parameters. The ShuffleNet V2 backbone was designed by using ShuffleNet V2 Units [42], and the backbone of the original model was replaced by the ShuffleNet V2 backbone.

As a lightweight convolutional neural network that is suitable for application to mobile devices, ShuffleNet V2 was first proposed in 2018. Compared with ShuffleNet V1, ShuffleNet V2 adopts the way of channel Shuffle, which divides the characteristic channels into two parts, ensuring that the input and output channels are the same, One part enters the bottleneck, and the other part does not run. Excessive point convolution will increase computational complexity. ShuffleNet V2 replaces the grouped point convolution with the standard point convolution. ShuffleNet V2 puts the channel shuffle after the dimensional stacking to prevent fragmentation of the model. ShuffleNet V2 replaces element-wise operators with concat to reduce the time of model detection. The basic model units of ShuffleNet V2 are divided into two types. The ShuffleNet V2 Units are shown in Figure 6. ShuffleNet V2 introduces channel shuffle. First, the channels of the input feature map are divided into two branches. The two branches directly connect to the concat. There are two 1 × 1 point convolution layers and a 3 × 3 group convolution layer with a stride size of 2 in the other branch. The convolution layers contain a batch normalization layer and ReLu. The other basic model unit of ShuffleNet V2 differs from the previous model, where two convolution layers: a 3 × 3 group convolution layer with a stride of 2 and a 1 × 1 point convolution layer. Finally, two images of branches of the same size were spliced together. In order to extract information on different-size feature maps, the ShuffleNet V2 backbone was designed to replace the backbone by using 16 ShuffleNet V2 Units in YOLOv5s.

**Figure 6.** The structure of ShuffleNet-v2 Units. (**a**) the structure of ShuffleNet-v2 Unit1. (**b**) the structure of ShuffleNet-v2 Unit2.

### 2.4.3. Stem Construction

Inception-v4 [43] was proposed in 2017, which confirmed that residual connectivity largely accelerated the training speed of Inception networks. With reference to the design idea of Inception-v4, the Stem was proposed to rapidly reduce the resolution of the input feature maps, ultimately achieving a top-5 error rate of 3.08% on ILSVRC. The feature map is continuously reduced from 299 × 299 to 35 × 35 by Stem in the InceptionV4 network, and it has many convolution layers, which is better for complex task feature extraction. However, the task is simpler to detect a single target of red jujube, which will cause excessive calculation. The Stem is shown in Figure 7. In order to reduce the parameters of the model, the model could be pruned. Inspired by the idea of fast feature map resolution reduction, four CBS were adopted to make the size of the feature map to be suitable for the network, where 3 × 3 convolutions with the stride of 2 were used in the first and third CBS and 1 × 1 convolution was used in the second and fourth CBS. In contrast to the Foucs, which sliced the feature map into 32 small feature maps before image concat, the Stem used two 3 × 3 convolutions with the stride of 2 to reduce the feature map sizes and concated it with the feature map of the maximum pooling layer, so that the number of parameters was reduced while improving the feature extraction ability of the network and improving the accuracy.

#### 2.4.4. BiFPN

With the deepening of the network level, the semantic information of image features gradually changes from a low dimension to a high dimension. As shown in Figure 8, the PANet structure was used to fuse the multi-scale features of images in the original YOLOv5s detection network. In order to improve the detection accuracy of red jujubes, the BIFPN network, a weighted bidirectional feature pyramid network, was applied to the detection of red jujubes. Compared with the traditional feature fusion network, BiFPN introduced weight to make it more sensitive to important features and makes better use of feature information of different sizes.

**Figure 7.** The structure of the Stem.

**Figure 8.** Bi-directional feature fusion network. (**a**) PANet with bi-directional feature fusion network, (**b**) BiFPN with bi-directional feature fusion network.

In this research, BiFPN was introduced in the neck of YOLOv5s, as shown in Figure 9. Because the node, which had only one input edge and no ability of feature fusion, made little contribution to the feature fusion of the network. Therefore, deleting this node had little effect on network feature fusion. When the original input node and the output node were in the same layer, an extra edge was added between the output node and the input node, and feature fusion was realized without increasing too much computational overhead. Different from the PANet structure of YOLOv5s, when performing feature fusion, each bidirectional path was used as a feature network layer, and the feature network layer was reused at the same layer, thus realizing a higher level of feature fusion.

**Figure 9.** The structure of the improved YOLOv5s model.

2.4.5. Counting Method of Red Jujube

The counting method of red jujubes was based on the improved jujube target detection algorithm. This research used ROS to count red jujubes. The detection steps were as follows: (1) Starting ROS core and publishing topics; (2) the improved YOLOv5s were used to detect the target of jujube fruit, and the target detection frame and corresponding features were obtained; (3) counting the number of target detection frames, as shown in Figure 10a. The detection results are shown in Figure 10b.

**Figure 10.** Counting method of red jujube. (**a**) the process of the red jujube counting method; (**b**) the results of the jujube counting method.

#### *2.5. Test Platform*

The experiment was conducted on an improved YOLOv5s architecture with Pytorch based on Python 3.8. The details of the experimental setup are shown in Table 2.

**Table 2.** Experimental environment.


The batch size was 4, and the epochs were 400. The adaptive matrix estimation algorithm (Adam) was used to optimize the model. The initial learning rate was 0.001, and the momentum was 0.9. The weight of the model was saved once every training session, and the best weight was also saved.

#### *2.6. Evaluation of Model Performance*

In order to evaluate the performance of our model of red jujube, Precision (P), Recall (R), Average Precision (AP), Parameters, Model Size, and detection speed (Fps) were chosen in the article, root mean square error (RMSE) and average absolute percentage error rate (MAPE) were used as evaluation indexes of jujube quality where Recall, Precision, F1-score, RMSE, and MAPE were defined as follows:

$$\text{Precision} = \frac{\text{TP}}{\text{TP} + \text{FP}} \times 100\% \tag{3}$$

$$\text{Recall} = \frac{\text{TP}}{\text{TP} + \text{FN}} \times 100\% \tag{4}$$

$$\text{F1} - \text{score} = \frac{2 \times \text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}} \tag{5}$$

$$\text{RMSE} = \sqrt{\frac{1}{m} \sum\_{i=1}^{m} \left(\mathbf{y}\_i - \mathbf{y}\_i\right)^2} \tag{6}$$

$$\text{MAPE} = \sum\_{i=1}^{m} \frac{| (\mathbf{y}\_i - \mathbf{\hat{y}}\_i) / \mathbf{y}\_i |}{\mathbf{m}} \times 100\text{\%} \tag{7}$$

where, TP represents the number of true positive samples, FP represents the number of false positive samples, and FN represents the number of false negative samples. The variable yi represents the actual number of red jujubes in each image, yˆi represents the number of red jujubes predicted by each image model, and m represents the number of image samples.

#### **3. Results and Discussion**

#### *3.1. Performance Comparison Using the Different Improve Method*

As shown in Table 3, Recall and Precision were based on a 0.5 threshold. As one of the important indicators for evaluating the model, the area of the Precision-Recall curve was larger, and the AP of the model was higher.


**Table 3.** The model performance with a different module.

ShuffleNet V2 was used as the backbone network of the network, resulting in a reduction in model parameters by 14.41 times and an increase in Fps from 35.10 to 35.47. The improved network could reduce model parameters and increase detection speed. BiFPN was applied to the red jujube detection network. The experimental result showed that BiFPN improved the average accuracy of the network without increasing the parameters of the network. At the same time, it improved the detection speed of the model, with the average accuracy increased by 0.20% and the Fps increased to 39.40. Therefore, BiFPN could enhance the feature fusion ability of YOLOv5s and speed up the detection speed of the model. The Focus was replaced by the Stem, and the improved network has been improved in Recall, F1-score, AP, model size, and Fps, among which the Recall has increased by 3.600%. So, Stem is more effective than Focus in jujube detection. Compared with YOLOv5s, the AP increased by 0.6%, but the parameters increased, which increased the calculation pressure of testing equipment when Stem and BiFPN were used at the same time. When Stem and ShuffleNet V2 were applied at the same time, compared with YOLOv5s, the parameters were greatly reduced, but the detection accuracy was also lower. Our method not only reduced the model parameters but also improved the detection accuracy. The parameters and model size of the improved model was 6.25% and 8.33% of the original network, respectively. The Precision, Recall, F1-score, AP, and Fps were increased by 4.30%, 2.00%, 3.10%, 0.60%, and 3.99%, respectively.

As a lightweight network model, YOLOv5s has high accuracy and can meet the detection of small targets in complex environments, but it is difficult to be satisfied with the identification and localization of red jujubes under limited computation. When locating and recognizing overlapping fruits, the original YOLOv5s tended to easily identify two red jujubes that were mutually obscured as the one red jujube, as shown in Figure 11b. The main reason was that the differences were small between mutually obscured fruits, and the original YOLOv5s did not extract enough feature information about them, causing false detection. In recognition of small red jujube targets, the original YOLOv5s easily missed the red jujubes that were obscured by a large area of leaves or caused by the camera being too far away, as shown in Figure 11e. The main reason was that the environment of outdoor was complex, and the discrimination of the red jujubes was large. The improved model could accurately detect red jujubes and could also accurately identify the blocked jujubes, as shown in Figure 11c, and the number of missed jujubes was obviously less than the original YOLOv5s, as shown in Figure 11f.

**Figure 11.** The results of different algorithms for the recognition of red jujube. (**a**) the original image of a dense jujube sample. (**b**) the original model to dense jujube detection image. (**c**) the improved model to dense jujube detection image. (**d**) the original image of leaf-obscured jujube. (**e**) the original model to leaf-obscured jujube detection image. (**f**) the improved model to leaf-obscured jujube detection image. Where the red boxes are the label boxes marked manually, and the blue boxes are the test results of model test.

#### *3.2. Performance Comparison Using the Different Lightweight Backbone Networks*

In order to embed mobile devices, the ShuffleNet V2 backbone network was used in YOLOv5s in this research. MoblieNet V3, as the improved version of MoblieNet V1 and MoblieNet V2, has a large improvement in detection efficiency. In order to verify the detection performance of the improved model, the MoblieNet V3 network was used as the backbone of YOLOv5s to compare the improved YOLOv5s, which used the ShuffleNet V2 backbone network and YOLOv5s. The results show that after adding MoblieNet V3 as the backbone, the network has a large improvement in Precision, but a large decrease in Recall, resulting in the improved YOLOv5s, which is used the MoblieNet V3 backbone network and the original YOLOv5s in the same AP, as shown in Table 4. In addition, there is a phenomenon of missing the detection of jujube fruit, as shown in Figure 12. The improved YOLOv5s, which is used in the MoblieNet V3 backbone network, has a significant reduction in parameters and Model Size with YOLOv5s. Therefore, using a lightweight network as the backbone reduces the size of the model while maintaining accuracy.


**Table 4.** The comparison of different backbone networks.

**Figure 12.** Test results of different lightweight backbone networks. Where the blue boxes are the test results of model test.

The Precision and AP of using ShuffleNet V2 as the backbone network were slightly lower than that of the original YOLOv5s and the improved YOLOv5s using MoblieNet V3 as the backbone network. However, using ShuffleNet V2 as the backbone network could provide a more comprehensive red jujube detection. When MoblieNet V2 and GhostNet were used as a backbone, some red jujubes were missed, as shown in Figure 12. Compared with the other four detection models, the number of parameters using ShuffleNet V2 as the backbone network was only 7.14% of YOLOv5s, and the number of parameters was obviously smaller than other networks. The detection speed using the ShuffleNet V2 backbone network model was also faster than other detection networks, as shown in Table 4. using ShuffleNet V2 as a backbone not only greatly reduced the number of model parameters but also improved the detection speed, which was more suitable for red jujubes counting and related embedded mobile devices.

### *3.3. Performance Comparison in Counting Jujubes Using the Different Algorithms*

To verify the effectiveness of improved YOLOv5s for target detection, YOLOv3-tiny, YOLOv4-tiny, Faster R-CNN, SSD, YOLOvx-tiny, and YOLOv7-tiny were selected to compare with improved YOLOv5s. This research experimented with the selected comparison models using datasets of the same size and the same training and test sets. In order to ensure the reliability of the test, the epoch was set to 400, and the batch size was set to 4. In this research, three orchard jujube images were selected to test the yield estimation method. The comparison results are shown in Table 5. The P-R curve of the models is shown in Figure 13.


**Table 5.** Detection results of red jujubes with different target detection algorithms.

**Figure 13.** The PR curve of red jujubes with different target detection algorithms.

The P-R curve is a curve with recall as the horizontal coordinate and precision as the vertical coordinate out of the curve, whose area can show the comprehensive performance of the target detection model for red jujubes. Figure 13. shows that the curve areas of YOLOv3-tiny, YOLOv4-tiny, YOLOv5s, YOLOvx-tiny and YOLOv7-tiny are larger than those of SSD and Faster R-CNN. It illustrates that the Yolo series detection networks have higher accuracy and better recognition of red jujubes. YOLOv5s is used as an improved detection network for YOLOv3-tiny and YOLOv4-tiny, but the best detection result is not obtained for red jujubes, as shown in Table 5. The YOLOv4-tiny has better detection results, but the YOLOv5s are smaller in model size and more suitable for being used in agricultural mobile devices. Compared with the classical networks, the improved network not only maintains a better detection performance but also greatly reduces the model size.

Different detection algorithms were used to count red jujubes. YOLOvx-tiny, YOLOv5s, SSD, and Faster R-CNN all showed that the counting results of red jujubes were less than the actual number, as shown in Figure 14 image1. YOLOv7-tiny, YOLOv5s, and Faster R-CNN caused repeated recognition in the process of counting red jujubes, which led to the counting results being higher than the actual number, as shown in Figure 14 image2 and image3. Error counting occurred when SSD counted red jujubes, as shown in Figure 14 image3. When counting image4, only YOLOv4-tiny and Our Model counted accurately. However, Our Model also missed the detection of red jujubes, but compared with other algorithms, the number of missed detection was less, as shown in Figure 14 image5. When counting the Shaded red jujubes, all algorithms could count effectively, as shown in Figure 14 image6.

**Figure 14.** *Cont*.

**Figure 14.** Test results of different algorithms. Where the blue boxes are the test results of model test.

According to the experimental results, In the detection of red jujube, YOLOv5s, YOLOv4-tiny, and Faster R-CNN all miss the detection, which leads to a decrease in the number of red jujubes. YOLOv3-tiny, SSD, and Faster R-CNN all have error recognition, which leads to the increase in the estimation error of jujube yield by the model, as shown in Figure 14. Faster R-CNN, as one of the representative networks of the two-stage detection model, has good overall detection performance for red jujubes, but the AP is lower compared with other detection networks, And RMSE and MAPE are the maximum values, as shown in Table 5. This difference is mainly manifested in the recognition difficulty of fruits with large leaves shading and poor recognition of overlapping fruits. The reason for the difference is that Faster R-CNN does not build an image feature pyramid and cannot sufficiently extract features for small targets, resulting in insensitivity to small target recognition. For both the single-stage detection model Yolo series and SSD, the overall performance is better than Faster R-CNN. Comparing SSD and YOLOv5s, the Precision is reduced by 0.80%. The recall is reduced by 3.20%, and AP is reduced by 5.10%, RMSE is increased by 45.75%, MAPE is increased by 6.86%. The main reasons are: (1) Since YOLOv5s introduces the FPN + PAN, while the detection layer is fused by three levels of feature layers, while all six feature pyramid layers of SSD come from the last layer of FCN, YOLOv5s is better than SSD in detecting red jujubes. (2) Due to the limited number of red jujube and the severe occlusion between red jujubes, it is difficult for the model to learn the various states. Compared with YOLOvx-tiny and YOLOv7-tiny, the AP of the improved network increased by 0.50% and 1.10%, respectively, RMSE decreased by 2.2 and 1.44 respectively, and MAPE decreased by 18.15% and 10.92% respectively. Comparing YOLOv5s, we introduce the ShuffleNet V2 backbone to reduce the size of the model, but the feature extraction ability of the model is limited. The idea of resizing images by convolution layer was adopted, and the Stem was added to enhance the feature extraction ability of the network. The improved model overall outperforms YOLOv5s, with Precision, Recall, and AP improving by 4.3%, 2.0%, and 0.6%. In addition, the model size, RMSE, and MAPE decreased by 91.82%, 20.87%, and 5.18%, respectively. The improved model has the highest Precision, Recall, F1-Score, and AP, and the smallest in model size, RMSE, and MAPE among the comparison networks.

#### **4. Conclusions**

In this research, a counting method of red jujube based on improved YOLOv5s was proposed for achieving accurate detection and counting red jujubes while reducing the model size in a complex environment. In order to reduce the number of parameters, ShuffleNet V2 was used as the backbone to make the model lightweight. In addition, the Stem module was designed as an intermediate module between the input and backbone to prevent the information loss caused by the change in feature map size. PANet was replaced by BiFPN for multi-scale feature fusion to enhance the model feature fusion capability and improve the model accuracy. Finally, the improved YOLOv5s detection model was used

to count red jujubes. In order to verify the efficiency of the proposed model, YOLOv5s, YOLOv3-tiny, YOLOv4-tiny, SSD, Faster R-CNN, YOLOvx-tiny, and YOLOv7-tiny were used to compare with the improved model. The results showed that the improved model not only greatly reduced the model size but also had better performance in detection results than the comparison networks. Compared with yolov5s, Precision, Recall, and AP are improved by 4.3%, 2%, and 0.6%, respectively. In addition, the model size, RMSE, and MAPE decreased by 91.82%, 42.21%, and 11.47%, respectively. Therefore, the improved YOLOv5s model can not only effectively improve the detection performance of red jujubes but also finish the task of counting red jujubes in agricultural production. The method can provide a basis for estimating the yield of jujube by vision.

In summary, a counting method of red jujube based on improved YOLOv5s was proposed in this research, and the counting effectiveness of the method was verified by experiments. The future work of the red jujube counting method is as follows:


**Author Contributions:** Data curation, methodology, project administration, writing—original draft, writing—review and editing, Y.Q.; review & editing, supervision, funding acquisition, and project administration, Y.H.; data curation, Z.Z.; formal analysis, H.Y.; formal analysis, K.Z.; review & editing, supervision, funding acquisition, and project administration, J.H. review & editing, supervision, J.G. All authors have read and agreed to the published version of the manuscript.

**Funding:** Please add: This research was supported by the Talent start-up Project of Zhejiang A&F University Scientific Research Development Foundation (2021LFR066) and the National Natural Science Foundation of China (C0043619, C0043628).

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** Not applicable.

**Conflicts of Interest:** The authors declare no conflict of interest.

### **References**

