Fruit detection using deep learning is a computer vision method used to localize and classify targets in an image or video. It has been widely utilized for ripeness recognition, yield prediction [
1], harvesting/picking robot applications [
2,
3,
4], fruit-quality detection [
5], fruit estimation, fruit counting [
6], etc. Notwithstanding, the complex and changeable natural environment can be challenging for fruit detection. Leaf occlusion, overlapping fruits, variation of illumination, colors or brightness, similar plant appearance as the background, and nonstructural fields [
7], among others, are some of these factors generally experienced in fruits detection. Additionally, the intelligent perception and dataset acquisition of jujube fruits for sorting and counting due to their ripeness stages, clusters growth, and complex background of leaves, branches, and stems becomes one of the most difficult tasks for the fruit detection. Interestingly, the fruit detection model is a vital aspect of an automatic recognition system, which contains a computer vision and computational platform; in other words, the success of the detection, sorting, and counting of target fruit in an image or video depends on the robustness of the fruit detection model.
You Only Look Once (YOLO) is a single-stage object detector that has shown remarkable performance for fruit detection. The quest for speed, low computational cost, and applicability to low power computing devices lead to the introduction of the tiny YOLO structure. Baes on the YOLOv3-tiny [
8], a fast and accurate kiwifruit detection method developed by Fuet al. [
9] reported an average precision (AP) of 90.05% and inference time of 29.4 frames per second (fps). However, the model weight-size is still large with a slower detection speed. The FL-YOLOv3-tiny proposed by Tan et al. [
10] for underwater object detection noted an AP of 10.9% and a 29% fps improvement on YOLOv3-tiny. Nevertheless, the robustness of the proposed model for generalization is still questionable. Gai et al. [
11] reported an AP of 95.56% and a speed of 35.5 fps, which was an improvement on YOLOv3-tiny; however, it was not tested on fruits with complex background and it has a large weight-size with a slower detection speed. Bochkovskiy et al. [
12] upgraded the YOLOv3 to YOLOv4. Similarly, Lawal [
2] and Liu et al. [
13] demonstrated that variation of illumination and occlusion factors of fruit detection is solvable using an improved YOLOv3 model. The YOLO-Oleifera model proposed by Tang et al. [
14], based on modified YOLOv4-tiny, achieved an AP of 92.07% and had a weight-size of 29 MB and average detection time of 32.3 fps to detect each fruit image. However, the detection time is still slower with a large weight-size. A robust real-time pear fruit counter for mobile applications using YOLOv4-tiny by Parico et al. [
15] recorded a speed of more than 50 fps and an AP of 94.19%; however, it had an associated weight-size of 22.97 MB, which defines a high computational cost. The GCS-YOLOv4-tiny [
16] model for multi-stage fruit detection achieved an AP of 93.42%, with a model weight-size of 20.70 MB. Nevertheless, the model was not tested for real-time detection, missed the detection in dense small images, and the weight-size remained large. YOLOv5 [
17], which has an excellent fast detection speed relative to YOLOv4, was presented. The apple target detection model based on the modified YOLOv5 developed by Yan et al. [
18] reported an accuracy of 86.75% and a detection time of 66.7 fps; however, it still needs further detection time improvement. Owing to the version of YOLOv5s, Zhang et al. [
19] introduced a ghost (Han et al. [
20]) module and SCYLLA-IoU (SIoU) loss function into the network for improvement in the detection of dragon fruit in the natural environment, which had an AP of 97.4% with a weight-size of 11.5 MB. However, in this case, the target is obviously different from the background; therefore, the generalization of the model is yet to be ascertained on different fruit datasets with complex backgrounds, particularly for jujube fruits, and the weight-size remains large. The YOLOv5s-cherry, proposed by Gai et al. [
21] for cherry detection, had a F
1 of 0.08 and 0.03, which were higher than the YOLOv4 and YOLOv5s, respectively, but also needs to be improved. A counting method of red jujube based on the modified YOLOv5s proposed by Qiao et al. [
22] recorded an AP of 94% and a speed of 35.5 fps using a ShuffleNetv2 [
23] backbone. Nevertheless, the robustness of the model is not certain because it was tested for only fully mature red jujube fruits and its detection speed is slower. Recently, the YOLOv7, introduced by Wang et al. [
24], was reported to have surpassed other known object detectors, including YOLOv4 and YOLOv5, achieving the highest accuracy of 56.8% and detection speed in the range from 5 to 160 fps on the MS coco dataset. The extended efficient layer aggregate networks (E-ELAN) utilized by YOLOv7 primarily focus on the parameters and computational density of the model for performance improvement. Apart from Zhang et al. [
19], who experimented on YOLOv7 and YOLOv7-tiny for dragon fruit detection and Chen et al. [
25] with modified YOLOv7 for citrus detection, the investigation on jujube fruit detection using YOLOv7-tiny is seldom. A large weight-size and the parameters common to all mentioned YOLO variants are a big challenge to realize a faster detection speed and deployment in low-power computing devices. More so, these methods were aimed at relatively sparse fruits to justify their high AP and cannot solve the detection problem of small and dense fruits having a similar image background. Therefore, it is necessary to address the factors of the fruit detection model, including large weight-size and parameters, low detection speed, and accuracy, and to investigate the seldom architecture using a jujube fruit dataset of small and dense images.
This study designed a detection network based on YOLOv5 to reduce the size of the model and improve the speed and accuracy, which was used for real-time detection, sorting, and counting of jujube fruit in complex scenes. The main contributions of this paper are as follows: