1. Introduction
Dragon fruit (
Hylocereus undatus), also known as pitahaya or pitaya, is a tropical and subtropical fruit that has gained popularity among consumers due to its nutritional value and pleasant taste [
1,
2]. China stands as the leading global producer, with an annual output of 1.297 million tons in 2019. Accurate monitoring of flower and fruit quantities during various stages is vital for growers to forecast yield and optimize order planning. Nevertheless, the prevalent reliance on labor-intensive, time-consuming, and inefficient manual counting methods persists among dragon fruit growers in different regions [
3,
4].
With the advancement of information technology, machine vision technology has emerged as a significant tool for plant monitoring and yield estimation. Its benefits, including cost-effectiveness, ease of acquisition, and high accuracy, have propelled its adoption. Orchards typically feature wide row spacing, enabling the collection of video data using ground moving platforms, facilitating large-scale detection. Video-based target statistics involve two essential steps: target recognition and target tracking. Traditional target recognition algorithms such as the Deformable Part Model (DPM) [
5] and Support Vector Machine (SVM) have been gradually replaced by deep learning algorithms due to their limitations in detection accuracy, robustness, and speed.
Currently, deep learning-based object detection algorithms can be broadly categorized into two-stage detection algorithms based on candidate regions and one-stage detection algorithms based on regression principles. Two-stage detection algorithms, such as Faster R-CNN [
6], have found extensive application in fruit (prickly pear fruit [
7], apples [
8], passion fruit [
9], etc.) detection, crop (Sugarcane Seedlings [
10], weeds [
11]) detection, disease (sweet peppers disease [
12], soybean leaf disease [
13]) detection, and so on, achieving accuracy levels ranging from 55.5% to 93.67%, depending on the complexity and difficulty. However, those kinds of two-stage algorithms have obvious drawbacks of being large in size, slow in speed, and challenging to integrate into mobile devices.
One-stage detection algorithms, prominently represented by the YOLO [
14,
15,
16] series, are preferred for real-time performance requirements owing to their compact size, rapid speed, and high accuracy. YOLO series models were built for variable objects in agriculture, such as detection of weeds [
17] or diseases [
18] with a similar appearance or features. YOLO models have also shown balanced performance on fruit detection though the biggest challenge that exists in the similar colors and occlusion in some conditions. Wang [
19] employed YOLOv4 [
20] for apple detection, achieving an average detection accuracy of 92.23%, surpassing Faster R-CNN by 2.02% under comparable conditions. Yao [
21] utilized YOLOv5 to identify defects in kiwi fruits, accomplishing a detection accuracy of 94.7% with a mere 0.1 s processing time per image. Yan [
22] deployed YOLOv5 to automatically distinguish graspable and ungraspable apples in apple tree images for a picking robot. The method achieved an average recognition time of 0.015 s per image. YOLO models of different versions also have performance differences. Comparative analysis revealed that YOLOv5 enhanced the model’s mean Average Precision (mAP) by 14.95% and 4.74% compared to YOLOv3 and YOLOv4, respectively, while significantly compressing the model size by 94.6% and 94.8%. Moreover, YOLOv5 demonstrated average recognition speeds 1.13 and 3.53 times faster than YOLOv4 and YOLOv3 models, respectively. Thus, the YOLOv5 algorithm effectively balances speed and accuracy in one-stage detection algorithms.
Conventional target tracking methods, including optical flow [
23] and frame difference [
24], exhibit limited real-time performance, high complexity, and susceptibility to environmental factors. Conversely, tracking algorithms based on Kalman filtering [
25] and Hungarian matching [
26], such as SORT [
27], DeepSORT [
28], and ByteTrack [
29], excel in rapid tracking speed and high accuracy, meeting the demands of video detection. SORT optimizes tracking efficiency by correlating frames before and after an image, yet struggles to address occluded target tracking challenges. DeepSORT enhances occluded target tracking by leveraging deep appearance features extracted through a convolutional neural network (CNN), augmenting the deep feature matching capability, albeit at the expense of detection speed. Similar to SORT, ByteTrack does not employ deep appearance features but solely relies on the inter-frame information correlation. However, ByteTrack effectively addresses occluded target recognition challenges by emphasizing low-confidence detection targets, thereby achieving a good balance between tracking accuracy and speed. In summary, ByteTrack surpasses other tracking algorithms in performance, yet it is restricted to single-class target tracking and does not provide direct support for multi-class target classification tracking.
To address existing challenges of current tracking algorithms and meet practical demands, this study introduces an enhanced ByteTrack tracking algorithm for real-time recognition, tracking, and counting of dragon fruit at various growth stages in inspection videos. The proposed method comprises three core modules: (1) a dragon fruit multi-classification recognition model based on YOLOv5; (2) the multi-class detection outcomes are utilized as inputs for the improved ByteTrack tracker, enabling end-to-end tracking of dragon fruit across different growth stages; (3) regions of interest (ROI) are defined to accomplish classification statistics for dragon fruit at distinct growth stages. This method can be further integrated into mobile devices, facilitating automated inspection of dragon fruit orchards and offering a viable approach for timely prediction of dragon fruit yield and maturity.
The organization of the rest of this paper is as follows. In
Section 2, we present the proposed method and performance metrics in detail.
Section 3 is dedicated to the discussion of our results. Following that, in
Section 4, we delve into the discussion of future work. Finally, in
Section 5, we draw our conclusions.
Our contributions include (1) an enhanced ByteTrack tracking algorithm was proposed to simultaneously recognize, track, and count dragon fruit of different ripeness at both sides; (2) the YOLOv5 object detector was employed as the detection component of the ByteTrack tracker, and multi-class tags were introduced into ByteTrack, achieving efficient and rapid tracking of multiple classes.
2. Materials and Methods
2.1. Image Acquisition for Dragon Fruit
A series of videos were recorded using several handheld smartphones along the inter-row paths of a dragon fruit plantation situated in Long’an, Guangxi, on 7 November 2021. Based on the observed conditions, the plots can be categorized into three scenarios: (a) plots with coexisting green fruits (immature dragon fruit) and red fruits (mature dragon fruit), as depicted in
Figure 1a; (b) plots with coexisting dragon fruit flowers and green fruits, as shown in
Figure 1b; and (c) plots with coexisting flowers, green fruits, and red fruits, as illustrated in
Figure 1c. The video recordings were conducted in the afternoon and night time (3:00 p.m.–9:00 p.m.), with supplementary illumination provided by grow lights during night time. The smartphones were handheld with a gimble system at a height of 1.0–1.5 m above the ground. The camera lens was put straight forward, assuring the dragon fruit plants on both sides of the path being recorded. The lighting conditions included front light and backlight during the day. The video acquisition speed was about 1 m/s. The collected videos were stored in MP4 format, featuring a resolution of 1920 (horizontal) × 1080 (vertical) and a frame rate of 30 frames per second. In total, 61 videos were acquired, with a combined duration of approximately 150 min.
2.2. The Proposed Technical Framework
The technical framework of this study, as illustrated in
Figure 2, encompasses the following five key components:
(1) Construction of a dragon fruit detection dataset: Curating a dataset by selecting videos captured under diverse environmental conditions, extracting relevant key frames, and annotating them to establish a comprehensive dragon fruit detection dataset.
(2) Training a dragon fruit detection model: Utilizing the constructed dataset to train a detection algorithm, thereby developing a dragon fruit recognition model capable of identifying dragon fruit at distinct growth stages. The recognition outcomes serve as an input for subsequent object tracking processes.
(3) Tracking dragon fruit at different growth stages: Adding multi-classification information on the basis of multi-object tracking technology and integrating the results of dragon fruit detection to achieve real-time tracking of dragon fruit across various growth stages.
(4) Counting dragon fruit using the ROI region: Incorporating a region of interest (ROI) within the video and utilizing the results obtained from object tracking to perform accurate counting of dragon fruit at different growth stages within the defined ROI.
(5) Result analysis: Conducting a comprehensive assessment of the dragon fruit counting method by evaluating the performance of object detection, object tracking, and the effectiveness of counting within the ROI region.
2.3. Construction of the Dragon Fruit Dataset
Twenty videos, each with a duration ranging from 30 s to 3 min, were randomly selected for the study. Out of these, 16 videos were utilized to extract a raw image dataset by capturing one frame for every 30 frames. Images depicting figs or suffering from blurriness were manually eliminated. The resulting dataset comprised 5500 images, which were subsequently numbered for manual classification of flowers, green fruits, and red fruits using labelimg software. Targets exhibiting occlusion areas exceeding 90% or those with blurry attributes were excluded from the labeling process. Finally, the annotated dataset was divided into a training set (5000 images) and a validation set (500 images), maintaining a 9:1 ratio. The remaining four videos were employed for system testing purposes, encompassing frame rates spanning from 1800 to 5400 frames. Videos 1 and 2 were captured during night time, while videos 3 and 4 were obtained in daylight conditions, employing the same filming techniques involving the simultaneous recording of two rows of dragon fruit plants. The basic information of the dataset is listed in
Table 1.
2.4. Multi-Object Tracking with Multi-Class
The ByteTrack algorithm, comprising object detection and object tracking stages, encounters challenges regarding real-time performance in detection and the lack of class information in tracking. To address these limitations, this study presents an enhanced version of the algorithm for efficient detection and tracking of dragon fruits at various growth stages. Notably, we replaced the original detector in ByteTrack with YOLOv5 to enhance both accuracy and speed of detection. Additionally, multi-class information was integrated into the tracking module so that the object numbers of specific classes could be counted.
2.4.1. Object Detection
The YOLO (You Only Look Once) algorithm is a renowned one-stage object detection algorithm that identifies objects within an image in a single pass. Unlike region-based convolutional neural network (RCNN)-type algorithms that generate candidate box regions, YOLO directly produces bounding box coordinates and class probabilities of each bounding box through regression. This unique approach enables faster detection speeds, as it eliminates the need for multiple passes or region proposal steps.
The YOLOv5 architecture offers four variants, distinguished by the network’s depth and width: YOLOv5-s, YOLOv5-m, YOLOv5-l, and YOLOv5-x. Among these, YOLOv5-s is the smallest model with the fastest inference speed, making it ideal for deployment on edge detection devices. In this study, we employed the YOLOv5-s network model, as depicted in
Figure 3, which comprises four key components: Input, Backbone, Neck, and Output. The Input module performs essential pre-processing tasks on the dragon fruit images, including adaptive scaling and data augmentation using Mosaic. The Backbone module, consisting of C3 and SPPF structures, focuses on feature extraction from the input image. The C3 module divides the basic feature layer into two parts. One part conducts convolutional operations, while the other part fuses with the convolutional operation results of the first part through cross-layer combination to obtain the feature layer. The SPPF structure plays a crucial role in aggregating multi-scale features obtained from C3 to effectively expand the image’s receptive field. In the Neck module, the FPN+PAN structure is employed to fuse features extracted from different layers. FPN facilitates the propagation of features from top to bottom layers, while PAN enables the propagation of features from bottom to higher layers. By combining both approaches, features from diverse layers are fused to mitigate information loss. The Output module generates feature maps of various sizes. By analyzing these feature maps, the location, class, and confidence level of the dragon fruit can be predicted. In this study, 5000 images containing dragon fruit flowers and fruits were used as the training dataset. The information about the dataset was listed in
Table 1.
2.4.2. Object Tracking
Multiple Object Tracking (MOT) aims to estimate the bounding boxes and identify objects within video sequences. Many existing methods for MOT achieve this by associating bounding boxes with scores exceeding a predefined threshold to establish object identities. However, this approach poses a challenge when dealing with objects that have low detection scores, including occluded instances, as they are often discarded, thus resulting in the loss of object trajectories.
The ByteTrack algorithm introduces an innovative data association approach called BYTE, which offers a simple, efficient, and versatile solution. In contrast to conventional tracking methods that primarily focus on high-scoring bounding boxes, BYTE adopts a different strategy. It retains most of the bounding boxes and classifies them based on their confidence scores into high and low categories. Rather than discarding the low-confidence detection targets outright, BYTE leverages the similarity between the bounding boxes and existing tracking trajectories. This enables the algorithm to effectively distinguish foreground targets from the background, mitigating the risk of missed detections and enhancing the continuity of object trajectories.
The BYTE workflow, depicted in
Figure 4, encompasses three key steps: (1) partitioning bounding boxes into high and low scoring categories; (2) prioritizing the matching of high-scoring boxes with existing tracking trajectories, resorting to low-scoring boxes only when the matching using high-scoring boxes is unsuccessful; (3) generating new tracking trajectories for high-scoring bounding boxes that fail to find suitable matches within existing trajectories, discarding them if no suitable match is found within 30 frames. By employing this streamlined data association approach, ByteTrack demonstrates exceptional performance in addressing the challenges of MOT. Moreover, its attention on low-scoring bounding boxes allows for effective handling of scenarios featuring significant object occlusion.
In this study, we incorporated multi-class information into the tracking process of ByteTrack, as depicted in
Figure 5 with dashed boxes indicating the inclusion of multi-class information. Based on the category of the bounding box, a Kalman filter was utilized with classification information to enhance the prediction accuracy, as depicted in Equation (
1), which represents the state equation.
In this equation, u and v denote the center point of the bounding box, s represents the aspect ratio, r corresponds to the height of the bounding box, , , and denote their respective rates of change, and class indicates the category information of the bounding box. Following the generation of predicted objects, matching of tracking trajectories is performed based on the category. Upon successful matching, detection boxes with classification and identity IDs are outputted.
2.5. Counting Method Using the ROI Region
Considering the characteristics of inspection videos, which involve single-side and double-side inspections, this paper presents a ROI counting method capable of counting on either one side or both sides. Here, we mainly focus on introducing the double-side ROI counting method, which enables simultaneous capture of both sides of the dragon fruit in the aisle. The layout of the double-side ROI region is shown in the inspection video schematic in
Figure 6. In this schematic, a counting belt is positioned on each side of the video, highlighted by a blue translucent mask overlay. Additionally, the current video frame number, processing frequency, and statistics related to various categories of flowers and fruits are displayed in the upper-left corner of the video.
The counting method, as illustrated in
Figure 6, involves two primary steps. (1) Frame-by-frame analysis: The center coordinates of the identified dragon fruit target box are assessed to determine if they fall within the designated ROI counting region. If they do not, the process is repeated. If the coordinates are within the ROI counting region, the procedure proceeds to the next step. (2) Categorization and tracking: The category of the flower/fruit target entering the ROI region is confirmed, and the corresponding category’s tracking list is checked for the presence of the target’s ID. If the ID is not already included in the tracking list, it is added, and the counter for the respective category is incremented by 1. If the ID is already present in the tracking list, it is not counted. Once all video frames have been analyzed, the tracking lists for all categories within the ROI region are cleared.
2.6. Evaluation Metrics
The Intersection over Union (IOU) is a widely used metric for evaluating the accuracy of object detection. It quantifies the overlap between the predicted bounding boxes and the ground truth boxes, as expressed in Equation (
2). In this study, we employed an IOU threshold of 0.5, indicating that a detection is deemed correct if the IOU is greater than or equal to 0.5, and incorrect otherwise.
The intersection area between the predicted bounding box and the ground truth box is represented as
, while the union area is denoted as
. By considering the differences between the predicted and ground truth results, the samples can be categorized into four groups: true positive (
), false positive (
), false negative (
), and true negative (
). From these groups, various evaluation metrics can be derived, including accuracy (
P), recall (
R), average precision (
), and mean average precision (
), which are expressed in Equations (
3)–(
6).
The evaluation of the counting results is conducted using counting accuracy (
), counting precision (
), and mean counting precision (
). These metrics are defined in Equations (
7)–(
9), where
represents the automated counting values,
represents the true values, and
N denotes the total number of tested videos
2.7. Experimental Environment and Parameter Setting
In this study, the models were built and improved using the PyTorch deep learning framework. Model training and validation were conducted on Ubuntu 18.04. The computer used had an Intel(R) Core(TM) i7-8700K CPU of 3.70 GHz, 16 GB RAM, and an NVIDIA GeForce RTX 3090 GPU with 24 GB of memory. GPU acceleration was employed to expedite network training, utilizing Cuda version 11.3.0 and Cudnn version 8.2.0. Stochastic Gradient Descent (SGD) was chosen as the optimizer for neural network optimization and to accelerate the training process. The momentum parameter for SGD was set to 0.9, and the weight decay parameter was set to 0.0005. The input images were uniformly resized to 512 × 512 pixels, and a batch size of 16 was used during training.The trained model was then tested on another computer which has an NVIDIA GeForce GTX 1080ti GPU with 16 GB of memory.
4. Discussion and Future Work
The proposed counting method based on ROI regions in this paper achieves over 90% counting accuracy and a detection speed of 56 FPS, primarily focusing on high-performance GPU platforms such as the 1080 ti. However, in practical orchard applications, detection tasks often need to be executed on mobile devices. When applied to mobile devices, the speed will inevitably decrease. We have also conducted preliminary experiments on smartphones (Huawei P20), and the frame rate detection typically ranges from 18 to 24 fps, which largely meets the real-time detection requirements. Therefore, future research will aim to implement the recognition, tracking, and counting of various flowers and fruits in orchards on mobile devices, providing a more portable solution for estimating orchard yields.
In the field of agricultural detection, mobile platform-based detection has been a hot research topic. For instance, Huang [
32] deployed an object detection model on edge computing devices to detect citrus images in orchards, achieving a detection accuracy of 93.32% with a processing speed of 180ms/frame. Mao [
33] implemented fruit detection on CPU platforms, achieving a detection speed of 50ms/frame on smartphone platforms. This demonstrates the significant potential of fruit detection on mobile platforms.
Considering these promising developments, our future work will focus on optimizing our model and adapting it for mobile platforms. This adaptation will enable the real-time counting of fruits using these models in the field, serving as a valuable direction to meet the needs of orchard management