2. Materials and Methods
Algorithms based on methods that use Deep Learning are the most efficient way to perform object detection [
17,
18]. The YOLO (You Only Look Once) approach proposes the use of a neural network that simultaneously predicts bounding boxes and class probabilities, distinguishing itself from previous object detection algorithms that repurposed classifiers for this purpose. By adopting a fundamentally different approach to object detection, YOLO has achieved state-of-the-art results, outperforming other real-time object detection algorithms significantly [
19].
In contrast to methods such as Faster RCNN [
30], which identify potential regions of interest using the Region Proposal Network and subsequently conduct recognition on those regions separately, YOLO performs all its predictions through a single fully connected layer. While approaches employing Region Proposal Networks require multiple iterations for the same image, YOLO accomplishes the task in a single iteration.
Considering speed and accuracy, YOLOv4 has presented good performance among object detection models recently [
31]. The YOLO V4 architecture has undergone significant modifications, with a renewed emphasis on data comparison, leading to substantial performance improvements. Its defining characteristic is its integration of various components, resulting in notably high performance. In essence, YOLO V4 can be described as a combination of CSP Darknet53, SPP, Pan, and YOLO V3 [
32].
The primary contributions of YOLO V4 include the introduction of an efficient target detection model, an investigation into the impact of state-of-the-art (SOTA) techniques during training, and the optimization of SOTA methods for single-GPU training. YOLO V4 also reorganizes the target detector framework into Input, Backbone, Neck, and Head components, utilizing multiple anchor points for a single ground truth. Key improvements in YOLO V4 encompass the inclusion of SPP, the utilization of the MISH activation function, data augmentation techniques like Mosaic/Mixup, and the adoption of the GIOU (Generalized Intersection over Union) loss function [
33].
Recent studies have harnessed YOLO-based models, including YOLOv3, YOLOv4, and YOLOv5, for fruit detection, demonstrating their significant potential in accurately identifying fruit directly on trees [
34,
35,
36,
37]. Notably, the results achieved with YOLOv4 [
32] have been found to be similar to those obtained with YOLOv5. Given this similarity in performance, YOLOv4 was chosen as the model of preference for orange detection in this work.
2.1. Creation of the Green Oranges Dataset
To carry out the YOLOv4 training to detect pre-harvest green oranges, a database containing images of oranges directly from the trees and duly annotated is necessary. It is one of the contributions of the present study, the creation of a pre-harvest and duly annotated green oranges dataset. Among the types of oranges available in the place where the images for the database were acquired, the type chosen for this work was the variety called “folha murcha”, which is a Valencia-type orange tree [
38], with data collection occurring between 7 and 6 months before harvest.
Data were obtained in the field using a Xiaomi Redmi Note 9PRO smartphone camera in both portrait and landscape orientations. At the same time, the camera operator walked in a straight line parallel to the row of orange trees being filmed. Data collection took place on 15 March 2021 and 18 April 2021. The orange trees were filmed in 1920 × 1080 p resolution at 60 frames per second. The images for the database were taken from the video frames at 3-s intervals.
The oranges were divided into three categories: green oranges, ripe oranges, and spoiled oranges. Furthermore, they were annotated using the online tool CVAT (COMPUTER VISION ANNOTATION TOOL) [
39] totaling 644 images with 43,109 annotated oranges, of which 532 images were separated for training and 112 images for tests. Among these annotations, 42,710 belong to the green orange class, 368 to the spoiled orange class, and 31 to the ripe orange class. It is important to emphasize that both vertical and horizontal image orientations were incorporated to augment the dataset’s generality. Unlike Rauf and Chan’s dataset [
40], which focuses exclusively on 1465 images of ripe oranges, our dataset includes a significant number of green oranges.
In
Figure 2, the oranges properly annotated using the CVAT tool are shown, where green oranges are annotated with blue bounding boxes, ripe oranges with yellow bounding boxes and spoiled oranges with pink bounding boxes.
2.2. YOLOv4 Model Training
The YOLOv4 training was performed using the Darknet framework on the Google Colab platform with the Tesla P100-PCIE-16GB GPU. We configured and tuned the YOLOv4 architecture for our custom database. The main source code of the Darknet framework was prepared by [
32]. We modified the last three layers of YOLOv4 to be compatible with the number of classes in our database, following the authors’ guidelines [
32]. The original YOLOv4 was trained in 80 classes; therefore, we have changed the number of classes to three: “green orange”, “ripe orange”, and “spoiled orange”. We set the width × height of the network input image to 1056 × 1056. This input value was chosen to consider the size of the oranges in the image, if the oranges become smaller than 11 × 11 pixels after resizing the image in the input, this can compromise the quality of network detections. Data augmentation techniques and network hyperparameters were kept at default values. In addition, the maximum number of training epochs was set at 6000 following the formula provided by the authors (number of classes × 2000) [
32].
Tensorflow [
41] is a machine learning framework that works in data processing and has a wide variety of libraries and resources that allow the use of the latest Deep Learning algorithms and models in a flexible way. After The YOLOv4 model had been trained, the model was converted to a Tensorflow model, using the following source code as a basis to perform the conversion [
42].
2.3. Orange Counting System
It is possible to use only detection to count objects. However, as discussed earlier, detection systems often fail in some occlusion and lighting situations. Thus, relying solely on the number of detections in an image to perform the orange count would be a wrong decision, especially in a pre-harvest scenario where the abovementioned situations are pretty standard. For this reason, a counting system must cover these limitations to ensure counting accuracy. One of the ways to achieve this is by assigning a unique ID to each orange detected and tracking it through video frames. In this way, obtaining more reliable results in counting objects in case of detection system failures is possible.
The system used in this work uses two methods of tracking the oranges using a unique ID across the frames. The first uses the Euclidean distance between the centroids of the objects detected via YOLOv4 in subsequent frames to relate the IDs of the oranges of the previous frame with the detected oranges by YOLOv4 in the current frame. The second uses object tracking algorithms to match the IDs of the oranges of the previous frame with the oranges of the current frame that YOLOv4 has not detected.
The first method, based on the Euclidean distance between centroids, is divided into four steps. Step 1: Obtain the coordinates of the bounding boxes and calculate their centroids. Step 2: Calculate the Euclidean distance between objects in the previous frame and objects in the current frame. Step 3: Update object coordinates. Step 4: Register objects with new IDs.
In step 1, the algorithm receives the coordinates of the bounding boxes and calculates their respective centroids. Assuming this is the first received set of bounding boxes. Each centroid, or object, is assigned a unique ID, as shown in the left image of
Figure 3. The object’s centroids are calculated at each video frame using the bounding boxes. However, instead of assigning new unique IDs to objects detected in the current frame, it is first necessary to determine if it is possible to associate the centroids of the new objects with the centroids of the objects of the previous frame. It is done through step 2.
In step 2, the Euclidean distance between each pair of object centroids of the previous frame and centroids of objects of the current frame is calculated. In the right image of
Figure 3, the centroids of the objects of the previous frame are represented by the two blue dots, and the yellow dots represent the centroids of the objects of the current frame. Arrows represent Euclidean distances.
The central assumption made in the first method is that objects in consecutive frames tend to move little. In this way, the distance between the centroids of the same object in 2 consecutive frames will be smaller than the distances between all other centroids.
In step 3, the centroids of objects with the smallest Euclidean distance between them will have their IDs related, as shown in the left image of
Figure 4. However, if the number of new objects is greater than the number of existing objects in the previous frame, or if these objects are at a distance greater than 70 pixels from all objects that have not been associated with an ID, then it will not be possible to associate the IDs of objects from the previous frame with all objects in the current frame, such as is shown in the left image of
Figure 4 with the isolated yellow dot. Therefore, it is necessary to associate these new objects with new IDs in step 4.
In step 4, objects that have not been associated with objects with existing IDs have new IDs assigned to them if the detection confidence of this object is 85% or more, as shown in the right image of
Figure 4. This situation happens when an object, which was not part of the previous frames, is detected.
In
Figure 5, we can see the first method for ten consecutive frames. In frame number 163, there are three oranges with IDs numbered 81, 80, and 77, each with a circle of different colors inside it to facilitate the identification of oranges of identical IDs in different frames. The smaller black circle inside the orange represents a YOLOv4-detected orange in that frame. It is possible to notice that the oranges move very little between consecutive frames, which facilitates the attribution of the IDs between the oranges frame by frame.
2.4. Object-Tracking Algorithm
The second method of tracking oranges through video frames with a unique ID is used when it is impossible to relate an object’s ID from the previous frame to an object that YOLOv4 has detected in the current frame. In this case, we use object-tracking algorithms to make this relationship.
In this process, the objective of object tracking algorithms is to estimate the object’s state (position) over time [
43]. When there is no change in the environment, tracking objects is not overly complex, but this is usually not the case. Various disturbances in the real world can disrupt tracking, including occlusion, variations in lighting, change of viewpoint, rotation and blurring due to motion [
44]. The steps used to track the object in the video involve:
Selecting the object (target) in the initial frame with a bounding box;
Initializing the tracker with information about the frame and the bounding box of the object;
Using subsequent frames to find the new bounding box of the object in these frames.
In this work, the following object trackers are used for comparison: the Dlib correlation tracker (Dlib tracker) [
45], Boosting [
46], Multiple Instance Learning (MIL) [
47], MedianFIow [
48], Kernelized Correlation Filter (KCF) [
44] and Channel and Spatial Reliability Tracker (CSRT) [
49]. These algorithms were selected for their robustness and because they are implemented in the OpenCV and Dlib-ml libraries. Except for the Dlib tracker, which is implemented using the Dlib-ml library [
50], all other object trackers are implemented using the OpenCV (The Open Source Computer Vision) library [
44,
51].
The Dlib Correlation Tracker focuses on robust scale estimation, while Kernelized Correlation Filter (KCF) adjusts channel characteristics and introduces CN features for tracking. The Channel and Spatial Reliability Tracker (CSRT) combines DCF with spatial and channel reliability, enabling adaptable and accurate short-term tracking, even for non-rectangular objects. In contrast, Boosting treats tracking as a binary classification task and continually updates the classifier for the target object, thereby enhancing its robustness. Median Flow, on the other hand, identifies tracking failures through the comparison of forward and backward trajectories, ensuring the robustness of tracking through discrepancy measurements. Additionally, Multiple Instance Learning (MIL) enhances the robustness of discriminative classifiers used in “tracking by detection” techniques, effectively mitigating inaccuracies in the tracker.
Object trackers are used to estimating the object’s position when YOLOv4 cannot detect the object in the image in the current frame. It is possible because the trackers only need the previous frame and the object’s bounding box. After that, object trackers can track the object frame by frame without the help of YOLOv4. The entire process is shown in the flowchart in
Figure 6.
Figure 7 shows an example of this situation using the Dlib tracker. In frame 196, the orange of ID 85 was detected by YOLOv4; this is demonstrated by the black circle inside. In frame 197, the orange of ID 85 was not detected by YOLOv4; the white circle inside it visually demonstrates this, which indicates that the object tracking algorithm is tracking the orange. The orange is then tracked through the frames even if YOLOv4 fails to detect it again. This tracking process takes place for 35 frames, if YOLOv4 cannot detect the orange again within those 35 frames, then the ID of that orange is deleted, and the object tracking algorithm stops tracking it.
3. Results and Discussions
Ten seconds of a video were chosen whose images were not used to compose the database to test the counting algorithm. So, in order to compare the performance of the algorithm using YOLOv4, a situation was simulated in which the detection algorithm used would detect all oranges in all video frames without committing any detection error. This was achieved by manually annotating all oranges in all the 600 frames of the 10-s video with bounding boxes. This information was used frame by frame to simulate an optimal detection. The actual number of 208 oranges possible to see in the video was also counted. The counting algorithm was then tested using YOLOv4 and the simulated optimal object detector together with each of the mentioned above object-tracking algorithms.
After training YOLOv4, the network acquired the parameters as displayed in
Table 1. The training process utilized 532 images with the classes, “green orange”, “ripe orange”, and “spoiled orange”. It is noteworthy that the parameters presented in the table reflect the network’s performance on a separate set of 112 test images. These test images were not part of the training dataset, ensuring an impartial evaluation of the model’s performance.
True Positives (TP) represent instances where the model correctly predicts a label that matches the ground truth. True Negatives (TN) occur when the model correctly abstains from predicting a label that is indeed not part of the ground truth. On the other hand, False Positives (FP) arise when the model erroneously predicts a label that is not supported by the ground truth. Conversely, False Negatives (FN) occur when the model fails to predict a label that is present in the ground truth.
Intersection over Union (IoU) is a metric that quantifies the overlap between the coordinates of the predicted bounding box and the ground truth box. A higher IoU value indicates that the predicted bounding box coordinates closely resemble those of the ground truth box. When evaluating multiple detections, the Average IoU is calculated as the mean IoU of all the detections made. This metric provides an overall measure of how well the predicted bounding boxes align with the ground truth boxes on average.
The Precision metric assesses how effectively true positives (TP) can be identified out of all positive predictions (TP + FP), while Recall evaluates the ability to locate true positives (TP) within all predictions (TP + FN). Both Precision and Recall are fundamental in evaluating the performance of object detection algorithms. Additionally, the F1-Score, often described as their harmonic mean, offers a balanced single metric that necessitates both Precision and Recall to be simultaneously high for the F1-Score to increase. This provides a comprehensive assessment of a detector’s performance, considering both the ability to minimize false positives and false negatives. Notably, the Precision, Recall, and F1-Score values presented in
Table 1 are computed with an Intersection over a Union (IoU) threshold of 50%.
Average Precision (AP) quantifies the detection model’s performance by calculating a weighted mean of precision scores across different confidence thresholds, accounting for the increase in recall from the previous threshold. Mean Average Precision (mAP) serves as a comprehensive metric to assess the overall performance of a detection model. It extends the evaluation by calculating Average Precision (AP) at multiple Intersection over Union (IoU) thresholds. The mAP is derived by averaging the AP scores across these IoU thresholds, offering a thorough assessment of the model’s accuracy across a spectrum of bounding box overlap conditions. Specifically, mAP0.5 represents the mAP calculated at an IoU threshold of 0.5, while mAP0.5:0.95 extends the analysis across a range of IoU thresholds from 0.5 to 0.95, with 0.05 increments.
It is important to emphasize that the 10 s video, comprising 600 frames, differs from the training and test datasets. The test images were exclusively utilized to evaluate the network’s image detection performance. In contrast, the 10 s video is employed to assess the counting algorithm’s performance, which is specifically tailored for video sequences. This distinction ensures a separate evaluation, clearly separate from the image detection task, within the scope of this study.
The counting algorithm using the frame-by-frame corrected detections obtained the results shown in
Table 2. As the detections were corrected frame by frame, to compare with the results obtained using the YOLOv4 network, there is no detection omission (oranges that were not detected in frames) and no false detections (objects that were detected as being orange, but which are not). The number of oranges counted by the algorithm remained close to the correct number of oranges, no matter which object tracker was used or even if none was used.
This is because whenever an orange is visible in the frame, it will be detected as the detections were manually corrected frame by frame. That way, when the object trackers are used, they are not tracking the oranges because if the orange was not detected, it is because it is no longer visible in the frame and not because the object detector failed. Thus, even if no object tracker is used, the number of oranges counted is still very close, even if the double counting number and the ID repetition number (situation in which the same ID is used in two oranges, generally when the first orange is occluded and soon after a new orange is detected nearby and the same ID of the occluded orange is assigned to the new detection) are more significant.
When the counting algorithm is used with YOLOv4 to perform the detection of oranges, we have a more significant discrepancy in the number of oranges counted, as shown in
Table 3. The number of omission of detections was 15 oranges, meaning that during all frames, there were 15 oranges that YOLOv4 did not detect even once in all frames. It does not mean that YOLOv4 failed to detect oranges 15 times in all frames, but rather that it missed 15 oranges in all frames at least once, so if the same orange was missed in multiple frames, it counts as an omission of detection only. If an orange appeared in 100 frames but was detected by YOLOv4 in 20 frames, it does not count as an omission of detection, as it is possible to track the orange in frames where YOLOv4 was unable to detect it using the object trackers.
The number of erroneous detections was 2, meaning that the network detected two objects as being orange and they were not. In the same way, even though YOLOv4 has detected the same object erroneously in several frames, this counts only as one false detection. The number of double countings of oranges and repetitions of the same ID for different oranges was also higher when no tracker was used than when any of the proposed object trackers were used.
It was already expected because now, unlike the detections corrected frame by frame, there were times when the orange was visible in the frame, but YOLOv4 did not detect it. In this way, having an object tracker to be able to track the oranges in the frames in which YOLOv4 fails helps to reduce the number of double counting. Because if YOLOv4 detects the orange again in future frames, reassign the ID of the orange being tracked by the object tracker instead of assigning a new ID.
Although frame-by-frame corrected detections are more accurate than YOLOv4-produced detections, the orange counting algorithm came closer to the correct number of oranges when using YOLOv4 detections than when using frame-by-frame corrected detections. This is because when using the corrected detections, there are no omissions of detections or wrong detections, so the only errors present are double counting errors and ID repetition errors. ID repetition errors are much rarer than double-counting errors, especially when using an object tracker. Thus, double counting errors constitute the majority of errors when using corrected detections. By their nature, double-counting errors tend to overestimate the number of oranges counted. That is, they tend to introduce a positive error in the final count of the number of oranges.
When using the detections made by YOLOv4, there are some detection failures. Some oranges are not detected at all in every frame, either because of occlusion or lighting, and some objects are detected as being orange when they are not. We have a total of 15 oranges that were not detected and two objects that were erroneously detected as oranges. In this way, we have a more significant error due to the oranges that were not detected than the objects detected wrongly. Unlike double counting, the error introduced by the omission of detection tends to underestimate the correct number of oranges. That is, it introduces a negative error in the final count. Thus, when the detections made by YOLOv4 are used, the error introduced by the omission of detection tends to balance the error introduced by the double counting. That is why the counting algorithm arrived at a count closer to the actual value when using YOLOv4 detections than frame-by-frame corrected detections.
It is also possible to notice that there was no significant difference in the count when different object trackers were used. It is because the object tracker only tracks the object for a maximum of 35 frames after the object detector can no longer detect it. If, after these 35 frames, the object detector has not detected a nearby orange that can be assigned to the orange tracked by the object tracker, the ID of that orange is deleted. The video acquisition was performed at 60 frames per second; there are no significant variations from one frame to its subsequent frame, as seen in
Figure 7. Therefore, at 35 frames, there are no significant variations in the orange image between the first frame and the last one, which makes it very difficult to track, making the object trackers have similar performances.
In
Figure 8, 20 frames are shown spaced from the first frame to the last of the 10 s video, where it is possible to see the counting algorithm with YOLOv4 for detection and the Dlib tracker for tracking objects. In the upper-left corner of each frame is indicated, respectively, from top to bottom:
The number of that frame;
The total number of oranges already counted by the algorithm;
The total number of oranges present in that frame;
The type of object tracker used;
The number of oranges being tracked by the object tracker.
In the upper-left corner of each orange’s bounding box, we have a legend indicating the ID number assigned to that orange. The larger colored circles are used to distinguish oranges of different IDs visually. The smaller black and white circles within the larger colored circles are used to indicate whether the object detector detected that orange in that frame or not—black if the object detector was able to detect and white if the object detector was not able, and therefore, the object tracker is being used to track orange through the respective frames.
In
Figure 9, we have a situation of occlusion by a branch in which the algorithm can reassign the same ID to the respective oranges, thus avoiding double counting. In frame 208, the algorithm is tracking the oranges of IDs 13 and 91, and we can see from the black circle marking that the object detector detected both in that frame. In frame 228, both oranges are no longer detected by the object detector, as it is possible to see by the marking of the white circle. Currently, the oranges are being tracked using the Dlib object tracker. In frame 248, both oranges are detected again by the object detector, and their IDs are reassigned without double counting.
Unlike the situation shown in
Figure 9, in which there was an occlusion situation, but the counting algorithm reassigned the orange IDs, thus avoiding double counting, in
Figure 10, we have an occlusion situation in which the algorithm was not able to avoid double counting. The orange of ID 84 was being tracked until the moment it was occluded by foliage and is no longer detected by the object detector in frame 156. After not being detected again by the object detector in the following frames, the algorithm stops tracking this orange in frame 186. However, in frame 280, that same orange is again detected by the object detector after the foliage stops obstructing her line of sight. However, in this case, we do not have the position of the ID 84 bounding box next to it to reassign the same ID since the algorithm stopped tracking it at frame 186, so a new ID number 124 is assigned to that orange, so the same orange was counted twice.
This same occlusion and double counting situation shown in
Figure 10 occurs again in
Figure 11. The orange in frame 126 is being tracked by the algorithm and has ID 11. In frame 144, this orange is no longer detected by the object detector and becomes tracked by the Dlib object tracker. In frame 162, as the object detector could not detect the orange again due to the occlusion caused by the foliage, the algorithm stopped tracking the orange and stopped showing its ID and its bounding box in subsequent frames. However, in frame 218, that same orange is again detected by the object detector, now receiving ID number 102, so the same orange was counted twice.
In
Figure 12, we have another double counting situation, but this time instead of occlusion due to foliage, occlusion occurs due to another orange. The orange of ID 99 was being tracked by the algorithm, as it is possible to see in frames 256 and 268. However, in frame 280, the orange of ID 99 is obstructed by the orange of ID 103 and is no longer detected by the object detector. At that moment, the Dlib object tracker now carries out the orange tracking. As the object detector cannot detect the orange of ID 99 again in subsequent frames, the algorithm stops tracking this orange and stops showing its bounding box and its ID in the frames, as seen in frame 301. However, the orange that had been obstructed by the orange ID 103 becomes visible again in frame 316, as it is impossible to reassign ID 99, as it was discarded in previous frames, the algorithm assigns the new ID number 133. Thus, performing the double counting of the same orange.