*2.4. Object Detection Pipelines*

In order to create robust object detectors which are still reliable on test images, usually, it is necessary to use a training pipeline where implicit regularization, such as data augmentation, is applied during the training process without constraining the model capacity. Data augmentation in computer vision is a set of techniques performed on the original images to increase the number of images seen by the model while training. As studied in several works [36,37], having the right data augmentation pipeline is critical in order for agricultural computer vision systems to work effectively. As shown in Figure 7, the two particular augmentation transforms that seem to have the most impact are (i) geometrical transformations and (ii) colour distortions. In this work, in the geometrical augmentation approach, horizontal and vertical flipping, and random cropping with resizing were implemented. With regard to the colour distortion, slight modifications in the brightness, contrast, hue and saturation were implemented. All of the data augmentations were executed with a probability of 50%.

**Figure 7.** Example of the original images, and after applying the augmentation transformations.

Several object detectors have arisen in the last few years, with each of them having different advantages and disadvantages. Some of them are faster and less accurate, while others have higher performances but use more computational resources, which sometimes are not suitable depending on the deployment platform. In this work, five different object detectors were used: Faster R-CNN [8], SSD [10], CenterNet [38], RetinaNet [11] and EfficientDet-D1 [39]. Table 1 shows the specific detectors evaluated in this work. Except for the input size, the value of which was the closest to the original image size (500 × 500), the most promising ones were selected after some early experiments. It can be observed that ResNet-152 [22] is used as the backbone in two object detectors (Faster R-CNN and RetinaNet). Furthermore, HourGlass-104 [40], MobileNet-V2 [41] and EfficientNet-D1 [39] were evaluated. On the other hand, three of them use a Feature-Pyramid-Network (FPN), which is supposed to provide better performance in the detection of small objects. The reason for selecting these architectures is that different detection "families" are represented. For instance, SSD (e.g.: RetinaNet) and two-shot (Faster R-CNN) were compared in order to verify that the second one may lead to better performances than the basic approach (SSD and MobileNet-V2). However, the architectural improvements of RetinaNet (e.g., focal loss) could invert this assumption. On the other hand, the inference time was out of the scope of this paper, in which real-time detection is not discussed; however, theoretically, SSD could lead to faster inferences. Additionally, with these detectors, the anchorless approach was contrasted against the traditional anchor-based detection. Again, the most important theoretical gain could be the inference time, which is not discussed; however, related works presented the promising performances of CenterNet, which could be the chosen detector for future deployment on the field. Table 1 also reports the mAP (mean Average Precision) obtained on the COCO dataset. From this performance, it can be estimated that CenterNet and EfficientDet-D1 will be the best detectors, while SSD (MobileNet) and Faster R-CNN will be the least promising ones.


**Table 1.** The selected object detection architectures used for the experiment.

When developing an object detection pipeline, it is important to fine-tune different hyper-parameters in order to select the ones that better fit a specific dataset. This means that different datasets could lead to different hyper-parameter configurations in every detector. Table 2 presents the evaluated hyper-parameter space used in this work to choose the most promising ones after 5 runs with different train-validations-test splits (see Table 3). The selected configurations were executed for 5 additional runs in order to complete the experimental trials and extract some statistics.

**Table 2.** Hyper-parameters evaluated for each detector.


**Table 3.** Hyper-parameters selected for the final experiments.


Two different optimizers were evaluated: Adam and SGD. All of the detectors performed better with SGD except CenterNet, which obtained its best performances with Adam. Related to the optimizer, the learning rate (LR) also played a major role. Adam worked better with the smallest evaluated LR (0.001), while SGD obtained the best performance with higher LRs (0.05 and 0.01). Two additional hyper-parameters that completely changed the training behavior were the warmup steps and the batch size. The warmup is the period of the training where the LR starts small and smoothly increases until it reaches the selected LR (0.05, 0.01 or 0.001). Every detector needed a different combination to obtain their best performance, but it is important to remark that the use of less than 500 steps (around 10 epochs) for warming obtained poor performances. Finally, all of the detectors, except CenterNet, which uses an anchorless approach, needed to run the Non-Maximum-Suppression (NMS) algorithm to remove redundant detections of the same object. Two important values configure this algorithm: the ratio to consider that two predicted bounding boxes point to the same object (the IoU threshold in the tables) and the maximum number of detections allowed (the Max. detections in the tables). As can be observed, again, all of the detectors found a different combination as the most promising one, which are presented in Table 3. Finally, an early stopping technique to avoid overfitting was implemented. Specifically, if the difference between the training and

validation performances is greater than 5% for 10 epochs, the training process stops. All of the detectors were trained for a maximum of 50 epochs.

In this work, the experiments were carried out with GeForce RTX 3090 GPU under Ubuntu 18.04. For the software, Tensorflow 2.6.0 was used to implement the object detector pipelines. Early experiments to gain knowledge on the most promising detectors were implemented through parallel tasks in an HPC cluster (ARIS infrastructure https://hpc. grnet.gr/en/, accessed date: 13 January 2022) with 2 NVIDIA V100 GPU cards.
