**4. Discussion**

The results have shown that, across all of the experiments, Faster R-CNN and CenterNet were the best-performing architectures for the task of broccoli maturity detection (Tables 4–7, and Figure 9). Their performances were, on average, always over 80% for mAP@50 and 70% for mAP@75 in the three-class problem, regardless of the hyperparameter configuration and augmentation techniques, which were solid performances considering the open-field nature of the dataset. The same pattern was encountered in the detection of small objects. Nonetheless, it is important to note that the differences were too small with high standard deviations, making it difficult to claim the best detector with small objects. According to Table 1, the high performance of CenterNet was expected due to its highest mAP on the COCO dataset in comparison to the other detectors. However, Faster R-CNN was, in theory, less promising than more recent architectures such as EfficientDet-D1 and RetinaNet. Additionally, Faster R-CNN and RetinaNet shared the same backbone, which made the comparison closer to the single-shot against two-shot detectors. In this case, the two-shot approach surpassed both SSD methods. This result highlights the fact that the evaluation of old architectures on new domains is always a worthy task, and opens the question of whether SSD may need deeper backbone architectures (e.g., ResNet 164) in or-

der to consistently overcome the two-shot approach. This experiment could be carried out because the inference time in SSD is lower, and it could be increased by a deeper backbone to match the Faster R-CNN's times and performances. However, this approach could have a limit because embedded systems where the detector would be deployed have hardware limitations, and smaller architectures such as MobileNet would be preferable. However, according to the poor results shown in this work by MobileNet V2, the version 3 (both small and large) architectures should be evaluated to throw some light on this problem.

Another important aspect of this research was to evaluate different data augmentation approaches due to the relevance that this technique currently has in related literature. Without any form of augmentation, the best-performing architectures were Faster R-CNN (ResNet) for mAP@50 and CenterNet for mAP@75, achieving over 82% and 73%, respectively (Table 4). However, as it was hypothesized, not using augmentations did not obtain the best performances. Moreover, as can be observed in the presented tables, augmentation made the difference between training and testing lower, which can be translated into lower chances of overfitting. Focusing on the most promising augmentation techniques, geometrical techniques with and without the combination of colour distortions led to the highest performances overall (see Figure 10). By using only geometrical augmentation, Faster R-CNN (ResNet) managed to score the highest mAP@50 across all of the configurations and architectures (84.19%). On the other hand, it was quite remarkable that colour distortion by itself decreased the performance. For example, SSD-MobileNet decreased its mAP@50 from 78.89% (non-augmentation) and 82.82% (geometrical augmentation) to 77.38% (colour augmentation). Finally, with both colour and geometrical augmentations active, a surprising result was the improvement of RetinaNet, which not only obtained the second-best performance across all of the architectures but also achieved the highest mAP@small across all of the architectures and augmentation configurations. In addition, Faster R-CNN (ResNet) performed worse than the use of a single form of augmentation (colour and geometrical) separately.

The improvement of the geometrical augmentation is aligned with the results presented in [31,43], although the specific transformations were different. As discussed by these authors, these geometrically transformed images are similar to broccoli from the test set because the test images also contain broccolis of different sizes, scales, and positions. As a result, these transformed images allowed the neural network to better generalize, and to detect the broccolis in the test images with a higher mAP. However, with the colour transformations, the transformed images were less similar to the broccolis of the test set. Some unrealistic dark or bright images could appear, and even changes in textures that differed from the textural patterns learned while training could result in a lower mAP (See Figure 7). Moreover, this can lead to the conclusion that the use of standard augmentations, which usually work in popular datasets like COCO, could not work in some specific domains like agriculture and maturity level classification, where colour plays an important role. Another relevant remark for data augmentation is that all of the types of augmentation were able to reach performances of around 87.5% in some specific train–test splits and pipeline configurations. This reinforces the need to run several experiments with different splits in order to deeply understand the general behaviour of a specific detection pipeline.

For the two-class approach, the performance was significantly higher for the tested detectors, indicating that the merging of the two lower-maturity classes into one, and essentially only separating them from ready-to-harvest plants, was successful. This is another logical outcome considering that the "boundaries" between the classes were an approximation method for field labelling, and not an actual measurable metric. As a result, some level of confusion might occur even for a human observer tasked to separate the instances of these two classes. From an agronomic point of view, quantifying maturity in the early stages holds some value regarding the early planning of upcoming harvesting operations. Nevertheless, the most important aspect of maturity detection is the identification of the already-mature crops in order to avoid their prolonged exposure and delayed harvesting, which might lead to quality degradation. Additionally, for the simplified

two-class problem, larger broccoli heads naturally cover more space in the image and are, therefore, more distinguishable in the first place. It can be inferred that the detection of class 3 (ready-to-harvest) outperforms the other two because the model can more easily detect them despite the fact that it is not the largest class (See Figure 6).

Focusing on the performances, the two (2) best-performing detectors both achieved over 91% mAP@50 and 83% mAP@75. These detectors were again CenterNet and Faster R-CNN, both with geometrical data augmentation. On the other hand, SSD and RetinaNet showed lower performances, but still maintained a significant improvement compared to their three-class problem counterpart. It is important to remark that, in the same way in which it happened in the three-class version of the problem, CenterNet made a larger difference when checking the mAP@75. This opens the question of whether the use of box centers instead of the box coordinates in the learning of the broccoli patterns is the only factor to boost localization accuracy. This could be discussed because Faster R-CNN, which is anchor-based, was also able to obtain high mAP@75 performances, but the variance of its results decreased the average severely.

One possible limitation of the presented work is the dataset size (288 images with 640 annotations) because it could be the cause of the overfitting. However, according to the results presented, the mAP@50 performances on the training and testing sets were quite similar. This means that, after 10 runs, the detectors did not overfit the training data, and they were able to generalize on the unseen test set. Additionally, the use of early stopping based on the difference between the training and validation performances reinforced the reduction of overfitting chances. On the other hand, these results could be seen as promising, but the real-world conditions could lead to more complex and diverse images. Therefore, this presented research is an initial experimental run of an open research subject, which is planned to be extended further (using larger data volumes and various learning techniques) in order to ensure its suitability in production.

Regarding a more technical aspect, the present study has been an opportunity to obtain knowledge regarding the proper planning and deployment of such experiments. First of all, during the orthomosaicing process, it is a common phenomenon that the flight lines around the perimeter of the flight plan, which often also reflect the boundaries of the experimental field, are the ones with the lowest values of overlap. The reason is that this area is only scanned once throughout the entire flight, such that the surrounding areas are only captured in the images of a single flight line. As a result, poor mosaicking quality can easily occur, and if most of the targets are placed in the perimeter, the entire mission is at risk of being abortive. On the other hand, the deployment of ground truth targets on the perimeter of the field is the easiest, as they are easily accessible, not covered by vegetation, and the perimeter's lower humidity level can potentially increase the time until the targets become soggy or covered by water droplets due to humidity, if they are not protected properly. Thus, the distribution of the ground truth targets is a factor that should always be considered. Therefore, the targets should be properly scattered towards the middle of the experimental area, and ideally across a large area. Additionally, in order to ensure their survivability in high-humidity conditions, waterproof materials should be used for the targets, and ideally should be placed on a slight incline in order to avoid the formation of droplets on their surface that would decrease their visibility.

Finally, as the timeframe during which data collection can be performed for this specific type of experiments is very strict, the utmost attention should be given to the minimization of as many risks as possible before committing to the field visit. One of the most unpredictable factors for all UAV missions is the weather. Certain UAVs have the capacity to perform flights under harsher conditions, while known thresholds are always a safety switch for the pilots to potentially cancel high-risk missions in time if they judge that the conditions do not allow for a safe flight. Weather forecasts can mitigate this risk to a certain extent by giving the pilots enough time to adjust/select the appropriate fleet for each mission based on the conditions they expect to encounter, although drastic changes—especially in wind speed and direction—are anything but rare.
