Review Reports - Assessment of Different Object Detectors for the Maturity Level Classification of Broccoli Crops Using UAV Imagery

Round 1

Reviewer 1 Report

The authors can summarize their contribution at the end of the introduction to attract the reader.
Please explicitly clarify the role of UAV. It seems UAV is responsible for collecting data and other computations are done by the other terminal device.
Is it possible to deploy the trained model into UAV and UAV can send real-time feedback based on current observation to a central server with the GPS location? Please explain the challenges and future direction regarding this envision.
The authors have mentioned that " Table 1 shows the specific hyper-parameters used in this work.". However, Table 1 does not provide sufficient enough information so that the interested reader can reproduce the model. Therefore, please add sufficient information regarding parameters and neural network settings that can help to reproduce the model.
Could the author deploy a Capsule network model for comparing their performance since the Capsule network can capture the sharp changes of the correlated feature to address the Picasso problem?

Author Response

Comment 1: The authors can summarize their contribution at the end of the introduction to attract the reader.

Response 1: We are grateful for this comment. We have updated the final paragraph of the ‘Introduction’ section (Line 133), and also updated the abstract accordingly.

Comment 2: Please explicitly clarify the role of UAV. It seems UAV is responsible for collecting data and other computations are done by the other terminal device.

Response 2: We thank the reviewer for this valuable comment. That is correct, in our present work, the UAV is only the data collection tool, and all computations are done in separate units, as described in ‘2.3 Object Detection Pipelines’. To make sure that this aspect of the experiment is clear, and no real-time operations take place in our case, we have added an introductory sentence at the start of “2.2 Data Pre-processing” clarifying this (Line 193).

Comment 3: Is it possible to deploy the trained model into UAV and UAV can send real-time feedback based on current observation to a central server with the GPS location? Please explain the challenges and future direction regarding this envision.

Response 3: Interestingly enough, the greater limitation of integrating this process in a real-time system, is not the inference of the images through the object detectors, but the photogrammetric process required in the preprocessing, which is something we are currently working on and hopefully can present in future work. If this photogrammetric process is omitted, and the problem is simplified to just sending or even inferencing the data in real time, we then encounter the problem of overlapping images, where a single crop can trigger multiple detections in consecutive images. In turn, if we follow another approach, one that intentionally reduces overlaps, from our personal experience, the accuracy of the flight and the collected data is decreased significantly, unless a high-end (RTK) positioning system is on board (which is what we are currently experimenting with). To cover this subject, a new sentence and some slight adjustments have been added towards the end of the ‘Conclusions’ section (Line 578).

Comment 4: The authors have mentioned that "Table 1 shows the specific hyper-parameters used in this work.". However, Table 1 does not provide sufficient enough information so that the interested reader can reproduce the model. Therefore, please add sufficient information regarding parameters and neural network settings that can help to reproduce the model.

Response 4: Thank you for this comment. We definitely agree that Table 1 is not sufficient for reproducibility purposes. For that reason, we have divided Table 1 into three different tables. Table 1 presents the detectors and the hyper-parameters that have not been fine-tuned. Table 2 presents the different hyper-parameters that have been evaluated in preliminary experiments to find the most promising configuration setting for each detector. And finally, Table 3 shows the selected hyper-parameters for every architecture, which have been the ones used for running the 10 runs presented in this manuscript. Moreover, additional text for explaining deeply these tables has been provided:

Line 278: When developing an object detection pipeline, it is important to fine-tune different hyper-parameters to select those ones that fit better a specific dataset. This means that different datasets could lead to different hyper-parameter configurations in every detector. Table 2 presents the evaluated hyper-parameter space used in this work to choose the most promising ones after 5 runs with different train-validations-test splits (See Table 3). The selected configurations were executed for 5 additional runs in order to complete the experimental trials and extract some statistics.

Line 287: Two different optimizers were evaluated: Adam and SGD. All detectors performed better with SGD except CenterNet, which obtained its best performances with Adam. Related to the optimizer, the learning rate (LR) also played a major role. Adam worked better with the smallest evaluated LR (0.001), while SGD obtained the best performance with higher LRs (0.05 and 0.01). Two additional hyper-parameters that completely changed the training behavior were the warm-up steps and the batch size. The warmup is a period of the training where the LR starts small and smoothly increases until it reaches the selected LR (0.05, 0.01 or 0.001). Every detector had a different combination to obtain their best performance, but it is important to remark that using less than 500 steps (around 10 epochs) for warming obtained poor performances. Finally, all detectors, except CenterNet, which uses an anchorless approach, needed to run the Non-Maximum-Suppression (NMS) algorithm to remove redundant detections of the same object. Two important values configure this algorithm: the ratio to consider that two predicted bounding boxes point to the same object (IoU threshold in the tables) and the maximum number of detections allowed (Max. detections in the tables). As it can be observed, again, all the detectors found a different combination as the most promising one. Finally, an early stopping technique to avoid overfitting has been implemented. Specifically, if the difference between training and validation performances is greater than 5% for 10 epochs, the training process stops. All the detectors were trained for a maximum of 50 epochs.

Comment 5: Could the author deploy a Capsule network model for comparing their performance since the Capsule network can capture the sharp changes of the correlated feature to address the Picasso problem?

Response 5: Once again, thank you for the comment. Although capsule networks were not evaluated in this work, we definitely foresee how their use could find more reliable models with more semantic features and relationships among them. Therefore, in the present manuscript, we have prescribed them as an interesting work to address in the future:

Line 578: Furthermore, different machine learning techniques, such as semi-supervised learning approaches, capsule networks [45] and transfer learning from backbones previously trained on agricultural datasets (i.e.: domain transfer) are already under experimentation in similar trials, while integration of the entire pipeline in a real-time system is also currently tested.

Reviewer 2 Report

presented manuscript deals with an interesting topic of object detection in agriculture. Proposed study introduces the process of detection of broccoli plants.

First, author describe the location of the experiment. I really appreciate that the procedure is based on real (not synthetic) data.

Then, data pre-processing and analysis using CNN is described.

I have to state that the manuscript is clearly written, the level of English is good.

I have these minor comments:

1) How did you solve the image registration?

2) What abou overlaping images?

3) Did you compare your aproach to other standard methods of image processing (any integral transform)?

Owing facts mentioned above I recommend to accept the manuscript after MINOR REVISION.

Author Response

Comment 1: How did you solve the image registration?

Response 1: We thank the reviewer for this comment. All images have been captured with a single camera, and the georeferencing metadata are generated from a common positioning system (described in our Materials and Methods section), ensuring consistency. The mosaicing process was performed using a standard RGB photogrammetric pipeline (Line 194), with the bright index objects mentioned in the 2.1 Data Acquisition section and other landmarks (such as a weather station) mainly towards the center of the field as GCPs. This approach (GCP objects in the center of the field) was selected to avoid using the low-overlap images and thus low appearance rate of the objects in the perimeter) being used as GCPs.

Comment 2: What about overlapping images?

Response 2: When designing the flights, the frontal and lateral overlaps of each image were both selected to be 80% as described in the Materials and Methods section 2.1 Data Acquisition’ (Line 166). These images were then used to create a georeferenced orthomosaic, as described in ‘2.2 Data Pre-processing’ (Line 194), with naturally 0% overlap. This mosaic was then ‘cut’ into the individual (500x500 resolution) images that would form our final dataset. Therefore, these steps ensure that the final dataset has no overlapping images, as the raw data have undergone orthomosaicing and were then split into smaller images, so that they can be used by the object detectors.

Comment 3: Did you compare your approach to other standard methods of image processing (any integral transform)?

Response 3: As a matter of fact, we are currently experimenting on multiple different image processing techniques (such as various integral transforms), as well as different learning approaches (i.e. semi-supervised learning) for this experimental case. However, we have decided not to include them in the present document, as it would deviate from the objectives of this paper, but also provide far too much information for a single manuscript.

Reviewer 3 Report

This manuscript compares state-of-the-art object detection architectures and data augmentation techniques and assess their potential in maturity classification of open-field broccoli crops, using a high-Ground Sampling Distance (GSD) RGB image dataset collected from low altitude UAV flights. In summary, the research is interesting and provides valuable results, but the current document has several weaknesses that must be strengthened in order to obtain a documentary result that is equal to the value of the publication.

General considerations:

Figures are not attractive and detailed enough.
Theoretical analysis of the techniques used in the experiment is not detailed enough.

Title, Abstract, and Keywords:

The innovation is not evident, and keywords do not well summarize the key technologies of this paper. The keywords are too general to be professional.

Chapter 1: Introduction

In the introduction, the key technologies of the research are insufficiently introduced, such as object detectors and UAV images processing. The research gap is not presented.

Chapter 2: Materials and Methods

Analysis of object detection network is too few, and network evaluation metrics are not comprehensive enough.

Chapter 3: Results

Analysis of experimental results is too superficial to be professional.

Chapter 4: Discussion

Datasets and settings are unclear. For deep learning, datasets are too few to over-fit.

Chapter 5: Conclusion

I suggest including a more professional analysis of the object detection network structure, so as to deepen the understanding of experimental phenomena and obtain better detection results. This section of the file can be improved and more rigorously completed.

Author Response

Comment 1: Figures are not attractive and detailed enough.

Response 1: We thank the reviewer for this comment. To address it, Figures 11 and 12 (Line 412 and Line 428 respectively) have been replaced with new ones where the confidence scores, labels and overall detection boxes are clearer.

Comment 2: Theoretical analysis of the techniques used in the experiment is not detailed enough.

Response 2: Thanks for the remark. In the present manuscript, we have extended theoretical and practical descriptions of the techniques used. Besides the text, Tables 2 and 3 have been added to describe not only the theoretical aspects of the detectors, but also the practical hyper-parameters shaping them and evaluated in this work.

Line 250: Several object detectors have arisen in the last few years, each of them having different advantages and disadvantages. Some of them are faster and less accurate, while others have higher performances, but use more computational resources, which sometimes are not suitable depending on the deployment platform. In this work, five different object detectors were used: Faster R-CNN [8], SSD [10], CenterNet [38], RetinaNet [11] and EfficientDet-D1 [39]. Table 1 shows the specific detectors evaluated in this work. Except for the input size, whose value was the closest to the original image size (500x500), the most promising ones were selected after some early experiments. It can be observed that ResNet-152 [22] is used as the backbone in two object detectors (Faster R-CNN and RetinaNet). Furthermore, HourGlass-104 [40], MobileNet-V2 [41] and EfficientNet-D1 [39] were evaluated. On the other hand, three of them use a Feature-Pyramid-Network (FPN), which is supposed to provide better performance at detecting small objects. The reason for selecting these architectures is that different detection "families" are represented. For instance, SSD (e.g.: RetinaNet) and two-shot (Faster R-CNN) were compared to verify that the second one may lead to better performances over the basic approach (SSD and MobileNet-V2). However, the architectural improvements of RetinaNet (e.g.: focal loss) could invert this assumption. On the other hand, the inference time was out of the scope of this paper, where the real-time detection is not discussed; but, theoretically, SSD, could lead to faster inferences. Additionally, with these detectors, the anchorless approach was contrasted against the traditional anchor-based detection. Again, the most important theoretical gain could be the inference time, which is not discussed; but related works presented the promising performances of CenterNet that could lead to being the chosen detector for future deployment on the field. Table 1 also reports the mAP obtained on the COCO dataset. From this performance, it can be estimated that CenterNet and EfficientDet-D1 will be the best detectors, while SSD (MobileNet) and Faster R-CNN will be the least promising ones.

Comment 3: The innovation is not evident, and keywords do not well summarize the key technologies of this paper. The keywords are too general to be professional.

Response 3: We certainly agree with the comment of the reviewer, as the initial keywords were indeed very generic. The keywords have been updated to better represent and describe our work. Moreover, adjustments have been done in the paragraphs of the Introduction section (Line 127 – Line 135), to properly highlight the innovation of our work.

Comment 4: In the introduction, the key technologies of the research are insufficiently introduced, such as object detectors and UAV images processing. The research gap is not presented.

Response 4: The research gap has been better highlighted in the Introduction section, as mentioned in [Response 1].

The methods’ section has been significantly expanded, as explained in great detail in the [Response 2], regarding the Theoretical Analysis.

For the use of UAV image processing, a detailed description of the UAV image processing exists in section ‘2.2 Data Pre-processing’ (Line 192). We believe that is an appropriate location, as the methodological pipeline used in this experiment contains many case-specific elements aiming to better address the challenges of our experiment, starting from raw-data handling and covering every step of the dataset preparation phase.

Comment 5: Analysis of object detection network is too few, and network evaluation metrics are not comprehensive enough.

Response 5: Thank you for this valuable comment. Although we encourage the reader to read the original sources, we agree that the description of the detectors is not enough and should be enriched for contextualizing the rest of the paper. As it was previously answered in a previous comment, in the current manuscript, we have extended the introduction of detectors, making some assumptions according to their theoretical skills, which have been further discussed in the Discussion Section (See next comment).

On the other hand, we agree that more information should be provided to describe the evaluation/metrics part. Therefore, in the present manuscript, we have extended the Evaluation Metrics Section, with additional explanations and a figure as depiction (See FIgure 8).

Line 321: Precision is defined as the number of true positives divided by the sum of true positives (TP) and false positives (FP), while AP is the precision averaged across all unique recall levels. Since the calculation of AP only involves one class and, in object detection, there are usually several classes (3 in this paper), mean Average Precision (mAP) is defined as the mean of AP across all classes. To decide what a TP is, Intersection over Union (IoU) threshold is used. IoU is defined as the area of the intersection divided by the area of the union of a predicted bounding box. For example, Figure 8 shows different IoU in the same image and ground truth (red box) by varying the prediction (yellow box). If an IoU > 0.5 is configured, only the third image will contain a TP, while the other two, will contain an FP. On the other hand, if an IoU > 0.4 is used, the central image will also count as a TP. In case multiple predictions correspond to the same ground-truth, only the one with the highest IoU counts as a TP, while the remaining are considered FPs. Specifically, COCO reports AP for two detection thresholds: IoU > 0.5 (traditional) and IoU > 0.75 (strict), which are the same thresholds used in this work. Additionally, mAP with small objects (area less than 32x32 pixels) was also reported.

Comment 6: Analysis of experimental results is too superficial to be professional.

Response 6: Thank you for the suggestion. We would say that the analysis found in the Results section is indeed a bit dry. However, this is a section where we try to be as objective and "cold" as possible, just putting in words what the tables and charts summarize; and providing some early insights that will be further extended in the Discussion section, where we try to be more creative developing hypothesis according to the results and our experience in previous researches and related literature. Therefore, we have rewritten the discussion section by trying to dig deeper.

Comment 7: Datasets and settings are unclear. For deep learning, datasets are too few to over-fit.

Response 7: We thank the reviewer for this comment.

Regarding the information of the dataset, a dedicated sub-section in section ‘2.2 Data Pre-processing’ (Line 192) already exists. We believe that the location of these paragraphs is appropriate, as the methodological pipeline used in this experiment contains many case-specific elements aiming to better address the challenges of our experiment, starting from raw-data handling and covering every step of the dataset preparation phase.

In regards to overfitting, we agree that we have not provided enough information to explain how we avoid it. Thus, some extra information has been added to the current manuscript. Although we can claim that overfitting, described as the difference in performance between training and validation/testing sets, is not present in our results, we should warn the readers about the small size of the dataset and the preliminary scope of the research: we are not presenting a pipeline ready for being deployed into a production setting, but just providing insights about how different detection pipelines perform on this problem under the described restrictions.

Line 303: Finally, an early stopping technique to avoid overfitting has been implemented. Specifically, if the validation performance is greater than 5% for 500 steps, the training process stops.

Additionally, all result tables present between parentheses a new value in the mAP@50 column that shows the performance on the training set. This way, we make sure that the reader can decide whether the differences between train and test performances could suggest the implementation of additional regularization techniques. To this end, Tables 4, 5, 6, 7 and 8 have been updated respectively.

Finally, the discussion section has been extended with some thoughts related to the overfitting chances and the soundness of the dataset.

Line 518: One possible limitation of the presented work is the dataset size (288 images with 640 annotations) since it could be the cause of overfitting. However, according to the results presented, the mAP@50 performances on train and test sets are quite similar. This means that, after 10 runs, the detectors did not overfit training data and they were able to generalize on the unseen test set. Additionally, the use of early stopping based on the difference between training and validation performances reinforced the reduction of overfitting chances. On the other hand, these results could be seen as promising, but the real-world conditions could lead to more complex and diverse images. Therefore, this presented research is an initial experimental run of an open research subject, which is planned to be further extended (using larger data volumes and various learning techniques) to ensure its suitability in production.

Comment 8: I suggest including a more professional analysis of the object detection network structure, so as to deepen the understanding of experimental phenomena and obtain better detection results. This section of the file can be improved and more rigorously completed.

Response 8: Thanks for the comment. As stated in the Results comment, we have enriched the Discussion section trying to provide deeper and more meaningful insights of the research’s findings.

Reviewer 4 Report

The study is based on the automation process, by using state-of-the-art Object Detection architectures, trained on georeferenced orthomosaic derived RGB images captured from low altitude UAV flights, and assess their capacity to effectively detect and classify broccoli heads based on their maturity level. The study is based on an important application and is very interesting. Introduction is well written. I have the following queries that I think that authors would want to improve upon:

The information on the dataset is very limited. The partition with 10% validation, could the authors clarify how this is done in the model process and the reason for taking large segment with training.
There is only one graph on performance of the model, could there be graphs to show the efficiency and error metrics?
More information should be added about COCO, Average Precision and mAP for readers.
2.4 Evaluation Metrics, error with comma 2,4 should be 2.4

Author Response

Comment 1: The information on the dataset is very limited. The partition with 10% validation, could the authors clarify how this is done in the model process and the reason for taking large segment with training.

Response 1: We thank the reviewer for this valuable comment. Regarding the information of the dataset, a dedicated sub-section in the Materials and Methods section provides relevant information in detail (Line 192). We believe that the location of these paragraphs is appropriate, as the methodological pipeline used in this experiment contains many case-specific elements aiming to better address the challenges we encountered (as described in the manuscript), starting from raw-data handling and covering every step of the dataset preparation phase.

In regards to the dataset split, we decided to use a rather standard approach (70-20-10), which allowed the different sets to have enough objects to provide reliable performances. The use of 10% for validation is a heuristic selection that sometimes could not be informative enough, but this did not happen in our experiments. The main reason for having a validation set is (i) to detect overfitting while training and (ii) stop the process since we can assume the detector has reached its maximum ability to generalize on the test set, which is the “important” one. We use a smaller validation set, because this set is not used for learning meaningful features, which is the final goal of a deep learning process, and, on the other hand, it is “biasing” the training so it cannot be considered a “clean” test set, where we can trust that the model is generalizing or not. We have clarified in the current manuscript the use of the validation set for stopping the training process when we detect that overfitting is starting to arise:

Line 303: Finally, an early stopping technique to avoid overfitting has been implemented. Specifically, if the difference between training and validation performances is greater than 5% for 500 steps, the training process stops.

Comment 2: There is only one graph on performance of the model, could there be graphs to show the efficiency and error metrics?

Response 2: Thank you for this comment. We have added a new graph (Line 387) where a deeper analysis of detection performance is presented. Moreover, we have added the following text to describe it:

Line 377: Figure 9 depicts a box plot summarizing the performance of the detectors across all the data augmentations. As it could be inferred from previous tables, Faster RCNN and CenterNet were the most consistent detectors at mAP@50 and mAP@75. However, both architectures presented a different behavior at the same time. On the one hand, Faster R-CNN was able to obtain the highest performances while having a higher variance. On the other hand, CenterNet did not reach the maximum, but showed more consistent performance. The other detectors (SSD, EfficientDet-D1 and RetinaNet) performed worse on average, but in some specific experiments, they were able to reach higher mAPs than CenterNet.

Comment 3: More information should be added about COCO, Average Precision and mAP for readers.

Response 3: We are grateful for this suggestion. We agree that more information should be provided to describe the evaluation part. Therefore, in the present manuscript, we have extended the Evaluation Metrics Section, with additional explanations and a figure as depiction (Line 338).

Line 317: In this paper, the evaluation method for the broccoli maturity detection task was based on the Microsoft COCO: Common Objects in Context (COCO) dataset [42], which is probably the most commonly used dataset for object detection in images. Like COCO, the results of the broccoli maturity detection were reported using the Average Precision (AP). Precision is defined as the number of true positives divided by the sum of true positives (TP) and false positives (FP), while AP is the precision averaged across all unique recall levels. Since the calculation of AP only involves one class and, in object detection, there are usually several classes (3 in this paper), mean Average Precision (mAP) is defined as the mean of AP across all classes. To decide what a TP is, Intersection over Union (IoU) threshold was used. IoU is defined as the area of the intersection divided by the area of the union of a predicted bounding box. For example, Figure 8 shows different IoU in the same image and ground truth (red box) by varying the prediction (yellow box). If an IoU > 0.5 is configured, only the third image will contain a TP, while the other two, will contain an FP. On the other hand, if an IoU > 0.4 is used, the central image will also count as a TP. In case multiple predictions correspond to the same ground-truth, only the one with the highest IoU counts as a TP, while the remaining are considered FPs. Specifically, COCO reports AP for two detection thresholds: IoU > 0.5 (traditional) and IoU > 0.75 (strict), which are the same thresholds used in this work. Additionally, mAP with small objects (area less than 32x32 pixels) was also reported.

Comment 4: 2.4 Evaluation Metrics, error with comma 2,4 should be 2.4

Response 4: Thank you for the valuable observation. It has been fixed in the manuscript.

Round 2

Reviewer 3 Report