Next Article in Journal
Performance Analysis and Testing of a Multi-Duct Orchard Sprayer
Previous Article in Journal
Drought-Induced Morpho-Physiological, Biochemical, Metabolite Responses and Protein Profiling of Chickpea (Cicer arietinum L.)
 
 
Article
Peer-Review Record

Research on Apple Object Detection and Localization Method Based on Improved YOLOX and RGB-D Images

Agronomy 2023, 13(7), 1816; https://doi.org/10.3390/agronomy13071816
by Tiantian Hu, Wenbo Wang *, Jinan Gu, Zilin Xia, Jian Zhang and Bo Wang
Reviewer 1:
Reviewer 2: Anonymous
Agronomy 2023, 13(7), 1816; https://doi.org/10.3390/agronomy13071816
Submission received: 5 June 2023 / Revised: 4 July 2023 / Accepted: 6 July 2023 / Published: 8 July 2023
(This article belongs to the Section Precision and Digital Agriculture)

Round 1

Reviewer 1 Report

This paper introduces a vision-based fruit recognition and localization system that serves as a foundation for the automatic operation of agricultural harvesting robots. Existing detection models often face limitations in real-time requirements for harvesting robots due to their high complexity and slow inference speed. To address these issues, a method for apple object detection and localization is proposed. Firstly, an improved YOLOX network is designed to detect the target region. It utilizes a multi-branch topology during the training phase and a single-branch structure during the inference phase. The spatial pyramid pooling layer (SPP) with a serial structure is employed to expand the receptive field of the backbone network and ensure consistent output. Secondly, an RGB-D camera is used to capture aligned depth images and calculate the depth value of the desired point. Finally, the three-dimensional coordinates of apple picking points are obtained by combining the two-dimensional coordinates from the RGB image with the depth value. Experimental results demonstrate that the proposed method achieves high accuracy and real-time performance. The F1 score is 93%, the mean average precision (mAP) is 94.09%, the detection speed reaches 167.43 F/s, and the positioning errors in the X, Y, and Z directions are all below 7 mm.

Future research endeavors could prioritize the augmentation of the apple dataset to encompass diverse growth stages. Moreover, given the constraints of utilizing a solitary camera, incorporating multiple depth cameras could be explored to amalgamate various data sources, thereby facilitating enhanced accuracy in positioning information.

 

 

I would like to suggest to add more discussions and result details of tests in an apple orchard.

 

Fig. 10 and 11 : even if reducing the number of figures in Figures 10 and 11, you need to increase the size for high resolution.

 

The author checks the overall English for corrections, and if so, Minor editing of English language may be required.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 2 Report

This paper shows an improved YOLOX network for accurate Apple recognition and better inference speed. The methodology and results are interesting, showing 93% for F1 and 94,09% for MAP.

The recommendations and suggestions for the improvement of the paper are listed below. Also, a pdf is attached with some marks to help with the suggestions and observations.

In yellow marks are located sentences or terms to modify or verify. In red marks are indicated simple adjustments mentioned below.

 

In line 26 (Introduction): I suggest rewriting this sentence, maybe it would sound better not to start the sentence using "with". Also, the use of robots in agricultural production is not something new, I recommend changing the word “now” (suggestion: replace to “enables”) or referencing this affirmation.

In line 30: The sentence “Apple picking operations require short-term intensive labor” could be moved after the next sentence in line 32. The location is also indicated in the attached pdf.

In line 57: The first sentence of the paragraph is not clear. It’s written about a winning classification method (AlexNet), and then you mentioned about the applications of deep learning in vision tasks. In deep learning, the classification is part of the tasks, but it is not clearly the intention of the sentence. I recommend rewriting this sentence.

In line 69: The paragraph starts with “as for”; however, its application starting a new sentence is better as a second part of a remark. I suggest rewriting this part or the full sentence.

In line 75: It’s written, “meet the real-time requirements of the harvesting robot”, but it does not show what these requirements are.

In line 85: Is there a need to improve fruit recognition (…) in practical work for current harvesting robots? This affirmation is strong for the reasons mentioned before, maybe changing the word “need” to another would be better (for example, change to “important”: …Therefore, improvements are important in fruit recognition…) or reference this statement.

In line 111 (Materials and Methods): I suggest replacing “.” with “:”.

About the methodology, the general structure seems that follows this article: https://www.mdpi.com/2072-4292/14/17/4150. If it’s following this article or another one, please add the reference in the text.

In line 115: It’s recommendable to add a reference to Software Labelimg.

In Figure 1 – first row, just correct “Date augmentation” to “Data augmentation”.

In Figure 2, Improve the alignment of images. If you prefer, you can organize them in 2 rows and 2 columns.

In line 144:  just add “was” after the word adopted (The labeling format adopted was the PASCAL VOC format.)

In line 147: The first sentence does not seem related to image acquisitions. Please, check if the first sentence or the paragraph can be moved to Section 2.3.

In line 169: The acronym FPN is mentioned only once without its complete name. Please give the full abbreviation name.

In lines 224 and 233: Where it is written “MaxPool”, please consider checking if the correct term is “MaxPooling”.

In line 229: The acronyms SiLU and ReLU are mentioned, however, it’s not showing the full name of the abbreviation.

In Figure 5, please indicate the meaning of “k”, “s” and “p” that appear in the boxes.

In line 240, Is YOLOX the original algorithm or another mentioned before?  I recommend you indicate the name of the network.

In lines 246 and 247, you affirm that the approach also helps the model reach the convergence state faster. However, is it faster compared to which model? It would be interesting to perform a comparison of the models trained with the original loss function and the loss proposed in the paper.

The paragraph in line 253 section 2.4 is repeated with the sub-section 2.4.1. It’s explained how to obtain the coordinate twice, which is more detailed in 2.4.1.. I suggest you move the paragraph to item 2.4.1 or reorganize it to sound better.

In Figure 7, the legend is on the next page and not with the images.

In line 287: It does not appear the number of the figure.

In line 289: There is a sentence with 4 words “The unit is Pixel”. I suggest you add this information with the sentence in 287 “The RGB image shown in figure…”.

In line 293: Please could you describe or indicate what is “m”.  In the previous lines, you mentioned the image plane in mm, and now “m”, so it could be confused with unit meters or an axis named “m”.

In Equation 9 (between lines 306 and 307), I think the ’ in “h’ ” is a text formatting error. Please check this term.

In line 322 (Experiments and Results): Do you perform some experimental tests to know if the use or not of the pretraining weights improves the results?

In Table 1 (between lines 323 and 324), in the last row of the table, second column, it’s not a compiler, actually pycharm is an Integrated Development Environment (IDE). Please, check this term.

In line 330: I suggest replacing the word “here” with “in this study” or similar.

In Equations 11-15, please check the use of the “ ’ ” in FP’, FN’, Recall’ and Latency’.

In line 350: You mentioned that “The optimal weight obtained from training was used for testing on 479 test sets”. According to the information on the previous pages, the paper has 1 test set with 10% of the image’s dataset. Please, check this information. Do the authors mean 479 test images or sets?

In line 352: It is written that “the proposed method has a better recognition effect”. Was any analysis conducted to support this statement?

In the paragraph line 356 and Table 2 are shown the models performed. We already have the YOLOV8 released, which has achieved better results compared to other YOLO models. For curiosity, has the authors performed an experiment comparing the proposed method with the YOLOV8?  

In line 368: The sentence “(…) computational complexity (…)  these factors did not directly determine the model's inference efficiency” seems to contradict the information in paragraph line 90. I didn’t understand if the authors are concluding this with their evaluation referring to the issue mentioned in item 2. Line 9, in the end, is different from that statement, they concluded that does not impact the efficiency of the inference model. Please verify this information.

In Table 3, there are 15 sample error test results. The datasets are composed of 4785 images, and the test set has 479 images, so the number of samples represents 3% of the test set, which seems to be a small representation for this error comparison. Please, check if it would be better to add a last line with the mean error and +/- standard deviation.

In lines 405 and 406: It’s confirmed that “the proposed model can detect and localize apples in real time under different light conditions in the morning and evening, and provide better picking points for most cases.”, however, I didn’t find more details about the amount of morning images and evening images and how they were considered in the training, validation, and test sets to help with this affirmation. Also, in Figure 12, image C, which represents the evening image is different from the example shown in Figures 2 and 10 for the evening or weak light. Did the images for evening lights show this same quality of the image or are there different conditions of lights or specific times of the day? Please, verify this information.

Please check some expressions appearing along the manuscript multiple times or repeating very closer to each other, such as: “meet the real-time requirements of the harvesting robot”, “original” in lines 190 and 191, “different loss components” in 244 and 245. “the imaging plane” is repeated 6 times in the paragraph line 285.

Comments for author File: Comments.pdf

The quality of the writing in English is adequate. However, a spell checker is still recommended to improve some minor errors and the quality of the text.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Round 2

Reviewer 2 Report

Congratulations on improving the manuscript. The rewritten sentences look good, as also other corrections and the graph comparing the loss of the original YOLOX and the proposed model. The addition of comparison with YOLOv8 turned the proposed model into an interesting novel and kept good results compared to the last released YOLO model.

I have one last recommendation:

In lines 140 – 142 is mentioned the use of public images to increase the dataset. It’s important and recommendable to include a reference to where they were taken to avoid copyright issues.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Back to TopTop