Next Article in Journal
Cybernetic Model Design for the Qualification of Pharmaceutical Facilities
Previous Article in Journal
Parameter Sensitivity Analysis for Machining Operation of Autofrettaged Cylinder Using Taguchi Method
 
 
Article
Peer-Review Record

YOLO-Chili: An Efficient Lightweight Network Model for Localization of Pepper Picking in Complex Environments

Appl. Sci. 2024, 14(13), 5524; https://doi.org/10.3390/app14135524
by Hailin Chen 1, Ruofan Zhang 1, Jialiang Peng 1, Hao Peng 1, Wenwu Hu 2, Yi Wang 1,* and Ping Jiang 2,*
Reviewer 1: Anonymous
Reviewer 2: Anonymous
Reviewer 3:
Reviewer 4:
Appl. Sci. 2024, 14(13), 5524; https://doi.org/10.3390/app14135524
Submission received: 13 April 2024 / Revised: 7 June 2024 / Accepted: 22 June 2024 / Published: 25 June 2024

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

Dear Authors,

the paper addresses an important and interesting research issue concerning the application of up-to-date algorithms to facilitate intelligent harvesting of crops that are difficult to treat.

 

Here some comments regarding the paper:

1. Please read the paper carefully, as there are numerous typos, particularly concerning punctuation, capitalization, and lowercase usage, both in the body of the paper and in the captions of figures and tables, as well as in the font of equations.

2. Please review the figures in terms of resolution and attention to detail (e.g., Figure 8 is unreadable).

3. Please provide explanations for acronyms whenever they are used for the first time, even if they may seem obvious.

4. Regarding Line 119, what is meant by "journals"?

5. The effective and actual application is not entirely clear (some insight can be gleaned from the last part of the conclusions). How is this localization of pepper utilized? Is there any mention of stereovision for picking with an existing harvester, or is it primarily important for assessing the potential harvest quantity?

Comments on the Quality of English Language

Dear Authors,

while the English is fluent, the numerous errors in punctuation, inconsistent uppercase/lowercase usage, and some uncentered terms are somewhat distracting.

Author Response

Dear reviewers:

thank you for taking the time to review this paper and for your comments on how to improve it.In response to your suggestions, I have made further revisions to the paper, including the quality of the images, the capitalization of the letters, etc. Meanwhile, the word "journals" is a mistake, which was incorrectly typed into the body of the paper.We have picked chili peppers in two provinces in China, Hunan and Xinjiang, and we use different chili pepper picking equipment for different chili pepper cultivation situations. For example, in Xinjiang, we use large pepper pickers with LIDAR and surveillance cameras to capture pepper images and point cloud data. At the same time, we are preparing to use models in the pepper picking robotic arm to automate pepper picking on a small scale, but this part of the experiment was not completed, so it was not included in the paper.

Reviewer 2 Report

Comments and Suggestions for Authors

The manuscript "yolo-chili: an efficient lightweight network model for localization of pepper picking in complex environments" presents an innovative approach aimed at improving pepper detection in complex agricultural settings using a lightweight neural network model. The authors employ adaptations like a three-channel attention mechanism and quantized pruning to enhance model efficiency and performance on mobile devices. Below are comments/suggestions that could enhance the quality of the manuscript:

1) The use of a specific dataset for training and evaluation is mentioned, but there is a need for more detailed analysis on the dataset's diversity and representativeness. The authors should consider expanding the discussion on how well this dataset represents the variety of real-world conditions under which the model is expected to perform.

2) While comparisons are made with other models, including YOLOv5 and Faster-RCNN, the methodology for these comparisons lacks detail. It would be beneficial to include a more thorough evaluation framework, detailing how each model was configured and the exact conditions under which the comparisons were made.

3) The introduction of novel components such as the three-channel attention mechanism and the hierarchical feature fusion network is a highlight of this work. However, a more granular breakdown of how each component specifically contributes to improvements in detection accuracy and model efficiency would provide deeper insights into their effectiveness.

4) The manuscript briefly addresses environmental factors affecting detection performance but could benefit from a quantitative analysis of model robustness across different environmental conditions. This would strengthen the argument for the model's utility in real-world agricultural applications.

5) To aid in the reproducibility of the results, the authors should consider providing access to the codebase and more detailed training procedures. This includes hyperparameters, model architecture specifics, and training epochs, ensuring that other researchers can replicate and build upon the work.

6) The improvements noted in the manuscript are promising, but the discussion on their statistical significance is missing. Incorporating statistical tests to confirm the significance of the observed improvements would substantiate the claims made.

7) While the manuscript outlines several advancements, a discussion on potential limitations or challenges that remain unaddressed would provide a balanced view. Additionally, outlining potential areas for future research based on the current work's findings could guide subsequent efforts in this field.

8) The figures provided in the manuscript, while illustrative, require improvement in quality. The resolution appears low, making it difficult to discern details, which is crucial for fully understanding the model's performance and the data presented. Additionally, the labels and legends within the figures need to be clearer and more consistent.

9) The current formatting of the manuscript does not adhere to the journal's submission guidelines. This includes inconsistencies in citation styles, heading formats, and overall layout that do not meet the prescribed standards of the journal.

Comments on the Quality of English Language

The manuscript would benefit significantly from a thorough grammatical revision. There are several instances where awkward phrasing and grammatical errors detract from the readability and professional quality of the text.

Author Response

Dear Reviewer:

  1. analyzed the diversity and representativeness of the dataset in detail in 2.1.
  2. added a detailed description of the configuration of the model in Table 2.
  3. provided a detailed description of how well each model component works using ablation experiments in 3.3.
  1. added Table 6 to analyze the robustness of the model under different environments.
  2. Uploaded the code base to kaggle.
  3. I may not be familiar with the concept of statistical tests, so I used Fig. 11 & Fig. 12 for statistical tests.
  1. gave a brief description of potential future research areas in Conclusions.
  2. redrew the images used to improve the resolution.
  3. improved the citation format, title format and general layout.

 

Kind regards,

Hailin Chen

Author Response File: Author Response.pdf

Reviewer 3 Report

Comments and Suggestions for Authors

This is on the application of state-of-the-art machine vision to an agricultural task – pepper picking. I truly appreciate such works, prefer them to the usual performance race on the usual machine-vision benchmarks. However, there is no robotics in it. That is a downside. Have the authors ever picked peppers themselves? Do they have any idea on the operation mode of a picking robot, and what kind of interface it would require for the vision module? The paper reads like there is no contact with robotics engineers. Result examples like presented in Figure 11 still show the rectangles oriented in vertical and horizontal directions as output-format – as dictated by the YOLO architecture. Being very limited in agricultural machines, I still would guess that oriented rectangles may well be more suited for a picking robot interface, since peepers are elongated objects. And also, the robot would need a distance estimate for proper 3D-location. That would mean, the work is not even half way done. This should be discussed somewhere.

As a machine vision engineer I appreciate that the paper aims at reducing the number of parameters to be tuned by machine learning. I learn around line 362 that the number has been reduced from 18.7 million to 9.7 million without too much compromising the recognition performance. With a side-view on Vapnic Chervonenkis theory of pattern recognition, I still think we should try to further reduce that number in order to construct low-risk robust seeing machines. But this is a step in the right direction. I emphasize that processing speed is not the main goal for the reduction of numbers. We desperately need to lower the dimension of the parameter space in which we are optimizing our loss-cost for theoretical mathematical reasons. The number of parameters must be in a healthy proportion with the number of labeled training samples. The number of training samples has been raised by the usual tricks: adding noise, rotation, scaling, and brightness changes - why not mirror flipping? - in my view this is only partially valid. Rotation for instance is not justified, peppers will always be hanging never standing upright, or horizontal. Current AI generally operates in too high-dimensional spaces and most often on too small data-sets. That is dangerous. This should also be discussed.

The corresponding table 4 shows a certain lack of diligence: What means “Size/MB”, I reckon this is just # M-params; and suddenly we have 0.97 where I would expect %, so probably 97. Table 3 has a different wording dividing the #-params by the magical number 106M. You really just mean million, but you wright 106 Million. There are many such sloppy and awkward things in this manuscript that must be fixed before publishing. Figure caption of Fig. 10 should be comparison of learning convergence, it has nothing to do with “contrast”. Ablation studies are very popular these days, and they might give an idea on which part of the architecture contributes how much for the performance. But Table 2 is hardly comprehendible in its current format. The formulae (4)-(7) repeat just the usual performance definitions everybody in the field is familiar with. However, in (5) instead of TP, TN, FP, and FN suddenly diseases and pests appear. This is very puzzling and complete nonsense. Figure 8 appears quite defocused, as if it was cropped or scanned and stolen from another paper. Formulae (1)-(3) use different fonds and are very ugly. You cannot present such junk in a decent journal publication. Formulae should contain symbols – not words or acronyms – and the symbols should be explained in the text.

So – as it is – this cannot be accepted. But it might be reparable.

Author Response

Dear reviewers, thank you for taking the time to review this paper and for your comments on how to improve it.
We have picked chili peppers in Hunan and Xinjiang provinces in China, but I am not good at robotics and apologize for that. We used different pepper picking equipment for different pepper growing situations. For example, in Xinjiang, we use a large pepper picking machine with LIDAR and surveillance cameras to capture pepper images and point cloud data. At the same time, we are using the model in the pepper picking robotic arm to automate the picking of peppers on a small scale.
Meanwhile, in order to address the problem of the AI model running on too small a dataset, our group is in the process of purchasing a server and expects to be able to conduct experiments on a larger dataset. I tried mirror flipping and GAN class of generative models, but the results didn't seem to be very good, so I didn't include them in the paper.
where √ in the ablation experiments in Table II refers to the use of this module.Also in response to your other questions, I have made further revisions to the paper. Thank you again for your willingness to take the time to bring your valuable suggestions to this paper.

Reviewer 4 Report

Comments and Suggestions for Authors

In this paper the authors propose a Yolo-chili target detection algorithm for localization of pepper picking in complex environments.

Further, the experimental results show that yolo-chili can be fully adapted to the task of picking peppers in real situations and has a lightweight detection speed to accomplish real-time detection in real production.

My suggestions regarding the improvement of the paper are as follows:

In Section 1, i.e., 1. Introduction, the short description of the paper content is missing. Needs to be added.

Line 136 – The text in Figure 2 is not readable, needs to be zoomed.

Line 170 – Figure 3 need to be reduced.

Line 198 – The fonts of the Eq. (1), (2), (3), (4), (5), (6), and (7) need to be changed.

Line 198 – Terms of the Eq. (1), (2), (3), (4), (5), (6), and (7) need to be described by text.

Line 199 – Each block at the Figure 5 needs to be described by the certain text.

Line 243 – The resolution of Figure 8 is low, needs to be updated.

There are some punctuation errors that need to be corrected.

The comparison between the YOLO-chili model and the currently mainstream object detection models depicted in Figure 3, including Faster-RCNN, SSD, YOLOv7, YOLOv7-tiny, and YOLOv5, shows that the model accuracy reaches 93.66%, which is only 0.45% decrease in accuracy, while the model volume is reduced by half, and the FPS is only 65, which proves the effectiveness of yolo-chili.

My comments are as following: In some contexts, an accuracy of 93.66% may be considered quite high and sufficient for practical use such as pepper detection under complex orchards. However, in other scenarios where precision is critical, a higher accuracy threshold may be necessary, i.e., equal or higher than 95%.

The findings are adequate and contribute on extension of target detection algorithms with chili pepper detection by using YOLOv5's backbone network to facilitate porting to different devices.

The methodology and comparison of the obtained results is clear.

The article is well written and composed, and it presents an interesting study.

Comments for author File: Comments.pdf

Author Response

Dear reviewers:

thank you very much for taking the time to review this paper and for your acknowledgement and suggestions. I have completed the revision of the introduction and images according to your suggestions, and added explanations to the formulas. Meanwhile, to address the current lack of accuracy, we are also conducting further research, including collecting point cloud data, using 3D images, trying different model structures, and so on.

Round 2

Reviewer 2 Report

Comments and Suggestions for Authors

The manuscript "yolo-chili: an efficient lightweight network model for localization of pepper picking in complex environments" has several critical issues that necessitate its rejection for publication. Despite revisions, the manuscript still suffers from numerous grammatical errors and awkward phrasing, significantly detracting from its readability and professional quality. Additionally, the manuscript lacks rigorous statistical analysis to support the claimed improvements. While some visual representations of the model's performance are provided, explicit statistical tests and confidence intervals are missing, leaving the reliability of the results in question. The comparison with other models remains superficial, lacking comprehensive evaluation and discussion of important metrics such as training time, inference speed, and resource consumption. The analysis of environmental variability is inadequate, as it fails to convincingly demonstrate the model's robustness under diverse conditions. Finally, there are persistent inconsistencies in citation styles, headings, and overall layout, indicating that the manuscript does not fully adhere to the journal's formatting guidelines. These significant issues must be addressed before the manuscript can be reconsidered for publication.

Comments on the Quality of English Language

The quality of the English language in the manuscript needs significant improvement. There are several grammatical errors, including lowercase letters at the beginning of citations, and words incorrectly joined without spacing. A thorough review by a native English speaker or a professional editor is recommended to correct these errors and enhance the clarity of the text.

Author Response

Dear reviewer
Thank you very much for your valuable comments on the paper, I have revised some of them, including adding test time, confidence level, etc. I need to explain to you about statistical testing, in deep learning code, we use to divide the model into training set, testing set, etc., the model will be trained for n rounds and get n results, but these results are generally speaking progressively better. And we are using pytorch wrapped code which randomly divides the images in the training set into n batches for training, so these results are already statistically significant in their own right, and I have trained five times to ensure the reliability of the results, and selected the average of the best results from these five trainings.
Kind regards, 
Hailin Chen

Reviewer 3 Report

Comments and Suggestions for Authors

The paper has improved in the revised version. Some of my points have been addressed, some not. In my view, the paper still lacks the diligence required for a decent journal publication. The formulae remain ugly. Reduction of parameter numbers is motivated in the new text by hardware constraints – mobile hardware on the picking robot. That was not my main concern. High numbers of trainable parameters are risky per se in pattern recognition. The worst case bounds of VC-theory speak for themselves. Reducing the number of parameters – while not compromising the performance too much – yields superior robustness, reduces risks of catastrophic failure.

There is a whole paragraph from L421 to L427 that is word by word repeated in L428 to L434 !!!

It is up to the editor: How many revisions may be given?

Author Response

Dear Reviewer:
Thank you very much for your valuable comments on the paper and I apologize for the misunderstanding caused by the formatting issue. Currently I have used a large number of methods to reduce the model parameters, but these methods seriously affect the model performance while further reducing the model parameters, so I have not written in the paper. I am also using a combination of point cloud data, depth images with position information, and common RGB data to try to further improve model performance. Thank you again for your valuable comments.

Kind regards, 
Hailin Chen

Round 3

Reviewer 2 Report

Comments and Suggestions for Authors

Dear Authors,

Thank you for submitting the revised version of your manuscript. I acknowledge the improvements made in terms of grammatical accuracy, although some errors persist.

However, the suggestions regarding technical aspects, such as including rigorous statistical analyses and confidence intervals, have not been adequately addressed.

The conclusions drawn from the results have seen improvements.

In terms of formatting, further improvements are still needed to meet the journal's guidelines.

Comments on the Quality of English Language

The quality of English in the manuscript has improved compared to the previous version. However, there are still numerous grammatical errors that need to be addressed to enhance readability and professionalism. I recommend seeking assistance from a native English-speaking professional to thoroughly review and edit the manuscript.

Reviewer 3 Report

Comments and Suggestions for Authors

At least the double paragraph has disappeared now.  There are new errors in the word separation at the ends of lines from L100 to L115. That should be fixed in the editing process in case of acceptance. 

Comments on the Quality of English Language

There are new errors in the word separation at the ends of lines from L100 to L115. And sometimes a space following each paragraph is missing. Minor typos.

Back to TopTop