Next Article in Journal
Improvement in Wheat Productivity with Integrated Management of Beneficial Microbes along with Organic and Inorganic Phosphorus Sources
Next Article in Special Issue
Energy Management of Sowing Unit for Extended-Range Electric Tractor Based on Improved CD-CS Fuzzy Rules
Previous Article in Journal
Herbicide Physiology and Environmental Fate
 
 
Article
Peer-Review Record

Beyond Trade-Off: An Optimized Binocular Stereo Vision Based Depth Estimation Algorithm for Designing Harvesting Robot in Orchards

Agriculture 2023, 13(6), 1117; https://doi.org/10.3390/agriculture13061117
by Li Zhang 1, Qun Hao 1,2,3, Yefei Mao 4, Jianbin Su 5 and Jie Cao 1,2,*
Reviewer 1:
Reviewer 2:
Reviewer 3: Anonymous
Agriculture 2023, 13(6), 1117; https://doi.org/10.3390/agriculture13061117
Submission received: 20 April 2023 / Revised: 12 May 2023 / Accepted: 19 May 2023 / Published: 25 May 2023
(This article belongs to the Special Issue Agricultural Automation in Smart Farming)

Round 1

Reviewer 1 Report

You have made significant improvements on the existing image matching algorithms, but there are still some issues:

(1) In section 2, the authors holds this views that the use of active ranging sensors has certain drawbacks and limitations, and passive ranging sensors is superior to active ranging sensors. In fact, with structured light devices, active vision sensors greatly improve the accuracy and efficiency of stereo image matching. For example, 3D scanner is like this, which can obtain the measurement accuracy required for machinery manufacturing. in addition, the vision devices based on infrared stereo matching (such as RealSense R200, RealSense D435) have also good application. For example, the paper titled as “Detection method for table grape ears and stems based on a far-close-range combined vision system and hand-eye-coordinated picking test”, which was published in Computers and Electronics in Agriculture in 2022.

(2) In line 432-435, we compared our proposed algorithm with sum of squared difference (SSD) , normalized cross correlation (NCC) , and adaptive support window (ASW) algorithms. The corresponding quantitative results shown in Table 5. However, the results of the other three algorithms in Table 5 were not given out. Moreover, the data in Table 5 should include the calculation time. In addition, the depth accuracy obtained by the matching algorithm proposed by the author should be compared with results of active vision system such as RealSense.

(3) Figure 12 should be explained in detail the content illustrated by the multiple sets of images.

In general, the manuscript draft is clear and well-written. 

Author Response

Response to Reviewer 1

 

Our point-to-point response to the comments

We appreciate for Editors/Reviewers’ warm work earnestly, and hope that the corrections will meet with approval. Revised parts are marked in yellow in the manuscript. Please feel free to contact us with any questions and we are looking forward to your consideration. The main corrections in the paper and the responds to the reviewer’s comments are as flowing:

Responds to the reviewer’s comments:

 

Reviewer #1:

(1) In section 2, the authors holds this views that the use of active ranging sensors has certain drawbacks and limitations, and passive ranging sensors is superior to active ranging sensors. In fact, with structured light devices, active vision sensors greatly improve the accuracy and efficiency of stereo image matching. For example, 3D scanner is like this, which can obtain the measurement accuracy required for machinery manufacturing. in addition, the vision devices based on infrared stereo matching (such as RealSense R200, RealSense D435) have also good application. For example, the paper titled as “Detection method for table grape ears and stems based on a far-close-range combined vision system and hand-eye-coordinated picking test”, which was published in Computers and Electronics in Agriculture in 2022.

Response: Thank you very much for your positive and valuable comments. We are very sorry for our inappropriate expression on active ranging sensors. We read this high-quality paper “Detection method for table grape ears and stems based on a far-close-range combined vision system and hand-eye-coordinated picking test” carefully which gave us great inspiration, and added this paper as reference.

2 “Jin, Y.; Yu, C.; Yin, J.; Yang, S.X. Detection method for table grape ears and stems based on a far-close-range combined vision system and hand-eye-coordinated picking test. Computers and Electronics in Agriculture 2022, 202, 107364.”

Also, consider in our present research work which focus attention on passive ranging sensors, we revised the content of Introduction and related work sections on active ranging sensors, and we will try to do some research work on active ranging sensors, such as RealSense in future work for further discussion on which way is more suitable for the design the visual system for harvesting robot in wild orchards.

“In future work, we will focus on some active vision systems such as RealSense for further discussion on which way is more suitable for the design of the visual system for harvesting robots in wild orchards.”

(2) In line 432-435, we compared our proposed algorithm with sum of squared difference (SSD), normalized cross correlation (NCC), and adaptive support window (ASW) algorithms. The corresponding quantitative results shown in Table 5. However, the results of the other three algorithms in Table 5 were not given out. Moreover, the data in Table 5 should include the calculation time. In addition, the depth accuracy obtained by the matching algorithm proposed by the author should be compared with results of active vision system such as RealSense.

Response: Thank you very much for your positive and valuable comments. We have added detail illustrated for multiple sets of images.

“In order to more comprehensively verify the feasibility of the algorithm proposed in this paper, experiments were carried out on image samples of different depth ranges, short distance (S, 0.6~0.75 m), medium distance (M, 1~1.2 m) and far distance (F, 1.6~1.9 m). The corresponding quantitative results shown in Table 5.”

“Under such three different distances conditions, we compared our proposed algorithm with sum of squared difference (SSD) [33], normalized cross correlation (NCC) [34], and adaptive support window (ASW) [35] algorithms. We visualized the results according to the principle that if to the depth from far to near, the color changed from cold to warm, as shown in Figure 12.”

For your positive and valuable comments on the depth accuracy obtained by the matching algorithm proposed by the author should be compared with results of active vision system such as RealSense.

In our present research work, we focus our attention on 3D depth estimation based on stereo cameras, trying to improve the performance of both accuracy and time cost. We will focus on the active vision system of depth estimation for harvesting robots in future work, and we added it in revised manuscript in section 6 of future work. “In future work, we will focus on some active vision systems such as RealSense for further discussion on which way is more suitable for the design of the visual system for harvesting robots in wild orchards.”

(3) Figure 12 should be explained in detail the content illustrated by the multiple sets of images.

Response: I am very grateful to your comments for the manuscript. We have added detail illustrated for multiple sets of images.

Figure 12. Visualization of Compared Disparity Map. From top to bottom are rectified left image, and the experiment results obtained by SSD, NCC, ASW and our proposed Completion-BiPy-Disp method, respectively.”

 

Thanks to all reviewers for the thoughtful and thorough review. Hopefully we have addressed all of your concerns.

Author Response File: Author Response.docx

Reviewer 2 Report

This paper proposes a disparity completion algorithm based on binocular stereo vision. The bilateral filtering and the pyramid fusion are added to the traditional method of obtaining disparity maps in this paper. The bilateral filtering algorithm is added to complete the disparity map that contains many holes and the pyramid fusion model can significantly reduce the time cost of the proposed method. Experiments at three kinds of different distances show that the proposed method can effectively complete disparity maps and estimate the depth with less time consumption. Here are some concerns and suggestions for further improvement.

Main comments:

1.The explanation in line 40-41 for selecting the binocular camera is not clear and not specific.

2.The introduction and related work are somewhat repetitive in some respects. For example, the sentence in line 161-164 is similar to the sentence in line 30-32. I suggest that these two sections can be appropriately combined and reduced.

3.The logical structure of the section 3.2.1 is not very well organized and the principle of the SGM algorithm is not very clearly explained.

4.The description in line 239-242 looks like a summary, which does not match this title of the section. I suggest that the description would be removed.

5.It would be better if this condition in line 262 was advanced.

6. Most of subsection 3.3.3 describes existing problems, but does not introduce the proposed improvement in detail. I suggest that this process in line 296-297 can be described in more detail.

7.The title of Figure 3 is the same as the title of Figure 4, please check if any changes are needed.

8.The title of the section 4.5 is the same as the title of the section 4.4, please check if any changes are needed.

9.The statement in line 422 is not accurate and “the relative error” should be changed to “the average relative error”.

This paper has some grammatical problems. For example, the sentence in line 39 lacks a subject. Please check it carefully.

Author Response

Response to Reviewer 2

 

Our point-to-point response to the comments

We appreciate for Editors/Reviewers’ warm work earnestly, and hope that the corrections will meet with approval. Revised parts are marked in yellow in the manuscript. Please feel free to contact us with any questions and we are looking forward to your consideration. The main corrections in the paper and the responds to the reviewer’s comments are as flowing:

Responds to the reviewer’s comments:

 

Reviewer #2:

1.The explanation in line 40-41 for selecting the binocular camera is not clear and not specific.

Response: We are grateful to the reviewer for reminding us of this point. According to your advice, we have added the detailed information in the revised manuscript. “Therefore, we applied baseline-variable stereo camera (LenaCV USB3.0), and baseline was set as 60mm.”

2.The introduction and related work are somewhat repetitive in some respects. For example, the sentence in line 161-164 is similar to the sentence in line 30-32. I suggest that these two sections can be appropriately combined and reduced.

Response: Thank you very much for your positive and valuable comments. We have revised this part of the content and removed the duplicate content.

3.The logical structure of the section 3.2.1 is not very well organized and the principle of the SGM algorithm is not very clearly explained.

Response: Thank you very much for your positive and valuable comments. According to your advice, we have reorganized the section 3.2.1 and explained the SGM algorithm in detail. “……based on the idea of pixelwise matching of Mutual Information and approximating a global, 2D smoothness constraint by combining many 1D constraints. This SGM method calculates the matching cost hierarchically by mutual Information. To pathwise optimize from all directions through the image, it uses an approximation of a global energy function named cost aggregation. Then, disparity computation is done by winner takes all and is supported by disparity refinements like consistency checking and sub-pixel interpolation. So, given…” .

4.The description in line 239-242 looks like a summary, which does not match this title of the section. I suggest that the description would be removed.

Response: Thanks for your comments. We have removed line 239-242.

5.It would be better if this condition in line 262 was advanced.

Response: Thanks very much for the kind comments. According to your advice, we have revised it in the manuscript. “where, x is the pixel coordinate,  and  adjust the intensity and spatial similarity respectively, and according to recommendations, the values of ,  are set to 10 and 30, respectively.”

  1. Most of subsection 3.3.3 describes existing problems, but does not introduce the proposed improvement in detail. I suggest that this process in line 296-297 can be described in more detail.

Response: Thank you very much for your positive and valuable comments. According to your advice, we have described in more detail it in the manuscript. “……Specifically, the disparity maps with different resolutions were processed by bilateral filtering and up-sampling at first. Then, the output results were as input into a multi-scale pyramid model where 1/32,1/16,1/8,1/4,1/2 and 1 these six different scale multi-resolution results can be obtained from. Finally, The multi-resolution results obtained with corresponding different resolutions which were fused from low to high in a certain proportion, and added to the results on the upper scale. ……“

7.The title of Figure 3 is the same as the title of Figure 4, please check if any changes are needed.

Response: We are grateful to the reviewer for reminding us of this point, we are very sorry for this mistake, and we revised Figure 4 caption. “Prototype equipment and experimental environment”.

8.The title of the section 4.5 is the same as the title of the section 4.4, please check if any changes are needed.

Response: We are very sorry for this mistake. And we have revised it in the manuscript. The title of the section 4.4 is “The qualitative completion results” and the title of the section 4.5 revised as “The quantitative results analyze”

9.The statement in line 422 is not accurate and “the relative error” should be changed to “the average relative error”.

Response: We are very sorry for this mistake. And we have revised it in the manuscript. “The average absolute error of the depth value is 7.2375mm, and the average relative error no more than 1.2%.”

Thanks to all reviewers for the thoughtful and thorough review. Hopefully we have addressed all of your concerns.

 

Author Response File: Author Response.docx

Reviewer 3 Report

I would like to congratulate you on the preparation of this article. Overall, I am pleased to report that I found the article satisfactory. The paper is well-structured and details all the stages and minor stages of the study. However, there are a few remarks I would like to make about the paper:

- Regarding the formatting:

o In line 26 replace "ability[2]" by "ability [2]" with space. Do the same for line 47.

o In line 30 replace "[4][5][6][7]" with "[4-7]". According to the authors' guide, a hyphen should be used if the citations are consecutive. Do the same for lines 38, 99, 146, 149.

o In line 30 replace "[3][13][14]" with "[3,13,14]". According to the authors' guide this is the correct format if the citations are not consecutive. Do the same for lines 140, 141, 142, 272.

o In line 101 replace "[27]studied" with "Gongal et al. [27] studied". Do the same for line 125, 128, 130, 131, 132.

o In line 103 replace "Gené-Mola et al published" with "Gené-Mola et al. [28] publised".

o In line 198 replace "for a point p" with "for a point p". The "p" should be in italics, to be consistent with the formatting of the rest. Do the same for lines 219, 232, 262, 268.

o In line 199 it says "as equation:", it would be more correct to put "as Equation (1):" or "as equation (1):" as it is written in line 230. Same for the rest (line 202, 204, 207, 215, 218, 219, 254, 257, 274.

o Rewrite sentences with punctuation errors in line 269, 294, 315.

o Put a space between the digit and its unit (line 310, 419, 422, 431, 465, 466).

o Replace in line 311 "in Figure 5" by "in Figure 5.", adding a full stop at the end of the paragraph. Same for lines 320, 329, 406, 414, 443.

o When in a figure there are subfigures within it, either left or right or top and bottom, etc., it does not follow the guidelines defined by the journal's author guidelines. Revise and adapt the format. In addition, the reference in the text to these subfigures is not appropriate.

o In table 4, in the first column "ID", space between the text and the parenthesis". In addition, start the first letter in capital letters. Same for the heading of table 5.

o Try to ensure that all units follow the international system of measurement, using metres (m) or millimetres (mm) if the measurement is very small (line 419, 431, etc.).

o When defining a range of a variable, use a hyphen to determine the minimum and maximum value instead of a space. For example, "1-1.2 m" instead of "1 1.2 m". (line 431, etc.)

- Regarding the content:

o I think you should review the citations in the introduction and related works and check if all citations are necessary.

o There are some ideas about the comparison between different methods such as monocular, binocular and multi-camera system. This idea is stated at the beginning of the introduction and repeated in the related works. It would be good to condense this information and fit it into the section that you think is most appropriate.

o I would have liked to find a more in-depth discussion of the results obtained and their comparison with the work of other authors.

o I would like to know if you have tried to install the perception systems close to the end-effector or used a hybrid between the two, as other authors have done this process to minimise coordination errors between the perception system and the actuator.

Thank you very much for your contribution.
Best regards.

Author Response

Response to Reviewer 3

 

Our point-to-point response to the comments

We appreciate for Editors/Reviewers’ warm work earnestly, and hope that the corrections will meet with approval. Revised parts are marked in yellow in the manuscript. Please feel free to contact us with any questions and we are looking forward to your consideration. The main corrections in the paper and the responds to the reviewer’s comments are as flowing:

Responds to the reviewer’s comments:

 

Reviewer #3:

  1. In line 26 replace "ability[2]" by "ability [2]" with space. Do the same for line 47.

Response: We are very sorry for this mistake. And we have revised it in the manuscript. “……Exploit RGB camera to achieve passive ranging have advantages of high resolution and strong anti-light influence ability [3].”

  1. In line 30 replace "[4][5][6][7]" with "[4-7]". According to the authors' guide, a hyphen should be used if the citations are consecutive. Do the same for lines 38, 99, 146, 149.

Response: We are very sorry for this mistake. And we have revised it in the manuscript.

 “……and also avoid the small field of view and difficult matching issues of stereo matching [5-8].”

“In addition, the time cost for matching multiple captured images is usually with a high time cost which is hardly to be applied for harvesting robots in practical applications [9-13].”

“The theory of depth calculation using a multi-camera is similar to the binocular camera which captured images from different cameras are generally used to calculate the 3D position of detected objects [9-13].”

“In summarize, the use of passive sensors for depth computing has the advantages of strong environmental adaptability, flexible implementation methods, and low-price cost, so it is being used in more and more practical fields [25-27].”

  1. In line 30 replace "[3][13][14]" with "[3,13,14]". According to the authors' guide this is the correct format if the citations are not consecutive. Do the same for lines 140, 141, 142, 272.

Response: We are very sorry for this mistake. And we have revised it in the manuscript.

“Although binocular stereo vision has many great advantages, there are many challenges that have to be faced when designing an applicable depth estimation algorithm for harvesting robots in orchard environments [4,14,15].”

 “In recent years, more and more research works have adopted binocular stereo vision-based methods to realize 3D position estimation for fruits [15,18].”

“Furthermore, binocular cameras were used as image acquisition equipment for many agricultural robots, such as strawberries [19,20] tomatoes [21,22], cucumbers [23], and oranges [24].”

“The pyramid model is widely used as a very efficient algorithm [31,32].”

 

  1. In line 101 replace "[27]studied" with "Gongal et al. [27] studied". Do the same for line 125, 128, 130, 131, 132.

Response: We are very sorry for this mistake. And we have revised it in the manuscript.

“Roy and Isler [5] first proposed a motion estimation method to realize the registration and reconstruction of apple orchards, extract the number and diameter information of apples collected by a monocular camera, and realize the estimation of apple yield.”

“Liu et al. [6] obtained continuous image data through a monocular camera and used the FCN model to segment fruits from background regions, they calculated the 3D position and size of the fruit by combining the results of segmentation with a motion estimation algorithm.”

“Hani et al. [7] used motion estimation on consecutive images obtained by a monocular camera to calculate the depth information of the fruit.”

Roy et al. [8] proposed a global feature-constrained solution proposed for predicting depth values for 3D reconstruction of orchard tree rows.

  1. In line 103 replace "Gené-Mola et al published" with "Gené-Mola et al. [28] publised".

Response: We are very sorry for this mistake. And we have revised it in the manuscript.

  1. In line 198 replace "for a point p" with "for a point p". The "p" should be in italics, to be consistent with the formatting of the rest. Do the same for lines 219, 232, 262, 268.

Response: We are very sorry for this mistake. And we have revised it in the manuscript.

  1. In line 199 it says "as equation:", it would be more correct to put "as Equation (1):" or "as equation (1):" as it is written in line 230. Same for the rest (line 202, 204, 207, 215, 218, 219, 254, 257, 274.

Response: We are very sorry for this mistake. And we have revised it in the manuscript.

 

  1. Rewrite sentences with punctuation errors in line 269, 294, 315.

Response: We are very sorry for this mistake. And we have revised it in the manuscript.

“In fact, high precious location estimation outputs and low time cost are required always at the same time, in practical applications.”

“Therefore, we took bilateral filtering and US algorithms for the disparity images of different resolutions, and the obtained multi-scale disparity maps are fused to achieve high accuracy and density final outputs.”

“Specifically,  and  are focal lengths expressed in pixels,  and  describe the coordinates in the center of the image.”

  1. Put a space between the digit and its unit (line 310, 419, 422, 431, 465, 466).

Response: We are very sorry for this mistake. And we have revised it in the manuscript.

“Our experiments explored by a zoom binocular camera as the images acquire device, use a Black-and-White checkerboard as calibration target, and the side length of each black and white square grid is 20 mm.”

“From the results showed in Table 4, in the depth range 0.6~0.9 m, the average vertical of the left and right……”

“The average absolute error of the depth value is 7.2375 mm ……”

“In order to more comprehensively verify the feasibility of the algorithm proposed in this paper, experiments were carried out on image samples of different depth ranges, short distance (S, 0.6~0.75 m), medium distance (M, 1~1.2 m) and far distance (F, 1.6~1.9 m).”

“Finally, the qualitative and quantitative experiments were carried out on three different ranges of distance, such as S (0.6~0.75 m), M (1~1.2 m), and F (1.6~1.9 m). And the experimental results showed the average absolute error of our proposed method is 3.2 mm,……”

  1. Replace in line 311 "in Figure 5" by "in Figure 5.", adding a full stop at the end of the paragraph. Same for lines 320, 329, 406, 414, 443.

Response: We are very sorry for this mistake. And we have revised it in the manuscript.

  1. When in a figure there are subfigures within it, either left or right or top and bottom, etc., it does not follow the guidelines defined by the journal's author guidelines. Revise and adapt the format. In addition, the reference in the text to these subfigures is not appropriate.

Response: We are grateful to the reviewer for reminding us of this point. And we have revised it in the manuscript.

Figure 7. The acquisition of Initial disparity map. (1) represents the disparity map achieved by SGM. (2) is the disparity map after the left-right consistency check processing. After removing the small connected regions is presented in (3). (4) is the final disparity map obtained by median filtering.

Figure 10. Experiment results of disparity fusion map based on pyramid. (A) and (B) are the Gaussian sampling results at the original resolution of  and , respectively. (C) are the fusion results of (A) and (B) in different scales.

  1. In table 4, in the first column "ID", space between the text and the parenthesis". In addition, start the first letter in capital letters. Same for the heading of table 5.

Response: We are very sorry for this mistake. And we have revised it in the manuscript.

  1. Try to ensure that all units follow the international system of measurement, using metres (m) or millimetres (mm) if the measurement is very small (line 419, 431, etc.).

Response: We are grateful to the reviewer for reminding us of this point. And we have revised it in the manuscript.

From the results showed in Table 4, in the depth range 0.6~0.9 m, the average vertical of the left and right cameras is 4.5 pixels, it is proved that the calibration parameters of the binocular camera are accurate and satisfied the binocular vision system under ideal conditions.

“In order to more comprehensively verify the feasibility of the algorithm proposed in this paper, experiments were carried out on image samples of different depth ranges, short distance (S, 0.6~0.75 m), medium distance (M, 1~1.2 m) and far distance (F, 1.6~1.9 m).”

  1. When defining a range of a variable, use a hyphen to determine the minimum and maximum value instead of a space. For example, "1-1.2 m" instead of "1 1.2 m". (line 431, etc.)

Response: We are very sorry for this mistake. And we have revised it in the manuscript.

“In order to more comprehensively verify the feasibility of the algorithm proposed in this paper, experiments were carried out on image samples of different depth ranges, short distance (S, 0.6~0.75 m), medium distance (M, 1~1.2 m) and far distance (F, 1.6~1.9 m).”

  1. I think you should review the citations in the introduction and related works and check if all citations are necessary.

Response: I am very grateful to your comments for the manuscript. We have checked all the references and formatted them strictly according to the Guide for Authors.

  1. There are some ideas about the comparison between different methods such as monocular, binocular and multi-camera system. This idea is stated at the beginning of the introduction and repeated in the related works. It would be good to condense this information and fit it into the section that you think is most appropriate.

Response: We are grateful to the reviewer for reminding us of this point. And we checked all citations, reorganized related works.

  1. I would have liked to find a more in-depth discussion of the results obtained and their comparison with the work of other authors.

Response: Thank you very much for your positive and valuable comments. According to your advice, we revised the manuscript.

“It can be seen from the results that there are many incorrect points in the disparity results obtained by the SSD, NCC, and ASW matching algorithms which may cause by lacking of future steps on post-processing operation, and the method proposed in this chapter can obtain smooth and dense continuous results, which reflects the feasibility of the algorithm proposed in this paper. Overall, the proposed model achieves excellent qualitative and quantitative results at different distances, which verifies the potential of the proposed algorithm for depth estimation of automated picking equipment.”

 

  1. I would like to know if you have tried to install the perception systems close to the end-effector or used a hybrid between the two, as other authors have done this process to minimise coordination errors between the perception system and the actuator.

Response: We are grateful to the reviewer for reminding us of this point. In fact, we have tried to the perception systems close to the end-effector. However, considering that the manipulator may encounter the influence of branches and leaves during the operation, we installed these perception systems far from the end-effector. We totally agree with you that install the perception systems close to the end-effector or use a hybrid between the two. Although limited by the configuration of robot hardware, we have not carried out further experimental research on this part at present we'll try to improve the configuration of harvesting robot and try this in very soon future work.

“In future work, we will try to install the perception systems close to the end-effector or use a hybrid between the two, to minimize coordination errors between the perception system and the actuator. Also, we will focus on some active vision systems such as RealSense for further discussion on which way is more suitable for the design of the visual system for harvesting robots in wild orchards.”

 

Thanks to all reviewers for the thoughtful and thorough review. Hopefully we have addressed all of your concerns.

Author Response File: Author Response.docx

Round 2

Reviewer 1 Report

According to the reviewer's feedback, you have made good modifications and improvements to the article, resulting in an improvement of paper quality.

Reviewer 2 Report

The article has been revised according to the review comments.

Minor editing of English language is required.

Back to TopTop