Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Printed Edition

A printed edition of this Special Issue is available at MDPI Books....

Share Help Cite Discuss in SciProfiles

Open AccessFeature PaperArticle

Peer-Review Record

Mask R-CNN Refitting Strategy for Plant Counting and Sizing in UAV Imagery

Remote Sens. 2020, 12(18), 3015; https://doi.org/10.3390/rs12183015

by Mélissande Machefer^1,2,*

, François Lemarchand¹

, Virginie Bonnefond¹

, Alasdair Hitchins¹

and Panagiotis Sidiropoulos^1,3

Reviewer 1: Anonymous

Reviewer 2: Anonymous

Reviewer 3: Anonymous

Reviewer 4: Anonymous

Reviewer 5: Anonymous

Remote Sens. 2020, 12(18), 3015; https://doi.org/10.3390/rs12183015

Submission received: 30 June 2020 / Revised: 2 September 2020 / Accepted: 11 September 2020 / Published: 16 September 2020

(This article belongs to the Special Issue UAVs for Vegetation Monitoring)

Round 1

Reviewer 1 Report

In this paper, the authors developed an algorithm to segment and detect potato and lettuce plants from UAV digital images. Good job with the introduction – clearly stating the importance of remote sensing and artificial intelligence. I appreciate the efforts on contributing a smart approach to agriculture. I feel the paper length has to be shortened and talk more about the research part instead of explaining the method. The current state of paper sounds like the Mask R-CNN was developed by the authors, that’s why they are providing such a detailed explanation. Please remove such explanation portions and discuss more about the research and analysis. I would recommend restructuring the paper and encourage the authors to arrange the sections like a research paper.

Comments:

I am confused if the paper deals with plant counting or just plant detection? In some places, it is mentioned that plants are counted, while the results were about detecting and comparison with the traditional computer vision approach. Please make this clear.
If the paper is about counting plants, that keyword can also go in the title. You will get better hits if you have “plant counting”.
Line 64: “approach, before Section 5 presents the obtained results”. Make section 5 as a separate sentence.
Line 223: “Firstly, since typically” Any one word is sufficient.
Line 226: Please change from “remote sensing imaging” to “remotely sensed images”.
Did you use a stitched image or analyzed individual images from the UAV image?
Section 4.1: More information on the dataset should be mentioned. Image resolution, UAV flight height, overall stitched image size, camera specifications, etc.
Lines 188-332: I appreciate the authors in providing a detailed explanation of the Mask R-CNN methodology, but I feel this is not required for a research paper. A brief explanation is sufficient or just referring to some key literature will help readers. This loses focus on the paper.
Line 367: The topic changes here. Not dealing with dataset anymore. So, please split this into a subsection.
Line 382: Image resolution is mentioned here – Please move this up. Indicate this just after the image acquisition in line 361.
Line 410: Split percentage is good. Please also indicate the number of samples here to avoid confusion. Because you are mentioning augmentation as well. Line 361 shows 340 images; line 386 shows 150 images, and augmentation was also performed to increase the sample size. Please make this clear.
Fig 5 caption – where are the blue disks – This is not clear. Red crosses are seen.
Fig 5a and 5b – Why is the ground truth board classified as a plant – red border?
In the methods, the authors have mentioned about collecting images at 6 dates. How are you using it for analysis? Are you comparing the accuracy on different dates? Please explain.
Which growth stage of potato was best identified with your method?
How did you deal with the overlapping plants?
Figure 5, the soil color Is different in the top and bottom row. Are those captured on different dates? Locations? Or cameras?
A lot of abbreviations in the paper makes it very difficult to follow. Use only necessary terms for describing and discussing results.
Discussions should have more literature cited. Only three literature were cited there, that too for the methods. Please compare this with previous literature on plant detection.

Author Response

In this paper, the authors developed an algorithm to segment and detect potato and lettuce plants from UAV digital images. Good job with the introduction – clearly stating the importance of remote sensing and artificial intelligence. I appreciate the efforts on contributing a smart approach to agriculture. I feel the paper length has to be shortened and talk more about the research part instead of explaining the method. The current state of paper sounds like the Mask R-CNN was developed by the authors, that’s why they are providing such a detailed explanation. Please remove such explanation portions and discuss more about the research and analysis. I would recommend restructuring the paper and encourage the authors to arrange the sections like a research paper. We would like to thank the reviewer for the positive comments and constructive remarks. Following the suggestions, we have restructured the paper as a research paper by moving some information from Related Works to the Discussion section, from the Results to the Methodology section, and added a distinct Conclusion section. Corrections throughout the paper have been added to clarify that we have not created the MaskRCNN model but we do explain how to adapt the model to a new use case. On this point, we appreciate the reviewer’s comment regarding the length of the section detailing the MaskRCNN’s parameters and, after careful consideration, we have decided to preserve this section. We believe this section to be key in reproducing the work and adapting MaskRCNN to new applied problems, which is novel in comparison to the original paper. As part of the effort to minimise the said section, we have moved the table of acronyms to the Annexes.
I am confused if the paper deals with plant counting or just plant detection? In some places, it is mentioned that plants are counted, while the results were about detecting and comparison with the traditional computer vision approach. Please make this clear. The goal is to produce a fully practical and accurate individual plant sizing algorithm. The suggested method generates location and size information for each plant which allow to derive detection and counting. To avoid confusion, we changed the title of the manuscript to “Mask R-CNN refitting strategy for plant counting and sizing in UAV imagery”. We added a Section 1.3 Aim of the study which highlights that “ the main contributions of this work are (a) the adjustment of Mask R-CNN to individual plant segmentation and detection for plant sizing (b) the thorough analysis and evaluation of transfer learning in this setup, (c) the experimental evaluation of this method on datasets from multiple crops and multiple geographies, which generate statistically significant results and (d) the comparison of the data-driven models derived with a computer vision baseline for plant detection.” (lines 167-172). We also added a figure showing the individual sizing of the plants (see Figure 7.)
If the paper is about counting plants, that keyword can also go in the title. You will get better hits if you have “plant counting”. We changed the title accordingly to “Mask R-CNN refitting strategy for plant counting and sizing in UAV imagery”.
Line 64: “approach, before Section 5 presents the obtained results”. Make section 5 as a separate sentence.The comment is accepted and a correction has been made line 180.
Line 223: “Firstly, since typically” Any one word is sufficient. The comment is accepted and a correction has been made (line 243).
Line 226: Please change from “remote sensing imaging” to “remotely sensed images”.The comment is accepted and a correction has been made (line 245).
Did you use a stitched image or analyzed individual images from the UAV image? The collected imagery is initially provided as a single image from stitched drone photographs, which are then annotated. These images are then split into 256x256 pixel images to be fed into the MaskRCNN for training purposes. These information are now added in Section 2.1.1 Study Area.
Section 4.1: More information on the dataset should be mentioned. Image resolution, UAV flight height, overall stitched image size, camera specifications, etc.We added a Section 2.1.1 Study area with all the missing information. We didn’t specify the overall stitched images sizes as only part of these images have been annotated so it would not be representative . Therefore the number of extracted images of 256x256 is given for each dataset in Table 2.
Lines 188-332: I appreciate the authors in providing a detailed explanation of the Mask R-CNN methodology, but I feel this is not required for a research paper. A brief explanation is sufficient or just referring to some key literature will help readers. This loses focus on the paper. As addressed in Point 1, we do agree that this section may appear long but, after careful consideration, we have moved the table of acronyms to the Annexes and conserved the section. The reason for this lies in the fact that the original MaskRCNN paper only defines parameters for a specific problem but we feel that the parameters need to be overviewed to tackle this novel problem for the MaskRCNN model. Moreover, we estimate that the MaskRCNN has so many parameters that the absence of this section would make this paper readable only to researchers with significant experience using this specific model or derivatives, and therefore, drastically reducing the target audience.
Line 367: The topic changes here. Not dealing with dataset anymore. So, please split this into a subsection. The section has been split between 2.1.1 Study area and 2.1.2 Dataset specifications.
Line 382: Image resolution is mentioned here – Please move this up. Indicate this just after the image acquisition in line 361. The information was moved up in the section line 198.
Line 410: Split percentage is good. Please also indicate the number of samples here to avoid confusion. Because you are mentioning augmentation as well. Line 361 shows 340 images; line 386 shows 150 images, and augmentation was also performed to increase the sample size. Please make this clear. The statement Line 422-424 highlights that we carried out live data augmentation with two transformations when generating the batches, which means that we virtually tripled the training set size. The choice of live data augmentation is favoured to avoid storing more data. Former Line 361 (line 209 in updated manuscript) mention of 340 images gives the number of images in the dataset POT_CTr whereas former Line 385 (224) mention of 150 images relates to the number of images on which Ganesh et al. 2019 trained their in field orange sizing Mask R-CNN model.
Fig 5 caption – where are the blue disks – This is not clear. Red crosses are seen. The blue disks can be observed when zooming in. Indeed, they are not particularly well highlighted but they only show the manually-tuned kernel around the detected centroids (red crosses), which we believe is an acceptable compromise.
Fig 5a and 5b – Why is the ground truth board classified as a plant – red border?In 6a) and 6b), Figure 5 in the first version of the manuscript, the models used are M1_POT and M2_POT which are the least advanced (see Section 3.1. Individual plant segmentation with transfer learning) hence they perform more poorly than M3_POT which correctly does not classify and segment the groundtruth board as a plant.
In the methods, the authors have mentioned about collecting images at 6 dates. How are you using it for analysis? Are you comparing the accuracy on different dates? Please explain. The imagery collected on different dates allow to corregister them and reuse annotations for the image on one date for the other dates, making the data annotation process more efficient. This statement was added to Section 2.1 Datasets (189-191). Considering that the annotation process is being coarse, the comparison of the accuracy between dates was not carried out since this dataset is used for “warming up” the model (see Section 2.1.2 Datasets Specifications and Section 2.2.2 Transfer learning strategy).
Which growth stage of potato was best identified with your method? Typically, the time window for the drone flights is going from 4 to 7 weeks after emergence. It was not possible to establish an ideal growth stage as potato plants vary in size depending on the variety and tuber rate. This would have required an agronomist to be in the field to create groundtruth following the BBCH-scale, the scale used to estimate the growth stage of plants. The information regarding the flight timing was added to Section 2.1.1 Study area line 188.
How did you deal with the overlapping plants? Growth stages selected enabled the human annotator to visually separate plants with bounding box delimitation and allowing for a small overlap for the coarse dataset. Consequently, this semi-automatic labelling displays imprecision in the groundtruthing as shown in Fig 2.(d). This information was added to Section 2.1.2 Datasets specifications line 214-215.
Figure 5, the soil color Is different in the top and bottom row. Are those captured on different dates? Locations? Or cameras? Yes, the soil is different as the top row of images originates from a drone flight in Australia while the bottom row are from the United Kingdom. The images’ origins were addressed in Section 2.1.1 Study area Table 1.
A lot of abbreviations in the paper makes it very difficult to follow. Use only necessary terms for describing and discussing results. We agree with the reviewer that the abbreviations can be tricky to follow, but this is one of the added values of this work as specified in the answers Point 1. and 9. Mask R-CNN is a powerful but complex deep learning architecture, from the Artificial Intelligence domain, which, without being fully understood through the spectrum of its parametrization, cannot be profitable to the Remote Sensing field. As specified in Section 2. 2 Mask R-CNN re-fitting strategy, “if the original model (i.e. the one trained on a large number of natural scene images with coarsely-annotated natural scenes objects) is trained or applied for inference without modifications of the default parameters on UAV images of plants, the results are particularly poor. The main reason behind this failure is the large number of free parameters (around 40) in the Mask R-CNN algorithm. ” lines 228-232. To set up a bridge of knowledge between the theoretical description of this network and the operational implementation on a specific task we used an open source implementation (Matterport) and linked its terminology to the physical explanation of the parameters’ influence. To alleviate the reading, we have moved the table of acronyms, linking the open source implementation variables, to the Annexes (See Table A1).
Discussions should have more literature cited. Only three literature were cited there, that too for the methods. Please compare this with previous literature on plant detection. In this study, we explore the specific problem of plant sizing for RGB UAV imagery with similar resolutions to the one we used. As specified in Section 1.2 (whose name has been updated from ‘Plant counting’ to ‘Plant counting, detection and sizing’) and to the best of our knowledge, we have not identified other work investigating a direct automatic plant sizing approach for remotely sensed similar crops. The paper presents a traditional computer vision approach and a deep learning approach (Mask-RCNN). Putting raw performances aside, the MaskRCNN brings uncomparable features by offering classification, instantiation and segmentation. Other deep learning models would require a significant amount of post-processing steps to provide such outputs, defeating the goal of this paper to present implementable pipelines for plant counting and sizing. Therefore, it was decided to compare two drastically different approaches to highlight the benefits of the Mask-RCNN. This latter allows accurate sizing and the possibility to detect plants even if different growth stages are present in the input image. In future work, it would be valuable to highlight performances of such systems on entire fields of hundreds of hectares, as the growth stage may vary significantly. In this scenario, we would expect the Mask-RCNN to significantly outperform the traditional computer vision solution. Therefore, comparing algorithms from existing works with the computer vision baseline proposed is not part of the scope of this study hence the little literature mentioned on plant detection in the discussions. We included references of existing studies in Section 4. Discussion highlighting the methodology adopted in this study to carry out plant sizing of remotely sensed potato plants and lettuce heads lines 504-532. We also included a comparison with the results obtained by Ganesh et al. 2019 on in-field oranges detection lines 551-554.

Reviewer 2 Report

The authors proposed a framework that uses MASK R-CNN architecture to detect and segment the plants using remotely sensed UAV imagery. The problem addressed in the paper is valuable from a scientific perspective and is also expected to have a tangible impact from an agricultural point of view.

The methodology is well explained and the experimental results confirm the soundness of their proposed architecture. The paper is generally well written covering all the important aspects of the selected topic, however, I have few suggestions.

-The title should be appropriate (for example: Mask R-CNN based individual plant segmentation and detection using remotely sensed UAV imagery)

-Line 1-3, 188-189: “This work introduces a method that merges remote sensing and deep learning into a framework that is tailored for accurate, reliable and efficient counting and sizing of plants in aerial images”

“The deep learning model introduced to tackle the plant counting and sizing tasks is based on the Mask R-CNN architecture, which is adjusted for this problem”. Please specify the quantitative results for the counted number of plants and the min-max size of the plants in the results section.

-Line 362-363: “UAV RGB imagery had been acquired over two UK fields on six different dates, covering the full span of growth stages from
emergence (plant size of more than 8cm) to the point where the canopy closes.“ Please add RGB sensor technical details.

- Line 367-368: “The mask was obtained by applying an Otsu threshold [21] to separate soil and vegetation within each bounding box.”. Had the authors evaluated the influence of inter-row soil effects (grass, weed ) on the reliability of mask generation for all the images acquired during the phenological stages of plants?

Author Response

The authors proposed a framework that uses MASK R-CNN architecture to detect and segment the plants using remotely sensed UAV imagery. The problem addressed in the paper is valuable from a scientific perspective and is also expected to have a tangible impact from an agricultural point of view.
The methodology is well explained and the experimental results confirm the soundness of their proposed architecture. The paper is generally well written covering all the important aspects of the selected topic, however, I have few suggestions. We would like to thank the reviewers for the positive comments and constructive remarks.
The title should be appropriate (for example: Mask R-CNN based individual plant segmentation and detection using remotely sensed UAV imagery) We changed the title accordingly to “Mask R-CNN refitting strategy for plant counting and sizing in UAV imagery”.
-Line 1-3, 188-189: “This work introduces a method that merges remote sensing and deep learning into a framework that is tailored for accurate, reliable and efficient counting and sizing of plants in aerial images”“The deep learning model introduced to tackle the plant counting and sizing tasks is based on the Mask R-CNN architecture, which is adjusted for this problem”. Please specify the quantitative results for the counted number of plants and the min-max size of the plants in the results section. The performance for the counted number of plants is represented by the precision, recall and MOTA metrics. A new paragraph about the predicted plant sizes was added line 494 in the Result section, as well as a Figure 7. We would like to thank the reviewer particularly for this comment as we believe that this additional data greatly contributes to demonstrating the real-world use case tackled by our work.
-Line 362-363: “UAV RGB imagery had been acquired over two UK fields on six different dates, covering the full span of growth stages from emergence (plant size of more than 8cm) to the point where the canopy closes.“ Please add RGB sensor technical details. We added Table1 in Section 2.1.1 Study Area with all the missing information.
- Line 367-368: “The mask was obtained by applying an Otsu threshold [21] to separate soil and vegetation within each bounding box.”. Had the authors evaluated the influence of inter-row soil effects (grass, weed ) on the reliability of mask generation for all the images acquired during the phenological stages of plants? While it is true that vegetation foreign to the field could bring inaccuracy in our process, we only selected the largest block of vegetation to build the mask. This means that any vegetation that is not in direct contact in the imagery with the largest block of vegetation will be discarded from the final mask. Moreover, as we can see in the images presented across the paper, the selected fields had close to no weed or grass in proximity to the cropped area. We, therefore, estimate their impact as negligible.

Reviewer 3 Report

A brief summary

The aim of the study was to prove that the Mask R-CNN with the UAV images in the detection of individual plants. This attempt shows some promise, but the manuscript is structured poorly and lacks a scientific component.

Broad comments

Deep learning offers a lot of potential in such applications and this attempt is noteworthy. Section 2. contains a good overview of the recent studies related to the research objective.

This research would be more comprehensive if it was formatted according to the Research manuscript sections defined in the Instructions for authors. The length of the sentences in the manuscript is generally far too long. The authors should split them into shorter sentences to improve readability. The Introduction lacks relevant references to support claims made by the authors. Some parts of 2.2. section are more suitable as a part of the Discussion section (like 156-157). The method of individual plant detection is not clearly explained and can be much shorter. There are technical components of the study that lack data (especially for UAV flights). The Discussion section needs to be expanded. An English language check is recommended.

Specific comments

Affiliations: These data seem incomplete. Also, is “firstname@hummingbirdtech.com” even a real e-mail?

Lines 3-10: These two sentences are very long. Consider splitting them to shorter sentences to improve readability.

Abstract: You should include key results of your research in the Abstract. Also, which crop type did you observe in your research?

Lines 15-17: I do not consider that as a true statement. Please add a reference to this sentence.

Lines 21, 26: Please avoid using “etc.” when it is not necessary.

Line 27: The statement regarding yield increase and environmental impact definitely requires at least one reference.

Fig. 1: While the figure is interesting, it does not fit in the Introduction. Also, how did you get to these results, using which data and which method?

Line 35: Gigapixels are not commonly used to assess image size. I suggest that you state image size in megabytes or gigabytes.

Line 175: Lower cost of RGB cameras is also an advantage.

Lines 177-179: I see your point, but it would be very interesting (and possibly necessary) to evaluate and prove that using quantitative indicators.

Lines 203-221: This is more suitable for the Introduction. You should focus more on the method that you used in this section.

Lines 361-362: You should add much more information about the acquiring of UAV imagery. Study area, UAV type, field data, orthophoto creation, exact dates of imaging (“six different dates” is not scientific) are missing.

Line 264: What means “before or after solar time”? You should also insert the results of your UAV image acquisition.

Section 4.1.: It would be interesting to see a mosaic of 16 or 25 sample images of potato and lettuce that you used. Figure 1 is not representative enough and is misplaced in the manuscript.

Line 373: How would you justify using a training set containing data from UK and Australia, as these areas offer completely different agroecological parameters for crop growth?

Table 2.: The object of LET_RTr is Lettuce I guess?

Line 404: It can be trained on CPU but it is far more consuming. I suggest removing that part or supporting it with a reference. Also, how long was the process of training in your case?

Sections 5.: A large part of this section belongs to the Methods section.

Table 3.: Please make this explanation clearer, so you tested four variations of training sets, but using the different parameters for each?

Figure 5.: These results should definitely be also interpreted in the form of statistical indicators for each method (aside from the ones in figure description), as this is the core of the research. The results do not justify so high Precision values as you stated after in the text. In two cases, a marker was detected as a plant?

Line 485: Precision and Recall should be described in the Material and Methods section.

Table 4. and Table 5.: So you correctly detected 98.3%, 99.7%, 99.7% and 100% (!?) of the plants? A clear interpretation of these results is necessary.

References: Just a question for the authors. Are these papers from CoRR peer-reviewed?

Author Response

The aim of the study was to prove that the Mask R-CNN with the UAV images in the detection of individual plants. This attempt shows some promise, but the manuscript is structured poorly and lacks a scientific component. We have restructured the paper to follow the typical structure of a research paper. Following this reorganisation, we have migrated some information from Related Works to the Discussion section, from the Results to the Methodology section to provide additional details on the datasets. The Abstract has been rewritten and a distinct Conclusion section has been added to emphasise on the scientific findings.
Deep learning offers a lot of potential in such applications and this attempt is noteworthy. Section 2. contains a good overview of the recent studies related to the research objective. We would like to thank the reviewers for the positive comments and constructive remarks.
This research would be more comprehensive if it was formatted according to the Research manuscript sections defined in the Instructions for authors. The length of the sentences in the manuscript is generally far too long. The authors should split them into shorter sentences to improve readability. The Introduction lacks relevant references to support claims made by the authors. Some parts of 2.2. section are more suitable as a part of the Discussion section (like 156-157). The method of individual plant detection is not clearly explained and can be much shorter. There are technical components of the study that lack data (especially for UAV flights). The Discussion section needs to be expanded. An English language check is recommended. As mentioned in Point 27, we have restructured the paper as required while following the reviewer’s suggestions. Regarding the requested information about the imagery included in the dataset, we have added Table 1 which details the sensors, drone and locations. We have included references of existing studies in Section 4 - Discussion highlighting the uniqueness of the methodology adopted in this study to carry out plant sizing of remotely sensed potato plants and lettuces lines 504-532. We also included the results obtained by Ganesh et al. 2019 on in-field oranges detection lines 551-554. We also improved the English over the entire paper with a particular attention to long sentences. Regarding the aim of our study, we demonstrate the superiority of Mask R-CNN to solve a new problem, plant sizing of low-density crops using UAV RGB imagery. The experiment design has been established to account for practical constraints of the remote sensing field for precision agriculture: commercial farming practices represented in the variability of images of the dataset, scalability for operational use of the model, and scarcity of annotated images at a pixel level. Combining remote sensing and deep learning for individual plant instance segmentation using Mask R-CNN is presented as a direct and automatic cutting-edge approach. Moreover, it outperforms the parametric computer vision baseline requiring multiple processing steps and manual parametrization when used for plant detection. The success of this approach is conditioned by transfer learning strategies and by the correct tuning of the numerous parameters of the Mask R-CNN training process, both detailed in this study. This justiﬁes the necessity of understanding the complex parameterization process of Mask R-CNN and this study is the ﬁrst one, to the best of our knowledge, which disseminates in detail the implications and effects of this complex model’s parameters. It also facilitates reproducibility by using the notations of variable names in the most popular open-source implementation. As part of the effort to minimise the said section, we have moved the table of acronyms to the Annexes.
Affiliations: These data seem incomplete. Also, is “firstname@hummingbirdtech.com” even a real e-mail? This email has been removed to avoid any confusion and was replaced by a unique contact email address for Hummingbird Technologies.
Lines 3-10: These two sentences are very long. Consider splitting them to shorter sentences to improve readability. The abstract has been reworked and, as suggested, the readability has been improved.
Abstract: You should include key results of your research in the Abstract. Also, which crop type did you observe in your research? As suggested, the abstract has been significantly edited. The crops observed (lettuce heads and potato plants) are now mentioned in the abstract.
Lines 15-17: I do not consider that as a true statement. Please add a reference to this sentence. We appreciate the reviewer’s remark and agree that the statement may have mistakenly suggested that the involvement of science and engineering in agriculture was only recent. The sentence was edited to clarify that engineering innovations have actually allowed us to investigate new data science problems in agriculture. References were added lines 21-24.
Lines 21, 26: Please avoid using “etc.” when it is not necessary. The comment is accepted and modifications were executed accordingly line 27.
Line 27: The statement regarding yield increase and environmental impact definitely requires at least one reference. The sentence was reformulated to remove the statement on individual plant management and yield gain. We are aware of ongoing empirical work in progress on this very topic but it is true that no publication exists as of now. New references have been added to mention the benefits of localised management decisions in herbicide spraying, which have similarities with plant-level management decisions line 35.
Fig. 1: While the figure is interesting, it does not fit in the Introduction. Also, how did you get to these results, using which data and which method? We moved this image out of the introduction to Section 2.1.2 Datasets specifications. The caption specifies that images come from lettuce and potato datasets with their manual annotations for individual segmentation and we added from which dataset these samples were extracted (see Figure 2).
Line 35: Gigapixels are not commonly used to assess image size. I suggest that you state image size in megabytes or gigabytes. We agree with this statement and we have added an estimation of gigabytes per hectare, which is about 0.2GB per hectare for imagery with 2cm resolution per pixel line 42.
Line 175: Lower cost of RGB cameras is also an advantage. The comment is accepted and modifications were executed accordingly line 515.
Lines 177-179: I see your point, but it would be very interesting (and possibly necessary) to evaluate and prove that using quantitative indicators. While we do agree that comparing a “patching” method and our presented method would be valuable, we believe that this is currently out of the scope of the paper but could certainly be addressed in future work. Indeed, the aim of the paper is to introduce an all-in-one solution which predicts both plant count and individual sizes in drone imagery. The high number of parameters, metrics, and concepts already used in this paper would make it difficult to introduce additional metrics without producing an extremely complex study. We have added a paragraph in the Discussion section to briefly emphasise on the fact that no other existing work provides comparable tasks or metrics.
Lines 203-221: This is more suitable for the Introduction. You should focus more on the method that you used in this section. We moved this paragraph to the Introduction line 89-106.
Lines 361-362: You should add much more information about the acquiring of UAV imagery. Study area, UAV type, field data, orthophoto creation, exact dates of imaging (“six different dates” is not scientific) are missing. We added a Section 2.1.1 Study area with all the missing information.
Line 264: What means “before or after solar time”? You should also insert the results of your UAV image acquisition. The comment was taken into account and the text has been replaced with “before and after solar noon” line 194. We added a UAV image acquisition for one of the fields in Section 2.1.1 Study Area Figure 1.
Section 4.1.: It would be interesting to see a mosaic of 16 or 25 sample images of potato and lettuce that you used. Figure 1 is not representative enough and is misplaced in the manuscript. We added a UAV image acquisition of one of the surveys so that it facilitates the appreciation of several crop stages, inter and intra row spacing between plants at one glance in Figure 1. Adding a mosaic of 16-25 images could be interesting but would not allow to correctly see the variability as a zooming/cropping is necessary for human readers to distinguish image properties of the plants (as done in Figure 2).
Line 373: How would you justify using a training set containing data from UK and Australia, as these areas offer completely different agroecological parameters for crop growth? By considering different crops stages and soil types within the training set, the goal is to build a training set representative of the variability of the plants in all possible fields. The purpose of using data-driven methodology is to obtain a unique model by crop type, which can accurately carry out the plant sizing task without pre/post-processing and discrimination on agroecological parameters as it could be the case with computer vision methods. This study considers two terrestrially opposite locations but the more images added to the training set, the more spatially global the model becomes.
Table 2.: The object of LET_RTr is Lettuce I guess? Section 2.1.2 Datasets Specifications (line 221-222 ) specifies that LET_RTr refers to the lettuce heads training set carefully annotated (R stands for ‘refined’ as explained line 208).
Line 404: It can be trained on CPU but it is far more consuming. I suggest removing that part or supporting it with a reference. Also, how long was the process of training in your case? We did not time exactly the processing as it was not part of the scope of this study. We removed the CPU mention due to the absence of exact numbers line 419.
Sections 5.: A large part of this section belongs to the Methods section.We moved the Section Hyper Parameters selection to the Material and Methods section 2.2.3
Table 3.: Please make this explanation clearer, so you tested four variations of training sets, but using the different parameters for each? We have changed the title of Table 3 to improve its comprehension and added extra explanation in the text lines 428-430.
Figure 5.: These results should definitely be also interpreted in the form of statistical indicators for each method (aside from the ones in figure description), as this is the core of the research. The results do not justify so high Precision values as you stated after in the text. In two cases, a marker was detected as a plant? the images shown in Figure 6 (Figure 5 in the first version of the manuscript) are a “zoom in” of 256x256 patch images on which the numerical evaluation have been conducted (mAP and MOTA) for visualization purpose, as stated in the caption. Since the image and the metrics values shown are not directly comparable (due to the “zoom in”), the numerical evaluation in the caption has been removed to avoid confusion . However, evaluation metrics related to the entire test sets and for all the datasets are still available in the text and in the tables 4 and 5 as well as their interpretation. Regarding the markers, in 6a) and 6b), the models used are M1_POT and M2_POT which are the least advanced (see Section 3.2) hence the they perform more poorly than M3_POT which correctly does not classify and segment the groundtruth board as a plant.
Line 485: Precision and Recall should be described in the Material and Methods section. The comment is accepted and modifications were executed accordingly lines 415-418.
Table 4. and Table 5.: So you correctly detected 98.3%, 99.7%, 99.7% and 100% (!?) of the plants? A clear interpretation of these results is necessary. Precision metric measures the number of true positives divided by the sum of true positives and false positives (see answer to your previous point). The numbers you mention in this comment refer to the Precision metric for two different methods (Computer vision baseline and Mask R-CNN re-fitted) evaluated on two different test sets (POT_RTe and LET_RTe). Considering these datasets exhibit close to no weed thanks to good farming practice, Precision is very high even for CV baseline (as stipulated line 483-486). Precision of 100% is even reached, for Mask R-CNN re-fitted (model called M2_LET) meaning that no element is wrongly identified as a plant for the entire LET_RTe test set. We added this last consideration in Section 3.2 lines 488-490.
References: Just a question for the authors. Are these papers from CoRR peer-reviewed? After verification, it appears that papers published in CoRR are not refereed but they are checked by an advisory committee. All the papers cited are, however, very popular in the Deep Learning world and have been thoroughly scrutinised. Many of the papers listed have actually been republished in peer-reviewed conferences and we have updated these accordingly.

Reviewer 4 Report

This research focuses on investigating Mask R-CNN and its optimal parametrization for counting and sizing plants, together with evaluating techniques to deal with challenges in sample data collection. The authors discussed the model configurations in detail, which is helpful for readers to repeat the experiment. The research results will contribute to the vegetation segmentation and monitoring. However, there are several concerns that need to be addressed in this manuscript before considering publication.

Please see below specific comments.

Line 15-17, please provide references for these statements.
L27, need reference.
Figure 1 d-f show the research results? While this is not the result section.
L44-45, not clear. The authors mean no remote sensing dataset or segmentation algorithms available?
50, please define the R-CNN.
109-111, please briefly review these studies used Mask R-CNN and evaluate the performances.
175, please define CIVE.
307, how was this confidence score calculated?
315-319, please provide references for these statements.
321, several millions of parameters? 323-332, these previous studies recommended training sample size for specific models. Are there conclusions transferable to the model used in this research?
336-337, “Computing vegetation indexes to translate image properties…”, not clear what this “translate” means.
Figure 4, why the local minima is the center of plant? If two or more plants are very close or overlapped, then how to find this center?
Table 2, the COCO dataset has substantially more images than the other datasets that were collected specifically for this study area. Not sure if this will cause imbalance on the training process, such as the model is more influenced by the COCO dataset.
533, this “poor environmental conditions” is a bit confusing. It refers to the poor environment for vegetation growth, or the “plants shadow and overlap” that affect image recognition?

Author Response

This research focuses on investigating Mask R-CNN and its optimal parametrization for counting and sizing plants, together with evaluating techniques to deal with challenges in sample data collection. The authors discussed the model configurations in detail, which is helpful for readers to repeat the experiment. The research results will contribute to the vegetation segmentation and monitoring. However, there are several concerns that need to be addressed in this manuscript before considering publication. We would like to thank the reviewers for the positive comments and constructive remarks.
Line 15-17, please provide references for these statements.We appreciate the reviewer’s remark and after consideration, we agree that the statement may have mistakenly suggested that the involvement of science and engineering in agriculture was only recent. The sentence was reformulated to clarify that engineering innovations have actually enabled us to investigate new data science problems in agriculture. References were added lines 21-24.
L27, need reference. The sentence was reformulated to remove the statement stipulating that individual plant management leads to yield gain. It was mentioned as we are aware of empirical work currently in progress on this very topic but this work has not been published yet. Nonetheless, references have been added to mention the benefits of localised management decisions in herbicide spraying, which have similarities with plant-level management decisions line 35.
Figure 1 d-f show the research results? While this is not the result section. We moved this image out of the introduction to Section 2.1.2 Datasets Specifications. The caption specifies that images come from lettuce and potato datasets with their manual annotations for individual segmentation and we added from which dataset these sample images were extracted (see Figure 2).
L44-45, not clear. The authors mean no remote sensing dataset or segmentation algorithms available? Yes, the only existing datasets with segmentation groundtruth have collected imagery from a small and unique area. The sentence was reformulated to clarify this point lines 53-54.
50, please define the R-CNN. In this line, “R-CNN” belongs to “Mask R-CNN” expression, which is the name of the architecture coined by its authors. The mention “so-called” has been added to avoid confusion line 164.
109-111, please briefly review these studies used Mask R-CNN and evaluate the performances. After careful consideration, the section in question had to be removed in order to make the Introduction shorter. However, a paragraph starting at line 523 was added in the Discussion section to emphasise on the fact that no other work has a comparable workflow or metrics.
175, please define CIVE. The acronym’s meaning has now been added lines 384-385.
307, how was this confidence score calculated? The Detection Layer follows the FPN Classifier, as stated line 326 Feature Pyramid Network (FPN) explains that “the output of these deep layers is composed of a classifier head with logits and probabilities for each item of the collection to be an object and belong to a certain class as well as refined box coordinates which should be as close as possible from the groundtruth boxes used at this step. ” To avoid confusion we replaced “confidence” by “probability” line 326.
315-319, please provide references for these statements. We added the missing references lines 335-336.
321, several millions of parameters? 323-332, these previous studies recommended training sample size for specific models. Are there conclusions transferable to the model used in this research? We do use the Zlateski et al. 2018 strategy for natural scene images to evaluate if it is transferable to remotely sensed images. As 10K accurately annotated masks are very costly and time consuming to digitize, we focus our study on pre-training on very large coarse annotated datasets and re-fitting on way smaller refined datasets, just as suggested in the Zlateski study. The added value is to consider different types (natural scenes/remotely sensed), and annotations (coarse/refined) and sizes of datasets [see section 1.1.2 Datasets Specifications and 3.1 Individual plant segmentation with transfer learning].
336-337, “Computing vegetation indexes to translate image properties…”, not clear what this “translate” means. We agree with this comment that the sentence was ambiguous and “translate” has now been replaced with “highlight” line 383.
Figure 4, why the local minima is the center of plant? If two or more plants are very close or overlapped, then how to find this center? Section 2.3 Computer Vision baseline stipulates that “The Laplacian filtered CIVE map with mask soil pixels (L f C) highlights regions of rapid intensity change. Potato plants and lettuce heads pixels of this map are meant to have minimal change close to their center due to homogeneous and isotropic properties of their visual aspect. Therefore, finding the local minima of the L f C map should output the geolocation of these plant centers. This is why all L f C pixels are tested as geo-center of a fixed size disk window and are only considered as the center of a plant if their value is the minimum of all pixels located within the window area. This framework is summarized in Fig. 5“. The more overlapped two plants are, the hardest it is to identify their geo centers. This is the reason why the computer vision approach is presented as a baseline, and then challenged by the Mask R-CNN model.
Table 2, the COCO dataset has substantially more images than the other datasets that were collected specifically for this study area. Not sure if this will cause imbalance on the training process, such as the model is more influenced by the COCO dataset. The COCO dataset is used for pre-training the architecture before applying the presented transfer learning strategy. As stated in Section 2.2.2 “Transfer learning can only be successful if the features learnt from the first model are general enough to enveloppe the targeted domain of the new task.” The lowest levels of the architecture are designed to extract low level features such as edges and curves whereas higher level features ensure task oriented functionalities. By “warming up” the architecture with the very large COCO dataset and using smaller specific datasets for refitting the highest part of the architecture (Backbone layers are frozen as specified line 366-367), the model is ensured to work at its best performance for the desired task (potato plant or lettuce head sizing).
533, this “poor environmental conditions” is a bit confusing. It refers to the poor environment for vegetation growth, or the “plants shadow and overlap” that affect image recognition? It refers to “shadowing effects, occluded by foliage or some degree of overlap between plants.” line 569. We removed “poor environmental conditions” to avoid confusion line 568.

Reviewer 5 Report

- There is a clear explanation of the impact of the parameters of the deep learning architectures used.

- The comparison with the computer vision and vegetation index (CIVE) technique used seems to me to be very accurate, because the results obtained give more prominence to the proposed methodology, leaving open the possibility that in future work other vegetation indices will be compared for their respective evaluation.

- Using images in their different stages of growth, I consider that they make the algorithm more robust.

Author Response

- There is a clear explanation of the impact of the parameters of the deep learning architectures used.
- The comparison with the computer vision and vegetation index (CIVE) technique used seems to me to be very accurate, because the results obtained give more prominence to the proposed methodology, leaving open the possibility that in future work other vegetation indices will be compared for their respective evaluation.
- Using images in their different stages of growth, I consider that they make the algorithm more robust.

We would like to thank the reviewer for the positive comments and constructive remarks. We sincerely hope that this manuscript will, thanks to this collaborative process, now match the Remote Sensing’s publication standards.

Round 2

Reviewer 3 Report

I believe that the authors made substantial improvements in the manuscript. Their replies to the reviewers are comprehensive and understandable, which deserves praise. However, I still have some suggestions related to the new version of the manuscript:

Table 1. Please change “flight height” to “relative flight height”.

Author Response

I believe that the authors made substantial improvements in the manuscript. Their replies to the reviewers are comprehensive and understandable, which deserves praise. We would like to thank the reviewer for appreciating the improvements made to the paper.

However, I still have some suggestions related to the new version of the manuscript:

Table 1. Please change “flight height” to “relative flight height”. The comment has been taken into account and changes have been made accordingly.

Line 195: Did you create orthophotos using Structure-from-Motion algorithms? Firstly, please stop using the word “drone” and use “UAV” instead. Secondly, since this is the Remote Sensing Journal, this information is very important to many readers and should be clarified. Yes, as presumed by the reviewer, we did use a Structure-from-Motion algorithm to generate our orthomosaic images and this information has been added on Line 195-196. The vocabulary has also been adjusted across the entire paper and all occurrences of the word “drone” have been replaced with “UAV”.

Figure 1. This figure is not representative of your study area and needs a complete rework. First of all, why do not you show the exact location of these fields in the UK and Australia? Since they are at different locations, this would be very interesting for readers. If you want to also put UAV images, it would be good to place only zoom-ins of all ten locations in a 2x5 grid. That way, you will have a clear and representative image of the study area. We would like to thank the reviewer for this comment as we believe the suggestions make the paper more attractive to the readers. As suggested, we replaced the “Country” column in Table 1 with a column named “Region”. This new column gives a more accurate location for each field while ensuring that the field cannot be identified for data privacy reasons. In addition, Figure 1 was replaced with a 2x5 mosaic of images from all 10 orthomosaic images to give a more representative sample of our dataset.

Tables 4 and 5, lines 488-490: You must expand a discussion regarding these values. What is the reason for such high precision values? Maybe it was the distinctiveness of lettuce regarding the potato, also the relative number of input data for potato and lettuce? I suspect that these values would have low repeatability in the field, so you should clarify under which conditions this approach performs like in your research more. We have expanded the discussion regarding the high precision values as requested from lines 489 to line 495. On top of lettuces displaying visual features algorithmically easier to detect, it is also a high-value crop. It implies that the amount of care per hectare of crop is significantly higher and may translate into more organised fields with a reduced presence of weeds. Regarding our imagery, it is issued from two commercial lettuce fields which have been annotated and the test images originate from miscellaneous and distributed locations from the said fields (of course, distinct locations from the training data). By preparing the dataset as such, we ensure reproducibility of these results in commercial fields.

References: I understand that currently there is not much literature on the subject, but I still do not think that this represents a completely reliable source of information. However, these things are up to the editor to decide. We believe that this comment refers to the use of papers from the Computer Research Repository (CoRR) as references. We agree that any paper is only fully reliable when peer-reviewed. Nonetheless, we have carefully scrutinised the papers in question and esteem them scientifically sound, with some having several applied and public implementations created by independent groups (e.g [11,20,22]). We have checked other papers previously published in Remote Sensing and it appears to us that referring to preprints is an acceptable practice, as long as it stays a minority of the references, as in the presented manuscript. Therefore, we do hope that our references are satisfactory to both the reviewers and the editor.

Reviewer 4 Report

The authors have responded to the comments properly and improved the quality of this manuscript. No further comments.

Author Response

The authors have responded to the comments properly and improved the quality of this manuscript. No further comments. We would like to thank the reviewer for their previous constructive comments and appreciating the improvements made to the paper.

Article Menu

Printed Edition

Mask R-CNN Refitting Strategy for Plant Counting and Sizing in UAV Imagery

Further Information

Guidelines

MDPI Initiatives

Follow MDPI