*3.1. Individual Plant Segmentation with Transfer Learning*

First, we performed transfer learning from the COCO weights (obtained by training the Mask R-CNN with parameters found in Table 3) on the POT\_CTr dataset (model M1\_POT). Mean Average Precision (mAP) over POT\_RTe is 0.084 which is low, as initially expected, due to the POT\_CTr training set being coarsely annotated. We then performed transfer learning from the COCO weights to the POT\_RTr dataset and obtained a higher mAP of 0.406 on the POT\_RTe dataset due to the refined annotations of the POT\_RTr dataset (model M2\_POT). Finally, retraining on POT\_RTr dataset from the POT\_CTr weights (model M3\_POT) led to the best mAP of 0.418 demonstrating that pre-training a network on a coarsely labeled dataset before fine-tuning on a polished dataset can definitely boost the accuracy of a model. Figure 6a–c respectively show the inference of M1\_POT with mAP of 0.011, M2\_POT with mAP of 0.284 and M3\_POT with mAP of 0.316 on an image from the POT\_RTe dataset. The resulting segmentation and the number of potato plants detected improve significantly as the mAP values are increased. Moreover, the boundaries of the predicted masks drawn on the images fit the contours of the plants better.

**Figure 6.** Predictions of the Mask R-CNN and Computer Vision (CV) methods. Yellow boundaries depict the groundtruth masks, red boundaries represent the predicted masks of the Mask R-CNN model, and red crosses surrounded by blue disks are the predictions of the computer vision method for plant centers. Samples shown are a subset of the entire image of spatial size 256 × 256 for visualization purpose but metrics computed stand for the whole image. In first row, one cropped sample image from POT\_RTe is shown: (**a**) M1\_POT; (**b**) M2\_POT; (**c**) M3\_POT; (**d**) CV. In the second row, one cropped sample image from LET\_RTe is shown: (**e**) M3\_POT; (**f**) M1\_LET; (**g**) M2\_LET; (**h**) CV.

In order to assess whether M3\_POT, which has been trained on potato plants, presents the same performance on another low density green-colored crop, M3\_POT was used in inference on LET\_RTe, composed of aerial images containing lettuce plants in fields. A mAP of 0.042 was calculated and by dissecting the parameters used in Mask R-CNN, this low accuracy score was explained by an excessively small *RPNtapi* = 128 limiting the number of possible lettuce plants which could be found in one image. By raising *RPNtapi* to 300 as well as the number of ROIs kept after using the NMS in training and inference phases (parameters *pNMSrtr* and *pNMSrinf*), the likelihood of detecting all lettuce plants is significantly increased. As the number of AOIs to detect is higher, *trROIpi* was also increased to obtain more ROIs training the FPN. Consequently, the number of groundtruth objects to consider *mGTi* and the number of predicted objects *Dmi* were also inflated. Following these adjustments, a mAP of 0.095 for M3\_POT used in inference on LET\_RTe still proved to be low, implying that the updated training parameters would not make any difference in prediction mode. Therefore, we conclude that a model trained on potato plants transfers poorly to lettuce plants detection and segmentation. Two models based on the COCO weights and M3\_POT weights were then trained separately on the LET\_RTr dataset. The models (M1\_LET and M2\_LET) respectively obtained a mAP of 0.641 and 0.660 on the LET\_RTe dataset. A similar conclusion as for models trained on potato plants can be drawn since pre-training on another crop seems to improve the final model. Figure 6e–g respectively show the inference outputs of M3\_POT with mAP of 0.178, M1\_LET with mAP of 0.309, and M2\_LET with mAP of 0.595 on an image from the LET\_RTe dataset. The resulting segmentation and the number of potato plants detected greatly improve as the mAP values demonstrate, as well as the boundaries of the predicted masks. Higher values of mAP are observed for lettuce plants than for potato plants. This could imply that the individual plant segmentation task is easier to achieve on the lettuce crop considering the conditions of acquisition of the imagery in both of the datasets, detailed in Section 2.1.

#### *3.2. A Comparison of Computer Vision Baseline and Mask R-CNN Model for Plant Detection*

The presented Mask R-CNN model for individual plant segmentation encompasses the fulfillment of the task of detection. The models with highest mAP, respectively M3\_POT for potato plants and M2\_LET for lettuce plants, have been compared to the Computer Vision based model detailed in Section 2.3. This latest model outputs coordinates corresponding to the center of the plants and postprocessing was also conducted on the masks predicted by the Mask R-CNN based model to calculate the plants' centroids for comparison. Table 4 displays results for both of the methods applied on the POT\_RTe dataset. MOTA, precision, and recall scores are all higher for the Mask R-CNN based model than for the traditional Computer Vision (CV) method. Figure 6c and Figure 6d respectively show the predictions of M3\_POT and CV on a sample from POT\_RTe. It should be noted that some weeds are wrongly classified as potato plants by the CV method and a higher number plants were missed than with the Mask R-CNN based model. These false negatives can be explained for the CV method by the distance between two local minima of the *L f C* map being smaller than the size of the chosen window which occurs when plants are very close to each other. It is suggested that a small and isolated plant can be missed due to a possible inaccuracy of the Otsu thresholding or an excessively strong smoothing effect of the Laplacian. However, it shall be noted that the imagery within POT\_RTe represents fine farming practices and data collection timing, implying that there is close to no weed in the field and the potato plants' canopy is not closed. It is thought that M3\_POT could have encountered difficulties if any of these two edge cases would have appeared in the data.


**Table 4.** Results of the comparison of Computer Vision (CV) and Mask R-CNN base algorithm for plant detection on POT\_RTe.

The same experiment is repeated to compare the CV method and M2\_LET on LET\_RTe. Once again, better MOTA values are observed for the Mask R-CNN based model as seen in Table 5. Lettuce plants display the strong advantage of being of a round shape, compact and well-separated from each other. Lettuce is also a high-value crop, meaning that the amount of care per hectare of crop is high and may result in more organised fields with a reduced presence of weeds. These characteristics translate into a more systematic uniqueness of the local minima of the

*LfC*. Consequently, the corresponding center of a lettuce plant is easier to detect than for a potato plant. Indeed, potato plants can present an irregular canopy and, often, merge with other plants. The CV method reaches the high scores of MOTA (0.858), Precision (0.997), and Recall (0.882). The deep learning model *M*2\_*LET* also benefits from the advantageous visual features displayed by lettuce plants in comparison with potato plants. A perfect precision score of 100% is reached for the Mask R-CNN refitted model, meaning that no element is wrongly identified as a plant for the entire LET\_RTe test set. The model also outperforms the CV method with a MOTA of 0.918 and a Recall of 0.954. A sample extracted from the LET\_RTe dataset is processed by M2\_LET and illustrated in Figure 6g. It can be observed that only lettuce plants on the edges of the image are missed due to the convolutional structure of the network. In comparison to the CV method, with its results illustrated in Figure 6h, is small lettuces that are missed.

**Table 5.** Results of the comparison of Computer Vision (CV) and Mask R-CNN base algorithm for plant detection on LET\_RTe.


To emphasise the real-world limits of our models, we have converted every pixel mask predicted by *M*2\_*LET* and *M*3\_*POT* into a square centimetre area using the resolution per pixel. We can observe that the smallest and largest lettuce plants are respectively 122 and 889 cm<sup>2</sup> , with a mean leaf area of 405 cm<sup>2</sup> . The smallest and largest potato plants are respectively 40 and 391 cm<sup>2</sup> , with a mean at 126 cm<sup>2</sup> . These values inform us that the *M*3\_*POT* can detect potato plants as small as 4–5 pixels in diameter in the imagery. Moreover, smaller plants detected by *M*3\_*POT* can be interpreted by the fact that potato fields need to be flown early in the season before the plants' canopy merges. Figure 7 shows a sample image from the L1 field with the predicted masks by *M*2\_*LET*. The corresponding size in square centimeter is superimposed over each plant.

**Figure 7.** Sizing of each lettuce plant in cm<sup>2</sup> for a sample image patch of L1 field. Each lettuce is overlaid with its predicted mask. The masks' green colour becomes increasingly darker with the plant size.

#### **4. Discussion**

In the existing literature, detailed in Section 1.2, the methodology applied to tackle the plant detection and counting problems often involves a two-step procedure which, firstly, performs segmentation of the vegetation and, secondly, a detection of the plants' centroids, using either computer vision or deep learning methods [25–27,32–37]. By relying on the Mask R-CNN architecture, our work presents an all-in-one model allowing to output individual plant masks, aiming at perfecting the complex instance segmentation task and intrinsically leading to outstanding plant detection and counting performances.

In this paper, the focus is both to accurately detect plants to estimate a count per area and to delineate every single plant's boundaries at pixel-level. In contrast with our work, previous works have adapted regression-based models to predict a count for a given area without any plant location or plant segmentation [25–27]. Moreover, we avoid the use of vegetation indices [27,33,35] because (a) the most renowned ones require a multi-spectral camera, which typically has a coarser spatial resolution and is more expensive than an RGB camera and (b) other indices, such as CIVE (Color Index of Vegetation Extraction), are linear combinations of the RGB channels, i.e., learnable from the neural network without explicitly estimating them in a pre-processing stage. Finally, we use an all-in-one model which has already demonstrated its potential in agricultural scenarios using in-field imagery taken from the ground [45–47] instead of UAV photographic imagery. By using this unique Mask R-CNN model, we avoid an additional "patching" step seen in a number of techniques explored in the literature which leads to more complex processing pipelines and performance comparability. [25–27,32–37].

Previous studies have also developed algorithms aiming at solving individual plant segmentation such as [38] on palm trees and [39] on sorghum heads. However, the imagery resolutions and plant types studied are not comparable with our presented work. Moreover, [38,39,42] solely assess their solutions on the detection task and could not evaluate them on a sizing task due to a lack of accurately-digitized groundtruth mask. Dijkstra et al. [40] propose a method carrying out individual remotely sensed potato plant bounding box detection. This work presents a notable difference with ours as the objective was not to generate accurate masks but exclusively to predict the plants' centroids within a targeted area associated with each plant. Hence, none of the existing methods enumerated above is directly comparable with our work due to differences in task complexity and metrics being too great.

Bringing together remote sensing and deep learning for plant instance segmentation with the Mask R-CNN embodies a direct automatic cutting-edge approach. Moreover, it outperforms parametrically and a multistep conventional computer vision baseline when used for plant detection. These results were based on the mean Average Precision as a metric for instance segmentation. This metric not only takes into account the matching between the groundtruth masks and their corresponding predictions, but also gradually penalizes the correctness of the resulting mask, at a pixel level, from 0.5 to 0.95 of Intersection Over Union. The images collected for our datasets have been manually digitized with a high precision. This allows for assessing the quality of the models' predictions by comparing with the annotations for each plant. This specific characteristic of our datasets enabled our study to be the first one, to the best of our knowledge, to use mAP using the masks representing plants in UAV images to quantify a model's performance on a plant sizing task. Ganesh et al. [47] also used Mask R-CNN for in-field orange fruit detection in the trees. However, they don't evaluate their algorithm for sizing such as we do thanks to the mean Average Precision metric. The MOTA metric, derived from The Multiple Object Tracking Benchmark [57], has recently been introduced in the remote sensing field and represents with fidelity the performance of algorithms developed for object detection. The best model obtained for the potato crop reaches a mAP of 0.418, and 0.660 for the lettuce crop. For plant detection only, these same models respectively obtain a MOTA score of 0.781 for potato plants and 0.918 for lettuce plants. In comparison, the traditional computer vision baseline solution tested in this paper only obtains 0.767 and 0.858 for the same crops, respectively. Ganesh et al. [47], studying in-field orange fruit detection, obtain a Precision of 0.895 and a Recall of 0.867 with their Mask R-CNN model. With our refitted Mask R-CNN models on remotely sensed images, we reach a Precision of 0.997 (potato plants) and 1.0 (lettuce plants) and a Recall of 0.825 (potato plants) and 0.954 (lettuce plants).

Annotated data in the remote sensing field is a common limitation to projects leveraging the power of deep learning as supervised techniques are data-greedy. The sequential training phases on large natural scenes datasets such as COCO [6], on the coarsely-annotated potato plants dataset and, finally, on the smaller and refined labeled potato plants dataset, allowed to build the best model and highlights a strategy to tackle data scarcity. The resulting model is crop specific despite the fact that potato and lettuce crops are two low-density green-coloured crops. Nonetheless, using the same weights as a basis for a new model for lettuce plant instance segmentation, turns to be a successful strategy. This demonstrates transfer learning capabilities across datasets containing imagery of different plant types. Models yielded poor results when trained on a first specific dataset and used in inference on a second one with other types of objects. This justifies the necessity of understanding the complex parameterization process of Mask R-CNN and this study is the first one, to the best of our knowledge, which disseminates in detail the nodes and effects of this complex model. It also facilitates reproducibility by using the notations of variable names used in an open-source implementation [48].

Our Mask R-CNN based model is robust as it is able to accurately detect and segment plants affected by shadowing effects, occluded by foliage or some degree of overlap between plants. Limitations occur once plants reach a greater merging state but humans annotators also encounter difficulties when digitising plant masks from UAV imagery. The computer vision baseline method for detection has shown to be more sensitive to merged plants. Our Mask R-CNN based model is also capable of rightly ignoring weeds which could be mistaken for potato plants. In contrast, the computer vision baseline algorithm frequently generates false positives when encountering weeds or human-made objects.

To apply this deep learning model presented in this paper to a real-world problem, it is important to note that it has only been trained on frames of 256 × 256 pixels. Moreover, the individual plant segmentation outputted is over one frame instead of an image sometimes covering hundreds of hectares. We suggest that, in order to process an entire potato or lettuce field, a preprocessing gridding step should be added so that each frame is fed into the model individually. Then, a postprocessing mosaicking step should be used to reconstitute the whole image and display all predictions at once. Due to the model having issues with plants located right on the image border, a sliding window with overlap between two frames could be developed for the preprocessing module. In regard to the postprocessing module, it should include the cropping of the corresponding overlap whilst plants segmented on two frames should be merged as a unique one. Such preprocessing and postprocessing steps would help to validate the model, but also to potentially create systems that automatically count an entire field and possibly assess the maturity of the crops based on their size. The development of such tool could lead to novel field management practices such as variable fertiliser spraying per plant depending on their size or plant growth stage estimation.

#### **5. Conclusions**

Combining remote sensing and deep learning for plant counting and sizing using the Mask R-CNN architecture embodies a direct and automatic cutting-edge approach. Moreover, it outperforms the manually-parametrized computer vision baseline requiring multiple processing steps when used for plant detection. This study is the first one, to the best of our knowledge, to evaluate an algorithm of individual instance segmentation on remotely sensed images of plants. The datasets are composed of high quality digitized masks of potato and lettuces. The success of this approach is conditioned by transfer learning strategies and by the correct tuning of the numerous parameters of the Mask R-CNN training process, both detailed in this study. This justifies the necessity of understanding the complex parameterization process of Mask R-CNN and disseminating the implications and effects of this complex model's parameters for remote sensing applications. It also facilitates reproducibility by using the notations of variable names of the open-source Matterport implementation [48]. The experiment design has been established to account for practical constraints of the remote sensing field for precision agriculture: commercial farming practices represented in the variability of images of the dataset, scalability for operational use of the model, and scarcity of annotated images at a pixel level.

**Author Contributions:** Conceptualization, M.M. and F.L.; Data curation, F.L. and V.B.; Investigation, M.M.; Methodology, M.M.; Supervision, F.L.; Validation, M.M.; Visualization, M.M.; Writing—original draft, M.M.; Writing—Review and editing, M.M., F.L., A.H., and P.S. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received no external funding.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **Appendix A**



#### **References**


© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
