2.1.2. Datasets' Specifications

The COCO dataset features a large-scale object instance segmentation dataset with 1.5 million objects over 200 K images across 80 object categories. Imagery covers natural scenes including many different objects. Therefore, not all of the objects are labeled and the groundtruth segmentation can be coarse.

In-house task-related datasets have been created thanks to the imagery acquired in the study area, and Table 2 provides details with the following naming conventions: **C** for "Coarse", **R** for "Refined", **Tr** for "Training", and **Te** for "Test". POT\_CTr, is a coarsely annotated dataset of 340 images of potato plants. Annotation work was completed by copying bounding boxes of the same size over the most merged surveys and these boxes were manually, but coarsely, adjusted. The mask was obtained by applying an Otsu threshold [27] to separate soil and vegetation within each bounding box. Growth stages selected enabled the human annotator to visually separate plants with bounding box delimitation and allowing for a small overlap for the coarse dataset. Consequently, this semi-automatic labelling displays imprecision in the groundtruthing as shown in Figure 2d. POT\_CTr was used as a proof of concept that fine-tuning Mask-R-CNN with small and coarse datasets could lead to satisfactory results. However, metric results could not be trusted considering the quality of the annotation. As opposed to, POT\_RTr is a training set carefully annotated at a pixel level covering sampled patches from the same fields from the UK and containing an extra field from Australia. It brings variability in the size of plants, variety and type of soils. POT\_RTe presents the same characteristics and it was kept exclusively as a test set to compare all models presented. Finally, LET\_RTr is also a training set accurately annotated with lettuce plants in imagery acquired in the UK. LET\_RTe is the corresponding test set. Figure 2 displays samples from all these datasets. As specified

in Section 1.2, [47] achieved state-of-the art accuracy training Mask R-CNN on RGB in-field images of oranges with a dataset size of 150 images of 256 × 256 × 3 and a total of 60 instances which is smaller than what our datasets look like.


**Table 1.** Characteristics of flight acquisition of imagery in the study area.

**Table 2.** Summary of the charateristics of the datasets. Naming convention relies on **C** for "Coarse", **R** for "Refined", **Tr** for "Training" and **Te** for "Test".


**Figure 1.** Samples of 10 × 10 m from every single orthomosaic image from our datasets: (**a**–**d**) P1; (**e**,**f**) P2; (**g**) P3; (**h**) P4; (**i**) L2; (**j**) L1.

(**a**) (**b**) (**c**)

**Figure 2.** Three examples of remote sensing imagery of low density crops (**a**–**c**), as well as their corresponding counting and sizing annotations (**d**–**f**). The (**a**,**d**), (**b**,**e**) are potato fields, respectively, from POT\_CTr and POT\_RTr while (**c**,**f**) are from a lettuce field from LET\_RTr. Please note that the colour coding is random, apart from the fact that each single-colour blob represents a single plant, while the green area not marked (especially in the top image) signifies irrelevant to the crop vegetation (weeds).

#### *2.2. Mask R-CNN Refitting Strategy*

The deep learning model introduced to tackle the plant counting and sizing tasks is based on the Mask R-CNN architecture, which is adjusted for this problem. As a matter of fact, if the original model (i.e., the one trained on a large number of natural scene images with coarsely-annotated natural scenes objects) is trained or applied for inference without modifications of the default parameters on UAV images of plants, the results are particularly poor. The main reason behind this failure is the large number of free parameters (around 40) in the Mask R-CNN algorithm. This section reviews thoroughly the Mask R-CNN's architecture and evaluates how each parameter may affect performances. By following the Matterport's implementation [48] terminology and architecture, the goal is to unfold the parameters' tuning process followed in this paper to ensure a fast and accurate individual plant segmentation and detection output while facilitating reproducibility. This approach sets up a bridge of knowledge between the theoretical description of this network and the operational implementation on a specific task. Subsequently, strategies of transfer learning which obtain state-of-the-art results with the Mask R-CNN based architecture on the panel of datasets explored are described. To conclude this section, a description of a computer vision baseline targeting the plant detection task is presented.

#### 2.2.1. Architectural Design and Parameters

The use of Mask R-CNN in the examined setup presents several challenges, related to the special characteristics of high-resolution remote sensing images of agriculture fields. Firstly, most of the fields have a single crop, the classification branch of the pipeline is a binary classification algorithm, a parameter which affects the employed loss function. Secondly, the remotely sensed images of plants impose the target objects (i.e., plants) to not present the same features, scales, and spatial distribution as natural scene objects (e.g., humans, cars) included in the COCO datasets used for the original Mask R-CNN model training. Thirdly, the main challenges of this setup are different than many natural scene ones. For example, false positives due to cluttered background (a main source of concern in multiple computer vision detection algorithms) is expected to be rather rare in plant counting/sizing setup, while object shadows affect the accuracy than in several computer vision applications much more. Due to these differences, Mask R-CNN parameters need to be carefully fine-tuned to achieve an optimal performance. In the following sub-sections, we analyse this fine-tuning, which, in some cases, is directly linked to the ad-hoc nature of the setup and in others is the result of a meticulous trial-and-error process. Figure 3 describes the different blocks composing the Mask R-CNN model and Table A1, in Appendix A, associates acronyms representing the examined model free parameters with the variable names used in Matterport's implementation [48] for reproducibility.

**Figure 3.** Mask R-CNN architecture.

#### 2.2.2. Backbone and RPN Frameworks

The input RGB image is fed into the Backbone network in charge of visual feature extraction at different scales. This Backbone is a pre-trained ResNet-101 [49] (in our study). Each block in the ResNet architecture outputs a feature map and the resulting collection is served as an input to different blocks of the architecture: the Region Proposal Network (RPN) but also Feature Pyramid Network (FPN). By setting the backbone strides *Bs<sup>t</sup>* , we can choose the sizes of the feature maps *Bs<sup>s</sup>* which feed into the RPN, as the stride controls downscaling between layers within the backbone. The importance of this parameter lies in the role of the RPN. For example, *Bs<sup>t</sup>* = [4, 8, 16, 32, 64] induces *Bs<sup>s</sup>* = [64, 32, 16, 8, 4] (all units in pixels) if the input image size is squared of width 256 pixels. The RPN targets generated from a collection of anchor boxes form an extra input for the RPN. These predetermined boxes are designed for each feature map with base scales *RPNas<sup>s</sup>* linked to the feature map shapes *Bs<sup>s</sup>* , and a collection of ratio *RPNar* is applied to these *RPNas<sup>s</sup>* . Finally, anchors are generated at each pixel of each feature map with a stride of *RPNas<sup>t</sup>* . Figure 4 explains the generation of anchors for one feature map. In total, with *R<sup>l</sup>* the number of RPN anchors ratios introduced, *nba*, the total number of anchors generated is defined as

$$mb\_a = \sum\_i \text{int}\left(\frac{B\mathbf{s}\_s[i]}{RPNas\_t}\right)^2 \times \mathbf{R}\_l \tag{1}$$

**Figure 4.** Anchors generation for the *i th* feature map *FM<sup>i</sup>* feeding the RPN of shape *Bss*[*i*]. *FM<sup>i</sup>* is related to anchors of corresponding anchor scale *RPNass*[*i*] on which the collection of ratios *RPNar* is applied to to generate *R<sup>l</sup>* (number of ratios) anchors at each pixel location obtained from the *FM<sup>i</sup>* with stride *RPNass*.

All the coordinates are computed in the original input image pixel coordinates system. All of the *nb<sup>a</sup>* anchors do not contain an object of interest, implying that anchors matching the most with the groundtruth bounding boxes filters will be selected. This matching process is carried out by computing the Intersection Over Union (IoU) between anchor boxes and groundtruth bounding boxes locations. If *IoU* > 0.7, then the anchor is classified as positive, if 0.3 < *IoU* ≤ 0.7 as neutral and finally if *IoU* ≥ 0.3 as negative. Then, the collection is resampled to ensure that the number of positive and negative anchors is greater than half of *RPNtapi*, which is a share of the total *nb<sup>a</sup>* anchors kept to train the RPN. Eventually, the RPN targets have two components for each image: a vector which states if each of the *nb<sup>a</sup>* anchors is positive, neutral or negative, and the second component is represented by delta coordinates between groundtruth boxes and positive anchors among the *RPNtapi* selected anchors to train the RPN. It is essential to note that only *mGTi* groundtruth instances are kept per image to avoid training on images with too many objects to detect. This parameter is important for training on natural scene images composing the COCO dataset as they might contain an overwhelming number of overlapping objects. Dimension of the targets for one image are [*nba*] and [*RPNtapi*,(*dy*, *dx*, log(*dh*), log(*dw*))], where *dy* and *dx* are the normalised distance of the coordinates centers between groundtruth and anchor boxes, whereas *log*(*dh*) and *log*(*dw*) respectively deal with the logarithm delta between height and width. Finally, the RPN is a FCN aiming at predicting these targets.

#### 2.2.3. Proposal Layer

The Proposal Layer is plugged onto the RPN and does not consist of a network but of a filtering block which only keeps relevant suggestions from the RPN. As stated in the previous section, the RPN produces scores for each of the *nb<sup>a</sup>* anchors with the probability to be characterised as positive, neutral or negative and the Proposal Layer begins by keeping the highest scores to select the best *pNMSl* anchors. Predicted delta coordinates from the RPN are coupled to the selected *pNMSl* anchors. Then, the Non-Maximum Suppression (PNMS) algorithm [50] is carried out to prune away predicted RPN boxes overlapping with each other. If two boxes among the *pNMSl* have more than *RPN*\_*NMSt* overlap, the box with the lowest score is discarded. Finally, the top *pNMSrtr* for training phase and *pNMSrinf* for inference phase are kept based on their RPN score.

At this stage, the training and inference paths begin to be separated despite the inference path relyinng on blocks previously trained.

#### 2.2.4. Detection Target Layer

As seen in Figure 3, the training path after the Proposal Layer begins with the Detection Target Layer. This layer is not a network but yet another filtering step of the *pNMSrtr* Regions of Interest (ROIs) outputted by the Proposal Layer. However, this layer uses the *mGTi* groundtruth boxes to compute the overlap with the ROIs and set them as a boolean based on the condition *IoU* > 0.5. Finally, these *pNMSrtr* ROIs are subsampled to a collection of *trROIpi* ROIs but randomly resampled to ensure a ratio *ROIpr* of the total *trROIpi* as positive ROIs. As the link with groundtruth boxes is established in this block and the notion of anchors is dropped, the output of this Detection Target Layer is composed of *trROIpi* ROIs with corresponding groundtruth masks, instance classes, and delta with groundtruth boxes for positive ROIs. The groundtruth boxes are padded with 0 values for the elements corresponding to negative ROIs. These generated groundtruth features corresponding to the introduced ROIs features will serve as groundtruth to train the FPN.

#### 2.2.5. Feature Pyramid Network

The Feature Pyramid Network (FPN) is composed of the Classifier and by the Mask Graph. The input to these layers will be referred to as Area of Interest AOIs, as they are essentially a collection of regions with their corresponding pixel coordinates. The nature of these AOI can vary during the training and inference phase as shown in Figure 3. Both of the extensions of the FPN (Classifier and Mask) present the same succession of blocks, composed of a sequence of ROIAlign and convolution layers with varying goals. The ROIAlign algorithm, as stated in the beginning of this section, has to pool all feature maps from FPN lying within the AOIs by discretising them into a set of fixed square pooled size bins without generating a pixel misalignment unlike traditional pooling operations. After ROIAlign is applied, the input of the convolution layers is a collection of same size squared feature maps, and the size of the batch is the number of AOIs. In the case of the Classifier, the output of these deep layers is composed of a classifier head with logits and probabilities for each item of the collection to be an object and belong to a certain class as well as refined box coordinates which should be as close as possible from the groundtruth boxes used at this step. In the case of the Mask graph, the output of this step is a collection of masks of fixed squared size which will later on be re-sized to the shape in pixels of the corresponding bounding box extracted in the classifier.

In the previous Section 2.2.4, it has been explained that, for the training phase, the Detection Layer Target outputs *trROIpi* AOIs and computes corresponding groundtruth boxes, class instances, and masks which are used for the training of a sequence of FPN Classifier and Mask Graph as shown in Figure 3. For the inference path, the trained FPN is used in prediction, but the FPN Classifier is firstly applied to the *pNMSrinf* AOIs extracted from the Proposal Layer, followed by a Detection Layer, as detailed in the following Section 2.2.6. It ensures the optimal choice of AOIs to only keep *Dmi* AOIs. Finally, these masks are extracted by the final FPN Mask Graph in prediction mode.

#### 2.2.6. Detection Layer

This block is dedicated to filtering the *pNMSrinf* proposals coming out from the Proposal Layer based on probability scores per image per class extracted from the FPN Classifier Graph in inference mode. AOIs with low probability scores under *Dmc* are discarded, and NMS is applied to remove lowest score AOI overlapping more than *DNMSt* with a higher score. Finally, only the best *Dmi* AOIs are selected to extract their masks with the following block FPN Mask Graph.

Each of these blocks is trained with a respective loss in regard to the nature of its function. Boxes' coordinates prediction is associated with a smooth L1-loss, binary mask segmentation with binary cross-entropy loss, and instance classification with categorical cross-entropy loss. The Adam optimiser [51] is used to minimize these loss functions.

#### 2.2.7. Transfer Learning Strategy

Transfer learning describes the process of using a model pretrained for a specific task and retraining this model for a new task [52]. The benefits of this process are to improve training time and performance due to previous domain knowledge demonstrated by the chosen model. Transfer learning can only be successful if the features learnt from the first model are general enough to envelope the targeted domain of the new task [53].

Deep supervised learning methods require a large amount of annotated data to produce the best results possible, especially with large architecture with several millions of parameters to learn such as Mask R-CNN. Therefore, the necessary dataset size is the main factor for a successful model achieving state-of-the-art instance segmentation. Zlateski et al. [54] advocate for 10 K images per class *accurately labeled* to obtain satisfying results on instance segmentation tasks with natural scene images. At a resolution between 1.5 cm and 2 cm, digitising 150 masks of a low density crop takes around an hour for a trained annotator. As a result, precise pixel labeling of 10 K images implies a significant amount of person-hour. However, Zlateski et al. [54] prove that pre-training CNN on a large coarsely-annotated dataset, such as the COCO dataset (200 K labeled images) [6] for natural scene images, and fine-tuning on a smaller refined dataset of the same nature leads to better performances than training directly on a larger finely-annotated dataset. Consequently, transferring the learning from the large and coarse COCO dataset to coarse and refined smaller plant datasets makes the complex Mask R-CNN architecture portable to the new plant population task.

#### 2.2.8. Hyper Parameters Selection

The parametrization setup of the Mask R-CNN model is key to facilitating the training of the model and maximizing the number of correctly detected plants. It was observed that the default parameters in the original implementation of [48] would lead to poor results due to an excessive amount of false negatives. Therefore, an extensive manual search guided by the the understanding the complex parameterization process of Mask R-CNN detailed in Section 2.2.1. Mask R-CNN is originally trained on the COCO dataset which is composed of objects of highly varying scales that can sometimes fully overlap between each other (leading to the presence of the so-called crowd boxes). In consideration of the datasets presented in our study, most of the objects of interest have a smaller range of scales, and two plants cannot have fully overlapping bounding boxes. Regarding the scale of both potato and lettuce crops, an individual plant goes from 4 to 64 pixels at our selected resolutions. Based on these observations, optimising the selection of the regions of interest through the RPN and the Proposal Layer can be solved by tuning the size of the feature maps *Bs<sup>s</sup>* and the scale of the anchors *RPNas<sup>s</sup>* . *Bs<sup>t</sup>* cannot be modified due to the Backbone being pre-trained on COCO weights and the corresponding layers frozen. Then, following the explanation in Section 2.2.2, the pixel size of 256 × 256 was chosen to include sizes of feature maps corresponding to the range of scales of the plant "objects". In addition, taking into account the imagery resolution (see Section 2.1.2) and an estimation of the range of the plant drilling distance, *mGTi* can easily be inferred. We estimated that not more than *mGTi* = 128 potatoes and *mGTi* = 300 lettuces could be found in an image of 256 × 256 pixels. This observation also allows for setting the number of anchors per image *RPNtapi* to train the RPN and the number of *trROIpi* in the Detection Target layer to be set to the same value. Starting from this known estimation of the maximum number of expected plants, this bottom-up view of the architecture is key to finding a more accurate number of AOIs to keep at each block and phase for each of the crops investigated. The parameters involved are *pNMSrtr* and *pNMSrinf* . Manual variations of these parameters by step of 100 have been attempted. Thresholds used for IoU in the NMS ( *RPN*\_*NMSt*, *DNMSt* ) and confidence scores (*Dmc*) can also be tuned but the default values were kept due to our tuning attempts being inconclusive.

#### *2.3. Computer Vision Baseline*

A computer vision baseline for plant detection is considered to compare predicted plant centers by the Mask R-CNN model. As a starting point of this method, the RGB imagery acquired by UAV has to be transformed to highlight the location of the center of the plants. Computing vegetation indexes to highlight image properties is a mastered strategy in the field of remote sensing. Each vegetation index presents their own strengths and weaknesses as detailed in Hamuda et al. [55]. The Colour Index of Vegetation Extraction (CIVE) [56] is presented as a robust vegetation index dedicated to green plant extraction in RGB imagery and defined as

$$
\dot{C}IVE = 0.441R - 0.811G + 0.385B + 18.78745 \tag{2}
$$

where R, G and B respectively stand for red, green and blue channels defined between 0 and 255. In order to have a normalised index, all channels are individually divided by 255. Segregating between vegetation and soil pixels in an automated manner is possible using an adaptive threshold such as the Otsu method [27]. In this classification approach, foreground and background pixels are extracted by finding the threshold value which minimizes the intra-class intensity variance. By applying Otsu thresholding on CIVE vegetation index, a binary soil vegetation map is obtained, unaltered by highly varying conditions of illumination and shadowing. The Laplacian filtered CIVE map with masked soil pixels (*L f C*) highlights regions of rapid intensity change. Potato plants and lettuce plants pixels of this map are meant to have minimal change close to their center due to homogeneous and isotropic properties of their visual aspect. Therefore, finding the local minima of the *L f C* map should output the geolocation of these plant centers. This is why all *L f C* pixels are tested as geo-centers of a fixed size disk window and are only considered as the center of a plant if their value is the minimum of all pixels located within the window area. This framework is summarized in Figure 5.

**Figure 5.** Computer vision based plant detection framework. A CIVE vegetation index is computed from a UAV RGB image before applying Otsu thresholding to discriminate pixels belonging to vegetation and soil. Laplacian filtering is applied to a soil masked CIVE map and, finally, a disk window is slid over this latest to extract local minima, corresponding to the center of the plants.

#### *2.4. Evaluation Metrics*

Metrics dedicated to the evaluation of the performance of algorithms in computer vision are strongly task and data dependent. In this paper, our first interest is to compare different strategies of transfer learning using various datasets adopting a data hungry Mask R-CNN model for the instance segmentation task. A plant is considered as a true positive if the intersection over union of the true mask **M<sup>t</sup>** and a predicted one **M<sup>p</sup>** is over a certain threshold **T** as *Mt*∩*M<sup>p</sup> Mt*∩*M<sup>p</sup>* > *T*. Varying this threshold from 0.5 to 0.95 with a step scale of 0.05 with averaging allows the widening of the definition of a detected plant. It defines the mean Average Precision (mAP) such as the average precision for a set of different thresholds.

Detection evaluation differs from the individual segmentation metric as the output of the algorithm is a collection of points. The Multiple Object Tracking Benchmark [57] comes with a measure which covers three error sources: false positives, missed plants, and identity switches. A distance map between predicted and true locations of the plants is computed to determine the possible pairs and

derive **m** the number of misses, **f<sup>p</sup>** number of false positives, **mm** the number of mismatches, and **g** the number of objects which allows for computing the multiple object tracking accuracy (MOTA)

$$\text{MOTA} = 1 - \frac{m + f\_p + mm}{\\$} \tag{3}$$

MOTA metric presents the advantage to synthesize error sources monitoring while the conventional Precision and Recall metrics allow for more straightforward interpretability. With **t<sup>p</sup>** being the number of true positives, Precision and Recall are defined as follows:

$$\text{Precision} = \frac{t\_p}{t\_p + f\_p} \tag{4}$$

$$\text{Recall} = \frac{t\_p}{\mathcal{S}} \tag{5}$$

#### **3. Results**

All models have been trained using an Nvidia GPU TITAN X 12 GB. A GPU with 12 GB of memory that typically allows a maximum batch size of eight images of 256 × 256 × 3 dimensions. We did not use batch normalization due to performance losses demonstrated on small batches [58].

Live data augmentation was performed on the training images with vertical and horizontal flips to artificially increase the training set size. Training a model without data augmentation can lead to overfitting and trigger an automated early stopping of the training phase.

Datasets used for training phases were split into 80% for training and 20% for validation to evaluate the model at the end of each training epoch.

Hyper parameters for each of the following models have been set in accordance with the method stated in Section 2.2.8. Table 3 states the different settings used for training on each dataset described in Table 2.


**Table 3.** Derived parameters for each Mask R-CNN model thanks to the architecture examination and hyper parameters selection (see Sections 2.2.1 and 2.2.8).
