*2.2. UAV Data Collection*

A DJI Matrice 600 pro unmanned aerial vehicle (UAV) platform (Figure 2) was used with a Zenmuse X5R camera to capture aerial imagery. In order to collect data with varying growth stages of the crop as well as variations in illumination conditions, the images from study site 1 (shown at the top in Figure 1) were collected on 2 July 2018 whereas the images from study site 2 (shown at the bottom in Figure 1) were collected on 12 July 2018. The flight altitude in both the cases was 20m above ground level. The Zenmuse X5R camera used is a 16 megapixel camera with 4/3" sensor and 72 degree diagonal field of view. The dimension of the captured images is 4608 × 3456 pixels in three bands—Red, Green, and Blue. To develop an economical solution, this study focuses on only using RGB imagery. At a 20-m altitude, for the given sensor specifications, the spatial resolution of the output image is 0.5 cm/pixel. DJI Ground Station pro software was used for flight control. Common weed species at the experimental site were waterhemp (Amaranthus tuberculatus), Palmer amaranthus (Amaranthus palmeri), common lambsquarters (Chenopodiam album), velvetleaf (Abutilon theophrasti), and foxtail species such as yellow and green foxtails. The weeds were naturally infesting the crop and were forming patches. The two data collections were performed after 45 to 50 days after soybean planting and 15 to 20 days after post-emergence herbicides were applied in most treatments, except in plots where only pre-emergence herbicides were applied and in non-treated control plots. Soybean was at V6 (six trifoliate stage) to R2 (full flowering) growth stage.

*2.2 UAV Data Collection* 

*2.3. Data Annotation and Processing* 

90% training images and 10% test images

*2.4. Patch Based CNN.* 

**Figure 1.** Study area at South Central Ag Laboratory in Clay Center, NE in most treatments, except in plots where only pre-emergence herbicides were applied and in non-treated control plots. Soybean was at V6 (six trifoliate stage) to R2 (full flowering) growth stage.

**Figure 2.** DJI Matrice 600 pro UAV platform with Zenmuse X5R camera. **Figure 2.** DJI Matrice 600 pro UAV platform with Zenmuse X5R camera.

The objective of the study is to develop a weed detection system with on-farm data processing

workflow and is not required in this case, overlapping images were removed, and only the non-overlapping raw images were retained. The original dimension of the raw image is too large to fit in the memory for processing so each raw image of size 4608×3456 pixels was sliced into 12 sub-images of size 1152×1152 pixels. The weed areas in each sub-image were annotated as rectangular bounding boxes using the python labeling tool LabelImg [53]. Only one annotator was involved in the labeling process. The annotator was trained to draw rectangular bounding boxes around weed patches. In case of weed patches of complex shapes, multiple rectangular bounding boxes were drawn to cover such patches. A total of 450 sub-images were annotated manually and were then randomly split into

Convolutional neural networks (CNNs) are feedforward artificial neural networks with the fully connected layers in the input hidden layers replaced with convolutional filters. This reduces the number of filters in each layer and enables CNNs to learn spatial patterns in images and other two-dimensional data. The advantage of a CNN is its ability to learn the features by itself, thereby preventing the need for time-consuming hand engineering of features needed in case of other Computer Vision algorithms. CNN architectures have been proposed, and its use in applications, such as document recognition by using backpropagation for training, has been studied much earlier [54]. However, their applications were limited because of the need for very large datasets to train a large number of parameters in deep networks, and also the computational needs for training. In the last decade, with advancements in parallel processing capabilities using graphical processing units and increases in the availability of large datasets, Krizhevsky et al. [36] showed the potential of CNNs in complex multiclass image classification tasks. However, in most cases, it was found that there were not enough data available to train a deep CNN from scratch. Transfer learning helped overcome this limitation. Transfer learning is the technique of using the weights of pre-trained networks trained on

#### *2.3. Data Annotation and Processing*

The objective of the study is to develop a weed detection system with on-farm data processing capability. Since the mosaicking of overlapping aerial images is the time-consuming process in the workflow and is not required in this case, overlapping images were removed, and only the non-overlapping raw images were retained. The original dimension of the raw image is too large to fit in the memory for processing so each raw image of size 4608 × 3456 pixels was sliced into 12 sub-images of size 1152 × 1152 pixels. The weed areas in each sub-image were annotated as rectangular bounding boxes using the python labeling tool LabelImg [53]. Only one annotator was involved in the labeling process. The annotator was trained to draw rectangular bounding boxes around weed patches. In case of weed patches of complex shapes, multiple rectangular bounding boxes were drawn to cover such patches. A total of 450 sub-images were annotated manually and were then randomly split into 90% training images and 10% test images

#### *2.4. Patch Based CNN*

Convolutional neural networks (CNNs) are feedforward artificial neural networks with the fully connected layers in the input hidden layers replaced with convolutional filters. This reduces the number of filters in each layer and enables CNNs to learn spatial patterns in images and other two-dimensional data. The advantage of a CNN is its ability to learn the features by itself, thereby preventing the need for time-consuming hand engineering of features needed in case of other Computer Vision algorithms. CNN architectures have been proposed, and its use in applications, such as document recognition by using backpropagation for training, has been studied much earlier [54]. However, their applications were limited because of the need for very large datasets to train a large number of parameters in deep networks, and also the computational needs for training. In the last decade, with advancements in parallel processing capabilities using graphical processing units and increases in the availability of large datasets, Krizhevsky et al. [36] showed the potential of CNNs in complex multiclass image classification tasks. However, in most cases, it was found that there were not enough data available to train a deep CNN from scratch. Transfer learning helped overcome this limitation. Transfer learning is the technique of using the weights of pre-trained networks trained on very large datasets such as Alexnet or GoogleNet and retraining them with small datasets for other applications [55]. This has been found to lead to exceptional classification performance and one hypothetical explanation is that the features learned in the initial convolutional layers are global features common across various image classification tasks. Several studies have looked at the application of neural networks for weed detection, such as [28,56].

In this study, a pre-trained network called Mobilenet v2 has been used for transfer learning [57]. Mobilenet v2 was developed primarily for use in mobile devices with limited memory capabilities. Hence, in order to reduce the number of parameters, each convolutional block of Mobilenet v2 consists of an expansion layer with a convolutional kernel of window size 1. This layer increases the number of channels in the input. This is followed by a depthwise convolutional layer which is then followed by a projection layer that consists of a convolutional kernel of window size 1. The depthwise convolution layer applies a single convolutional filter per input channel. The 1 × 1 convolutional layer that follows is called point wise layer. It reduces the number of channels in the output, thereby reducing the number of parameters in the next convolutional block. Hence in each block, feature maps are projected to a high dimensional space followed by learning higher dimensional features in the depthwise convolutional layer which are then encoded using a pointwise convolutional projection layer. The Mobilenet v2 network was trained on the ImageNet dataset containing 1.4 million images belonging to 1000 classes [57]. This network was then fine-tuned using the training patches belonging to both the classes in this study. Initially, for the first 10 epochs, only the classifier layer of the network were trained by freezing the weights of all other layers. This was performed to use the global features learned on the ImageNet dataset and fine-tune the classifier for this specific application. After this, fine-tuning was performed in which all the top layers were unfrozen and to allow the network to adapt

to this specific application. The fine tuning was performed for 10 epochs and, hence, the model was only trained for 20 epochs in total [58].

### *2.5. Object Detection Models*

An object in Computer Vision refers to a connected, single element present in the image. Object detection is defined as the problem of finding the class of an object, and also localizing it in the image [59]. Hence, for every object in the image, the model is expected to regress the coordinates of the bounding box of the object in addition to the class probabilities for classification. Two different models have been investigated—Faster RCNN and SSD, both with Inception v2 as a feature extractor. Faster RCNN and SSD were chosen since Faster RCNN was found to have better performance, whereas SSD was found to have better speed [60]. Several different models trained on Imagenet dataset such as Inception v2 [61], Mobilenet v2 [57], Resnet 101 [62], VGG 16 [63] can be used as feature extractors for transfer learning. Of these, Inception v2 and Mobilenet v2 have been found to be the fastest in terms of inference speed [60]. The objective was to develop a weed detection system with on-farm real-time data processing capabilities. Since with similar inference speed, Inception v2 has better performance than Mobilenet v2 for object detection tasks, Inception v2 was chosen as the feature extractor [60].

#### 2.5.1. Faster RCNN

Faster RCNN is a region proposal method-based object detection algorithm. Region-based CNN (R-CNN) was the first region proposal method-based model [64]. However, it was computationally expensive since CNN based feature extraction has to be performed for each proposed region. Fast RCNN was proposed to reduce the computational time by sharing convolutional features across the region proposals [65]. To improve the speed, Faster RCNN was proposed with fully convolutional Region Proposal Networks (RPN) that are trained to propose better object regions [66]. The Faster RCNN model consists of four sections: the feature extractor, the region proposal network, Region of Interest (RoI) pooling, and classification (as shown in Figure 3).

For feature extraction, the convolutional layers from Inception v2 were used. The advantage of the Inception v2 network is its use of wider networks with filters of different kernel sizes in each layer which makes it translation and scale invariant. Hence, the Inception v2 architecture outputs a reduced-dimensional feature map for the region proposal layer. The region proposal network is defined by anchors or fixed boundary boxes at each location. At each location, anchors of different scale and aspect ratio are defined, thereby enabling the region proposal network to make scale invariant proposals. The region proposal layer uses a convolutional filter on the feature map to output a confidence score for two classes; object and background. This is called the objectness score. Furthermore, the convolutional filter outputs regression offsets for anchor boxes. Hence, assuming there are k anchors at a location, the convolutional filter in the region proposal network outputs 6k values, namely 4k coordinates and 2k scores. Two losses are calculated from this output—classification loss and bounding box regression loss. The bounding box coordinates of anchors classified as objects are then combined with the feature map from feature extractor. In the RoI pooling layer, bounding box regions of different sizes and aspect ratios are resized to fixed size outputs using max pooling. Pooling layer refers to a down sampling layer and in case of max pooling, the down sampling is done by maximum of pixels [36]. The max-pooled feature map of a fixed size corresponding to each output is then classified, and its bounding box offsets with respect to ground truth boxes are regressed. Hence, as in the region proposal layer, two losses are computed at this output, namely the classification loss and bounding box regression loss.

263 **Figure 3.** Faster RCNN architecture. **Figure 3.** Faster RCNN architecture.

### 2.5.2. Hyperparameters of the Architecture

264 For feature extraction, the convolutional layers from Inception v2 were used. The advantage of 265 the Inception v2 network is its use of wider networks with filters of different kernel sizes in each layer 266 which makes it translation and scale invariant. Hence, the Inception v2 architecture outputs a 267 reduced-dimensional feature map for the region proposal layer. The region proposal network is 268 defined by anchors or fixed boundary boxes at each location. At each location, anchors of different 269 scale and aspect ratio are defined, thereby enabling the region proposal network to make scale 270 invariant proposals. The region proposal layer uses a convolutional filter on the feature map to output 271 a confidence score for two classes; object and background. This is called the objectness score. 272 Furthermore, the convolutional filter outputs regression offsets for anchor boxes. Hence, assuming 273 there are k anchors at a location, the convolutional filter in the region proposal network outputs 6k 274 values, namely 4k coordinates and 2k scores. Two losses are calculated from this output— In the framework that was used, the input images to the Faster RCNN network were resized to images of fixed size 1024 × 1024 pixels. At each location in the region proposal layer, 4 different scales namely 0.25, 0.5, 1.0, 2.0 and 3 different aspect ratios namely 0.5, 1.0 and 2.0 were used. Hence, in total, there were 12 anchors at each location. The model was trained for 25,000 epochs with a batch size of 1 using stochastic gradient descent with momentum optimizer. The training dataset was split into training and validation datasets and the performance of the model on validation data was continuously monitored during training to check if the model starts to overfit. Random horizontal flip and random crop operations were performed to augment the training data. The data collected had the crop rows always parallel to the horizontal axis of the image, therefore random horizontal flip and crop operations augment the training data.

#### 275 classification loss and bounding box regression loss. The bounding box coordinates of anchors 276 classified as objects are then combined with the feature map from feature extractor. In the RoI pooling 2.5.3. Single Shot Detector

262

277 layer, bounding box regions of different sizes and aspect ratios are resized to fixed size outputs using 278 max pooling. Pooling layer refers to a down sampling layer and in case of max pooling, the down 279 sampling is done by maximum of pixels [36]. The max-pooled feature map of a fixed size 280 corresponding to each output is then classified, and its bounding box offsets with respect to ground 281 truth boxes are regressed. Hence, as in the region proposal layer, two losses are computed at this 282 output, namely the classification loss and bounding box regression loss. 283 2.5.2. Hyperparameters of the Architecture 284 In the framework that was used, the input images to the Faster RCNN network were resized to 285 images of fixed size 1024 × 1024 pixels. At each location in the region proposal layer, 4 different scales 286 namely 0.25, 0.5, 1.0, 2.0 and 3 different aspect ratios namely 0.5, 1.0 and 2.0 were used. Hence, in The Single Shot Detector (SSD) (Figure 4) model was proposed to improve the inference time of objection detection models with region proposal network such as Faster RCNN. The main difference in SSD compared to Faster RCNN is the generation of detection outputs without a separate region proposal layer. Similar to Faster RCNN, SSD uses a feature extractor which is the Inception v2 architecture in this case. At each location of the feature map output, the model outputs a set of bounding boxes of different scales and aspect ratios. This is very similar to Faster RCNN but the difference being the convolutional filter on the feature map directly outputs the confidence scores corresponding to the output classes along with regression box offsets. Hence, the class and bounding box offsets are output in a single shot as the name suggests. For the model to be scale and translation invariant, rather than outputting bounding boxes from only the feature map, extra feature layers are added to the feature map output and detection boxes are output at different scales from each output. Hence, in total, the SSD model has 6 layers that output detection boxes at different scales [67].

294 2.5.3. Single Shot Detector

total, there were 12 anchors at each location. The model was trained for 25,000 epochs with a batch size of 1 using stochastic gradient descent with momentum optimizer. The training dataset was split into training and validation datasets and the performance of the model on validation data was continuously monitored during training to check if the model starts to overfit. Random horizontal flip and random crop operations were performed to augment the training data. The data collected had the crop rows always parallel to the horizontal axis of the image, therefore random horizontal

The Single Shot Detector (SSD) (Figure 4) model was proposed to improve the inference time of objection detection models with region proposal network such as Faster RCNN. The main difference in SSD compared to Faster RCNN is the generation of detection outputs without a separate region proposal layer. Similar to Faster RCNN, SSD uses a feature extractor which is the Inception v2 architecture in this case. At each location of the feature map output, the model outputs a set of bounding boxes of different scales and aspect ratios. This is very similar to Faster RCNN but the difference being the convolutional filter on the feature map directly outputs the confidence scores corresponding to the output classes along with regression box offsets. Hence, the class and bounding box offsets are output in a single shot as the name suggests. For the model to be scale and translation

306 Hence, in total, the SSD model has 6 layers that output detection boxes at different scales [67].

308 **Figure 4.** Single Shot Detector (SSD) architecture. **Figure 4.** Single Shot Detector (SSD) architecture.

#### 309 2.5.4. Hyperparameters of the Architecture 2.5.4. Hyperparameters of the Architecture

293 flip and crop operations augment the training data.

310 In the case of SSD, in the framework that has been used, the input images are always reshaped 311 to a fixed dimension of 300 × 300 pixels. After the feature extraction, in 6 different layers that output 312 detection boxes, 6 different scales in the range 0.2-0.95 were used. Five different aspect ratios namely 313 1.0, 2.0, 0.5, 3.0 and 0.333 were generated at each location. The model was trained for 25,000 epochs 314 as in the case of Faster RCNN. A batch size of 24 was used in training and the RMS prop optimizer 315 was used. Data augmentation was applied with random horizontal flipping and random cropping of 316 images. Validation images were, again, evaluated periodically during the training to check if the 317 model is overfitting. In the case of SSD, in the framework that has been used, the input images are always reshaped to a fixed dimension of 300 × 300 pixels. After the feature extraction, in 6 different layers that output detection boxes, 6 different scales in the range 0.2–0.95 were used. Five different aspect ratios namely 1.0, 2.0, 0.5, 3.0 and 0.333 were generated at each location. The model was trained for 25,000 epochs as in the case of Faster RCNN. A batch size of 24 was used in training and the RMS prop optimizer was used. Data augmentation was applied with random horizontal flipping and random cropping of images. Validation images were, again, evaluated periodically during the training to check if the model is overfitting.

#### *2.6. Hardware and Software Used*

318 *2.6. Hardware and Software Used*  319 The models were trained and evaluation of the models was performed on a computer with Intel 320 i9 processor with 18 cores and 64 GB of RAM and NVIDIA GeForce RTX 2080 Ti graphics card. 321 Tensorflow object detection API [61] in Python was used to train and evaluate Faster RCNN and SSD. The models were trained and evaluation of the models was performed on a computer with Intel i9 processor with 18 cores and 64 GB of RAM and NVIDIA GeForce RTX 2080 Ti graphics card. Tensorflow object detection API [61] in Python was used to train and evaluate Faster RCNN and SSD. Tensorflow tutorial on transfer learning [58] was used to train the MobileNet v2 architecture for patch-based CNN.

322 Tensorflow tutorial on transfer learning [58] was used to train the MobileNet v2 architecture for

#### 323 patch-based CNN. *2.7. Evaluation Metrics*

Precision, recall, f1 score, and Intersection over Union (IoU) are the evaluation metrics used in this study.

$$\text{Precision} = \frac{\text{TP}}{\text{TP} + \text{FP}} \tag{1}$$

$$\text{Recall} = \frac{\text{TP}}{\text{TP} + \text{FN}} \tag{2}$$

$$\text{F1 score} = \frac{2 \times \text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}} \tag{3}$$

Here TP refers to True Positive, FP refers to False Positive, and FN refers to False negative. Moreover, mean Average Precision (mAP) is another metric that is commonly used in object detection problems [59,68]. It is the mean of the average precision at all recall values at different IoUs for prediction and ground truth thresholds from 0.5 to 0.95. It should be noted that these metrics were primarily formulated for object detection. Even though, in this study, we use object detection models, the objective is not to find weed objects rather all the area covered by weeds for management purposes. In case of a deep learning-based object detection model, multiple objects with their bounding box are predicted. Of these, only the boxes which have IoU with the ground truth greater than threshold and class score (probability of that object being in each class) greater than confidence threshold are considered positive prediction boxes. Among these, only the box with highest class score is considered as the true positive and other positive boxes are considered as false positives. In our case, for a weed patch that is marked as a ground truth box, the model might have multiple positive weed boxes corresponding to that one ground truth box. However, only one of those would be considered as

362 **3. Results and Discussion** 

363 *3.1. Training of Faster RCNN and SSD* 

372 gradient updates than SSD with a batch size of 24.

true positive and other boxes are false positives. As can be seen in the following Figure 5, the output of this image has two prediction boxes covering the weed area in the left but in the ground truth it was marked as one bounding box. Hence, if precision is used as the evaluation metric, the box on the bottom will be regarded as False Positive even though that box adds to more weed area being detected. Therefore, the Intersection over Union (IoU) of binary output image representing weed and background pixels with the ground truth binary image is used as the primary evaluation metric. The binary output images corresponding to prediction outputs and ground truth are obtained by considering pixels representing weed objects as 1 and other areas as 0. The intersection and union of the two binary images obtained are then used to find the IoU ratio. Hence, IoU here represents the ratio between the intersection of all positive prediction boxes (true positive and false positives in object detection terms) and all ground truth boxes in an image.

$$\text{IoU} = \frac{\text{Area of overlap}}{\text{Area of union}} \tag{4}$$

353 **Figure 5.** Example output image showing a weed patch annotated with single box in ground truth 354 image detected as two boxes in output. This will lead to lesser precision as only the bigger box is 355 considered true positive and therefore IoU is a better evaluation metric for this problem. **Figure 5.** Example output image showing a weed patch annotated with single box in ground truth image detected as two boxes in output. This will lead to lesser precision as only the bigger box is considered true positive and therefore IoU is a better evaluation metric for this problem.

356 To evaluate the patch-based CNN on the sub-image, an overlap slicing approach is used. The 357 sub-image of size 1152 × 1152 pixels is sliced into patches of size 128 × 128 pixels with a stride of 32 358 on the horizontal and vertical. Therefore, the sliced patches have 75% horizontal and vertical overlap. 359 Hence, each small area of size 32 × 32 is part of 8 patches and the class with maximum votes from the 360 4 patches is assigned as the class of the small area. To evaluate this result with ground truth and to 361 compare with the results of Faster RCNN and SSD, IoU is used as the evaluation metric. To evaluate the patch-based CNN on the sub-image, an overlap slicing approach is used. The sub-image of size 1152 × 1152 pixels is sliced into patches of size 128 × 128 pixels with a stride of 32 on the horizontal and vertical. Therefore, the sliced patches have 75% horizontal and vertical overlap. Hence, each small area of size 32 × 32 is part of 8 patches and the class with maximum votes from the 4 patches is assigned as the class of the small area. To evaluate this result with ground truth and to compare with the results of Faster RCNN and SSD, IoU is used as the evaluation metric.

364 Figure 6 shows the training graph for Faster RCNN and SSD. The decrease in training loss and

RCNN converged faster than SSD. The training process of Faster RCNN might appear to oscillate more than SSD, which could be due to the different batch sizes and optimizers being used by the two models. However, it should be noted that the scale of the two loss plots was different. The different batch size and optimizer could also be the reason for the Faster RCNN model converging to high validation mAP earlier than SSD, since a batch size of 1 for Faster RCNN leads to 24 times more

380

382 box (**a**) Faster RCNN and (**b**) SSD.

#### **3. Results and Discussion**

#### *3.1. Training of Faster RCNN and SSD*

Figure 6 shows the training graph for Faster RCNN and SSD. The decrease in training loss and the increase in mAP of the validation data with training epochs can be seen. By the end of the training, very little difference in the mAP of Faster RCNN and the SSD validation data was obtained. Faster RCNN converged faster than SSD. The training process of Faster RCNN might appear to oscillate more than SSD, which could be due to the different batch sizes and optimizers being used by the two models. However, it should be noted that the scale of the two loss plots was different. The different batch size and optimizer could also be the reason for the Faster RCNN model converging to high validation mAP earlier than SSD, since a batch size of 1 for Faster RCNN leads to 24 times more gradient updates than SSD with a batch size of 24. *Remote Sens.* **2019**, *11*, x FOR PEER REVIEW 12 of 25

374 **Figure 6.** Change in training loss and Validation Mean Average Precision with number of epochs of **Figure 6.** Change in training loss and Validation Mean Average Precision with number of epochs of (**a**) Faster RCNN and (**b**) SSD.

#### 375 (**a**) Faster RCNN and (**b**) SSD. *3.2. Optimal IoU and Confidence Thresholds for Faster RCNN and SSD*

376 *3.2. Optimal IoU and Confidence Thresholds for Faster RCNN and SSD*  377 In order to find the optimal threshold for IoU of the prediction boxes and ground truth boxes In order to find the optimal threshold for IoU of the prediction boxes and ground truth boxes that would result in best performance of the model, precision recall curves were drawn using various confidence thresholds from 0 to 1 at various IoU thresholds ranging from 0.5 to 0.95 (Figure 7).

378 that would result in best performance of the model, precision recall curves were drawn using various 379 confidence thresholds from 0 to 1 at various IoU thresholds ranging from 0.5 to 0.95 (Figure 7). It can be seen that the area under the precision-recall curve is almost the same in case of Faster RCNN and SSD which explains the fact that the validation mAP during the final epochs as seen from the training graph was very similar (0.63 in Faster RCNN and 0.62 in SSD). Furthermore, both Faster RCNN and SSD achieved the maximum area under the precision-recall curve at an IoU threshold of 0.5 for the prediction box and ground truth box. Hence, for each ground truth box, among all prediction boxes with a confidence score greater than the threshold for confidence score, the prediction box with the highest value of IoU with the ground truth box and also whose IoU with ground truth box is greater than the threshold for IoU was considered a true positive. All prediction boxes that were not a true positive with any ground truth box are regarded as false positives. The number of false negatives is equal to the number of ground truth boxes that do not have a corresponding true positive. With the optimal IoU threshold found for Faster RCNN and SSD, the following graph (Figure 8) was plotted to find the optimal confidence threshold for Faster RCNN and SSD that results in the best performance.

381 **Figure 7.** Precision-recall curve at different thresholds for IoU of the predicted box and ground truth

box (**a**) Faster RCNN and (**b**) SSD.

*3.3 Comparison of Performance of Faster RCNN and SSD* 

(**a**) Faster RCNN and (**b**) SSD.

**Figure 6.** Change in training loss and Validation Mean Average Precision with number of epochs of

In order to find the optimal threshold for IoU of the prediction boxes and ground truth boxes that would result in best performance of the model, precision recall curves were drawn using various

*3.2. Optimal IoU and Confidence Thresholds for Faster RCNN and SSD* 

**Figure 7.** Precision-recall curve at different thresholds for IoU of the predicted box and ground truth **Figure 7.** Precision-recall curve at different thresholds for IoU of the predicted box and ground truth box (**a**) Faster RCNN and (**b**) SSD. corresponding true positive. With the optimal IoU threshold found for Faster RCNN and SSD, the following graph (Figure 8) was plotted to find the optimal confidence threshold for Faster RCNN and SSD that results in the best performance.

**Figure 8.** Change in IoU of output binary image and ground truth binary image as well as f1 score with change in recall. **Figure 8.** Change in IoU of output binary image and ground truth binary image as well as f1 score with change in recall.

Figure 8 shows the change in f1 score and the mean IoU of the output binary image of the model with the ground truth binary image with change in recall. From the figure, the recall value which results in the best IoU and F1 score was found using the peak. The recall at which the best mean IoU and f1 score were observed was around 0.7 and its corresponding confidence threshold for class scores was 0.6 in the case of Faster RCNN, and 0.1 in the case of SSD. It is to be noted that mean IoU here refers to the Intersection over Union of the whole binary model output image with the ground truth binary image whereas the IoU mentioned earlier was the Intersection over Union of individual prediction bounding boxes with individual ground truth bounding boxes. Figure 8 shows the change in f1 score and the mean IoU of the output binary image of the model with the ground truth binary image with change in recall. From the figure, the recall value which results in the best IoU and F1 score was found using the peak. The recall at which the best mean IoU and f1 score were observed was around 0.7 and its corresponding confidence threshold for class scores was 0.6 in the case of Faster RCNN, and 0.1 in the case of SSD. It is to be noted that mean IoU here refers to the Intersection over Union of the whole binary model output image with the ground truth binary image whereas the IoU mentioned earlier was the Intersection over Union of individual prediction bounding boxes with individual ground truth bounding boxes.

Table 1 shows the precision, recall, f1 score, and mean IoU of the model output binary image and the ground truth binary along with the inference time for a 1152 × 1152 image. The precision,

the Faster RCNN network outputs 300 proposals from the region proposal network. However, Huang et al. [36] found that by reducing the number of proposals output by Faster RCNN, the inference time of Faster RCNN can be improved with a slight cost in precision, recall, and f1 score. Therefore, experiments were conducted to study the change in inference time, precision, recall, f1 score and

420

421

422

#### *3.3. Comparison of Performance of Faster RCNN and SSD*

Table 1 shows the precision, recall, f1 score, and mean IoU of the model output binary image and the ground truth binary along with the inference time for a 1152 × 1152 image. The precision, recall, f1 score, and mean IoU of both the models were similar but the SSD model was slightly faster in execution than Faster RCNN. It should be noted that the above performance was in the case that the Faster RCNN network outputs 300 proposals from the region proposal network. However, Huang et al. [36] found that by reducing the number of proposals output by Faster RCNN, the inference time of Faster RCNN can be improved with a slight cost in precision, recall, and f1 score. Therefore, experiments were conducted to study the change in inference time, precision, recall, f1 score and mean IoU, by varying the number of proposal boxes from the Faster RCNN network from 50 to 300 and the results are plotted in Figure 9. 416 mean IoU, by varying the number of proposal boxes from the Faster RCNN network from 50 to 300 417 and the results are plotted in Figure 9. 418 **Table 1.** Performance of test data in Faster RCNN and Single Shot Detector (SSD). **Model Precision Recall F1 Score Mean IoU Inference Time of 1152 × 1152 Image in Seconds**  Faster RCNN 0.65 0.68 0.66 0.85 0.23


**Table 1.** Performance of test data in Faster RCNN and Single Shot Detector (SSD).

423 **Figure 9.** Change in evaluation metrics and inference time of Faster RCNN model with increase in **Figure 9.** Change in evaluation metrics and inference time of Faster RCNN model with increase in number of proposals.

424 number of proposals. 425 The inference time of Faster RCNN had a linear time complexity with the number of proposal 426 boxes output from the region proposal network. It can be seen that, from 200 to 300 proposals, there 427 was no change in performance of the model but the inference time decreased. Hence, 200 proposals 428 was selected as the optimal number of proposals for this dataset. At 200 proposals, the inference time 429 of Faster RCNN was 0.21 seconds, which was the same as SSD. In the case of constraints in The inference time of Faster RCNN had a linear time complexity with the number of proposal boxes output from the region proposal network. It can be seen that, from 200 to 300 proposals, there was no change in performance of the model but the inference time decreased. Hence, 200 proposals was selected as the optimal number of proposals for this dataset. At 200 proposals, the inference time of Faster RCNN was 0.21 seconds, which was the same as SSD. In the case of constraints in computational power, using 100 proposal boxes would result in significant compute savings with minimal loss in mean IoU. Hence, no difference in performance was found between Faster RCNN with 200 proposals

with 200 proposals and SSD in terms of the evaluation metrics used in this study. However, it is to be noted that, even with the same performance metric, Faster RCNN output weed objects with high confidence compared to SSD, since the confidence threshold being used for Faster RCNN was 0.6, whereas it was a very low 0.1 for SSD. Though this threshold might result in the best performance with the current validation test, it might affect the generalization performance of the model in the case of a test dataset from a different location or from a field with different management practices. In

439 On visual observation of the outputs of all the 44 test images, it was found that in 41 images, 440 both the networks detected all the weed areas. Hence, in these images, the difference in IoU between

438 such cases, the low threshold might lead to reduced precision.

and SSD in terms of the evaluation metrics used in this study. However, it is to be noted that, even with the same performance metric, Faster RCNN output weed objects with high confidence compared to SSD, since the confidence threshold being used for Faster RCNN was 0.6, whereas it was a very low 0.1 for SSD. Though this threshold might result in the best performance with the current validation test, it might affect the generalization performance of the model in the case of a test dataset from a different location or from a field with different management practices. In such cases, the low threshold might lead to reduced precision.

On visual observation of the outputs of all the 44 test images, it was found that in 41 images, both the networks detected all the weed areas. Hence, in these images, the difference in IoU between the model output and the ground truth is only because of the slight displacements of the boundaries of the bounding boxes from each other. As mentioned in Section 2.7, the low values of precision, recall, and f1 score obtained are primarily because of the way these metrics are calculated, since only one bounding box is considered as a true positive for one ground truth box, whereas the model in case of some weed areas with slight discontinuities outputs multiple prediction boxes to detect those areas. Therefore, the mean IoU of the binary output image with the binary image of the ground truth is the appropriate metric. In three of the test images (shown in Figure 10), there was a difference in the output of Faster RCNN and SSD. In the output image 1, Faster RCNN failed to detect a small strip of weed between the crop rows, but this was detected by SSD. However, by looking at the confidence score of the weed object from SSD, it can be understood that SSD was only able to detect this weed object because of the very low confidence threshold set for it. Whereas in output image 2, SSD misclassified a row of soybean crops with herbicide drift injury as weeds. Moreover, in case of output image 3, SSD could not detect the weeds on the left vertical border of the image. With both the failure areas being present in the border of the images, this might show the susceptibility of the SSD model in the image border. This could be due to the architecture of SSD that does detection of objects and classification into its class in a single shot, unlike Faster RCNN. Another possible reason could be that, by default, the API used to train both the models was resizing the input images of Faster RCNN to 600 × 600 whereas in case of SSD it was resized to 300 × 300. Therefore, this further loss of detail in the input image compared to the Faster RCNN input image might have led to the misclassifications in the border. Hence, further study with the same input image resolution is needed for a fair comparison.

Other than the above-mentioned three images, Faster RCNN, as well as SSD, performed exceptionally well in detecting weed objects of various scales as seen in Figure 11. As mentioned earlier, it can be seen that though SSD detected all the weed objects that were detected by Faster RCNN, the confidence of many of those predictions were very low and ended up as true positive because of the low confidence threshold. Since, by reducing the number of proposals to 200, Faster RCNN can be as fast SSD in terms of inference time, it can be concluded that Faster RCNN has better speed performance tradeoff.

#### *3.4. Comparison of Performance of Faster RCNN and Patch-Based CNN*

The Mobilenet v2 network trained on the training patches showed very high performance in classifying test patches with an f1 score of 0.98. However, in order to evaluate its performance in detecting the weed objects in the sub-image and compare its performance with the Faster RCNN object detection model, the overlapping approach explained earlier was used. Table 2 shows the mean IoU of the output binary image from Faster RCNN and patch-based CNN with the ground truth binary image. Furthermore, the table shows the time taken to evaluate one sub-image by both the models.

462 463 464

465

466 **Figure 10.** Output images with discrepancies between Faster RCNN and SSD and their corresponding 467 ground truth. **Figure 10.** Output images with discrepancies between Faster RCNN and SSD and their corresponding ground truth.

468 Other than the above-mentioned three images, Faster RCNN, as well as SSD, performed **Table 2.** Performance of Faster RCNN and patch-based CNN in test sub-images.


486

476 **Figure 11.** Example output images with good model performance. **Figure 11.** Example output images with good model performance.

477 *3.4 Comparison of Performance of Faster RCNN and Patch-Based CNN*  478 The Mobilenet v2 network trained on the training patches showed very high performance in 479 classifying test patches with an f1 score of 0.98. However, in order to evaluate its performance in 480 detecting the weed objects in the sub-image and compare its performance with the Faster RCNN 481 object detection model, the overlapping approach explained earlier was used. Table 2 shows the mean 482 IoU of the output binary image from Faster RCNN and patch-based CNN with the ground truth 483 binary image. Furthermore, the table shows the time taken to evaluate one sub-image by both the 484 models. 485 **Table 2.** Performance of Faster RCNN and patch-based CNN in test sub-images. **Model Mean IoU Inference Time in Seconds for each Sub-image (1152×1152)**  Faster RCNN with 200 proposals 0.85 0.21 Patch based CNN sliced with overlap 0.61 1.03 Patch based CNN sliced without overlap 0.6 0.22 487 Faster RCNN had better performance than patch-based CNN with overlap, both in terms of 488 mean IoU and inference time. However, patch-based CNN without overlap has an inference time 489 which is almost the same as Faster RCNN. The low values of IoU of patch-based CNN without Faster RCNN had better performance than patch-based CNN with overlap, both in terms of mean IoU and inference time. However, patch-based CNN without overlap has an inference time which is almost the same as Faster RCNN. The low values of IoU of patch-based CNN without overlap were because of the coarse nature of this algorithm. Since each sub-image was split into 81 patches in this approach, weeds that were smaller in size would not be detected in this approach. Furthermore, because of the way the patches were sliced, there could be a lot of patches with weeds and background in equal proportion, whereas the Mobilenet v2 model had only been trained with patches that contained only weed or only background, and hence the model was prone to error in this approach. To reduce this error, the slicing with overlap approach was tested. Since, for each small block within a patch, the class was determined by majority vote in eight patches, the problem of mixed patches was solved to some extent. Still, the similar IoU of slicing with overlap and without overlap is because the ground truth binary image represents weed objects as rectangular boxes whereas output binary images from the patch-based overlap approach consist of weed objects, which are polygonal in nature because of the majority vote as can be seen in Figure 12. Therefore, patch-based CNN with overlap has better performance than the IoU value with ground truth image suggests. However, the drawback of this approach is the very high inference time compared to Faster RCNN and patch-based RCNN without overlap. Further studies can be done with different levels of horizontal and vertical overlap and its influence on the inference time of this approach. However, with the inference time of Faster RCNN being the same as the patch-based CNN without overlap, any amount of overlap would lead to more patches to be evaluated than the non-overlap approach and hence greater inference time. Therefore, among the approaches investigated in this study, Faster RCNN had the best overall performance. It would be interesting to study a modified Fast RCNN architecture with the region proposal part

490 overlap were because of the coarse nature of this algorithm. Since each sub-image was split into 81

534

replaced with an image analysis method that selects polygons. This could achieve faster computational speed as well as better performance for a patch-based CNN method.

535 **Figure 12.** Output images of patch-based CNN and Faster RCNN. **Figure 12.** Output images of patch-based CNN and Faster RCNN.

536 **5. Conclusions**  537 In this study, Faster RCNN and SSD object detection models were trained and evaluated over 538 UAV imagery for mid- to late-season weed detection in soybean fields. The performance of two object 539 detection models, Faster RCNN and the Single Shot Detector (SSD) models, as well as the 540 performance of object detection CNN models with the patch-based CNN model, were evaluated and 541 compared in terms of weed detection performance using mean IoU and inference speed. 542 It was found that the Faster RCNN model with 200 box proposals had a similar weed detection 543 performance to the SSD model in terms of precision, recall, f1 score, and IoU as well as similar 544 inference time. The precision, recall, f1 score and IoU were 0.65, 0.68, 0.66 and 0.85 for Faster RCNN 545 with 200 proposals, and 0.66, 0.68, 0.67 and 0.84 for SSD respectively. However, the optimal 546 confidence threshold of SSD was found to be 0.1, indicating the lower confidence of this model in the 547 case of weed objects detected, whereas the optimal confidence threshold was found to be 0.6 in the 548 case of Faster RCNN, meaning higher confidence in the weed objects detected. In addition, SSD was 549 susceptible to misclassification in the border of some test images. These findings indicate that SSD 550 might have lower generalization performance than Faster RCNN for mid- to late-season weed 551 detection in soybean fields using UAV imagery. Hence, Faster RCNN was determined to be the better 552 performing model among the two in this study. Between Faster RCNN and patch-based CNN, Faster 553 RCNN had better weed detection performance than patch-based CNN with overlap as well as 554 without overlap. The inference time of Faster RCNN was similar to patch-based CNN without In order to implement this system for on-farm detection, further evaluation of the performance of these approaches at higher altitudes is needed. At the altitude of 20m at which these data were collected, it is practically impossible to cover the large soybean fields with the current limitations on the battery capacity of UAV systems. Therefore, the evaluation of the performance of these models at low-resolution images from high altitude is needed for practical adoption of these systems. Like SSD, it can be seen that there is a higher misclassification rate of patches in the border of the images. In this case, it is suggested to collect images with some overlap, such as 15%, so that weed objects present in the border of one image end up in the interior of the next image. Furthermore, it is to be noted that the dataset used to train the models in the study was only collected on two different days. Therefore, the differences in phenological stage of the crop and the weed and lighting conditions are limited within the dataset. Further experiments with wide variations in lighting conditions, flight altitudes, different phenological stages are needed to analyze and compare the generalizability of performance of these models in varying conditions in the field. In addition, since the manual labeling of bounding boxes used in this study was labeled by one annotator, it is possible that there is error due to bias of the observer. Therefore, further studies using multiple annotators for labeling data with more variations as mentioned above is needed to remove bias and study the generalizability of the model. With the manual annotation of images being a time-consuming process, use of multiresolution segmentation approaches from OBIA could help in automating this. In that case, OBIA could help generate polygon labels from which rectangular bounding box labels can be generated for object detection tasks.

555 overlap, but significantly less than patch-based CNN with overlap. Hence, Faster RCNN was found

#### **4. Conclusions**

In this study, Faster RCNN and SSD object detection models were trained and evaluated over UAV imagery for mid- to late-season weed detection in soybean fields. The performance of two object detection models, Faster RCNN and the Single Shot Detector (SSD) models, as well as the performance of object detection CNN models with the patch-based CNN model, were evaluated and compared in terms of weed detection performance using mean IoU and inference speed.

It was found that the Faster RCNN model with 200 box proposals had a similar weed detection performance to the SSD model in terms of precision, recall, f1 score, and IoU as well as similar inference time. The precision, recall, f1 score and IoU were 0.65, 0.68, 0.66 and 0.85 for Faster RCNN with 200 proposals, and 0.66, 0.68, 0.67 and 0.84 for SSD respectively. However, the optimal confidence threshold of SSD was found to be 0.1, indicating the lower confidence of this model in the case of weed objects detected, whereas the optimal confidence threshold was found to be 0.6 in the case of Faster RCNN, meaning higher confidence in the weed objects detected. In addition, SSD was susceptible to misclassification in the border of some test images. These findings indicate that SSD might have lower generalization performance than Faster RCNN for mid- to late-season weed detection in soybean fields using UAV imagery. Hence, Faster RCNN was determined to be the better performing model among the two in this study. Between Faster RCNN and patch-based CNN, Faster RCNN had better weed detection performance than patch-based CNN with overlap as well as without overlap. The inference time of Faster RCNN was similar to patch-based CNN without overlap, but significantly less than patch-based CNN with overlap. Hence, Faster RCNN was found to be the best model in terms of weed detection performance and inference time among the different models compared in this study.

Future work can evaluate the performance variation of models in different weed species. In addition, the performance of Faster RCNN at different altitudes by resampling high-resolution images to low-resolution images can be studied. Furthermore, the inference time experiments at different altitudes should be performed on low computational power devices such as regular laptops and mini-PCs used for the flight control of UAV systems. Inference time experiments should also be performed on low cost hardware accelerators available for edge computing such as the Intel Neural Compute Stick or Google Coral. This would help understand the potential of using such devices for on-farm, near real-time data processing and actuation. In addition, the effect of model compression techniques and approximation algorithms developed for neural networks can be studied to understand the limit of edge computing for in-field near real-time weed detection. Moreover, further work can be performed on using the RTK GPS data of individual images and their corresponding IMU data to orthorectify the image and find the geolocation of the weed patches detected by the object detection models. In addition, the performance of object detection models for weed detection can be compared between raw individual images as used in this study and stitched mosaic maps. With the manual annotation of images being a laborious part of the process, using techniques such as self-supervised learning [69] and active learning [70] to reduce the amount of manual labeling for this task can be studied. Furthermore, few-shot learning algorithms can be studied to investigate the transfer learning of this algorithm to other crops and weed species by training with a few labeled instances from those crops and weed species.

**Author Contributions:** Conceptualization, Y.S. and E.P.; methodology, A.N.V.S., Y.S. and S.S.; data acquisition, A.N.V.S. and J.L.; software, analysis and evaluation, A.N.V.S.; writing—original draft preparation, A.N.V.S.; writing—review and editing, Y.S., E.P., S.S., A.J.J., J.D.L. and J.L.; project administration, Y.S.; funding acquisition, Y.S., E.P. and A.J.J. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was funded by the Nebraska Research Initiative (NRI) Collaboration Initiative Seed Grant 2132250011, the Nebraska Corn Board, and the Nebraska Agricultural Experiment Station through the Hatch Act capacity funding program (Accession Number 1011130) from the USDA National Institute of Food and Agriculture.

**Acknowledgments:** Thanks to Jonathan Forbes for their assistance in data collection.

**Conflicts of Interest:** The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

### **References**


© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
