*Article* **Plant and Weed Identifier Robot as an Agroecological Tool Using Artificial Neural Networks for Image Identification**

**Tavseef Mairaj Shah \*,†, Durga Prasad Babu Nasika † and Ralf Otterpohl**

Rural Revival and Restoration Egineering (RUVIVAL), Institute of Wastewater Management and Water Protection, Hamburg University of Technology, Eissendorfer Strasse 42, 21073 Hamburg, Germany; durga.nasika@tuhh.de (D.P.B.N.); ro@tuhh.de (R.O.)

**\*** Correspondence: tavseef.mairaj.shah@tuhh.de

† These authors contributed equally to this work.

**Abstract:** Farming systems form the backbone of the world food system. The food system, in turn, is a critical component in sustainable development, with direct linkages to the social, economic, and ecological systems. Weeds are one of the major factors responsible for the crop yield gap in the different regions of the world. In this work, a plant and weed identifier tool was conceptualized, developed, and trained based on artificial deep neural networks to be used for the purpose of weeding the inter-row space in crop fields. A high-level design of the weeding robot is conceptualized and proposed as a solution to the problem of weed infestation in farming systems. The implementation process includes data collection, data pre-processing, training and optimizing a neural network model. A selective pre-trained neural network model was considered for implementing the task of plant and weed identification. The faster R-CNN (Region based Convolution Neural Network) method achieved an overall mean Average Precision (mAP) of around 31% while considering the learning rate hyperparameter of 0.0002. In the plant and weed prediction tests, prediction values in the range of 88–98% were observed in comparison to the ground truth. While as on a completely unknown dataset of plants and weeds, predictions were observed in the range of 67–95% for plants, and 84% to 99% in the case of weeds. In addition to that, a simple yet unique stem estimation technique for the identified weeds based on bounding box localization of the object inside the image frame is proposed.

**Keywords:** deep learning; artificial neural networks; image identification; agroecology; weeds; yield gap; environment; health

### **1. Introduction**

Growing food through agriculture involves different labor-intensive practices. Most of these practices have traditionally been performed manually. Weeding is one such agricultural practice. However, generally, as farming has become more industrialized—or that the industrialized agriculture has become the leitmotif for all to emulate—different practices evolved over time with the aim of increasing the efficiency of labor and increasing the productivity of the land. This involved efforts to increase the efficacy of the manual practices by using mechanical and chemical aids or in some cases to present alternate pathways for these practices without any direct manual intervention [1].

The growth of weeds is one of the largest biotic factors contributing to the yield gap in food crops [2,3]. In South Asia, it is the single largest biotic yield gap factor in rice production systems [4,5]. It has been reported that in sugarcane cultivation, weeds reduced the crop growth at early stages and have resulted in a yield loss of 27–35% [6]. In traditional farming systems, weeds have been manually removed from the crop field with a help of hands or with a hoe. Growing intercrops in between the main crop rows is also a potential strategy to control the growth of weeds. However, the rise in the use of agrochemicals multiple times (up to 300 times) in the last 50 years, to control the growth of weeds among

**Citation:** Shah, T.M.; Nasika, D.P.B.; Otterpohl, R. Plant and Weed Identifier Robot as an Agroecological Tool Using Artificial Neural Networks for Image Identification. *Agriculture* **2021**, *11*, 222. https:// doi.org/10.3390/agriculture11030222

Academic Editors: Sebastian Kujawa, Gniewko Niedbała and Maciej Zaborowicz

Received: 31 December 2020 Accepted: 4 March 2021 Published: 8 March 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

other things, has shown a lot of negative effects on human and planetary health [7]. The incidence of herbicide resistance among certain weed populations is also a cause of concern in this regard [8,9]. weeds among other things, has shown a lot of negative effects on human and planetary health [7]. The incidence of herbicide resistance among certain weed populations is also a cause of concern in this regard [8,9].

a potential strategy to control the growth of weeds. However, the rise in the use of agrochemicals multiple times (up to 300 times) in the last 50 years, to control the growth of

*Agriculture* **2021**, *11*, x FOR PEER REVIEW 2 of 34

It is in this backdrop that a transition to agroecology-based farming systems is being recommended internationally with an urgency never expressed before. Agroecology is the study of the ecology of food systems and applying this knowledge for the design of sustainable farming systems. Agroecology-based alternatives include organic farming and sustainable intensification strategies like the System of Rice Intensification [10]. The problem of weeds, however, persists in some of the proposed methodologies too. For example, the proliferation of weeds is an oft-cited critique of an agroecological methodology of growing rice, the System of Rice Intensification, which involves growing rice under alternate wetting and drying conditions, with earlier transplantation and wider spacing between the rice plants [11] (Figure 1). While as in the case of agrochemical-based farming, the problem of weeds leads to environmental hazards due to the use of pesticides, in the case of agroecological methodologies, the practices that are suggested to counter weed proliferation are not harmful to the environment. Such practices are however often labor intensive [10,12]. It is in this backdrop that a transition to agroecology-based farming systems is being recommended internationally with an urgency never expressed before. Agroecology is the study of the ecology of food systems and applying this knowledge for the design of sustainable farming systems. Agroecology-based alternatives include organic farming and sustainable intensification strategies like the System of Rice Intensification [10]. The problem of weeds, however, persists in some of the proposed methodologies too. For example, the proliferation of weeds is an oft-cited critique of an agroecological methodology of growing rice, the System of Rice Intensification, which involves growing rice under alternate wetting and drying conditions, with earlier transplantation and wider spacing between the rice plants [11] (Figure 1). While as in the case of agrochemical-based farming, the problem of weeds leads to environmental hazards due to the use of pesticides, in the case of agroecological methodologies, the practices that are suggested to counter weed proliferation are not harmful to the environment. Such practices are however often labor intensive [10,12].

**Figure 1.** Weeds growing in between rice crop rows (Source: Author). **Figure 1.** Weeds growing in between rice crop rows (Source: Author).

The excessive use of agrochemicals like pesticides including herbicides has become a burning topic of discussion in the past few years although the dangers associated with it have been discussed in the literature for a long time [13–16]. The presence of fertilizer residues in surface and groundwater and that of pesticide residues in food items has been well documented [15–20]. Their effects on human and planetary health have been detailed in different studies; with the use of fertilizers and pesticides has increased manifold over the past four decades particularly in developing countries [9,21,22]. On the other hand, lack of nutrients in the soil and pest proliferation continues to challenge farmers leading to a decline in productivity [23,24]. For example, increased weed proliferation due to excessive use of fertilizers has resulted in yield losses in farming systems in South Asia The excessive use of agrochemicals like pesticides including herbicides has become a burning topic of discussion in the past few years although the dangers associated with it have been discussed in the literature for a long time [13–16]. The presence of fertilizer residues in surface and groundwater and that of pesticide residues in food items has been well documented [15–20]. Their effects on human and planetary health have been detailed in different studies; with the use of fertilizers and pesticides has increased manifold over the past four decades particularly in developing countries [9,21,22]. On the other hand, lack of nutrients in the soil and pest proliferation continues to challenge farmers leading to a decline in productivity [23,24]. For example, increased weed proliferation due to excessive use of fertilizers has resulted in yield losses in farming systems in South Asia [2,9,18,25].

[2,9,18,25]. In agrarian societies, secondary practices in farming, associated with plant protection, have traditionally been done with the help of manual labor, much like the primary In agrarian societies, secondary practices in farming, associated with plant protection, have traditionally been done with the help of manual labor, much like the primary practices, those associated with sowing, planting and harvesting. In some parts of the world, farming practices like weeding are still done or were done until recently, manually. These practices have gradually phased out to a large extent and have been replaced by the use of chemical

pesticides like herbicides and weedicides. As such, the use of agrochemical pesticides has become the norm [26].

So, one of the options to reverse the ecological damage of the pesticides would be to go back to manual weeding. However, agroecology does not simply advocate going back to earlier practices; it involves going back to roots armed with new knowledge and tools [27,28]. This is the motivation behind the AI-based weed identifier robot, the concept and design of which is detailed in the following sections. An AI-trained weeding robot could play a supporting role in agroecology in this regard, when designed keeping in view the needs of smallholder farms, in particular. As for conventional farming, by which the current dominant form of agriculture is referred to, such a robot could achieve the double goals of reducing pesticide use and controlling weed proliferation [11,26].

Different non-conventional yet non-chemical methods for weed identification and management have been proposed, thanks to the widening scope of technological advances [29]. In this regard, different technologies have been used for the task of precision weed management in agriculture, which includes the follows:

Aerial and Satellite Remote Sensing: Aerial remote sensing technologies operate from a certain height. Here the differential spectral reflectance of the plants and weeds and spectral resolution of the instrument (vision device) are the driving factors of identification [30]. In the case of a developing plant canopy or taller plants, such methods are hindered by their inability to differentiate through the lack of or improper visual access to the weeds growing on the ground. In the initial stages of the cropping season as well, random stubbles or crop residues might interfere with weed identification [31]. Inaccuracies due to spectral signal mixing have also been reported in aerial weed identification and hence hinders precision weed removal [32]. The major reported challenges in aircraft and satelliteenabled remote sensing for weed management in addition to the acquisition of high spatial and temporal imagery from higher altitudes is the acquisition of good imagery under cloudy conditions [31,32].

Unmanned Aerial Vehicles (UAVs): UAVs provide an edge over remote sensing methods as they operate from a height that is closer to the ground and provides highresolution imagery in real-time. Images can be retrieved more frequently and largely independent of the weather conditions like clouds [29]. Although UAVs provide higher resolution imagery, they are beset with limitations such as high battery use during flight time and the high processing time of the imagery [33]. The operation of UAVs like drones is also often regulated by the government and hence their use and usability might get affected by local government regulations [34]. Huang et al. have proposed the serial application of a UAV-based weed identification system and a Variable Rate Spray (VRS) system for weed identification and management [33]. The integration of both the operative functions is limited by the payload carrying capacity of the UAV. However, the two operative functions could easily be integrated into the same machine, with a much higher carrying capacity, for example, in an on-ground robotic precision weed identification and removal system.

Robotics: The increasing scope of robotic technologies has made possible the deployment of robotics in weed identification and management [29]. With robotics, weed identification goes a step closer to the ground as compared to the previously discussed methods. Based on artificial intelligence, using artificial neural networks, weeds can be not just identified in real-time with higher spatial resolution but can also be tackled, physically, thermally, or biologically, in real-time with a robotic arm attached to the robot on the ground. In this regard, the application of machine learning using convolutional neural networks for the identification of plants/fruits at their different stages has also been reported [35].

In this study, a plant and weed identifier robot (precision weeding robot) has been conceptualized and its software designed, based on state-of-the-art deep learning techniques using artificial neural networks (convolution neural networks). Experiments were conducted on a dataset of over 200 images of three different plant species taken under

different conditions and of different sizes at different growth stages. The neural network was trained to identify the plants and classify them as either weed or plant.

The robot is conceptualized for use in both small and big farms. However, the motivation behind rendering it low-cost and low-tech is to enable smallholders to be the primary beneficiaries. The importance of this approach stems from the fact that smallholder farmers are the primary producers of food for the majority of the world population [36]. A low-cost weeding robot that can identify and distinguish weeds from plants could be an addition to the agroecological interventions [28,37]. The robot can, based on the need, either remove the weeds or incorporate them into the soil. The option of fitting the robotic arm with other heads is also there, which can be used to spray trace elements or plant protection substances.

The construction of the autonomous farming robot mainly focussed on performing weeding operations is broadly divided into six phases for prototyping and carry out the initial tests:

Phase 1: Conceptualisation of the idea framework for the design of the robot.

Phase 2: Building and testing an artificially intelligent classifier that can distinguish a plant from a weed in real-time.

Phase 3: Design the method for estimation and extraction of the position of the identified weeds using computer vision techniques.

Phase 4: Building a mobile robotic platform prototype and install all necessary components and the robotic manipulator for developing and testing.

Phase 5: Design and develop control algorithms for moving the robot platform and the manipulator with the end effector towards the weed and perform different removal strategies.

Phase 6: Validation studies and iterative tests in the lab and in the field. Improving on the flaws and developing additional features and testing.

The ideas and results from the first three phases are described in the following sections.

### **2. Literature Review**

### *2.1. Studies on Weed Killing Herbicides and Its Effects*

Application of weedicides is the commonly used method for post-emergent control of weeds [38,39]. A study conducted in 2016 reported that, globally, the use of the single most commonly used herbicide Glyphosat increased 15-fold in a span of 20 years [40]. An increasing number of studies detail the concerns that arise with the usage of herbicides with respect to adverse effects on human health, soil nutrition, crop health, groundwater, and biodiversity [41]. Many governments are planning to ban the usage of such agrochemicals and are hence looking for alternative solutions in this regard [40]. The World Health Organisation (WHO) has reported sufficient evidence regarding the carcinogenicity of insecticides and herbicides, while its potential effect on human beings at the DNA (Deoxyribonucleic acid) and chromosome level has also been reported [42]. In a study, the US FDA (Food and Drug Administration) reported the presence of glyphosate residues in 63.1% of corn and 67% of soy samples, respectively [40]. A case study in 2017 reported that a detectable amount of glyphosate was found in the urine specimens of pregnant women leading them to have shorter pregnancy lengths [40]. Another study from Sri Lanka shows that drinking glyphosate contaminated water causes chronic kidney diseases [43]. In addition to being a health risk for humans, the use of pesticides has also been reported to cause a decrease in monarch butteries population [44], slow larvae growth in honey bees, and lead to their death when exposed to glyphosate [45]. The use of herbicides generally poses a slew of adverse non-target risks on the different components of the agroecosystems [46]. Hence exploring a non-chemical solution to the problem of weed proliferation is plausible.

### *2.2. Deep Machine Learning in Agriculture*

Machine learning (ML) is a subset of the artificial intelligence domain that provides computers the ability to learn, analyze, and make their own decisions/predictions with-

out being explicitly programmed. It is mainly categorized into predictive or supervised learning and unsupervised learning [47]. Machine learning (ML) is a subset of the artificial intelligence domain that provides computers the ability to learn, analyze, and make their own decisions/predictions without

generally poses a slew of adverse non-target risks on the different components of the agroecosystems [46]. Hence exploring a non-chemical solution to the problem of weed

*Agriculture* **2021**, *11*, x FOR PEER REVIEW 5 of 34

proliferation is plausible.

*2.2. Deep Machine Learning in Agriculture* 

The goal of the supervised learning approach is to learn a mapping function from inputs *x* to outputs *y*, given a labeled N set of input-output pairs being explicitly programmed. It is mainly categorized into predictive or supervised learning and unsupervised learning [47].

$$D = \{ (\mathbf{x}\_{i\prime} y\_i) \}\_{i=1}^{N} \tag{1}$$

Here, *D* is called the training set, and *N* is the number of training examples. In simple terms, we have few sample inputs and outputs and we use a mathematical algorithm to learn an underlying mapping function that maps input to the output. Hereby, the aim is to estimate the mapping function and predict the output when an entirely new set of input data is provided. Currently, supervised learning is widely used in many applications, such as classification, pattern recognition, and regression problems [47]. = {(, )}ୀଵ ே (1) Here, *D* is called the training set, and *N* is the number of training examples. In simple terms, we have few sample inputs and outputs and we use a mathematical algorithm to learn an underlying mapping function that maps input to the output. Hereby, the aim is to estimate the mapping function and predict the output when an entirely new set of input

On the other hand, in unsupervised learning, we are only given inputs, and the goal is to find 'interesting patterns' in the data [11,47]. data is provided. Currently, supervised learning is widely used in many applications, such as classification, pattern recognition, and regression problems [47]. On the other hand, in unsupervised learning, we are only given inputs, and the goal

$$D = \{ \mathfrak{x}\_i \}\_{i=1}^N$$

In simple terms, here, the algorithm is left to learn and analyze the underlying pattern without providing any input labeled data. The algorithm learns through structuring data patterns and predicts the output. Some of the examples of unsupervised learning are clustering and association problems [47]. Figure 2 shows a general block diagram of the machine learning approach. = {}ୀଵ ே [11] In simple terms, here, the algorithm is left to learn and analyze the underlying pattern without providing any input labeled data. The algorithm learns through structuring data patterns and predicts the output. Some of the examples of unsupervised learning are clustering and association problems [47]. Figure 2 shows a general block diagram of the ma-

**Figure 2.** Machine learning approach. **Figure 2.** Machine learning approach.

Deep Learning is a subset of the Machine Learning approach in artificial intelligence. Artificial deep neural networks are one of the deep learning architectures, which provide a compelling supervised learning framework [48,49]. Machine learning and deep learning algorithms are applied in various agricultural operations, such as flower species recognition, disease prediction and detection in plants, crop yield forecasting, weed classification and detection, and plant species recognition and classification [50]. These are briefly described below. Deep Learning is a subset of the Machine Learning approach in artificial intelligence. Artificial deep neural networks are one of the deep learning architectures, which provide a compelling supervised learning framework [48,49]. Machine learning and deep learning algorithms are applied in various agricultural operations, such as flower species recognition, disease prediction and detection in plants, crop yield forecasting, weed classification and detection, and plant species recognition and classification [50]. These are briefly described below.

### 2.2.1. Disease Identification 2.2.1. Disease Identification

Crop diseases are a significant threat to the crop yield and the quality of the food produced, with adverse consequences on the livelihood of small-scale farmers and food security [51]. Globally, 80% of the food is grown majorly by the small-scale farmers, and among them, there is a reported yield loss of 50% due to crop diseases and pests [51]. Crop diseases are a significant threat to the crop yield and the quality of the food produced, with adverse consequences on the livelihood of small-scale farmers and food security [51]. Globally, 80% of the food is grown majorly by the small-scale farmers, and among them, there is a reported yield loss of 50% due to crop diseases and pests [51]. Various types of microbial plant pathogens are the typical causative agents of plant diseases [20]. Different bio-control agents have been assessed and used against those pathogens to curb plant diseases [52]. However, a few decades back, research efforts were initiated for the early identification of plant and crop diseases at different agricultural institutes to help farmers in the prevention of crop diseases [51]. To carry out the prevention measures, early detection of the pathogens, and the diagnosis of crop diseases is essential. With

technological advancements, today, these disease detection steps are carried out much more efficiently [53].

Artificial Intelligence technology, along with computer vision, image processing, object detection, and machine learning algorithms are widely used and analyzed and have proven to be effective in plant disease diagnosis and detection [53]. By utilizing popular architectures like AlexNet [23] and GoogleNet [24], Mohanty et al. reported a disease prediction accuracy of 99.35% upon the analysis of 26 diseases in 14 crop varieties [51]. In addition to that, a real-time disease detector proposed in the experimental study by Alvaro et al. in tomato plants helped to diagnose diseases at an early stage in tomato crops in comparison to various lab analyses [54]. Hence, deep machine learning-based interventions are making significant contributions to agricultural research.

### 2.2.2. Crop Yield Forecasting

For the purpose of planning and designing food supply chains, it is helpful to have an idea about the crop yield that can be expected for a particular cropping system. Accurate yield estimation also helps farmers to choose better crop management methodologies among the different available ones [55]. Conventionally, crop yield estimation is based on previous experience and seasonal weather conditions [55,56]. Such yield estimation approaches, however, are constrained by factors including climate variability and the changing soil and water dynamics and are hence often not well adapted to changing conditions [56]. In modern farming systems, the availability of time-series yield data, combined with many other sources of spatial agricultural farm data, can be utilized in designing machine learning algorithms that can contribute to better yield prediction models [56]. Support Vector Machines (SVM), Artificial Neural Networks (ANN's), Bayesian Networks (BN), Backpropagation Networks (BPN), Least Squared Support Vector Machines (LS-SVM), Convolutional Neural Networks (CNN) are some of the models that are used for yield prediction [50].

In a study, Support Vector Machine (SVM) algorithms used on coffee plantations to determine whether the seeds are harvestable or not helped farmers to optimize their economic plans and work schedules [50]. In another study, Unmanned Aircraft Systems (UAS) were used to collect the spatial and temporal remote sensing data, using an artificial neural network model to predict tomato crop yield which had a predictive accuracy of (R2~0.78–0.89) [57]. R<sup>2</sup> is the coefficient of determination which is an evaluation metric that is commonly used in regression tasks. In another study, three factors, such as soil conditions, weather conditions, and management practices data (sowing dates) from the year 1980 to 2015, were collected and considered as inputs [58]. With that data, a CNN-RNN (Convolutional Neural Network-Recurrent Neural Networks) model was used to predict the yield in soybean and corn fields across 13 states in the United States. The model showed that soil and weather conditions are vital components in yield forecasting in addition to crop management practices [58]. In other recent research, it is reported that a deep learning-based 3D CNN model applied for soybean crop yield prediction outperformed the state-of-the-art machine learning techniques [59].

### 2.2.3. Plant Leaf Classification and Identification

Easy recognition of different plant species can be of great help to ecologists, biologists, taxonomists, and researchers in plant-related studies and for medical purposes [60,61]. Machine learning and computer vision algorithms are making considerable contributions in this field [50]. They help reduce the dependency on expert availability and save time in classification tasks [50]. Deep learning models that specifically deal with images are used in plant leaf identification and have outperformed conventional image processing techniques and machine learning algorithms [62].

In one research study, a proposed deep learning model that uses ResNet26 architecture could achieve recognition levels of 91.78% on the BJFU100 dataset that consists of 10,000 images of 100 classes [60,62]. In comparison to that, the same proposed model could achieve 99.65% in classifying 32 kinds of leaf structures of plants utilizing the publicly available Flavia leaf dataset [60,62,63]. Studies report that it is not just the colors and shape of the leaves that are used to classify the plants, rather plant leaf veins can also be used as input features in determining leaf identity and properties [62]. The increased usage of mobile technology has brought the above techniques to the stage of practical implementation, being integrated into the form of mobile applications. Few mobile applications like Floraincognita, Pla@ntNet are able to recognize plants, fruits, flowers, and barks of the trees by just snapping a picture of it [64,65]. Currently, Pl@ntNet is able to recognize 27,909 varieties of plants and maintains a database of 1,794,096 images of different plants [64].

### 2.2.4. Weed Classification and Detection

Weed management in crops is a challenging task for farmers and poses a significant threat to crop yields if not done properly [50,66]. Weeds compete with crops for nutrients and usually grow faster, hence early identification and classification are crucial for a better crop yield [50,67,68]. Machine learning algorithms like SVM, ANN, have already been used for classifying and achieved high accuracy levels in different crops [50].

Utilizing the openly available dataset of plant seedlings provided by the Aarhus University of Denmark, Ashqar et al. developed a deep learning model that was able to classify 12 species of weeds over 5000 images with a precision of 99.48% [69]. In another study, Smith et al. used CNNs and transfer learning techniques to classify grass, dock, and clover and achieved a 94.9% accuracy in classifying weeds [70]. The transfer learning technique is a powerful tool that can be used over small datasets and can achieve a reasonable level of accuracies [70]. In another study, a fuzzy real-time classifier was developed for weed identification in sugarcane crops, with an accuracy level of 92.9% [6]. However, the latest deep learning architectures can improve the performance of the tools and can leverage the possibilities in exploring new ideas in weed control and management strategies [68]. Real-time identification of weeds can be a potent tool for robots in precise weeding. It can be a valuable addition to sustainable weed management systems [50,68]. Consequently, this could contribute towards offsetting the heavy usage of pesticides [67].

### *2.3. Artificial Neural Networks*

As the name suggests, an artificial neural network (ANN) is a system that is inspired by the connections of neurons in human brains [71]. An artificial neuron is a single block mathematical entity that processes information and is essential in the functioning of a neural network [71]. Haykin stated that a typical neuron has three essential elements: a set of connection links that have their weights, a summation point, and an activation function. The neuron *k* can be mathematically described by the following equations [71].

$$\boldsymbol{u}\_{k} = \sum\_{j=1}^{m} \boldsymbol{w}\_{kj} \boldsymbol{x}\_{j}$$

$$\boldsymbol{y}\_{k} = \Phi(\boldsymbol{u}\_{k} + \boldsymbol{b}\_{k})$$

where *u<sup>k</sup>* is linear combiner output; *wk*<sup>1</sup> , *wk*<sup>2</sup> , *wk*<sup>3</sup> , . . . *wkm* are synaptic weights; *x*1, *x*2, *x*3, . . . *x<sup>m</sup>* are inputs; *b<sup>k</sup>* is the bias that has the effect of lowering the input activation function; Φ(.) is the activation function; *y<sup>k</sup>* is the output of the neuron. A typical mathematical model of the neuron is shown in Figure 3 [71].

ematical model of the neuron is shown in Figure 3 [71].

An artificial neural network is simply a collection of artificial neurons. Typically they are connected and organized in layers. A layer is made up of interconnected neurons that contain an activation function. A neural network consists of an input layer, an output layer, and one or more hidden layers. The input layer takes the inputs from the outside world and passes those inputs with a weighted connection to the hidden layers. The hidden layers then perform the computations and feature extractions and are activated by standard nonlinear activation functions such as tanh, ReLU (Rectified Linear Unit), sigmoid, softmax, and pass the values to the output layer. These types of networks are typically called feed-forward neural networks or multilayer perceptrons. Figure 4 shows a feed-forward neural network [72]. An artificial neural network is simply a collection of artificial neurons. Typically they are connected and organized in layers. A layer is made up of interconnected neurons that contain an activation function. A neural network consists of an input layer, an output layer, and one or more hidden layers. The input layer takes the inputs from the outside world and passes those inputs with a weighted connection to the hidden layers. The hidden layers then perform the computations and feature extractions and are activated by standard nonlinear activation functions such as tanh, ReLU (Rectified Linear Unit), sigmoid, softmax, and pass the values to the output layer. These types of networks are typically called feed-forward neural networks or multilayer perceptrons. Figure 4 shows a feed-forward neural network [72]. **Figure 3.** A non-linear mathematical model of an artificial neuron [71]. An artificial neural network is simply a collection of artificial neurons. Typically they are connected and organized in layers. A layer is made up of interconnected neurons that contain an activation function. A neural network consists of an input layer, an output layer, and one or more hidden layers. The input layer takes the inputs from the outside world and passes those inputs with a weighted connection to the hidden layers. The hidden layers then perform the computations and feature extractions and are activated by standard nonlinear activation functions such as tanh, ReLU (Rectified Linear Unit), sigmoid, softmax, and pass the values to the output layer. These types of networks are typically called feed-forward neural networks or multilayer perceptrons. Figure 4 shows a feed-forward neural network [72].

function; (. ) is the activation function; is the output of the neuron. A typical math-

**Figure 4.** A feed-forward neural network [72]. **Figure 4.** A feed-forward neural network [72]. **Figure 4.** A feed-forward neural network [72].

When it comes to training a neural network, the focus is mainly put on minimizing the output prediction error by adjusting the weights on each connection in a backward manner. This process is called back-propagation [73]. The back-propagation algorithm then searches for the minimum value in the weight space using a stochastic gradient descent method. The obtained weights, which can minimize the loss/cost function, are then considered as a solution for the training problem and the training process culminates [73]. **Commented [TS2R1]:** Done. When it comes to training a neural network, the focus is mainly put on minimizing the output prediction error by adjusting the weights on each connection in a backward manner. This process is called back-propagation [73]. The back-propagation algorithm then searches for the minimum value in the weight space using a stochastic gradient descent method. The obtained weights, which can minimize the loss/cost function, are then considered as a solution for the training problem and the training process culminates [73].

**Commented [M1]:** The 3 coordinate axis on the right are not clear, please provide a clearer

picture.

### *2.4. Convolution Neural Networks*

*2.4. Convolution Neural Networks*  The term convolutional neural network (CNN) denotes one of the deep neural network algorithms that mainly deal with computer vision-related tasks [48]. They are often used in applications like image classification, object detection, and instance segmentation problems. The special feature of CNNs is that they are able to learn and understand the spatial or temporal correlation of the data. These are highly successful in practical applications Convolutional neural networks use a special kind of mathematical operation in one of its layers called convolution operation instead of a generic matrix multiplication [48]. spatial or temporal correlation of the data. These are highly successful in practical applications Convolutional neural networks use a special kind of mathematical operation in one of its layers called convolution operation instead of a generic matrix multiplication [48]. A convolution neural network (ConvNet) typically consists of three layers, a convo-

The term convolutional neural network (CNN) denotes one of the deep neural network algorithms that mainly deal with computer vision-related tasks [48]. They are often used in applications like image classification, object detection, and instance segmentation problems. The special feature of CNNs is that they are able to learn and understand the

*Agriculture* **2021**, *11*, x FOR PEER REVIEW 9 of 34

A convolution neural network (ConvNet) typically consists of three layers, a convolutional layer, a pooling layer, and a fully connected or dense layer. By aligning all those layers in a sequence or stacking them up, CNN architectures can be built. Figure 5 illustrates a convolutional neural network. The convolution layer is the central building unit of CNNs. It consists of kernels that convolve independently on the input image resulting in a set of feature maps. Strides, depth, and zero paddings are the three parameters that control the size or volume of the activation map [74]. Here, stride represents the number of pixels it has to move over the input image at a time; depth represents the number of kernels that are used for convolution over the input image [74]. Convolving kernel over the input image results in a reduction of the size of the activation map and loss of information in the corners. The zero-padding concept adds zero values at the corners and helps to control the output volume of the activation map. Besides, to provide the network with the ability to understand complex data, every neuron is linked with a nonlinear activation function. ReLU is one of the frequently used activation functions because it provides the network with the ability to make accurate predictions [74]. lutional layer, a pooling layer, and a fully connected or dense layer. By aligning all those layers in a sequence or stacking them up, CNN architectures can be built. Figure 5 illustrates a convolutional neural network. The convolution layer is the central building unit of CNNs. It consists of kernels that convolve independently on the input image resulting in a set of feature maps. Strides, depth, and zero paddings are the three parameters that control the size or volume of the activation map [74]. Here, stride represents the number of pixels it has to move over the input image at a time; depth represents the number of kernels that are used for convolution over the input image [74]. Convolving kernel over the input image results in a reduction of the size of the activation map and loss of information in the corners. The zero-padding concept adds zero values at the corners and helps to control the output volume of the activation map. Besides, to provide the network with the ability to understand complex data, every neuron is linked with a nonlinear activation function. ReLU is one of the frequently used activation functions because it provides the network with the ability to make accurate predictions [74].

**Figure 5.** A convolutional neural network (CNN) [74]. **Figure 5.** A convolutional neural network (CNN) [74].

The pooling layer mainly serves the purpose of reducing the spatial size representation to reduce training parameters and computing costs in the network and retains essential information when the images are larger. Pooling is also referred to as downsampling or subsampling. Pooling is done independently on each depth dimension of the image. However, the pooling layer also helps to reduce over-fitting during training. Among other types of pooling, max pooling with a 2 × 2 filter, and stride = 2 is commonly used in practice for better results [74]. The pooling layer mainly serves the purpose of reducing the spatial size representation to reduce training parameters and computing costs in the network and retains essential information when the images are larger. Pooling is also referred to as downsampling or subsampling. Pooling is done independently on each depth dimension of the image. However, the pooling layer also helps to reduce over-fitting during training. Among other types of pooling, max pooling with a 2 × 2 filter, and stride = 2 is commonly used in practice for better results [74].

### *2.5. State-of-the-Art Object Detection Methods 2.5. State-of-the-Art Object Detection Methods*

In case of image classification problems, the object recognition (detection, recognition or identification) part is the challenging part. It involves the classification of various objects in an image and localization of the detected objects by drawing some bounding boxes and assigning class label names for every bounding box [75]. The instance or semantic segmentation is another problem in computer vision, where instead of drawing a bounding box around the objects, they are indicated with specific pixels or masks [75]. In case of image classification problems, the object recognition (detection, recognition or identification) part is the challenging part. It involves the classification of various objects in an image and localization of the detected objects by drawing some bounding boxes and assigning class label names for every bounding box [75]. The instance or semantic segmentation is another problem in computer vision, where instead of drawing a bounding box around the objects, they are indicated with specific pixels or masks [75].

Compared to machine learning methods of detecting objects, deep learning methods are highly successful and do not require manual feature extraction. Region-Based Convo-Compared to machine learning methods of detecting objects, deep learning methods are highly successful and do not require manual feature extraction. Region-Based Convolutional Neural Network (R-CNN), You Only Look Once (YOLO), Single shot Multi Detector (SSD) are some of the techniques that are proposed for object identification and localization tasks, that can perform end to end training and detection [76–81].

R-CNN was proposed in 2014, and comprises three steps. Initially, a selective search algorithm is used to find the regions that may contain objects (approximately 2000 proposals) in an image [76,77]. Later on, a CNN is used for feature extraction and finally,

the features are classified. However, the constraint here is that the whole ROI (Region of Interest) with objects is warped to a fixed size and provided as an input to the CNN [77]. This process is computationally heavy and has a slow object detection speed. To mitigate some of the flaws and make it work fast, the Fast R-CNN method was introduced [77]. Here, in the first stage, it uses a CNN to extract all the features and then an ROI pooling layer is used to extract features for a specific input region and feed the output to a fully connected layer that divides and passes it to two classifiers which perform classification and bounding box regression [77]. algorithm is used to find the regions that may contain objects (approximately 2000 proposals) in an image [76,77]. Later on, a CNN is used for feature extraction and finally, the features are classified. However, the constraint here is that the whole ROI (Region of Interest) with objects is warped to a fixed size and provided as an input to the CNN [77]. This process is computationally heavy and has a slow object detection speed. To mitigate some of the flaws and make it work fast, the Fast R-CNN method was introduced [77]. Here, in the first stage, it uses a CNN to extract all the features and then an ROI pooling layer is used to extract features for a specific input region and feed the output to a fully connected layer that divides and passes it to two classifiers which perform classification

lutional Neural Network (R-CNN), You Only Look Once (YOLO), Single shot Multi Detector (SSD) are some of the techniques that are proposed for object identification and

R-CNN was proposed in 2014, and comprises three steps. Initially, a selective search

localization tasks, that can perform end to end training and detection [76–81].

*Agriculture* **2021**, *11*, x FOR PEER REVIEW 10 of 34

However, another method Faster R-CNN was proposed by Shaoqing Ren and colleagues and it outperformed both the previous models in terms of speed and detection [76]. They introduced the Regional Proposal Network (RPN) method and combined it as a single-mode [76]. It uses RPN to propose the regions and Fast R-CNN detector that uses proposed regions. Mask R-CNN is another method that is an extension to the Faster R-CNN for pixel-level semantic segmentation [78]. It was introduced as a third branch, based on the Faster R-CNN architecture, along with classification and localization. It is a fully connected network that predicts a segmentation mask in a pixel-to-pixel manner. Although it is fast, it is not optimized for speed and accuracy [78]. Figure 6 represents the summary of the R-CNN family of methods [82]. and bounding box regression [77]. However, another method Faster R-CNN was proposed by Shaoqing Ren and colleagues and it outperformed both the previous models in terms of speed and detection [76]. They introduced the Regional Proposal Network (RPN) method and combined it as a single-mode [76]. It uses RPN to propose the regions and Fast R-CNN detector that uses proposed regions. Mask R-CNN is another method that is an extension to the Faster R-CNN for pixel-level semantic segmentation [78]. It was introduced as a third branch, based on the Faster R-CNN architecture, along with classification and localization. It is a fully connected network that predicts a segmentation mask in a pixel-to-pixel manner. Although it is fast, it is not optimized for speed and accuracy [78]. Figure 6 represents the summary of the R-CNN family of methods [82].

**Figure 6.** The Region-Based Convolutional Neural Network (R-CNN) family. **Figure 6.** The Region-Based Convolutional Neural Network (R-CNN) family.

YOLO is another popular object detection method proposed by Redmon et al. that uses a different approach compared to the above R-CNN family of approaches [80]. A single neural network is used to predict class probabilities and bounding boxes from the images. Their base model and Fast YOLO model can process images in real-time at 45 fps and 155 fps with double mAP (mean Average Precision) [80]. Although it was reported to be fast and outperformed the state-of-the-art R-CNN's family techniques in terms of speed, it tends to make more localization errors [80]. YOLO is another popular object detection method proposed by Redmon et al. that uses a different approach compared to the above R-CNN family of approaches [80]. A single neural network is used to predict class probabilities and bounding boxes from the images. Their base model and Fast YOLO model can process images in real-time at 45 fps and 155 fps with double mAP (mean Average Precision) [80]. Although it was reported to be fast and outperformed the state-of-the-art R-CNN's family techniques in terms of speed, it tends to make more localization errors [80].

SSD is another approach proposed by Wei Liu et al. to detect objects in images by using a single neural network [79]. It performs the generation of region proposals and also identifies the objects in the proposed region in a single shot. Whereas, RPN-based approaches use two shots, and are hence slower than SSD, have achieved an mAP higher than Faster R-CNN or YOLO [79]. SSD is another approach proposed by Wei Liu et al. to detect objects in images by using a single neural network [79]. It performs the generation of region proposals and also identifies the objects in the proposed region in a single shot. Whereas, RPN-based approaches use two shots, and are hence slower than SSD, have achieved an mAP higher than Faster R-CNN or YOLO [79].

### *2.6. Transfer Learning Technique 2.6. Transfer Learning Technique*

Transfer learning is a technique that is used in many machine learning and deep learning tasks. It has been defined in different ways. Goodfellow et al. define it as an approach of transferring the knowledge of a previously trained neural network model to the new model [48]. It has also been defined as an optimization that allows rapid progress when the model is learning for another task [83]. Mathematically, this can be defined as follows.

Definition: For a learning task *L<sup>s</sup>* in the source domain *D<sup>s</sup>* and a learning task *L<sup>t</sup>* in the target domain *D<sup>t</sup> ,* transfer learning helps improving the performance of the predictive follows.

function *ft(.)* in target domain *D<sup>t</sup>* by utilizing the knowledge acquired from *D<sup>s</sup>* and *T<sup>s</sup>* ; where *D<sup>s</sup>* 6= *D<sup>t</sup>* and *L<sup>s</sup>* 6= *L<sup>t</sup> .* Figure 7 represents transfer learning technique. target domain *Dt,* transfer learning helps improving the performance of the predictive function *ft(.)* in target domain *Dt* by utilizing the knowledge acquired from *Ds* and *Ts*; where *Ds* ≠ *Dt* and *Ls* ≠ *Lt.* Figure 7 represents transfer learning technique.

Transfer learning is a technique that is used in many machine learning and deep learning tasks. It has been defined in different ways. Goodfellow et al. define it as an approach of transferring the knowledge of a previously trained neural network model to the new model [48]. It has also been defined as an optimization that allows rapid progress when the model is learning for another task [83]. Mathematically, this can be defined as

Definition: For a learning task *Ls* in the source domain *Ds* and a learning task *Lt* in the

*Agriculture* **2021**, *11*, x FOR PEER REVIEW 11 of 34

**Figure 7.** The transfer learning technique. **Figure 7.** The transfer learning technique.

For instance, a neural network model that is trained to learn and recognize the images of animals or birds can be used to train and identify automotive cars or medical x-ray diagnostic images or any other set of images. Usually, this process comes in handy when there is less amount of data that is available to train for the second task. However, it also helps in accelerating the training process on the second task, compared to training from scratch, which may take weeks to achieve optimal performance. When the first task is trained to recognize some images, the low-level layers of the neural network model try to learn the basic features of the images. For example, contours, edges, circles are extracted by the low-level layers, which are called feature extractors. These feature extractors are a standard in the first stages of the neural network training and are the standard building blocks for most image recognition-related tasks. We utilize these feature extractors for the second task, and in the end, we use an image classifier to train and classify for our specific job. In our scenario, since the task is to recognize two classes i.e., plants and weeds, the transfer learning technique was utilized to perform experiments that are described in the next section. For instance, a neural network model that is trained to learn and recognize the images of animals or birds can be used to train and identify automotive cars or medical x-ray diagnostic images or any other set of images. Usually, this process comes in handy when there is less amount of data that is available to train for the second task. However, it also helps in accelerating the training process on the second task, compared to training from scratch, which may take weeks to achieve optimal performance. When the first task is trained to recognize some images, the low-level layers of the neural network model try to learn the basic features of the images. For example, contours, edges, circles are extracted by the low-level layers, which are called feature extractors. These feature extractors are a standard in the first stages of the neural network training and are the standard building blocks for most image recognition-related tasks. We utilize these feature extractors for the second task, and in the end, we use an image classifier to train and classify for our specific job. In our scenario, since the task is to recognize two classes i.e., plants and weeds, the transfer learning technique was utilized to perform experiments that are described in the next section.

The transfer learning technique, as described above, is proposed as the method to be utilized for the tasks of weed identification and classification as it has been reported to be suitable for tasks of autonomous identification and classification tasks [84]. Despite its widespread application in diverse fields like training self-driving cars to audio transcription, the transfer learning technique faces two major limitations. The phenomena of negative transfer and over-fitting are considered two major limitations of the transfer learning technique [85]. Negative transfer occurs when the model source domain data is dissimilar from target domain data. In other words, negative transfer can occur when the two tasks are too dissimilar [86]. As a result, the model does not perform well, leading to poor re-The transfer learning technique, as described above, is proposed as the method to be utilized for the tasks of weed identification and classification as it has been reported to be suitable for tasks of autonomous identification and classification tasks [84]. Despite its widespread application in diverse fields like training self-driving cars to audio transcription, the transfer learning technique faces two major limitations. The phenomena of negative transfer and over-fitting are considered two major limitations of the transfer learning technique [85]. Negative transfer occurs when the model source domain data is dissimilar from target domain data. In other words, negative transfer can occur when the two tasks are too dissimilar [86]. As a result, the model does not perform well, leading to poor results. On the other hand, while doing transfer learning, the models are prone to overfitting, in absence of careful evaluation and tuning. Overfitting is however a general limitation for all prediction technologies [87]. These limitations can be overcome by carefully tuning the hyperparameters and choosing the right size (number of layers) of the neural network model.

### **3. Materials and Methods**

This field of studies regarding the problem of weed infestation were carried out on rice farming systems in the Kashmir region in India. The robot development research is being undertaken at the Hamburg University of Technology (TU Hamburg), Hamburg, Germany under the research group Rural Revival and Restoration Engineering (RUVIVAL) at the Institute of Wastewater Management and Water Protection with the support of the Institute of Reliability Engineering.

### *3.1. Conceptualisation and High-Level Design of the Robot*

The conceptualized mobile robot platform's intuitive design is shown in Figure 8 as a demonstration of how a robot platform might look once it is built in real-time. The design is developed using Onshape design software [88]. The robot was conceptualized initially to operate between rows of rice plants with a spacing of 25 cm, however, subsequently, it is planned that the robot shall be a modular one, as such operation can be adjusted to the row width and the height of the plants at different stages. The robot is intended to recognize weeds at an early BBCH stage, ideally at the leaf development stage. In this regard, the images of plants taken for training purposes also included plants at the sprouting stage. The conceptualized robot, as shown in the figure, has an electronics storage box where it has batteries, sensors, and a single-board computer. On top of the electronic box, there is a solar panel mount to provide a renewable source of energy for the robot's movement. Once the robot has successfully identified the weeds, an algorithm provides the position of the weeds in terms of real-world coordinates of the robotic platform relative to the image frame. After the transformations have taken place, a robotic manipulator picks up the real-world coordinates and performs inverse kinematics operations and drives the end effector to the desired position and performs weed control mechanisms like mechanical or thermal weed control, optionally mulching. *Agriculture* **2021**, *11*, x FOR PEER REVIEW 13 of 34

**Figure 8.** A rough representation of the idea of the plant and weed classifier robot. **Figure 8.** A rough representation of the idea of the plant and weed classifier robot.

delta manipulator (excluding the fourth degree of the end actuator) [90].

The choice of robotic manipulators to perform mechanical weeding can vary depending on various factors such as kinematic structure, degrees of freedom, workspace, motion control, accuracy, and repeatability [89,90]. There is a possibility to mount three types of manipulators underneath the robotic platform. The choice of robotic manipulators to perform mechanical weeding can vary depending on various factors such as kinematic structure, degrees of freedom, workspace, motion control, accuracy, and repeatability [89,90]. There is a possibility to mount three types of manipulators underneath the robotic platform.

acceleration, high dynamic characteristics, and it is easier to solve inverse kinematics problems with them compared to serial manipulators [89,90]. On the downside, they have a limited and complex workspace. A parallel manipulator may still be one of the better choices for performing weeding action. Serial manipulators or articulated arms have a larger workspace, high inertia, low stiffness, low speeds, and accelerations and experience more difficulty in solving the inverse kinematics problem compared to parallel manipulators [89]. Cartesian robots are not considered an ideal choice because of their lesser number of applications on mobile platforms. At this point, we propose a parallel manipulator as the ideal choice based on its advantages and characteristics. However, it can still be an open question to agree on the perfect manipulator that can be mounted onto the robot to perform weeding acts. The following Figure 9 presents three degrees of freedom parallel


3. Parallel manipulator

### 3. Parallel manipulator

Parallel manipulators have high rigidity, high payload/weight ratio, high speed and acceleration, high dynamic characteristics, and it is easier to solve inverse kinematics problems with them compared to serial manipulators [89,90]. On the downside, they have a limited and complex workspace. A parallel manipulator may still be one of the better choices for performing weeding action. Serial manipulators or articulated arms have a larger workspace, high inertia, low stiffness, low speeds, and accelerations and experience more difficulty in solving the inverse kinematics problem compared to parallel manipulators [89]. Cartesian robots are not considered an ideal choice because of their lesser number of applications on mobile platforms. At this point, we propose a parallel manipulator as the ideal choice based on its advantages and characteristics. However, it can still be an open question to agree on the perfect manipulator that can be mounted onto the robot to perform weeding acts. The following Figure 9 presents three degrees of freedom parallel delta manipulator (excluding the fourth degree of the end actuator) [90]. *Agriculture* **2021**, *11*, x FOR PEER REVIEW 14 of 34

**Figure 9.** A schematic of a delta robot manipulator with three degrees of freedom: (**a**) A Delta robot with three degrees of freedom; (**b**) A three-dimensional model of a Delta robot with the different pa-rameters [90]. **Figure 9.** A schematic of a delta robot manipulator with three degrees of freedom: (**a**) A Delta robot with three degrees of freedom; (**b**) A three-dimensional model of a Delta robot with the different pa-rameters [90].

This robot is intended to be used as an agricultural tool together with other sustainable agricultural practices, which decrease the dependence of farmers on external inputs like mineral fertilizers and pesticides. Therefore, from a purely monetary perspective, the robot can decrease the input costs by decreasing labor requirements and eliminating the cost associated with pesticides, while increasing yield by bridging the yield gap resulting from weed infestation. An important aspect of the use of an autonomous weeding robot, from an agroecological perspective, is to reduce the ecological footprint of food production through the phasing out of chemical pesticides. This will also lead to better quality food and less contamination of soil and water due to agrochemical residues, as already discussed in the introduction. The environmental and societal damages of pesticide use have been estimated to be around \$10 billion [91]. The costs and benefits of this intervention, hence, go much beyond the cost of procurement of the equipment and the benefit of labor savings due to robot deployment for weeding. The proposed weeding robot is conceptualized as a low-cost robotic machine, as compared to the robots that are available in the market, which are available in the range of \$20,000 to \$125,000 [92–94]. The prototype is being built with a cost estimation of \$15,000 and the final robot upon industrial production is expected to be available to the farmers for a price under \$10,000. In comparison, the monetary costs of pesticides for a smallholder with 10 hectare land under cultivation, is around \$1750 per year at \$70 per acre (2018, 2019) [95]. Pesticide costs are expected to further increase in the coming years with increased incidence of pesticide resistance. This means, if the robot is acquired by a farmer cooperative of five farmers who use it on sharing basis, the monetary cost of procuring the robot will be the same as the cost they would This robot is intended to be used as an agricultural tool together with other sustainable agricultural practices, which decrease the dependence of farmers on external inputs like mineral fertilizers and pesticides. Therefore, from a purely monetary perspective, the robot can decrease the input costs by decreasing labor requirements and eliminating the cost associated with pesticides, while increasing yield by bridging the yield gap resulting from weed infestation. An important aspect of the use of an autonomous weeding robot, from an agroecological perspective, is to reduce the ecological footprint of food production through the phasing out of chemical pesticides. This will also lead to better quality food and less contamination of soil and water due to agrochemical residues, as already discussed in the introduction. The environmental and societal damages of pesticide use have been estimated to be around \$10 billion [91]. The costs and benefits of this intervention, hence, go much beyond the cost of procurement of the equipment and the benefit of labor savings due to robot deployment for weeding. The proposed weeding robot is conceptualized as a low-cost robotic machine, as compared to the robots that are available in the market, which are available in the range of \$20,000 to \$125,000 [92–94]. The prototype is being built with a cost estimation of \$15,000 and the final robot upon industrial production is expected to be available to the farmers for a price under \$10,000. In comparison, the monetary costs of pesticides for a smallholder with 10 hectare land under cultivation, is around \$1750 per year at \$70 per acre (2018, 2019) [95]. Pesticide costs are expected to further increase in the coming years with increased incidence of pesticide resistance. This means, if the robot is acquired by a farmer cooperative of five farmers who use it on sharing basis, the monetary cost of procuring the robot will be the same as the cost they would otherwise incur by using pesticides in one year, with environmental and human health benefits a strong motivation.

otherwise incur by using pesticides in one year, with environmental and human health

Designing robot hardware that is operating under dynamic surroundings is often a challenging task. We can notice, a high-level, modular hardware design is presented and introduced in Figure 10. The robot ideally consists of a single-board computer along with all the required modules, peripherals, sensors, and actuators. Single boards computers have everything built on a single circuit board like RAM, processor, and peripherals. It has general-purpose input-output pins that are good at controlling sensors and actuators. There are many open-source single-board computer varieties available today. Depending on the choice of application, it is essential to choose one. Open source boards like Raspberry Pi have processors and have to ability to run Linux and distributed systems like

*3.2. Hardware Design Approach of the Weeding Robot* 

Robot Operating System (ROS) [96,97].

### *3.2. Hardware Design Approach of the Weeding Robot*

Designing robot hardware that is operating under dynamic surroundings is often a challenging task. We can notice, a high-level, modular hardware design is presented and introduced in Figure 10. The robot ideally consists of a single-board computer along with all the required modules, peripherals, sensors, and actuators. Single boards computers have everything built on a single circuit board like RAM, processor, and peripherals. It has general-purpose input-output pins that are good at controlling sensors and actuators. There are many open-source single-board computer varieties available today. Depending on the choice of application, it is essential to choose one. Open source boards like Raspberry Pi have processors and have to ability to run Linux and distributed systems like Robot Operating System (ROS) [96,97]. *Agriculture* **2021**, *11*, x FOR PEER REVIEW 15 of 34

**Figure 10.** High-level hardware design block diagram for the weeding robot. **Figure 10.** High-level hardware design block diagram for the weeding robot.

ROS is a lightweight middleware that is specifically designed for robotic applications. Its publish-subscribe design pattern is one of the featured patterns that enables asynchronous parallel processing from node to node communication. It has built-in packages that can solve inverse kinematics, forward kinematics, path planning, navigation, PID (proportional-integral-derivative) control, vision-related tasks. It also has graphical tools like Gazebo, a Rviz that helps to visualize the robot model for simulations [96]. Boards like Jetson Nano, Jetson TX2-Serie, Jetson Xavier NX, Jetson AGX Xavier-Se-ROS is a lightweight middleware that is specifically designed for robotic applications. Its publish-subscribe design pattern is one of the featured patterns that enables asynchronous parallel processing from node to node communication. It has built-in packages that can solve inverse kinematics, forward kinematics, path planning, navigation, PID (proportional-integral-derivative) control, vision-related tasks. It also has graphical tools like Gazebo, a Rviz that helps to visualize the robot model for simulations [96].

ries from NVIDIA [86], Coral dev board from Google has TPU(Tensor Processing Unit) and NPU(Neural Processing Unit) [98], that enables and accelerates them to use in AIspecific applications like object detection, image classification, instance segmentation for training and inferencing purposes [99]. These boards are cheaper and costs in the range of approximately 100\$ to 800\$. These boards will be analyzed and utilized for our robot building purpose in future work by keeping a low-cost reliable design in scope. *3.3. Software Design Approach of the Weeding Robot*  Boards like Jetson Nano, Jetson TX2-Serie, Jetson Xavier NX, Jetson AGX Xavier-Series from NVIDIA [86], Coral dev board from Google has TPU(Tensor Processing Unit) and NPU(Neural Processing Unit) [98], that enables and accelerates them to use in AIspecific applications like object detection, image classification, instance segmentation for training and inferencing purposes [99]. These boards are cheaper and costs in the range of approximately 100\$ to 800\$. These boards will be analyzed and utilized for our robot building purpose in future work by keeping a low-cost reliable design in scope.

### Software for the weeding robot can be entirely developed in the ROS framework us-*3.3. Software Design Approach of the Weeding Robot*

weeding robot is presented in Figure 11.

ing high-level languages like C++ or python. A sensor interface provides all the inputs from the cameras and sensors on the robot. The perception interface deals with the identification of weeds, stem positions, and position estimation of the detected weeds. Software for the weeding robot can be entirely developed in the ROS framework using high-level languages like C++ or python. A sensor interface provides all the inputs from

The navigation interface has closed-loop feedback control algorithms that help with pathplanning between the crop rows. The robot interface takes the outputs from the feedback controllers and drives the robot in the crop field autonomously and manages the weeds in real-time using the delta manipulator. A high-level software block diagram for the

the cameras and sensors on the robot. The perception interface deals with the identification of weeds, stem positions, and position estimation of the detected weeds. OpenCV libraries can be used in the perception interface for real-time weed identification. The navigation interface has closed-loop feedback control algorithms that help with path-planning between the crop rows. The robot interface takes the outputs from the feedback controllers and drives the robot in the crop field autonomously and manages the weeds in real-time using the delta manipulator. A high-level software block diagram for the weeding robot is presented in Figure 11. *Agriculture* **2021**, *11*, x FOR PEER REVIEW 16 of 34

**Figure 11.** High-level software block diagram for the weeding robot. **Figure 11.** High-level software block diagram for the weeding robot.

Python is widely popular and is used for AI, Computer Vision, and Machine Learning applications. It has gained popularity over the last few years because of its simple syntax structure and versatile features. The open-source community developers are actively contributing to many libraries, which makes it easy for application or product developers to build a product without reinventing the wheel. Python is widely popular and is used for AI, Computer Vision, and Machine Learning applications. It has gained popularity over the last few years because of its simple syntax structure and versatile features. The open-source community developers are actively contributing to many libraries, which makes it easy for application or product developers to build a product without reinventing the wheel. **Figure 11.** High-level software block diagram for the weeding robot. Python is widely popular and is used for AI, Computer Vision, and Machine Learning applications. It has gained popularity over the last few years because of its simple syntax structure and versatile features. The open-source community developers are actively contributing to many libraries, which makes it easy for application or product de-

OpenCV is an open-source software library for computer vision applications. This library can be modified and used for commercial purposes under BSD-license. It comes with many built-in algorithms, for example, face recognition, object identification, tracking humans, and objects. This library is used broadly in all domains, including medicine, research labs, and defense. OpenCV is an open-source software library for computer vision applications. This library can be modified and used for commercial purposes under BSD-license. It comes with many built-in algorithms, for example, face recognition, object identification, tracking humans, and objects. This library is used broadly in all domains, including medicine, research labs, and defense. velopers to build a product without reinventing the wheel. OpenCV is an open-source software library for computer vision applications. This library can be modified and used for commercial purposes under BSD-license. It comes with many built-in algorithms, for example, face recognition, object identification, tracking humans, and objects. This library is used broadly in all domains, including medicine,

### *3.4. Training and Implementation 3.4. Training and Implementation*

research labs, and defense.

### 3.4.1. Plant and Weed Identification Pipeline 3.4.1. Plant and Weed Identification Pipeline *3.4. Training and Implementation*

The plant and weed identification pipeline process comprises three stages. Figure 12 represents the three stages. In the first stage, data was collected and preprocessed according to the input requirements of the neural network model. In the second stage, two neural network models were trained, evaluated, analyzed, and optimized. Finally, in the third stage, the best performing optimized model was exported for real-time identification of plants and weeds. The plant and weed identification pipeline process comprises three stages. Figure 12 represents the three stages. In the first stage, data was collected and preprocessed according to the input requirements of the neural network model. In the second stage, two neural network models were trained, evaluated, analyzed, and optimized. Finally, in the third stage, the best performing optimized model was exported for real-time identification of plants and weeds. 3.4.1. Plant and Weed Identification Pipeline The plant and weed identification pipeline process comprises three stages. Figure 12 represents the three stages. In the first stage, data was collected and preprocessed according to the input requirements of the neural network model. In the second stage, two neural network models were trained, evaluated, analyzed, and optimized. Finally, in the third stage, the best performing optimized model was exported for real-time identification of plants and weeds.

**Figure 12.** Proposed plant and weed identification pipeline. **Figure 12.** Proposed plant and weed identification pipeline. **Figure 12.** Proposed plant and weed identification pipeline.

### 3.4.2. Experimental Setup 3.4.2. Experimental Setup 3.4.2. Experimental Setup

tests.

tests.

Deep learning tasks are majorly dependent on data is essential for conducting experiments. The input data was based on three plant species: red radish (*Raphanus raphanistrum subsp. sativus or Raphanus sativus*), garden cress (*Lepidium sativum*), and common dandelion (*Taraxacum oficinale*) were considered for our experiments. The abundant availability of the common dandelion on lawns, and the fast growth of red radish and garden Deep learning tasks are majorly dependent on data is essential for conducting experiments. The input data was based on three plant species: red radish (*Raphanus raphanistrum subsp. sativus or Raphanus sativus*), garden cress (*Lepidium sativum*), and common dandelion (*Taraxacum oficinale*) were considered for our experiments. The abundant availability of the common dandelion on lawns, and the fast growth of red radish and garden cress made us opt for them. The problem of plant and weed classification can be divided Deep learning tasks are majorly dependent on data is essential for conducting experiments. The input data was based on three plant species: red radish (*Raphanus raphanistrum subsp. sativus or Raphanus sativus*), garden cress (*Lepidium sativum*), and common dandelion (*Taraxacum oficinale*) were considered for our experiments. The abundant availability of the common dandelion on lawns, and the fast growth of red radish and garden cress made

into two categories: binary and multi-class classification. By grouping the species separately into two categories, we considered this as a binary classification problem. Considering them individually, it becomes a multi-class classification problem. The end goal was to precisely locate any type of weeds in the soil. We treated the classification as a binary classification task. We merged edible radish and garden cress into one category (plants) and common dandelion (weed) into another category and carried out our classification

ering them individually, it becomes a multi-class classification problem. The end goal was to precisely locate any type of weeds in the soil. We treated the classification as a binary classification task. We merged edible radish and garden cress into one category (plants) and common dandelion (weed) into another category and carried out our classification us opt for them. The problem of plant and weed classification can be divided into two categories: binary and multi-class classification. By grouping the species separately into two categories, we considered this as a binary classification problem. Considering them individually, it becomes a multi-class classification problem. The end goal was to precisely locate any type of weeds in the soil. We treated the classification as a binary classification task. We merged edible radish and garden cress into one category (plants) and common dandelion (weed) into another category and carried out our classification tests.

It is a common phenomenon that weeds grow faster compared to edible plants and compete for more soil nutrients. As a result, during this crucial time at the beginning of the growth cycle, distinguishing between the plant and weed is essential. This can then be followed by weed management techniques. Based on that fact, a dataset of edible plant seedlings and weeds of different sizes under different surrounding conditions and backgrounds were prepared. Python programming language, Google's open-source TensorFlow object detection API were utilized to build, train, and analyze neural network models. The system overview used for training, testing, and inference is presented in the table below (Table 1). *Agriculture* **2021**, *11*, x FOR PEER REVIEW 17 of 34 It is a common phenomenon that weeds grow faster compared to edible plants and compete for more soil nutrients. As a result, during this crucial time at the beginning of the growth cycle, distinguishing between the plant and weed is essential. This can then be followed by weed management techniques. Based on that fact, a dataset of edible plant seedlings and weeds of different sizes under different surrounding conditions and backgrounds were prepared. Python programming language, Google's open-source Tensor-Flow object detection API were utilized to build, train, and analyze neural network mod-

> **Table 1.** System overview. els. The system overview used for training, testing, and inference is presented in the table below (Table 1).


GPU NVIDIA 8 GB RAM

3.4.3. Data Acquisition and Pre-Processing OS Ubuntu 18.04 LTS 64-bit

Deep learning tasks require a considerable amount of input data as the main source for training the neural network models. For our problem, we made our dataset based on three plant species, for experimental purposes. A greenhouse was maintained in the laboratory, and we planted red radish and garden cress in mini-plots. We took photographs and compiled the dataset by taking RGB pictures of the growing plants using a mobile camera. The raw pictures collected were of pixel dimensions 4032 × 3024. Since they were high-resolution images, providing them directly as input to train the network would have been computationally expensive and hence the learning process would have been time-consuming. Therefore the raw images were converted to 800 × 600 dimensions and then used for pre-processing. 3.4.3. Data Acquisition and Pre-Processing Deep learning tasks require a considerable amount of input data as the main source for training the neural network models. For our problem, we made our dataset based on three plant species, for experimental purposes. A greenhouse was maintained in the laboratory, and we planted red radish and garden cress in mini-plots. We took photographs and compiled the dataset by taking RGB pictures of the growing plants using a mobile camera. The raw pictures collected were of pixel dimensions 4032 × 3024. Since they were high-resolution images, providing them directly as input to train the network would have been computationally expensive and hence the learning process would have been timeconsuming. Therefore the raw images were converted to 800 × 600 dimensions and then

A complete set of 200 images consisting of photos taken from different perspectives and angles of plants and weeds was used for training and evaluation purposes. Figures 13 and 14 show some of the input image samples that were used for training the network. used for pre-processing. A complete set of 200 images consisting of photos taken from different perspectives and angles of plants and weeds was used for training and evaluation purposes. Figures 13 and 14 show some of the input image samples that were used for training the network.

**Figure 13. Figure 13.** Test weeds: two photographs of common Dand Test weeds: two photographs of common Dandelion that were used in the training. elion that were used in the training.

*Agriculture* **2021**, *11*, x FOR PEER REVIEW 18 of 34

**Figure 14.** Test plants: A few photographs of Radish seedlings (Left) and Cress (Right) that were used in the training. **Figure 14.** Test plants: A few photographs of Radish seedlings (**Left**) and Cress (**Right**) that were used in the training. **Figure 14.** Test plants: A few photographs of Radish seedlings (Left) and Cress (Right) that were used in the training.

In order to train the network, the whole dataset was split into two, one for training and one for evaluation. The train/test split ratio was considered as 160/40 using the Pareto rule. When using the TensorFlow object detection API, we maintained a structure such as a workspace for all the configuration files and datasets. The whole process was divided into five steps based on the TensorFlow custom object detection process. Those five steps included preparing the workspace, annotating images, generate TFRecord file format input files, configure/train/optimize the model, and export the inference graph for testing. In order to train the network, the whole dataset was split into two, one for training and one for evaluation. The train/test split ratio was considered as 160/40 using the Pareto rule. When using the TensorFlow object detection API, we maintained a structure such as a workspace for all the configuration files and datasets. The whole process was divided into five steps based on the TensorFlow custom object detection process. Those five steps included preparing the workspace, annotating images, generate TFRecord file format input files, configure/train/optimize the model, and export the inference graph for testing. In order to train the network, the whole dataset was split into two, one for training and one for evaluation. The train/test split ratio was considered as 160/40 using the Pareto rule. When using the TensorFlow object detection API, we maintained a structure such as a workspace for all the configuration files and datasets. The whole process was divided into five steps based on the TensorFlow custom object detection process. Those five steps included preparing the workspace, annotating images, generate TFRecord file format input files, configure/train/optimize the model, and export the inference graph for testing.

For annotating images, an open-source labeling tool LabelImg was used to draw the bounding boxes. The annotations were saved in the PASCAL Visual Object Classes (VOC) format as XML files. A representation of the bounding boxes that were drawn around the edible plants and weeds is shown in Figure 15. For annotating images, an open-source labeling tool LabelImg was used to draw the bounding boxes. The annotations were saved in the PASCAL Visual Object Classes (VOC) format as XML files. A representation of the bounding boxes that were drawn around the edible plants and weeds is shown in Figure 15. For annotating images, an open-source labeling tool LabelImg was used to draw the bounding boxes. The annotations were saved in the PASCAL Visual Object Classes (VOC) format as XML files. A representation of the bounding boxes that were drawn around the edible plants and weeds is shown in Figure 15.

3.4.4. Training and Analysis of the Neural Network Model **Figure 15.** Annotating weed images using labeling software. **Figure 15.** Annotating weed images using labeling software.

By utilizing the transfer learning technique, two pre-trained models Faster R-CNN 3.4.4. Training and Analysis of the Neural Network Model 3.4.4. Training and Analysis of the Neural Network Model

inceptionv2 and SSD inceptionv2 were chosen from the TensorFlow model zoo that were trained for the Common Objects in Context (COCO) dataset. For the weeding robot, a latency is preferred between the detection and interacting with the weed. Hence there was no primary requirement for higher detection speeds in our scenario. A reasonable detection speed with higher mean Average Precision (mAP) accuracies and higher confidence By utilizing the transfer learning technique, two pre-trained models Faster R-CNN inceptionv2 and SSD inceptionv2 were chosen from the TensorFlow model zoo that were trained for the Common Objects in Context (COCO) dataset. For the weeding robot, a latency is preferred between the detection and interacting with the weed. Hence there was no primary requirement for higher detection speeds in our scenario. A reasonable detection speed with higher mean Average Precision (mAP) accuracies and higher confidence By utilizing the transfer learning technique, two pre-trained models Faster R-CNN inceptionv2 and SSD inceptionv2 were chosen from the TensorFlow model zoo that were trained for the Common Objects in Context (COCO) dataset. For the weeding robot, a latency is preferred between the detection and interacting with the weed. Hence there was no primary requirement for higher detection speeds in our scenario. A reasonable detection speed with higher mean Average Precision (mAP) accuracies and higher confidence scores were preferred. The reported mAP accuracies and speed of the above two

mentioned models on the COCO dataset were reasonably well and suitable for our plant and weed detection problem. Hence these models were adapted for training, optimization, or better generalization. sive. A large neural network also comes with the downside of not being able to provide remarkable accuracies. Some of the available backbone architectures include AlexNet,

scores were preferred. The reported mAP accuracies and speed of the above two mentioned models on the COCO dataset were reasonably well and suitable for our plant and weed detection problem. Hence these models were adapted for training, optimization, or

layers sequentially, which however can make the large network computationally expen-

Generally, to come up with a model architecture, neural networks are stacked up in

*Agriculture* **2021**, *11*, x FOR PEER REVIEW 19 of 34

better generalization.

Generally, to come up with a model architecture, neural networks are stacked up in layers sequentially, which however can make the large network computationally expensive. A large neural network also comes with the downside of not being able to provide remarkable accuracies. Some of the available backbone architectures include AlexNet, VGG16/19, GoogLeNet, MobileNet, Inceptionv2, Inceptionv3, Inceptionv4, NASNet, ResNet, Xception, Inception-Resnet. The models that were trained in our experiments and analysis use Inceptionv2 architecture as feature extractors. Christian Szegedy and his colleagues had proposed GoogLeNet. It consists of a 22-layer deep convolutional neural network architecture that was considerably computationally efficient. Instead of stacking up layers sequentially and selecting filters, they proposed a block in between the layers named as "Inception" module. The inception module performed different kinds of filter operations in parallel. From these filter operations, we get different outputs that were concatenated 'depth' wise all together. This makes the network go wider rather than deeper. The single output obtained from the previous operation is then passed on to the next layer. The result of doing this operation was observed to be computationally less expensive. The inception block is represented in Figure 16. VGG16/19, GoogLeNet, MobileNet, Inceptionv2, Inceptionv3, Inceptionv4, NASNet, Res-Net, Xception, Inception-Resnet. The models that were trained in our experiments and analysis use Inceptionv2 architecture as feature extractors. Christian Szegedy and his colleagues had proposed GoogLeNet. It consists of a 22-layer deep convolutional neural network architecture that was considerably computationally efficient. Instead of stacking up layers sequentially and selecting filters, they proposed a block in between the layers named as "Inception" module. The inception module performed different kinds of filter operations in parallel. From these filter operations, we get different outputs that were concatenated 'depth' wise all together. This makes the network go wider rather than deeper. The single output obtained from the previous operation is then passed on to the next layer. The result of doing this operation was observed to be computationally less expensive. The inception block is represented in Figure 16.

**Figure 16.** Inception module with dimension reduction. **Figure 16.** Inception module with dimension reduction.

Before the training process had started, the pre-trained Faster R-CNN inceptionv2 model configuration file that was trained on the COCO dataset was modified. In the custom configuration (Configuration 2), we set the total number of classes to 2, as it indicates the classification of plant and weed. The maximum detections per class and maximum total detections variables were set to 10. The network was then allowed to start the training process from the fine-tune checkpoint that comes with the unmodified model. The learning rate is considered as one of the essential hyperparameters that help to optimize the model to achieve better performance. Considering the unmodified learning rate and the number of steps that come with the pre-trained model, the model was over-fitting with a large deviation with increasing evaluation loss. By using the heuristics method and reducing the step size and keeping the learning rate constant, the model performed with a better generalization ability. For further evaluation and fine-tuning purposes, we also con-Before the training process had started, the pre-trained Faster R-CNN inceptionv2 model configuration file that was trained on the COCO dataset was modified. In the custom configuration (Configuration 2), we set the total number of classes to 2, as it indicates the classification of plant and weed. The maximum detections per class and maximum total detections variables were set to 10. The network was then allowed to start the training process from the fine-tune checkpoint that comes with the unmodified model. The learning rate is considered as one of the essential hyperparameters that help to optimize the model to achieve better performance. Considering the unmodified learning rate and the number of steps that come with the pre-trained model, the model was over-fitting with a large deviation with increasing evaluation loss. By using the heuristics method and reducing the step size and keeping the learning rate constant, the model performed with a better generalization ability. For further evaluation and fine-tuning purposes, we also considered another higher learning rate value for the same model using the heuristics method. This configuration (Configuration 2) was tried to find if the model converges faster to 0 with better generalization capability.

method. This configuration (Configuration 2) was tried to find if the model converges

sidered another higher learning rate value for the same model using the heuristics ł

faster to 0 with better generalization capability.

*Agriculture* **2021**, *11*, x FOR PEER REVIEW 20 of 34

Evaluation Metrics

Evaluation Metrics Intersection Over Union (IOU): It is an evaluation metric based on the overlap between two bounding boxes. It requires a ground truth bounding box BG and a predicted bounding box BP. With this metric, we can determine if the detection is valid or invalid. IOU ranges from 0 to 1. The higher the number, the closer the boxes together. IOU is defined mathematically as the intersection of the overlapping bounding boxes area divided by the union of the overlapping bounding boxes area Figure 17. Intersection Over Union (IOU): It is an evaluation metric based on the overlap between two bounding boxes. It requires a ground truth bounding box BG and a predicted bounding box BP. With this metric, we can determine if the detection is valid or invalid. IOU ranges from 0 to 1. The higher the number, the closer the boxes together. IOU is defined mathematically as the intersection of the overlapping bounding boxes area divided by the union of the overlapping bounding boxes area Figure 17. Evaluation Metrics Intersection Over Union (IOU): It is an evaluation metric based on the overlap between two bounding boxes. It requires a ground truth bounding box BG and a predicted bounding box BP. With this metric, we can determine if the detection is valid or invalid. IOU ranges from 0 to 1. The higher the number, the closer the boxes together. IOU is defined mathematically as the intersection of the overlapping bounding boxes area divided by the union of the overlapping bounding boxes area Figure 17.

**Figure 17.** Graphical representation of Intersection Over Union (IOU) (Source: Adrian Rosebrock/Creative Commons).

When IOU scores were available, a threshold (example 0.5) was set for transforming the score into classifications. The IOU values that were above the threshold were considered positive predictions, and if it was below the threshold, they were considered as negative predictions.

Average Precision (AP): Average precision is another way to evaluate object detectors. It is a numerical metric that is the precision averaged across all the recall values between 0 and 1. It uses an 11 point interpolation technique to calculate the AP. It can be interpreted as the area under the precision x recall curve.

Mean Average Precision (mAP): The mAP is another and widely accepted metric to evaluate object detectors. It is merely the average of AP, i.e., it computes the AP for each class and averages them. Tensorboard app is used to visualize the mAP and AP values at different thresholds. The results are briefly discussed in the next sections.

### 3.4.5. Stem Position Extraction

Extracting the position of the stem is essential for the robotic manipulator for the precise weed management process. It can be done using semantic segmentation techniques, as described by Lottes et al. [100]. The approach reported though is computationally expensive and at best a predictive approach. In this work, a simple stem position extraction technique was formulated and proposed based on the bounding box localization, based on the fact that plants usually exhibit radial or bilateral symmetry. However, plants that are anchored to a single location exhibit an overall roughly radial symmetry. Based on that fact, we say that the center point of the detected bounding box around the weed should be the estimated stem position in the image frame. The accuracy of the stem position was directly proportional to how well the bounding box regressor localizes the complete weed or plant structure.

### **4. Results and Discussions**

### *4.1. Training*

Tensorboard is a powerful visualization tool for evaluating model performances. It was utilized in this work for obtaining the graphs and analyzing purposes. We consider COCO mAP at [0.5:0.95] IOU and mAP at a 0.5 IOU threshold to evaluate the model's performance.

### 4.1.1. Case 1: Configuration 1

In this case, we considered learning rate configuration 1. With that configuration, the Faster R-CNN inceptionv2 COCO model was trained and fine-tuned up to 200 k steps. The model performed considerably well and reached a maximum overall mAP [0.5:0.95]IOU of 30.94% at 149.6 k step (Figure 18). The maximum mAP at the 0.5 IOU threshold was 61.5% at 149.6 k (Figure 19). At the 200 k step, the maximum overall mAP[0.5:0.95]IOU was reached, at 30.57%. The maximum mAP at the 0.5 IOU threshold was 61.29%. These values were considered suitable given the comparatively less amount of data that the model was trained with. The graphs corresponding to the model performance are shown in the following figures. Graphs were generated at a smoothing value of 0.6 to show the overall trend of the training and evaluation process. *Agriculture* **2021**, *11*, x FOR PEER REVIEW 22 of 34

**Figure 18.** Overall mean Average Precision (mAP) at (0.5:0.95) IOU, X-axis: steps, Y-axis: mean average precision. **Commented [M3]:** The ordinate in the picture should completely display 0, 0.2, 0.3, please **Figure 18.** Overall mean Average Precision (mAP) at (0.5:0.95) IOU, X-axis: steps, Y-axis: mean average precision.

change the picture completely displayed. **Commented [TS4R3]:** Figure replaced.

**Commented [M5]:** −20k in the abscissa should be fully displayed, please change the picture

Please change -0.1 in the ordinate to −0.1. **Commented [TS6R5]:** Figure replaced.

completely displayed.

Steps on *X*-axis: One gradient update is considered as a training or evaluation step (iteration). It represents the number of batch-size images that are processed during a single iteration. For instance, we considered 200 images, and our batch size is set to 1 image in training configuration. That means one image was processed during one step, and gradients were updated once. Now the model takes 200 steps to complete the processing of the entire dataset. As the model processed the entire dataset, we say the model completed

*Y*-axis: *Y*-axis in the following graphs corresponds to their respective losses and mAP

By observing Figures 20 and 21 for training and evaluation loss, we say the model is performing better as the two loss learning curves show a decreasing trend without huge variations. The *X*-axis represents the number of training and evaluation steps the model was trained with, while the *Y*-axis represents the training and evaluation loss recorded at each step respectively. Approximately at 150 k step, the model's total evaluation loss had reached a minimum of 0.61, and from after that, we observe a very slight increase in the loss values, this indicates the model was trying to overfit slowly and indicates it may not be feasible to train further. The training was stopped at 200 k, and the nearest checkpoint recorded at 200 k step was exported and inferencing was done. This performance was cross-verified with the pre-trained configuration, as it stated at 200 k steps were enough for the model to perform better. Although the model performed quite well on the new

**Figure 19.** mAP at 0.5 IOU, X-axis: steps, Y-axis: mean average precision.

one epoch.

of the model.

average precision.

*Agriculture* **2021**, *11*, x FOR PEER REVIEW 22 of 34

**Figure 19.** mAP at 0.5 IOU, X-axis: steps, Y-axis: mean average precision. **Figure 19.** mAP at 0.5 IOU, X-axis: steps, Y-axis: mean average precision.

Steps on *X*-axis: One gradient update is considered as a training or evaluation step (iteration). It represents the number of batch-size images that are processed during a single iteration. For instance, we considered 200 images, and our batch size is set to 1 image in training configuration. That means one image was processed during one step, and gradients were updated once. Now the model takes 200 steps to complete the processing of the entire dataset. As the model processed the entire dataset, we say the model completed one epoch. *Y*-axis: *Y*-axis in the following graphs corresponds to their respective losses and mAP of the model. fully displayed, please change the picture completely displayed. Please change -0.1 in the ordinate to −0.1. **Commented [TS6R5]:** Figure replaced. Steps on *X*-axis: One gradient update is considered as a training or evaluation step (iteration). It represents the number of batch-size images that are processed during a single iteration. For instance, we considered 200 images, and our batch size is set to 1 image in training configuration. That means one image was processed during one step, and gradients were updated once. Now the model takes 200 steps to complete the processing of the entire dataset. As the model processed the entire dataset, we say the model completed one epoch.

**Figure 18.** Overall mean Average Precision (mAP) at (0.5:0.95) IOU, X-axis: steps, Y-axis: mean

**Commented [M3]:** The ordinate in the picture should completely display 0, 0.2, 0.3, please change the picture completely displayed. **Commented [TS4R3]:** Figure replaced.

**Commented [M5]:** −20k in the abscissa should be

By observing Figures 20 and 21 for training and evaluation loss, we say the model is performing better as the two loss learning curves show a decreasing trend without huge *Y*-axis: *Y*-axis in the following graphs corresponds to their respective losses and mAP of the model.

variations. The *X*-axis represents the number of training and evaluation steps the model was trained with, while the *Y*-axis represents the training and evaluation loss recorded at each step respectively. Approximately at 150 k step, the model's total evaluation loss had reached a minimum of 0.61, and from after that, we observe a very slight increase in the loss values, this indicates the model was trying to overfit slowly and indicates it may not be feasible to train further. The training was stopped at 200 k, and the nearest checkpoint recorded at 200 k step was exported and inferencing was done. This performance was cross-verified with the pre-trained configuration, as it stated at 200 k steps were enough for the model to perform better. Although the model performed quite well on the new By observing Figures 20 and 21 for training and evaluation loss, we say the model is performing better as the two loss learning curves show a decreasing trend without huge variations. The *X*-axis represents the number of training and evaluation steps the model was trained with, while the *Y*-axis represents the training and evaluation loss recorded at each step respectively. Approximately at 150 k step, the model's total evaluation loss had reached a minimum of 0.61, and from after that, we observe a very slight increase in the loss values, this indicates the model was trying to overfit slowly and indicates it may not be feasible to train further. The training was stopped at 200 k, and the nearest checkpoint recorded at 200 k step was exported and inferencing was done. This performance was cross-verified with the pre-trained configuration, as it stated at 200 k steps were enough for the model to perform better. Although the model performed quite well on the new unknown images, there was scope in the optimization of the model by tuning the model's hyperparameters. *Agriculture* **2021**, *11*, x FOR PEER REVIEW 23 of 34

**Figure 20.** Training loss, X-axis: steps, Y-axis: training loss. **Figure 20.** Training loss, X-axis: steps, Y-axis: training loss.

One of the observations from the analysis and experiments was: If the data considered was very low, data augmentation techniques such as flipping the images can help increase the mAP. The transfer learning technique was evaluated and justified that it can be quite helpful and quick when training on a new classification task instead of training the network from scratch or initializing with random weights. Hyperparameters such as

One of the observations from the analysis and experiments was: If the data considered was very low, data augmentation techniques such as flipping the images can help increase the mAP. The transfer learning technique was evaluated and justified that it can be quite helpful and quick when training on a new classification task instead of training the network from scratch or initializing with random weights. Hyperparameters such as the learning rate can be tuned further to increase mAP. Having high graphical processing units and performing a grid search or random search method can help us find optimal hyperparameters, but the process may be computationally expensive and time-consum-

In order to establish fully the notion that our model was finely well-tuned, the losses for the RPN network and the final classifier were also considered. By observing Figures 22 and 23, the decreasing trend of the box classifier classification and localization loss indicates that the final classifier is good at classifying and localizing the detected plant and weed objects. In Figures 22 and 23, *X*-axis represents the number of evaluation steps and *Y*-axis represents the classification loss and localization loss recorded at each step respec-

**Figure 21.** Total evaluation loss, X-axis: steps, Y-axis: evaluation loss.

ing.

tively.

the learning rate can be tuned further to increase mAP. Having high graphical processing units and performing a grid search or random search method can help us find optimal hyperparameters, but the process may be computationally expensive and time-consuming. **Figure 20.** Training loss, X-axis: steps, Y-axis: training loss. **Commented [M7]:** Please change -0.2 and -20k to

−0.2 and −20k.

**Commented [TS8R7]:** Figure replaced. The minus sign is like that in the program we have used.

**Commented [M9]:** The ordinate and abscissa in

**Commented [M11]:** The ordinate and abscissa in the picture are not displayed completely, please change the picture completely displayed. **Commented [TS12R11]:** Figure replaced.

**Commented [M11]:** The ordinate and abscissa in the picture are not displayed completely, please change the picture completely displayed. **Commented [TS12R11]:** Figure replaced.

**Commented [M13]:** The ordinate and abscissa in

**Commented [M15]:** The handwriting in the picture is not clear, please provide a higher

**Commented [M15]:** The handwriting in the picture is not clear, please provide a higher

**Commented [TS16R15]:** Replaced with a higher resolution image. The writing is readable upon

**Commented [TS16R15]:** Replaced with a higher resolution image. The writing is readable upon

resolution picture.

resolution picture.

zooming.

zooming.

Unfortunately, cannot be changed.

*Agriculture* **2021**, *11*, x FOR PEER REVIEW 23 of 34

hyperparameters.

unknown images, there was scope in the optimization of the model by tuning the model's

**Figure 21.** Total evaluation loss, X-axis: steps, Y-axis: evaluation loss. **Figure 21.** Total evaluation loss, X-axis: steps, Y-axis: evaluation loss.

One of the observations from the analysis and experiments was: If the data considered was very low, data augmentation techniques such as flipping the images can help increase the mAP. The transfer learning technique was evaluated and justified that it can be quite helpful and quick when training on a new classification task instead of training the network from scratch or initializing with random weights. Hyperparameters such as the learning rate can be tuned further to increase mAP. Having high graphical processing units and performing a grid search or random search method can help us find optimal the picture are not displayed completely, please change the picture completely displayed. **Commented [TS10R9]:** Figure replaced. In order to establish fully the notion that our model was finely well-tuned, the losses for the RPN network and the final classifier were also considered. By observing Figures 22 and 23, the decreasing trend of the box classifier classification and localization loss indicates that the final classifier is good at classifying and localizing the detected plant and weed objects. In Figures 22 and 23, *X*-axis represents the number of evaluation steps and *Y*-axis represents the classification loss and localization loss recorded at each step respectively. *Agriculture* **2021**, *11*, x FOR PEER REVIEW 24 of 34 *Agriculture* **2021**, *11*, x FOR PEER REVIEW 24 of 34

hyperparameters, but the process may be computationally expensive and time-consum-

**Figure 22.** BoxClassifier: classification loss, X-axis: steps, Y-axis: classification loss. **Figure 22.** BoxClassifier: classification loss, X-axis: steps, Y-axis: classification loss.

**Figure 22.** BoxClassifier: classification loss, X-axis: steps, Y-axis: classification loss.

**Figure 23.** BoxClassifier: localisation loss, X-axis: steps, Y-axis: localisation loss. **Figure 23.** BoxClassifier: localisation loss, X-axis: steps, Y-axis: localisation loss.

The final ground truths and detections of various sizes of weeds and plants at the 200 k evaluation step are presented in Figures 24–26 corresponding to common dandelion (weed), garden cress and radish respectively. It is worth noticing the model gave predictions with good detection scores. the picture are not displayed completely, please change the picture completely displayed. **Commented [TS14R13]:** Figure replaced. **Figure 23.** BoxClassifier: localisation loss, X-axis: steps, Y-axis: localisation loss. The final ground truths and detections of various sizes of weeds and plants at the 200 k evaluation step are presented in Figures 24–26 corresponding to common dandelion (weed), garden cress and radish respectively. It is worth noticing the model gave predictions with good detection scores. **Commented [M13]:** The ordinate and abscissa in the picture are not displayed completely, please change the picture completely displayed. **Commented [TS14R13]:** Figure replaced. The final ground truths and detections of various sizes of weeds and plants at the 200 k evaluation step are presented in Figures 24–26 corresponding to common dandelion (weed), garden cress and radish respectively. It is worth noticing the model gave predictions with good detection scores.

**Figure 24.** Left: detection (97%), Right: groundtruth—larger object.

**Figure 24.** Left: detection (97%), Right: groundtruth—larger object.

**23.** BoxClassifier: localisation loss, X-axis: steps, Y-axis: localisation loss.

tions with good detection scores.

The final ground truths and detections of various sizes of weeds and plants at the 200 k evaluation step are presented in Figures 24–26 corresponding to common dandelion (weed), garden cress and radish respectively. It is worth noticing the model gave predic-

**Figure** 

**Figure 22.** BoxClassifier: classification loss, X-axis: steps, Y-axis: classification loss.

**Figure 24. Left**: detection (97%), **Right**: groundtruth—larger object.

*Agriculture* **2021**, *11*, x FOR PEER REVIEW 25 of 34

*Agriculture* **2021**, *11*, x FOR PEER REVIEW 25 of 34

**Figure 25.** Left: detection (98%), Right: groundtruth—smaller object. **Figure 25. Left**: detection (98%), **Right**: groundtruth—smaller object. **Figure 25.** Left: detection (98%), Right: groundtruth—smaller object.

**Figure 26. Left**: detection (88–99%), **Right**: groundtruth—medium sized object.

**Figure 26.** Left: detection (88–99%), Right: groundtruth—medium sized object.

**Figure 26.** Left: detection (88–99%), Right: groundtruth—medium sized object.

4.1.2. Case 2: Configuration 2

4.1.2. Case 2: Configuration 2

showing the overall trend of the training and evaluation process.

showing the overall trend of the training and evaluation process.

In this case, we considered learning rate configuration 2. With this configuration, the

results of the learning rate configuration 1. However, the model was overfitting and trying to memorize when trained for a longer time. It was one of the reasons the model achieved a higher overall mAP of 34.82% at (0.5:0.95) IOU (Figure 27) and mAP of 63% at 0.5 IOU threshold at 200 k step (Figure 28). The resultant graphs during the training and evaluation process are shown below. The graphs were generated at a smoothing value of 0.6 for

In this case, we considered learning rate configuration 2. With this configuration, the

steps in this configuration, the results obtained were not comparatively better than the results of the learning rate configuration 1. However, the model was overfitting and trying to memorize when trained for a longer time. It was one of the reasons the model achieved a higher overall mAP of 34.82% at (0.5:0.95) IOU (Figure 27) and mAP of 63% at 0.5 IOU threshold at 200 k step (Figure 28). The resultant graphs during the training and evaluation process are shown below. The graphs were generated at a smoothing value of 0.6 for

### 4.1.2. Case 2: Configuration 2

In this case, we considered learning rate configuration 2. With this configuration, the training process was faster and achieved higher mAP values. With a lesser amount of steps in this configuration, the results obtained were not comparatively better than the results of the learning rate configuration 1. However, the model was overfitting and trying to memorize when trained for a longer time. It was one of the reasons the model achieved a higher overall mAP of 34.82% at (0.5:0.95) IOU (Figure 27) and mAP of 63% at 0.5 IOU threshold at 200 k step (Figure 28). The resultant graphs during the training and evaluation process are shown below. The graphs were generated at a smoothing value of 0.6 for showing the overall trend of the training and evaluation process. *Agriculture* **2021**, *11*, x FOR PEER REVIEW 26 of 34

*Agriculture* **2021**, *11*, x FOR PEER REVIEW 26 of 34

**Figure 27.** Overall mAP at (0.5:0.95) IOU, X-axis: steps, Y-axis: mean average precision. **Commented [M21]:** The ordinate and abscissa in **Figure 27.** Overall mAP at (0.5:0.95) IOU, X-axis: steps, Y-axis: mean average precision.

the picture are not displayed completely, please change the high-resolution image that is fully

**Commented [M21]:** The ordinate and abscissa in the picture are not displayed completely, please change the high-resolution image that is fully

**Commented [M21]:** The ordinate and abscissa in the picture are not displayed completely, please change the high-resolution image that is fully

**Commented [M23]:** The ordinate and abscissa in the picture are not displayed completely, please change the high-resolution image that is fully

**Commented [M23]:** The ordinate and abscissa in the picture are not displayed completely, please

**Commented [M25]:** Please change the high-

**Commented [M25]:** Please change the high-

**Commented [TS26R25]:** Figure replaced.

**Commented [TS26R25]:** Figure replaced.

**Commented [TS26R25]:** Figure replaced.

**Commented [M25]:** Please change the high-

resolution picture.

resolution picture.

resolution picture.

**Commented [TS22R21]:** Figure replaced.

**Commented [TS22R21]:** Figure replaced.

**Commented [TS22R21]:** Figure replaced.

displayed.

displayed.

displayed.

**Figure 27.** Overall mAP at (0.5:0.95) IOU, X-axis: steps, Y-axis: mean average precision.

By observing the loss curves in Figures 29 and 30, the localization loss is increasing after the 60 k step. Ideally, all the loss curves should be in decreasing trend, and any large **Figure 28.** mAP at 0.5 IOU, X-axis: steps, Y-axis: mean average precision. By observing the loss curves in Figures 29 and 30, the localization loss is increasing **Figure 28.** mAP at 0.5 IOU, X-axis: steps, Y-axis: mean average precision.

deviations of any loss are considered not suitable for generalization. Considering that, in this case, we should stop training at this point. Hence we can say the chosen learning rate hyperparameter may not be ideal for inferencing purposes compared to case 1 results. With that, case 1 results were considered for inferencing purposes, and the results are reported discussed in the following section. displayed. **Commented [TS24R23]:** Figure replaced. after the 60 k step. Ideally, all the loss curves should be in decreasing trend, and any large deviations of any loss are considered not suitable for generalization. Considering that, in this case, we should stop training at this point. Hence we can say the chosen learning rate hyperparameter may not be ideal for inferencing purposes compared to case 1 results. With that, case 1 results were considered for inferencing purposes, and the results are reported discussed in the following section. change the high-resolution image that is fully displayed. **Commented [TS24R23]:** Figure replaced. By observing the loss curves in Figures 29 and 30, the localization loss is increasing after the 60 k step. Ideally, all the loss curves should be in decreasing trend, and any large deviations of any loss are considered not suitable for generalization. Considering that, in this case, we should stop training at this point. Hence we can say the chosen learning rate hyperparameter may not be ideal for inferencing purposes compared to case 1 results. With that, case 1 results were considered for inferencing purposes, and the results are reported discussed in the following section. **Figure 28.** mAP at 0.5 IOU, X-axis: steps, Y-axis: mean average precision. By observing the loss curves in Figures 29 and 30, the localization loss is increasing after the 60 k step. Ideally, all the loss curves should be in decreasing trend, and any large deviations of any loss are considered not suitable for generalization. Considering that, in this case, we should stop training at this point. Hence we can say the chosen learning rate hyperparameter may not be ideal for inferencing purposes compared to case 1 results. With that, case 1 results were considered for inferencing purposes, and the results are reported discussed in the following section. **Commented [M23]:** The ordinate and abscissa in the picture are not displayed completely, please change the high-resolution image that is fully displayed. **Commented [TS24R23]:** Figure replaced.

**Figure 29.** Training loss, X-axis: steps, Y-axis: training loss. **Figure 29.** Training loss, X-axis: steps, Y-axis: training loss.

*Agriculture* **2021**, *11*, x FOR PEER REVIEW 27 of 34

**Figure 30.** Total evaluation loss, X-axis: steps, Y-axis: evaluation loss. **Figure 30.** Total evaluation loss, X-axis: steps, Y-axis: evaluation loss.

### *4.2. Plant and Weed Identification 4.2. Plant and Weed Identification Agriculture* **2021**, *11*, x FOR PEER REVIEW 27 of 34

After the model was trained, it was used for inference on real-time data for plant and weed identification. For inferencing a new set of images, the model was saved and exported. For exporting the frozen graph, TensorFlow object detection API's inbuilt "export inference graph.py" script was used. The python script was modified accordingly to our task. The same training hardware setup and a Logitech stereo camera were used for realtime identification of plants and weeds. A completely new set of images was provided for predictions. The predicted output images are shown in Figures 31–33. After the model was trained, it was used for inference on real-time data for plant and weed identification. For inferencing a new set of images, the model was saved and exported. For exporting the frozen graph, TensorFlow object detection API's inbuilt "export inference graph.py" script was used. The python script was modified accordingly to our task. The same training hardware setup and a Logitech stereo camera were used for realtime identification of plants and weeds. A completely new set of images was provided for predictions. The predicted output images are shown in Figures 31–33. **Figure 30.** Total evaluation loss, X-axis: steps, Y-axis: evaluation loss. *4.2. Plant and Weed Identification*  After the model was trained, it was used for inference on real-time data for plant and weed identification. For inferencing a new set of images, the model was saved and exported. For exporting the frozen graph, TensorFlow object detection API's inbuilt "export inference graph.py" script was used. The python script was modified accordingly to our task. The same training hardware setup and a Logitech stereo camera were used for realtime identification of plants and weeds. A completely new set of images was provided for predictions. The predicted output images are shown in Figures 31–33. After the model was trained, it was used for inference on real-time data for plant and weed identification. For inferencing a new set of images, the model was saved and exported. For exporting the frozen graph, TensorFlow object detection API's inbuilt "export inference graph.py" script was used. The python script was modified accordingly to our task. The same training hardware setup and a Logitech stereo camera were used for realtime identification of plants and weeds. A completely new set of images was provided for predictions. The predicted output images are shown in Figures 31–33.

**Figure 31.** Plant and weed identification: detection of weed (Left: 92%; Right: 90%).

**Figure 31.** Plant and weed identification: detection of weed (92%).

**Figure 31.** Plant and weed identification: detection of weed (92%).

**Figure 32.** Plant and weed identification in black soil. **Figure 32.** Plant and weed identification in black soil.

*Agriculture* **2021**, *11*, x FOR PEER REVIEW 28 of 34

**Figure 33.** Plant and weed identification in brown soil under artificial light. **Figure 33.** Plant and weed identification in brown soil under artificial light.

### *4.3. Extracted Stem Positions 4.3. Extracted Stem Positions*

With the previously described stem estimation technique, we tested our method in real-time. We observed that the estimated stem positions were close enough (83–97%) to the original stem positions of the weed. The result of the extracted stem position in the image frame is presented in Figure 34. With the previously described stem estimation technique, we tested our method in real-time. We observed that the estimated stem positions were close enough (83–97%) to the original stem positions of the weed. The result of the extracted stem position in the image frame is presented in Figure 34. *Agriculture* **2021**, *11*, x FOR PEER REVIEW 29 of 34

**Figure 34.** Estimated stem position of the weeds and plants using the trained robot. **Figure 34.** Estimated stem position of the weeds and plants using the trained robot. **Commented [M31]:** The handwriting in the **Figure 34.** Estimated stem position of the weeds and plants using the trained robot.

### *4.4. Discussion of Results 4.4. Discussion of Results*  The use of convolutional neural network-based models has been reported in different *4.4. Discussion of Results*

results together with the current study.

The use of convolutional neural network-based models has been reported in different areas of agriculture, including disease identification, classification on the basis of ripeness areas of agriculture, including disease identification, classification on the basis of ripeness of fruits, plant recognition using leaf images, and identification of weeds [35,101–104]. The application of convolutional neural networks (CNNs) using the transfer learning technique has also been reported in recent literature in the case of crop/fruit (age) classification. Perez-Perez et al. (2021) reported accuracy of 99.32% in the case of identification of different ripening stages of Medjoul dates [35]. This specific work points to the possibility of tuning the hyperparameters to achieve higher performance parameters with the proposed weeding robot as has been mentioned regarding the results of the current study. In recent years other studies have reported classification of plants through plant and leaf image recognition using convolution neural networks with accuracies up to 99% [103,104]. Sladojevic et al. (2016) reported the use of CNNs for disease recognition by leaf image classification with precision up to 98% [102]. With respect to the classification of different plant species, with the aim of site-specific weed management, Dyrmann et al. (2016) trained a CNN on a set of 10,413 images of **Commented [TS32R31]:** The writing (percentages) is clear upon zooming. The use of convolutional neural network-based models has been reported in different areas of agriculture, including disease identification, classification on the basis of ripeness of fruits, plant recognition using leaf images, and identification of weeds [35,101–104]. The application of convolutional neural networks (CNNs) using the transfer learning technique has also been reported in recent literature in the case of crop/fruit (age) classification. Perez-Perez et al. (2021) reported accuracy of 99.32% in the case of identification of different ripening stages of Medjoul dates [35]. This specific work points to the possibility of tuning the hyperparameters to achieve higher performance parameters with the proposed weeding robot as has been mentioned regarding the results of the current study. In recent years other studies have reported classification of plants through plant and leaf image recognition

picture is not clear, please provide a higher

resolution picture.

ages that were considered in the dataset were of plants at the same growth stage i.e., the seedling stage [101]. This makes the classification easier due to the same plant and leaf structure and hence higher accuracies are expected. However, in the case of weed removal applications, multiple weeding procedures might be needed at different times during a crop season, hence training a neural network with images of plants and weeds at different growth stages was done in the current study. The methodology is also reported in a recent study reported in literature where a crop field at two different growth stages was used to train the neural network, achieving an accuracy of 99.48% [105]. The classification accuracies achieved in the current study hence fall in the range of accuracies found in various studies reported in recent literature. The current study adds further value to the research by reporting the mean Average Precision (mAP) of the object detection tasks performed by the trained model. The mAP is an important metric to evaluate object detection models including both classification and localization tasks. Table 2 gives an overview of three other studies on CNNs for plant/weed/fruit classification that have reported comparable

using convolution neural networks with accuracies up to 99% [103,104]. Sladojevic et al. (2016) reported the use of CNNs for disease recognition by leaf image classification with precision up to 98% [102].

With respect to the classification of different plant species, with the aim of site-specific weed management, Dyrmann et al. (2016) trained a CNN on a set of 10,413 images of 22 different plant species and were able to achieve a classification accuracy of up to 86% [101]. In the reported study, although the number of species classified was high, the images that were considered in the dataset were of plants at the same growth stage i.e., the seedling stage [101]. This makes the classification easier due to the same plant and leaf structure and hence higher accuracies are expected. However, in the case of weed removal applications, multiple weeding procedures might be needed at different times during a crop season, hence training a neural network with images of plants and weeds at different growth stages was done in the current study. The methodology is also reported in a recent study reported in literature where a crop field at two different growth stages was used to train the neural network, achieving an accuracy of 99.48% [105]. The classification accuracies achieved in the current study hence fall in the range of accuracies found in various studies reported in recent literature. The current study adds further value to the research by reporting the mean Average Precision (mAP) of the object detection tasks performed by the trained model. The mAP is an important metric to evaluate object detection models including both classification and localization tasks. Table 2 gives an overview of three other studies on CNNs for plant/weed/fruit classification that have reported comparable results together with the current study.

**Table 2.** Comparison of studies with reported training of CNNs for plant classification and identification tasks.


### **5. Conclusions**

The weed identifier robot is proposed as a non-chemical solution to the rampant problem of weed infestation in food crop farming systems. Research and implementation of a plant and weed identification system using deep learning and state-of-the-art object detection methods was done. Transfer learning technique was explored and the deep learning model was further analysed, evaluated and justified for better generalization. It was seen that deep learning architectures are much better than conventional machine learning architectures in terms of image identification and predictive performance. A simple unique stem estimation technique was proposed which extracted their positions in the image frame. Consequently, the paper also offers a high-level hardware and software design architecture proposal of a cost-effective autonomous weeding robot.

The developed plant and weed identification system was presented and tested on the real-world data and good confidence scores on classification and identification were achieved. It can be concluded that higher values of mAP could be achieved with more steps with the right hyperparameters. Real-time identification was done using a Logitech web camera and it was observed that the model was good at identifying and distinguishing between plants and weeds. The stem position estimation approach was tested and it was found that accuracies were directly dependent on the bounding box localization during identification. Based on our observation, we conclude that this technique also reduces the amount of computation when compared with other methods. In addition to building the prototype and validation studies, future work in this direction could include investigations on choosing a method to find the right hyperparameters for optimization of the identification function of the robot. Further studies could explore 3D position estimation methods to determine the position from the center of the identified weed in the 2D image frame to the real-world robot frame.

**Author Contributions:** D.P.B.N., did the main work in the research, through programming, experiments and implementation part of the work presented in this paper. T.M.S., did the main work in the writing and putting together the contents of this manuscript, in addition to supervising the experiments. R.O., ideated and supervised the research work and gave feedback during the course the research. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received no external funding.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** The data generated in this study are presented in the article. For any clarifications, please contact the corresponding author.

**Acknowledgments:** We acknowledge support for the Open Access fees by Hamburg University of Technology (TUHH) in the funding programme Open Access Publishing. We acknowledge support of Hamburg Open Online University (HOOU) for the grant to develop the prototype of this robot. We acknowledge the support the Institute of Reliability Engineering, TUHH for their logistical support in this research. We acknowledge the suggestions made by the editors and reviewers that led to vast improvements in the quality of the submitted manuscript.

**Conflicts of Interest:** The authors declare no conflict of interest.

### **References**

