*Article* **Simulation-Aided Development of a CNN-Based Vision Module for Plant Detection: Effect of Travel Velocity, Inferencing Speed, and Camera Configurations**

**Paolo Rommel Sanchez 1,2 and Hong Zhang 1,\***


**Abstract:** In recent years, Convolutional Neural Network (CNN) has become an attractive method to recognize and localize plant species in unstructured agricultural environments. However, developed systems suffer from unoptimized combinations of the CNN model, computer hardware, camera configuration, and travel velocity to prevent missed detections. Missed detection occurs if the camera does not capture a plant due to slow inferencing speed or fast travel velocity. Furthermore, modularity was less focused on Machine Vision System (MVS) development. However, having a modular MVS can reduce the effort in development as it will allow scalability and reusability. This study proposes the derived parameter, called overlapping rate (*ro*), or the ratio of the camera field of view (*S*) and inferencing speed (*f ps*) to the travel velocity (- *v* ) to theoretically predict the plant detection rate (*rd*) of an MVS and aid in developing a CNN-based vision module. Using performance from existing MVS, the values of *ro* at different combinations of inferencing speeds (2.4 to 22 *fps*) and travel velocity (0.1 to 2.5 m/s) at 0.5 m field of view were calculated. The results showed that missed detections occurred when *ro* was less than 1. Comparing the theoretical detection rate (*rd*,*th*) to the simulated detection rate (*rd*,*sim*) showed that *rd*,*th* had a 20% margin of error in predicting plant detection rate at very low travel distances (<1 m), but there was no margin of error when travel distance was sufficient to complete a detection pattern cycle (≥10 m). The simulation results also showed that increasing *<sup>S</sup>* or having multiple vision modules reduced missed detection by increasing the allowable - *v max*. This number of needed vision modules was equal to rounding up the inverse of *ro*. Finally, a vision module that utilized SSD MobileNetV1 with an average effective inferencing speed of 16 *fps* was simulated, developed, and tested. Results showed that the *rd*,*th* and *rd*,*sim* had no margin of error in predicting *ractual* of the vision module at the tested travel velocities (0.1 to 0.3 m/s). Thus, the results of this study showed that *ro* can be used to predict *rd* and optimize the design of a CNN-based vision-equipped robot for plant detections in agricultural field operations with no margin of error at sufficient travel distance.

**Keywords:** modeling; simulation; precision agriculture; convolutional neural networks; machine vision; computer vision; modular robot

#### **1. Introduction**

The increasing cost and decreasing availability of agricultural labor [1,2] and the need for sustainable farming methods [3–5] led to the development of robots for agricultural field operations. However, despite the success of robots in industrial applications, agricultural robots for field operations remain primarily in the development stage due to the complex characteristics of the farming environment, high cost of development, and high durability, functionality, and reliability requirements [6–8].

**Citation:** Sanchez, P.R.; Zhang, H. Simulation-Aided Development of a CNN-Based Vision Module for Plant Detection: Effect of Travel Velocity, Inferencing Speed, and Camera Configurations. *Appl. Sci.* **2022**, *12*, 1260. https://doi.org/10.3390/ app12031260

Academic Editors: Anselme Muzirafuti and Dimitrios S. Paraforos

Received: 15 December 2021 Accepted: 21 January 2022 Published: 25 January 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

As a potential solution to these challenges, field robots with computer vision have been increasing due to the large amount of information that can be extracted from an agricultural scene [9]. Real-time machine vision systems (MVS) are often used to recognize, classify and localize plants accurately for precision spraying [10,11], mechanical weeding [12], solid fertilizer application [13], and harvesting [14,15]. However, using traditional image processing techniques, early machine vision implementations for field operations were difficult due to the vast number of features needed to model and differentiate plant species [13] and work at various farm scenarios [12].

Recently, developments in deep learning allowed Convolutional Neural Networks (CNN) to be used for accurate plant species detection and segmentation [16,17]. However, despite high classification and detection performance, the large computational power requirement of CNN limits its application in real-time operations [18]. As a result, most CNN applications in agriculture were primarily employed in non-real-time scenarios, excluding inferencing speeds in the evaluated parameters among related studies. For example, a study that surveyed CNN-based weed detection and plant species classification reported 86–97% and 48–99% precisions, respectively, but data on inferencing speeds were unreported [19]. Similarly, research on fruit classification and recognition using CNN showed 77–99% precision, but inferencing speeds were also excluded in the measured parameters [19–21].

Few studies have evaluated the real-time performance of CNNs for agricultural applications. For example, in a study by Olsen et al. (2019) [22] on detecting different species of weeds, the real-time performance of ResNet-50 in an NVIDIA Jetson TX2 was only 5.5 *fps* at 95.1% precision. Optimizing their TensorFlow model using TensorRT increased the inferencing speed to 18.7 *fps*. The study of Chechli ´nski et al. (2019) [23] using a custom CNN architecture based on U-Net, MobileNets, DenseNet, and ResNet in a Raspberry Pi 3B+ resulted in only 60.2% precision at 10.0 *fps*. Finally, Partel et al. (2019) [11] developed a mobile robotic sprayer that used YOLOv3 running on an NVIDIA 1070TI at 22 *fps* and NVIDIA Jetson TX2 at 2.4 *fps* for real-time crop and weed detection and spraying. The vision system with 1070TI had 85% precision, while the TX2 vision system had 77% precision. Furthermore, the system with TX2 missed 43% of the plants because of its slow inferencing speed. Conversely, the CNN-based sprayer with the faster 1070TI, resulting in higher inferencing speed, only missed 8% of the plants.

However, the travel velocities of surveyed systems were often unevaluated despite operating in real-time. Additionally, theoretical approaches to quantify and account for the effect of travel velocity on the capability of the vision system to sufficiently capture discrete data and effectively represent a continuous field scenario were often not included in the design. Other studies, on the contrary, evaluated the effect of travel velocity on CNN detection performance, but the impact of inferencing speed remained unevaluated. For instance, in the study of Liu et al. (2021) [24], the deep learning-based variable rate agrochemical spraying system for targeted weeds control in strawberry crops equipped with a 1080TI showed increasing missed detections, as travel velocity increases regardless of CNN architecture (VGG-16, GoogleNet or AlexNet). At 1, 3, and 5 km/h, their system missed 5–9%, 6–10%, and 13–17% of the targeted weeds, respectively.

The brief review of the developed systems showed that inferencing speed (*f ps*) and travel velocity (- *v* ) of a CNN-based MVS impact its detection rate (*rd*). *rd* is the fraction of the number of detected plants to the total number of plants and was often expressed as the recall of the CNN model [11,24,25]. However, a theoretical approach to predict the *rd* of a CNN-based MVS as affected by both *f ps* and - *v* is yet to be explored.

Current approaches in developing a mechatronic system with CNN-based MVS involve building and testing an actual system to determine *rd* [11,24,25]. However, building and testing several CNN-based MVS to determine the effect of - *v* and *f ps* on *rd* would be very tedious as the process would involve building the MVS hardware, image dataset preparation, training multiple CNN models, developing multiple software frameworks to integrate the different CNN models into the vision system, and testing the MVS. Hence, this study also proposes to use computer simulation, in addition to theoretical modeling, in predicting *rd* as a function of the mentioned parameters, to reduce the difficulty of selecting and sizing the components of the CNN-based MVS.

Computer simulations were often used to characterize the effect of design and operating parameters on the overall performance of agricultural robots for field operations. For example, Villette et al. (2021) [26] demonstrated that computer simulations could be used to estimate the required sprayer spatial resolution of vision-equipped boom sprayer, as affected by boom section weeds, nozzle spray patterns, and spatial weed distribution. Wang et al. (2018) [27] used computer modeling and simulation to identify potential problems of a robotic apple picking arm and developed an algorithm to improve the performance by 81%. Finally, Lehnert et al. (2019) [28] used modeling and computer simulation to create a novel multi-perspective visual servoing technique to detect the location of occluded peppers. However, the use of simulation to quantify the effects of *f ps*, - *v* , and camera configurations on *rd* of an MVS remain unexplored.

Furthermore, a survey of review articles showed that almost all robotic systems in agriculture employ fixed configurations and are non-scalable, resulting in less adaptability to complex agricultural environments [7,8,29,30]. Thus, aside from modeling and simulation, this study also proposes a module-based design approach to enable reusability and scalability by minimizing production costs and shortening the lead time of machine development [31]. A module can be defined as a repeatable and reusable machine component that performs partial or full functions and interact with other machine components, resulting in a new machine with new overall functionalities [32].

Therefore, this study proposes theoretical and simulation approaches in predicting combinations of *f ps*, - *v* , and camera configuration to prevent missed plant detections and aid in developing a modular CNN-based MVS. In particular, based on the brief literature review and identified research gaps, the study specifically aims to make the following contributions:


#### **2. Materials and Methods**

*2.1. Concept*

Cameras for plant detection are typically mounted on a boom of a sprayer or fertilizer spreader [10,11,13], as illustrated in Figure 1. They are oriented so that their optical axis is perpendicular to the field and captures top-view images of plants [33].

**Figure 1.** Camera mounting location and orientation in a boom (not drawn in scale).

Depending on the distance between the camera lens and captured plane, lens properties, and sensor size, the frame width or height is equivalent to an actual linear distance. For simplicity, the linear length of the side of a field of view of a frame parallel to the direction of travel shall be denoted as *S*, in meters per frame. To provide complete visual coverage of the traversed width of the boom, the maximum spacing between adjacent cameras is equal to the length of the side of a field of view of a frame perpendicular to the direction of travel, denoted as *W*, in meters per frame [10].

During motion, the traverse distance between two consecutive frames of the camera (*df*), in meters, is equal to the product of travel velocity (- *v* ), in m/s, and the time between the frames (1/ *f ps*), in seconds, as illustrated in Equation (1).

$$d\_f = \stackrel{\rightarrow}{\upsilon} \cdot \frac{1}{fps} = \frac{\stackrel{\rightarrow}{\upsilon}}{fps} \tag{1}$$

The ratio of *S* to *df* is proposed as the overlapping rate (*ro*), a dimensionless parameter, and is represented by Equation (2).

$$r\_o = \frac{S \times fps}{\frac{\cdot}{v}} = \frac{S}{d\_f} \tag{2}$$

With a single camera, the value of *ro* describes whether there is an overlap or gap between frames. Depending on the value of *ro*, certain regions in the traversed field of the machine, with a single camera configuration, will be uniquely captured, captured in multiple frames, or completely missed, as shown in Figure 2 and illustrated by the following cases:


**Figure 2.** Relative positions of two consecutive processed frames of a vision system: (**a**) *ro* = 1 since *df* = *S*, resulting in frames with unique bounded regions; (**b**) *ro* > 1 since *df* < *S*, resulting in overlap region (**b**); and (**c**) *ro* < 1 since *df* > *S*, resulting in missed region.

#### 2.1.1. Theoretical Detection Rate

The theoretical or maximum detection rate (*rd*,*th*) can be defined as min(1, *r*0), shown in Equation (4). The *rd*,*th* was also a dimensionless parameter.

$$r\_{d,th} = \begin{cases} \ 1 \\ \ r\_{o\_{\ell}} \end{cases} \quad \begin{array}{l} r\_0 \ge 1 \\ \ r\_{o} < 1 \end{array} \tag{4}$$

#### 2.1.2. Maximizing Travel Velocity

Setting *ro* = 1 in Equation (2) resulted in Equation (5), which is similar to the equation used by Esau et al. (2018) [10] in calculating the maximum travel velocity of a sprayer. However, Equation (5) only describes the maximum forward velocity - *v max* that a visionequipped robot can operate to prevent gaps while traversing the field as a function of *S* and *f ps*. -

$$
\overline{v}\_{\
uux} = \mathcal{S} \times fps \tag{5}
$$

#### 2.1.3. Increasing - *v max*

A consequence of Equation (5) was that increasing the length of the frame *S* at a constant *f ps* shall increase - *v max*. Hence, raising the camera mounting height or using multiple adjacent synchronous cameras along a single plant row can increase the effective *<sup>S</sup>*. This situation, then, shall increase - *v max* without the need for powerful hardware for a faster inferencing speed. When *ro* < 1, the number of vision modules (*nvis*) to prevent missed detection can be calculated using Equation (6). Since *ro* represents the fraction of the field that can be covered by a single camera, the inverse of *ro* represents the number of adjacent cameras that will result in 100% field coverage. The calculated inverse was rounded up to the following number, as cameras are discrete elements.

$$m\_{\rm vis} = \left[\frac{1}{r\_o}\right] \tag{6}$$

The effective actual ground distance (*S eff*), in meters, captured side-by-side by identical and synchronous vision modules without gaps and overlaps is equal to the product of *nvis* and *S*, as shown in Equation (7). This configuration will then allow the use of less powerful devices while operating at the required - *v* of an agricultural field operation such as spraying, as illustrated in Figure 3.

$$S\_{eff} = S \times n\_{vis} \tag{7}$$

**Figure 3.** Multiple adjacent cameras for plant detection at *nvis* = 2. Thus, *S eff* = 2*S*.

#### *2.2. Field Map Modeling*

A virtual field was prepared to test the concepts that were presented. A 1000 m field length (*dl*) with crops planted in hills at 0.2 m hill spacings (*dh*) was used. The number of hills (*nh*) and plant hill locations (*Xi*), in meters, in the entire field length were calculated using Equations (8) and (9), respectively. A section of the virtual field is presented in Figure 4.

$$m\_h = \left[\frac{d\_l}{d\_h}\right] \tag{8}$$

$$X\_i = i \times d\_{\text{li}}; \ i \in \mathbf{1}, \mathbf{2}, \dots, n\_{\text{li}} \tag{9}$$

**Figure 4.** Virtual field with field map and motion modeling parameters. The frame at k = 0 represents the frame just outside the virtual field. The frame at *k* = 1 represents the first frame that entered the virtual field.

#### *2.3. Motion Modeling*

The robot was assumed to move from right to left of the field during simulation, as shown in Figure 4. Therefore, the right border of the virtual area was the assumed field origin. The total number of frames (*K*) throughout the motion of the vision system then becomes the number of *df*-sized steps to completely traverse *dl*, as shown Equation (10).

$$K = \frac{d\_l}{d\_f} \tag{10}$$

The elapsed time after several frame steps (*tk*), in seconds, was calculated by dividing the number of elapsed frames (*k*) by the inferencing speed, as shown in Equation (11). *tk* was then used to calculate the distance of the left (*do*,*k*) and right (*ds*,*k*) borders of the virtual camera frame with respect to the field origin, in meters, using kinematic equations as shown in Equations (12) and (13), respectively. In Equation (13), *S* was subtracted from *do*,*<sup>k</sup>* due to the assumed right to left motion of the camera.

$$k \in 0, 1, 2, \dots \text{ K}$$

$$t\_k = k \times \frac{1}{fps} = \frac{k}{fps} \tag{11}$$

$$d\_{o,k} = \stackrel{\rightarrow}{\vec{v}} \times t\_k \tag{12}$$

$$d\_{s,k} = d\_{o,k} - S \tag{13}$$

#### *2.4. Detection Algorithm*

The simulation was implemented using two Python scripts, which were made publicly available in GitHub. The first script, called "settings.py", was a library that defined the "Settings" object class. This object contained the properties of the virtual field, kinematic motion, and camera parameters for the detection. The second script, "vision-module.py", was a ROS node that published only the horizontal centroid coordinates of the plant hills that would be within the virtual camera frame. The central aspect of ROS was implementing a distributed architecture that allows synchronous or asynchronous communication of nodes [34]. Hence, the ROS software framework was used so that the written simulation scripts for the vision system can be used in simulating the performance and optimizing the code of a precision spot sprayer that was also being developed as part of the future implementation of this study.

When "vision-module.py" was executed, it initially loaded the "Settings" class and fetched the required parameters, including *Xi*, from "settings.py". The following algorithm was then implemented for the detection:

	- a. *tk*, *do*,*<sup>k</sup>* and *ds*,*<sup>k</sup>* were calculated.
	- b. For each *i* within the number of hills *nh*:
		- i. All *Xi* within the left border, *do*, *<sup>k</sup>*, and the right border, *ds*,*k*, were plant hills within the camera frame
		- ii. Append detected hill indices to list

In step 1, an empty vector was needed to store the indices of the detected plant hills. In step 2, each *k* frame represented a camera position as the vision system traversed along the field. Step 2a calculated the elapsed time and the left and right border locations of the frame as described in Section 2.3. The specific detection method was performed in Step 2b, which compared the current distance locations of the left and right bounds of the camera frame to the plant hill locations. The plant hill indices that satisfied Step 2b-i were then appended to the NumPy vector. The duplicates were filtered from the NumPy vector in Step 3, and the remaining elements were counted and stored in the integer variable *nd*. Finally, the simulated detection rate (*rd*,*sim*) of the vision system was then the quotient of *nd* and *nh* as shown in Equation (14).

$$
\sigma\_{d,sim} = \frac{n\_d}{n\_h} \tag{14}
$$

#### *2.5. Experimental Design*

A laptop (Lenovo ThinkPad T15 g Gen 1) with Intel Core i7-10750H, 16 GB DDR4 RAM, and NVIDIA RTX 2080 Super was used in the computer simulation. The script was implemented using Python 2.7 programming language and ROS Melodic Morenia in Ubuntu 18.04 LTS operating system.

The simulation was performed at *S* = 0.5 m, based on the camera configuration of Chechli ´nski et al. (2019) [23]. Sensitivity analysis was performed at *dl* values of 1, 10, 100, 1000, and 10,000 m. The literature review showed that 20,000 m was used in the study of Villette et al., 2021 [26]. However, the basis for the *dl* used in their study was not explained. Hence, sensitivity analysis was performed in this study to establish the sufficient *dl* that would not affect *rd*,*th* and *rd*,*sim*. The resulting values of *rd*,*sim* were compared to *rd*,*th*. Inferencing speed of 2.4 *fps* and travel velocity of 2.5 m/s were used for the sensitivity analysis to have an *ro* < 1 at *S* = 0.5 m. If a faster inferencing speed or slower travel was used, *ro* could be equal to or greater than 1. This result will fall into Case 1 or 2 and could not be used for sensitivity analysis.

The model was then simulated at different values of - *v* and *f ps*, as shown in Table 1 to estimate the detection performance of combinations of CNN model, hardware, and - *v* . Forward walking speeds using a knapsack sprayer typically ranged from 0.1 to 1.78 m/s [35–37]. On the other hand, the travel velocities of boom sprayers ranged from 0.7 to 2.5 m/s [38–41]. Solid fertilizer application using a tractor-mounted spreader operated at 0.89 to 1.68 m/s [13,42]. Finally, a mechanical weeder with rotating mechanisms worked at 0.28–1.67 m/s [43,44]. The literature review showed that 0.1 m/s was the slowest [41] and 2.5 m/s was the highest [40] forward velocities found. The mid-point velocity of 1.3 m/s estimated the typical walking speed using knapsack sprayers [35–37] and forward travel velocities of boom sprayers and fertilizer applicators [38–41].

**Table 1.** Complete factorial design to determine the detection rate of a CNN-based vision system for agricultural field operation using simulation.


In addition, 2.4 and 22 *fps* were the inferencing speeds of YOLOv3 running on an NVIDIA TX2 embedded system and a laptop with NVIDIA 1070TI discrete GPU as described in the study of Partel et al. (2019) [11]. Finally, 12.2 *fps* approximated the inferencing time of a custom CNN architecture or SSD MobileNetV1 CNN model optimized in TensorRT and implemented an embedded system [22,23].

The effect of increasing *S* using multiple camera modules in preventing missed detection was also performed on treatments falling under Case 3.

#### *2.6. Vision Module Development*

The development of the vision module was divided into three phases: (1) hardware and software development; (2) dataset preparation and training of the CNN model; and (3) simulation and testing.

#### 2.6.1. Hardware and Software Development

Table 2 summarizes the list and function of the hardware components used to develop the vision module. NVIDIA Jetson Nano with 4 GB RAM was used to perform inferencing on 1280 × 720 at 30 *fps* video from a USB webcam (Logitech StreamCam Plus). Powering the whole system is a power adapter that outputs 5VDC at 4A.

**Table 2.** Summary of vision module hardware.


Table 3 summarizes the software packages used to develop the software framework of the vision module. The software for the vision module was written in Python 2.7. The detectnet object class of Jetson Inference Application Programming Interface (API) was used to develop the major components of the software framework. Detectnet object facilitated connecting to the webcam using gstream, optimizing the PyTorch-based SSD MobileNetV1 model into TensorRT, loading the model, performing inferences on the video stream from the webcam, image processing for drawing bounding boxes onto the processed frame, and displaying the frame. OpenCV is an open-source computer vision library focused on real-time applications. It was used to display the calculated speed of the vision module and convert the detectnet image format from red–green–blue–alpha (RGBA) to blue–green–red (BGR), which was the format needed by ROS for image transmission.

**Table 3.** Summary of vision module software.


To enable modularity, the software framework, as illustrated in Figure 5, was also implemented using ROS version Melodic Morenia, which was the version that was compatible with Ubuntu 18.04. A node is a virtual representation of a component that can send or receive messages directly from other nodes. The vision module or node required two inputs: (1) RGB video stream from a video capture device and (2) TensorRT-optimized SSD MobileNetV1 object detection model. It calculates and outputs four parameters, namely: (1) weed coordinates, (2) crop coordinates, (3) processed images, and (4) total delay time. Each parameter was published into its respective topics. Table 4 summarizes the datatype and the function of these outputs.

**Figure 5.** Software framework of the vision module.


#### 2.6.2. Dataset Preparation and Training of the CNN Model

Using the Jetson Inference library, a CNN model for plant detection was trained using SSD MobileNetV1 object detection architecture and PyTorch machine learning framework. A total of 2000 sample images of artificial potted plants at 1280 × 720 composed of 50% weeds and 50% plants were prepared. For CNN model training and validation, 80% and 20% of the datasets were used, respectively. A batch size of 4, base learning rate of 0.001, and momentum of 0.90 were used to train the model for 100 epochs (5000 iterations).

#### 2.6.3. Testing and Simulation

The performance requirement for the vision module was to avoid missed detections for spraying operations at walking speeds, which was 0.1 m/s at minimum [35–37]. The Jetson Nano and webcam were mounted at a height where *S* was equal to 0.79 m, as shown in Figure 6. *S* was determined so that the top projections of the potted plants were within the camera frame, and the camera and plants would not collide during motion. A conveyor belt equipped with a variable speed motor was used to reproduce the relative travel velocity of the vision system at 0.1, 0.2, and 0.3 m/s. A maximum of 0.3 m/s was used, since beyond this conveyor speed consistent *dh* at 0.2 m was difficult to achieve, even with three people performing the manual loading and unloading, as the potted artificial plants were traveling too fast.

A total of 60 potted plants were loaded onto the conveyor for each conveyor speed setting. Detection was carried out at a minimum conference threshold of 0.5. Detected and correctly classified plants were considered true positives (*TP*), while detected and incorrectly classified plants were categorized as false positives (*FP*). Missed detections were classified as false negatives (*FN*). The precision (*pactual*) and recall (*ractual*) of the vision module were then determined using Equations (15) and (16), respectively.

$$p\_{actual} = \frac{TP}{TP + FP} \tag{15}$$

$$r\_{actual} = \frac{TP}{TP + FN} \tag{16}$$

**Figure 6.** Laboratory setup composed of Jetson Nano 4GB, Logitech StreamCam Plus, and variable speed conveyor belt with artificial potted plants.

#### **3. Results and Discussion**

The sensitivity of *rd*,*th* and *rd*,*sim* to the total traversed distance was first determined to establish the *dl* that was used in the experimental design. The influence of - *v* and *f ps* at specific *S* on *rd*,*th* and *rd*,*sim* were then compared and analyzed. Finally, the results of performance testing the vision module were compared to theoretical and simulation results.

#### *3.1. Sensitivity Analysis*

As illustrated in Figure 7, the sensitivity analysis results showed that *rd*,*th* and *rd*,*sim* converged at 10 m traversed distance. The 20% difference of *rd*,*th* from *rd*,*sim* can be attributed to the different variables considered to determine each parameter. *rd*,*th* used inferencing speed, travel velocity, and capture width to theoretically calculate the gaps between consecutive processed frames related to the detection rate, as illustrated in Section 2.1.

On the other hand, *rd*,*sim* determined the detection rate based on the number of unique plants within the processed frames, as influenced by traversed distance, hill spacing, inferencing speed, travel velocity, and capture width, as described in Section 2.2, Section 2.3, Section 2.4. Results showed that simulation better approximated the detection rate than theoretical approaches at less than 10 m traversed distance, 0.2 m hill spacing, 2.5 m/s travel velocity, 2.4 *fps*, and 0.5 m frame capture width.

These results infer that at very short distances, *rd*,*sim* approximates the detection rate more accurately than *rd*,*sim*. However, for long traversed distances, the influence of hill spacing on the detection rate was no longer significant and *rd*,*th* can simply be used to calculate the detection rate.

#### *3.2. Effects of* - *v and f ps*

Table <sup>5</sup> summarizes the theoretical and simulation results on the combinations of - *v* and fps at *S* equal to 0.5-m. Comparing the *rd*,*th* to *rd*,*sim* for any combinations of the tested parameters showed that detection rates were equal. Results also showed that there were no missed detections at any - *v* when the inferencing speeds were at 12.2 and 22 *fps* (Case 2), as illustrated in Figures 8 and 9. These results infer that one-stage object detection models, such as YOLO and SSD, running on a discrete GPU such as 1070TI, have sufficient inferencing speed to avoid detection gaps in typical ranges of travel velocities for agricultural field operations. The result was also comparable to the 92% precision of the CNN-based MVS with 22 *fps* inferencing speed in the study of Partel et al. (2019) [11]. Therefore, these results infer that using a one-stage CNN model such as YOLOv3 on a laptop with NVIDIA 1070TI GPU or better can provide sufficient inferencing speed to avoid gaps in different field operations. However, as mentioned in Section 1, the study did not report the travel velocity and field of view length of their setup. Thus, only an estimated performance comparison can be made.



**Figure 8.** Simulated plant hill detection rates of the vision system moving at 0.1, 1.3, and 2.5 m/s at different inferencing speeds (*f ps*).

**Figure 9.** Simulated plant hill detection rates of the vision system moving at 2.4, 12.2, and 22 *fps* at different travel velocities.

The result of this study also agrees with the results of other studies with known *S*, - *v* , and *f ps*. In the study of Chechli ´nski et al. (2019) [23], their CNN-based-vision spraying system had *S* = 0.55 m, - *v* = 1.11 m/s, and *f ps* = 10.0. Applying these values to Equation (2) also yields *ro* > 1 (Case 2), which correctly predicted their results of full-field coverage. In the study of Esau et al. (2018) [10], their vision-based spraying system had *S* = 0.28 m, - *v* = 1.77 m/s, and *f ps* = 6.67 and also falls under Case 2. Similarly, the vision-based robotic fertilizer application in the study of Chattha et al. (2018) [13] had a *S* = 0.31 m, - *v* = 0.89 m/s, and *f ps* = 4.76. Again, calculating *ro* yielded Case 2, which also agrees with their results.

At 2.4 *fps*, the simulated MVS failed to detect some plant hills when - *v* was 1.3 (Treatment 4) or 2.4 m/s (Treatment 7). In contrast, missed detections were absent at 0.1 m/s (Treatment 1). As mentioned in Section 2.5, treatments 1, 4, and 7 represent typical inferencing speeds of CNN models, such as YOLOv3 running in an embedded system, such as NVIDIA TX2 [11]. From these results, it can be inferred that unless CNN object detection models were optimized, such as illustrated in previous studies [23,45], MVS with embedded systems shall only be applicable for agricultural field operations at walking speeds.

Figure 10 illustrates the detected hills per camera frame along the first 10 m traversed distance of treatments simulated at 2.4 *fps* (Treatments 1, 4, and 7). From Figure 10, three information can be obtained: (1) absence of vertical gaps between consecutive frames; (2) horizontal overlaps among consecutive frames; and (3) detection pattern. In Figure 10a, the absence of vertical gaps at 0.1 m/s detections infers that all the hills were detected as the vision moved along the field length. The horizontal overlaps among consecutive frames also illustrate that a plant hill was captured by more than one processed frame. Finally, a detection pattern was observed to repeat every 24 consecutive frames or approximately every 1 m length. The length of the pattern was calculated by multiplying the number of frames to complete a cycle and *df* .

In contrast, the vertical gaps in some consecutive frames at 1.3 m/s, shown in Figure 10b, illustrated the missed detections. Horizontal overlaps were also absent. Hence, the detected plant hills were only represented in the frame once. The vision module traveled too fast and processed the captured frame too slowly at the set capture width, as demonstrated by the detection pattern of one missed plant hill every seven consecutive frames or approximately every 3.8 m traversed distance.

Similar results were also observed at 2.5 m/s travel velocity, as shown in Figure 10c. However, the vertical gaps were more extensive than Figure 10b due to faster travel speed. Observing the detection pattern showed that 14 plant hills were being undetected by the vision system every five frames or approximately every 5.21 m traversed distance. This pattern that forms every 5.21 m further explains the difference in the *rd*,*th* and *rd*,*sim* in the sensitivity analysis in Section 3.1, when the traversed distance was only 1 m. A complete detection pattern was already formed when the distance was more than 10 m, resulting in better detection rate estimates.

From these results, two vital insights can be drawn. First, at *ro* < 1, *rd*,*th* shall have a margin of error when the length of the detection pattern is less than the traversed distance. Second, concerning future studies, object tracking algorithms, such as Euclidean-distancebased tracking [46], that requires objects should be present in at least two frames, would be not applicable when *ro* ≤ 1. Hence, the importance that *ro* > 1 in MVS designs is further emphasized.

(**c**)

**Figure 10.** Range of plant hill indices that were detected per frame along the first 10 m of the simulated field at 2.4 *fps*, 0.5 m capture width, 0.2 m hill spacing at (**a**) 0.1 m/s, (**b**) 1.3 m/s, and (**c**) 2.5 m/s. Blue broken lines enclose a detection pattern, while broken red lines specify the missed plant hills.

#### *3.3. Effect of Increasing S or Multiple Cameras*

In cases where *ro* <sup>&</sup>lt; 1 (Case 3), a practical solution to increase - *v max* is to raise the camera mounting height, which, in effect, shall increase *S*. However, if raising the camera mounting height is inappropriate as doing so shall also decrease object details, the use of multiple cameras can be a viable solution.

Figure 11 illustrates the effect of increasing the effective *S* or using multiple cameras on the calculated values of - *v max* for the three levels of infencing speeds (2.4, 12.2, and 22 *fps*) simulated at *S* = 0.5 m. The results showed that treatments with missed detections exceeded the allowable - *v max*. For treatments 4 and 7, the allowable travel velocity was only 1.2 m/s using a single camera module, which was less than the simulated - *v* of 1.3 and 2.5 m/s, respectively.

**Figure 11.** Theoretical maximum travel velocity to prevent missed detections with 1, 2, and 3 cameras at 2.4, 12.2, and 22.0 *fps*.

Calculating *nvis* using Equation (6) for treatments 4 and 7 showed that 2 and 3 vision modules, respectively, were required to prevent missed detections. Thus, using two vision modules for treatment 4 prevented missed detections, as shown in Figure 12. The 6th, 20th, and 34th frames captured by the second camera detected the plants undetected by the first camera.

As predicted, a two-vision-module configuration for treatment 7 was insufficient in preventing missed detections since the simulated - *v* of 2.5 m/s of the vision system was still higher than the increased - *v max*. As illustrated in Figure 12, without a third camera, the two-camera configuration would still result in an undetected hill on the 16th frame.

Based on these simulated results, the problem of missed detection due to the slow inferencing speed of embedded systems could be potentially solved by using multiple, adjacent, non-overlapping, and colinear cameras along the traversed row when raising the height of the camera was unwanted.

**Figure 12.** The range of plant hill indices detected per frame along the first 10 m of the simulated field at *f ps* = 2.4, *S* = 0.5 m, and *dh* = 0.2 m: (**a**) detections at 1.3 m/s using two cameras where broken red lines represent plant hills undetected by the first camera but detected by the second camera; and (**b**) detections at 2.5 m/s and three cameras where broken orange lines represent plant hills undetected by the first and second cameras but detected by the third camera.

#### *3.4. Vision Module Simulation and Testing Performance*

Figure 13 shows the sample detection of the vision module. Results showed that using a TensorRT-optimized SSD MobileNetV1 to detect plants in 1280 × 720 images on an NVIDIA Jetson Nano 4 GB had an average inferencing speed of 45 *fps*. This average inferencing speed only represented the elapsed time to inference on an already loaded frame. However, due to calculation overheads caused by additional data processing and transmission, the average effective inferencing speed of the vision module was only 16 *fps*, as shown in Figure 13. The effective speed was the average time difference for the vision module to complete a single loop, including grabbing a frame from the camera, inferencing, calculating detection parameters, image processing, and transmitting data.

The results using the theoretical approach and simulation for the vision module are shown in Table 6. Using Equation (2), the configuration of the laboratory setup falls under Case 2 since *ro* > 1. Then, using Equation (4), *rd*,*th* was calculated to be equal to 1.00. Applying Equation (5) yields - *v max* = 12.64 m/s, which was highly sufficient for the target 0.3 m/s and inferred that multiple vision modules were not required to prevent missed detections. Theoretical prediction of the performance of the vision module showed that the configuration was sufficient to prevent missed detection. Likewise, the theoretical result

was confirmed by the simulation results that showed no missed detections (*rd*,*sim* = 1.00) for both crops and weeds among the simulated - *v* .

**Figure 13.** Sample real-time inferencing using trained SSD MobileNetV1 model and optimized in TensorRT. The vision system utilized Jetson Inference API.

**Table 6.** Theoretical and simulation performance of the CNN-based vision module for plant detection at *<sup>S</sup>* <sup>=</sup> 0.79 m, 16 *f ps*, and at three-levels of travel velocity (- *v* ).


Table 7 summarizes the precision and recall of the trained CNN model in detecting potted plants at different relative travel velocities of the conveyor. Results showed that the combination of an optimized SSD MobileNetV1 in TensorRT running in a Jetson Nano 4 GB have robust detection performance, and incorrect or missed detections were absent despite increasing travel velocity. Comparing the value of *ractual* to *rd*,*th* and *rd*,*sim*, results showed that the detection rates were equal. The recall was used for comparison instead of precision since the former is the ratio of the correctly detected plants to the total sample plants. This definition of *ractual* in Equation (16) is equivalent to the definition of *rd*,*sim* in Equation (14). Since the *ractual*, *rd*,*th* and *rd*,*sim* were equal, these results proved the validity of the theoretical concepts and simulation methods presented in this study. Hence, *rd*,*th* and *rd*,*sim* can be used to theoretically determine the detection rate of a vision system in capturing plant images as a function of - *v* and *f ps* with known *S*.


**Table 7.** The detection performance of the CNN-based vision module for detecting a 60 potted plants at different conveyor velocities (- *v* ).

#### **4. Conclusions**

This study presented a practical approach to quantify *rd* and aid in the development of a CNN-based vision module through the introduction of the dimensionless parameter *r*0. The reliability of *r*<sup>0</sup> in predicting the *rd* of an MVS as a function of inferencing speed and travel velocity was successfully demonstrated by having no margin of error compared to simulated and actual MVS at sufficient traversed distance (≥10 m). In addition, a set of scripts for simulating the performance of a vision system for plant detection was also developed and showed no margin of error compared to the *rd* of actual MVS. This set of scripts was made publicly available to verify the results of this study and provide a practical tool for developers in optimizing design configurations of a vision-based plant detection system.

The mechanism of missed detection was also successfully illustrated by evaluating each of the simulated frames in detail. Using the concept of *r*0, simulation, and detailed assessment of each processed frame, the mechanism to prevent missed plant hills by increasing the effective *S* through synchronous multi-camera vision systems in low-frame processing rate hardware was also successfully presented.

Furthermore, a vision module was also successfully developed and tested. Performance testing showed that the *rd*,*th* and *rd*,*sim* accurately predicted the *ractual* of the vision module with no margin of error. The script for the vision module was also made available in a public repository where future improvements shall also be uploaded.

However, despite accomplishing the set objectives in this research, the study encountered limitations that shall be improved in future research. First, the robustness of *r*<sup>0</sup> in predicting the detection rate was supported mainly by simulation data. The current laboratory tests were only implemented at a maximum travel velocity of 0.3 m/s due to limitations in the manual loading of the test plants. At this time, the study relied on results of other studies to validate the robustness of *r*<sup>0</sup> at higher travel velocities and different inferencing speeds. Thus, the concepts presented in this shall be further tested to determine the robustness of *r*<sup>0</sup> at higher travel velocities during the application of the developed vision module on actual field scenarios.

Lastly, the methodology to calculate *rd*,*th* and *rd*,*sim* assumed that the CNN has 100% precision. In cases less than 100% precision, it can be theorized that *rd*,*th* and *rd*,*sim* can be multiplied by the precision of the CNN to estimate the effective recall of a CNN-based vision system in evaluating moving objects across the camera frame. However, this concept is yet to be demonstrated and shall also be included in future studies.

**Author Contributions:** Conceptualization, P.R.S.; methodology, P.R.S. and H.Z.; software, P.R.S.; validation, P.R.S.; formal analysis, P.R.S. and H.Z.; investigation, P.R.S. and H.Z.; resources, P.R.S. and H.Z.; data curation, P.R.S.; writing—original draft preparation, P.R.S.; writing—review and editing, H.Z.; visualization, P.R.S.; supervision, H.Z.; project administration, H.Z.; funding acquisition, P.R.S. and H.Z. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received no external funding.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** The scripts used in the simulation and vision module were made publicly available in GitHub under BSD license to use, test, and validate the data presented in this study. The scripts for simulation are available at https://github.com/paoap/vision-modulesimulation, (accessed on 31 December 2021) while the vision module software framework can be downloaded from https://github.com/paoap/plant-detection-vision-module (accessed on 31 December 2021).

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**

