Development of Apple Detection System and Reinforcement Learning for Apple Manipulator

Andriyanov, Nikita

doi:10.3390/electronics12030727

Open AccessArticle

Development of Apple Detection System and Reinforcement Learning for Apple Manipulator

by

Nikita Andriyanov

Data Analysis and Machine Learning Department, Financial University under the Government of the Russian Federation, pr-kt Leningradsky, 49/2, 125167 Moscow, Russia

Electronics 2023, 12(3), 727; https://doi.org/10.3390/electronics12030727

Submission received: 31 December 2022 / Revised: 17 January 2023 / Accepted: 21 January 2023 / Published: 1 February 2023

(This article belongs to the Special Issue Neural Networks in Robot-Related Applications)

Download

Browse Figures

Versions Notes

Abstract

:

Modern deep learning systems make it possible to develop increasingly intelligent solutions in various fields of science and technology. The electronics of single board computers facilitate the control of various robotic solutions. At the same time, the implementation of such tasks does not require a large amount of resources. However, deep learning models still require a high level of computing power. Thus, the effective control of an intelligent robot manipulator is possible when a computationally complex deep learning model on GPU graphics devices and a mechanics control unit on a single-board computer work together. In this regard, the study is devoted to the development of a computer vision model for estimation of the coordinates of objects of interest, as well as the subsequent recalculation of coordinates relative to the control of the manipulator to form a control action. In addition, in the simulation environment, a reinforcement learning model was developed to determine the optimal path for picking apples from 2D images. The detection efficiency on the test images was 92%, and in the laboratory it was possible to achieve 100% detection of apples. In addition, an algorithm has been trained that provides adequate guidance to apples located at a distance of 1 m along the Z axis. Thus, the original neural network used to recognize apples was trained using a big image dataset, algorithms for estimating the coordinates of apples were developed and investigated, and the use of reinforcement learning was suggested to optimize the picking policy.

Keywords:

robotic arm; stereo vision; computer vision; reinforcement learning; YOLO; Raspberry Pi; artificial intelligence; smart electronics

1. Introduction

Recently, multimodal deep learning models, which combine several areas in the field of artificial intelligence, have become increasingly popular [1,2,3,4]. Of particular interest in the field of robotics is the combination of computer vision algorithms and reinforcement learning [5]. At the same time, agriculture is one of the most popular areas for introducing intelligent robotics technologies [6,7,8,9]. This is due to the observed increase in the global population, which has led to a demand for increased efficiency in all areas related to food production. For example, according to the United Nations Organization [10], in the coming decades it is expected that the population of the Earth will exceed the threshold of 10 billion people. Obviously, providing food for such a huge number of people is impossible without the optimization of modern agriculture. In addition, the economies of many countries are still facing the consequences of the coronavirus infection, which affected, among other things, the digitalization of agri-food systems [11].

Moreover, problems associated with the complexity of efficient harvesting already exist in horticulture. For example, many apples fail to reach the consumer due to reasons such as a low payback of labor, inaccessibility of picking in some places, etc. As a result, such apples remain in agricultural holdings and rot.

Given the factors listed above, on the one hand, there is the problem of increasing the amount of food produced in agriculture, and, on the other hand, the need to ensure fast and efficient harvesting with the help of intelligent robotic solutions is clear. Our article aims to solve the second problem. In addition to the harvesting robot itself, solving this issue requires the use of a control system. The main decisions in this case can be made using video information. It is clear that the detection of objects of interest in the video is required, followed by the localization of these objects. Also of interest is the method of movement of the manipulator during harvesting. Thus, the following three related problems are considered: detection of objects in the image [12], determination of the coordinates of objects relative to the manipulator [13], and the formation of a route for moving behind objects [14]. In this article, the objects of interest will be apples. The first and second tasks can be successfully solved using convolutional neural networks, as well as specialized deep cameras.

The recognition of apples in images was achieved using the YOLO neural network architecture [15,16]. An important feature of such a pretrained model is the basic ability to detect apples among 80 object classes. To estimate the depth, we use the methods of computer optics [17], and also perform a comparison for the Intel Real Sense [18] and ZED [19] cameras. The first camera registers optical images and builds a depth map through IR sensors, and the second one provides access to the point cloud as a result of scanning. Since the movement for apples will actually be carried out in a simulator in two-dimensional space, a reinforcement learning model based on Q-learning will be trained [20].

The purpose of this work is to study the efficiency of detecting apples and estimating their coordinates using various stereo cameras. The contribution that this article makes to the scientific community is the development of fruit coordinate estimation algorithms for robot guidance using various pre-trained YOLOv5 network configurations and transfer learning, as well as a comparative analysis in terms of coordinate estimation error for various stereo cameras. Moreover, the article proposes a hardware–software solution for the implementation of such a robot with a GPU computer and control using Raspberry Pi. Finally, for the first time, an attempt was made to generate apple picking trajectories, specifically in two-dimensional space, but with the possibility of generalization in the future to three-dimensional space. Transfer learning made it possible to achieve complete and accurate recognition with improved performance compared to earlier works on this topic.

The solution to a complex problem with a combination of algorithms for computer vision, stereo vision and robotic-arm control based on our developments will be described for the first time.

The structure of the article is as follows. The second section will describe stereo vision technologies, as well as computer vision systems in object detection tasks and examples of the application of reinforcement learning in robotics. The third section is devoted to the development of a software and hardware solution for the control system of a robotic arm for picking apples, and also describes an approach to the formation of a harvesting policy. The fourth section describes the evaluation of the quality and performance of the developed algorithms. The main results and conclusions are reflected in the Conclusion, followed by a list of references.

2. Related Works

Today, as noted earlier, it is difficult to imagine intelligent robotic systems without machine vision. Machine or computer vision includes the tasks of recognizing faces, gender, age, and general human characteristics [21,22]. Animal recognition is also an important task in the field of machine vision [23,24]. A number of datasets have been collected for this task, such as those used in machine learning tasks associated with the recognition of flowers, vegetables and fruits [25,26].

Among the tasks of machine vision, one can single out the acquisition, processing and (pre-processing) of images, and sometimes the prediction of future frames of a sequence of images, associated, among other things, with the simulation of new scenes [27,28]. At the same time, computer vision systems can be considered as systems used to describe the real world using images or their sequences. However, an accurate description of the world is impossible without stereovision systems. The task of stereovision systems is the reconstruction of three-dimensional coordinates of points for depth estimation. The stereo vision system usually consists of two cameras installed in the appropriate way, as shown in Figure 1 [29].

Similarly to the vision of a person who has two eyes, the cameras on the left and right provide two shifted images of the same object. Further, based on these offsets, you can determine the distance to the object. The lot method approximates the differences between two frames to perform depth mapping. The purpose for the development of such systems is to increase the accuracy of building depth maps and the speed of work.

The study in [29] proposes a taxonomy of computer vision algorithms, including methods for evaluating algorithms for the construction of depth maps. In [30], Brown and his co-authors provide approaches to the construction of stereo vision maps, as well as inconsistencies between the systems for obtaining them and the methods of processing occlusion. In addition, the authors paid attention to the implementation of such algorithms in real time. A review paper [31], presented in 2008, compares stereo vision methods in terms of computational costs, as well as the accuracy of matching predicted and actual distances. The authors of [32] consider not only global algorithms, but also more efficient local depth determination algorithms that can be implemented on various software and hardware platforms. Tombari et al. [33] made a great contribution to the development and evaluation of object recognition algorithms in three-dimensional space. In [34], the issue of assessing the suitability and applicability of stereo vision algorithms is explored. The authors collected data on the errors of various algorithms, as well as on the performance of devices that generate depth maps. Emphasis was placed on real-time systems. Although systems for matching and combining several images make it possible to obtain a high accuracy in determining the range, they process occlusions; therefore, stereo matching systems remain the most preferable in a number of tasks. Such problems, as a rule, are associated with the need to have a fixed geometry, the number of images being limited, and the speed of decision making, which also plays an important role [35]. Undoubtedly, robotic solutions satisfy all the above conditions; therefore, the use of stereo vision in robotics is appropriate.

It is noteworthy that the solution of the problem of estimating the coordinates of objects in the case of using stereo cameras can be separately divided into the tasks of estimating local coordinates on a color image frame, and recalculating coordinates relative to the camera using a depth map. Modern models consider the problem of detecting deterministic objects in close connection with pattern recognition [36]. The networks of the Region-Based CNN (R-CNN) family [37] can be attributed to classical neural network detectors of previously known classes of objects. In fact, such a network cut out a set of smaller rectangular images, the so-called regions, from the original image and process them using a classification neural network, such as AlexNet [38]. The non-optimality of this approach, firstly, consisted of the slow speed of work, and secondly, the inefficient choice of regions. The problem of increasing performance was solved to some extent by a modification called Fast R-CNN [39]. The acceleration occurred due to the fact that the entire image was first processed using the convolutional network, regions were then formed for verification using the obtained feature map, and these regions were then passed through subsequent layers of the neural network for classification. In addition, to reduce the error when determining the coordinates in the Fast R-CNN network, the regression of bounding boxes was proposed. This was also required in order to transfer the frame from the feature map for which the regions were formed to the original image. However, the performance of the Fast R-CNN model was still quite poor. In fact, the selective selection method led the neural network to perform many “idle” recognitions in regions without objects of interest. The modified Faster R-CNN method [40] used another specialized neural network which suggested regions for testing based on the feature map obtained in the first step.

Despite the optimization of the selection of test regions, real-time operation required either very powerful computing devices or processing optimization tools [41,42]. However, such works consider hardware acceleration options. At the same time, more high-performance detection algorithms were implemented in single-pass detectors such as Single Shot-Multiple Detections. Localization occurs in the same pass as detection. Examples of the architecture of convolutional single-pass detectors include the neural networks YOLO [43], SSD [44], and RetinaNet [45].

Let us take a closer look at the You Only Look Once (YOLO) architecture. At present, version 7 of this architecture has already appeared. However, no less important was the appearance of the third and fifth versions. Firstly, YOLOv3 provided an acceptable detection quality in many tasks, and the YOLOv5 network implemented on the PyTorch framework with different sizes made it possible to obtain fairly fast solutions for object detection. The basic recognition network in YOLO is the DarkNet network [46]. The selective selection algorithm has been replaced by an algorithm for dividing the original image into square areas of different sizes, for which objects are recognized in the single-pass mode. Each square is checked using three bounding boxes, for which the probability of the presence of objects of certain classes is determined. These probabilities can be used to make a final decision about the presence or absence of an object in an area. This approach made it possible to use YOLO neural networks for processing video sequences [47,48].

In recent years, there has been a trend towards the use of transformer architectures in the field of object detection in images [49,50]. However, their main disadvantage compared to convolutional models is their low performance, despite their often high accuracy. Therefore, convolutional detection architectures are more often used for video analysis. For example, YOLO has already proven itself in the problems of recognizing fruits on trees [51]. It is true, however, that [51] does not take into account data from video sequences, and the processing concerns individual frames. Moreover, many works do not estimate the real distance to apples in any way, since there are practically no datasets with such information. An increase in the quality of processing is achieved through the use of a variety of image preprocessing algorithms, as was shown in [52]. The authors of [53] studied the impact of customized datasets on the quality of detection. In addition, augmentations can improve the metrics of models [54]. Indicators of the F1-score metric at the level of 93% in the apple recognition problem were obtained in [55]. At the same time, the used convolutional networks lag behind in accuracy when detecting green apples. However, the computer vision system proposed in [56] recognizes apples and pears, and also tracks objects in the wind using the Kalman filter [57].

Finally, the problems of applying depth maps are considered in [58,59,60]. In particular, the work in [58] is devoted to studying the efficiency of estimating the distances between the robot and landmarks based on data from the Intel Real Sense Depth Camera D435. These distances are used to determine the trajectory of movement. A comparative analysis of the line of Intel Real Sense cameras is provided by the authors in [59]. However, the disadvantages of using only depth cameras when constructing motion trajectories are shown in [60]. Problems can arise when there is noise in the image or when exposed to visual attacks [61].

However, the optimal trajectories for the movement of robots are not always obvious, and learning to form them is difficult to imagine as a learning task with a teacher. In this case, reinforcement learning algorithms are used. In [62], the indoor navigation of robots is achieved using reinforcement learning. Combining algorithms for computer vision, simultaneous localization and mapping (SLAM) estimation, and reinforcement learning for robotics and positioning is proposed by the authors of [63]. The work in [64] is devoted to the application of reinforcement learning methods in a simulation environment to control a robotic arm.

Thus, a review of the related works shows that in order to solve the problem of picking apples using a robotic arm, it is necessary to organize a computer vision system based on convolutional neural networks, develop algorithms for converting coordinates relative to the camera into coordinates relative to the manipulator, and to also consider options for applying reinforcement learning for picking apples.

The scientific novelty of this article is as follows. Firstly, we performed transfer learning on a large volume of images, and secondly, a comparative analysis of two types of cameras is performed. Thirdly, this article also explored the simulation of a method for constructing a path for picking apples based on reinforcement learning.

3. Materials and Methods

To develop an algorithm for detecting apples in images, we used our own dataset consisting of 7218 images in which about 75,000 apples were marked and 3400 images without apples. The data were collected in specialized partner gardens. Red and green apples were considered. It is important to note that all images were taken in good and clear weather with sunlight. There was also no binding to different varieties in order to simplify the task of recognizing apples against the background of everything else.

An example of a training sample image is shown in Figure 2.

The test sample included 1000 images with 9620 apples and 215 images without apples. YOLOv5 architectures with various weights were considered. Since such models are able to detect apples without training, models with and without weight tuning are compared.

YOLOv5 was chosen after testing different architectures using NVIDIA GeForce GTX 1070. Table 1 shows the results of a comparative analysis for apple detection on 100 images with 854 apples and 20 images without apples.

It can be seen that YOLOv5 shows best precision and performance.

The Computer Vision Annotation Toolbox (CVAT) was used for markup. It allows you to make annotations in various formats. The YOLOv5 network itself was trained using RoboFlow. Figure 3 shows an example of data markup in CVAT [65].

Thus, the color image processing model was first trained. Further, the use of specialized depth cameras, such as the Intel Real Sense Depth Camera D415 and Stereo Labs ZED 2, was considered. If the first one supplies color and depth images for processing separately, for which no alignment is required, then the second one provides researchers with a three-dimensional point cloud.

In the case of Real Sense, the pixel coordinates of detected objects need to be recalculated using depth map data. Such a recalculation is carried out using the following expressions [66]:

\begin{array}{l} X [m m] = d_{x 0, y 0} \frac{(C_{x} - x_{0} [p x])}{f_{x}}, \\ Y [m m] = d_{x 0, y 0} \frac{(C_{y} - y_{0} [p x])}{f_{y}}, \end{array}

(1)

where

X [m m]

is the projection of the distance relative to the center of the image on the axis X, which we calculate in millimeters;

Y [m m]

is the projection of the distance relative to the center of the image on the axis Y, which we calculate in millimeters;

d_{x 0, y 0}

is the depth map value at the point with coordinate

(x_{0}, y_{0})

, which the camera calculates in millimeters;

C_{x}

is the X-axis coordinate of the image center, which is machine calculated in pixels by definition;

C_{y}

is the Y-axis coordinate of the image center, which is machine calculated in pixels by definition;

f_{x}

and

f_{y}

are internal parameters of the optical system of the camera used to obtain an image, i.e., the focal lengths along the axes X and Y;

(x_{0}, y_{0})

is the coordinate of the center of the detected object on the image, which the neural network algorithm calculate in pixels.

When using the D415 model, it is necessary to make a correction for the shift of the camera center for Expression (1):

X ’ [m m] = X [m m] - 35

(2)

where

X ’ [m m]

is the unshifted projection of the distance from the center of the camera to the object along the X axis (in mm), and 35 (mm) is the standard offset of the D415 model.

Having calculated the projection of the distance on the Y axis in accordance with expression (1), and the projection of the distance on the X axis in accordance with expression (2), it is easy, knowing the total distance from the depth map, to determine the projection of the distance along the Z axis as follows:

Z [m m] = \sqrt{d^{2} - X^{'} {[m m]}^{2} - Y {[m m]}^{2}}

(3)

Expressions (1)–(3) provide coordinates relative to the camera center. However, the position of the camera on the manipulator is known in its local coordinate system, so the distance from the manipulator to the object can be recalculated by introducing appropriate corrections for each of the coordinate axes. Since stepper motors are used to control the manipulator along the X, Y, and Z axes, the decision regarding the number of pins supplied by the control system is easy to adjust by a given number of millimeters, and can be easily calculated using regression models along the axes. Figure 4 shows a manipulator with a camera and an apple-recognizing computing device.

Figure 4 shows the prototype of a manipulator. There is laptop with a GPU on the front. It is used for object detection. We can also see some drivers and a connection with the depth camera. They are controlled using Raspberry Pi 3 connected to a laptop. The experiment also includes two apples for picking. We can see that robotic arm is aimed at the first (left) apple. The next steps are the movement to the apple and grabbing.

Control is carried out using a single-board microcomputer Raspberry Pi 4, which in the local network receives the number of necessary steps from the computer vision system so as to aim the manipulator at the object. A block diagram for this type of intelligent manipulator is shown in Figure 5.

According to the above diagram, the manipulator control procedure is as follows. First, the stereo vision system evaluates the coordinates of the object relative to the camera, recalculates them relative to the manipulator and transmits the results to the control device. The control device determines the required number of pins to apply to the drive and creates a control action to move to the desired point. When the manipulator reaches the required position, it picks the apple and returns to the starting point.

The system works in a similar way using the ZED 2 stereo camera, which is shown in Figure 6.

As shown in Figure 6 [19] (camera device), the ZED 2 stereo camera has left and right optical cameras, the image difference between which allows you to build a three-dimensional point cloud. Therefore, it can also be used for coordinates estimation.

Let the coordinates of the rectangle bounding the apple be determined using YOLO. Then, we determine the coordinates relative to the center of the ZED 2 camera. Note that in our case, processing was performed for the frame from the left camera of the device. Object detection after YOLO includes the following parameters: coordinates of the upper left corner of the rectangle along the X and Y axes labeled as

(x_{l}, y_{l})

, width

w

and height

h

.

Accordingly, the point corresponding to the center of the object has coordinates

(x_{l} + w / 2, y_{l} + h / 2)

. The ZED 2 camera interface provides the ability to access the point cloud by the coordinate of the color image frame. Let us estimate the distance to the object using a cloud of points according to the Euclid formula:

r = \sqrt{p c_{x}^{2} (x_{l} + w / 2, y_{l} + h / 2) + p c_{y}^{2} (x_{l} + w / 2, y_{l} + h / 2) + p c_{z}^{2} (x_{l} + w / 2, y_{l} + h / 2)}

(4)

where

p c_{x (y, z)}

is the distance projection value along the X(Y,Z) axis.

Knowing all projections of distances relative to the left camera, we recalculate them for the manipulator and determine the required number of steps for movement.

All image processing in the laboratory was performed using an ASUS FX 504 computer with an NVIDIA GeForce 1060 video card and CUDA version 11.0.

The formation of the route was solved using Python with Q-learning methods. It was believed that the apples were at an equal distance. A 10 × 10 map was modeled, where 3 apples could be randomly located. Figure 7 shows one of the states of the simulation environment.

In this case, it is assumed that the manipulator agent always starts from the lower right corner. Additionally, an optimal route needs to be built, with shifts occurring only along the X and Y axes; however, an unpicked apple cannot be knocked down when moving towards the apple that the manipulator agent wants to pick.

The action space was described by movements along the X and Y axes and the capture of an apple. A discount factor of 0.9 and a learning rate of 0.005 were used. The q-values were updated according to the following formula [67,68]:

Q_{n e w} (s_{t}, a_{t}) \leftarrow Q (s_{t}, a_{t}) + α (r_{t} + γ \max Q (s_{t + 1}, a) - Q (s_{t}, a_{t}))

(5)

where

(s_{t}, a_{t})

is the state and action at a point in time t,

α

is the learning rate,

γ

is the discount coefficient, and

r_{t}

is the reward at a point in time t.

4. Results

The results of the work of various detection algorithms according to the image frame, estimates of the accuracy of coordinates by different cameras, and the effectiveness of reinforcement learning algorithms are further presented in the tables.

Figure 8 shows an example of detecting apples on a test image.

Table 2 shows the comparative mAP metrics for different neural network models.

The analysis shows that the largest model provides the best quality. At the same time, transfer learning leads to an increase of up to 5%. The speed of work will be inversely proportional to the size of the models.

Although there were red and green apples in the training data, we did not treat them separately as two classes in training. However, interesting results were observed during testing. It was found that the neural network recognizes red apples somewhat better than green ones. This is probably due to the presence of green leaves in the background. Table 3 presents the results of the detection of red and green apples.

Let us review how the system works in the laboratory. We use red apples on the stand, as the task of estimating coordinates can then be solved.

Figure 9 shows an example of detecting apples and coordinates in the laboratory.

Figure 9 shows the direct processing of data from the video camera. However, since we do not use specialized algorithms when working with video data, such as optical flow, motion vector estimation, and others [42], the neural network is used for detection and processes individual frames of the video stream. This approach is applicable, since the apples on the trees are quite static, which imitates our laboratory stand. No field studies have yet been performed to evaluate effectiveness; thus, results are provided for a laboratory stand.

Table 4 presents the comparative characteristics of the RMSE metrics for the error of estimating coordinates based on different cameras.

The analysis shows that the accuracy of position estimation using the ZED 2 camera is 10–15% higher than that of the Real Sense camera.

It should also be noted that error data is important for different sizes of blocks and manipulators. In a laboratory study, a three-fingered manipulator with a girth of about 12 cm in a compressed state and apples with a diameter of about 10 cm were investigated. In this regard, it is important that the pointing errors at the center of the apple do not exceed 1 cm in content and width. This allows for the capture of the grip. Figure 10 shows the manipulator grabbing the apple.

Table 5 shows the results of reinforcement learning for picking apples. Note that the time to collect all the apples was considered on average and the number of errors during movement on average.

Clearly, the metrics for the trained model are much better, because the Naive algorithm performs all actions randomly, but our model learned the shortest ways in different states in a simulator.

Undoubtedly, the obtained developments can be used in the future to find rotten and infected fruits [69,70].

5. Conclusions

Thus, several algorithms were implemented in this article. First, transfer learning was performed for the YOLOv5 neural network, which provides 97% apple detection accuracy on real images of horticulture. Moreover, Real Sense and ZED 2 cameras were compared to estimate the coordinates of apples. The analysis showed that the errors in determining the coordinates are smaller for the ZED 2 camera. Along the X and Y axes, this error does not exceed 6.5 mm according to the RMSE metric, and along the Z axis, such an error is slightly more than 1 cm. However, such errors are not critical for collecting apples. Finally, a reinforcement learning model is explored in the simulation, which is several times faster and more accurate than models with naive apple-picking policies. In the future, we plan to optimize the processing of video data, as well as to teach the model to distinguish between red, green and yellow apples. Furthermore, the investigation will be devoted to detection in different weather conditions and during day time. Another important area of research will be the adaptation of algorithms for estimating coordinates in sunlight. Therefore, it will be important to consider solar light interference.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The author declares no conflict of interest.

References

Ran, Y.; Tang, H.; Li, B.; Wang, G. Self-Supervised Video Representation and Temporally Adaptive Attention for Audio-Visual Event Localization. Appl. Sci. 2022, 12, 12622. [Google Scholar] [CrossRef]
Qu, Z.; Tongqiang, H.; Tuming, Y. MFFAMM: A Small Object Detection with Multi-Scale Feature Fusion and Attention Mechanism Module. Appl. Sci. 2022, 12, 8940. [Google Scholar] [CrossRef]
Andriyanov, N.A. Combining Text and Image Analysis Methods for Solving Multimodal Classification Problems. Pattern Recognit. Image Anal. 2022, 32, 489–494. [Google Scholar] [CrossRef]
Tsourounis, D.; Kastaniotis, D.; Theoharatos, C.; Kazantzidis, A.; Economou, G. SIFT-CNN: When Convolutional Neural Networks Meet Dense SIFT Descriptors for Image and Sequence Classification. J. Imaging 2022, 8, 256. [Google Scholar] [CrossRef] [PubMed]
Bernstein, A.V.; Burnaev, E.V.; Kachan, O.N. Reinforcement Learning for Computer Vision and Robot Navigation. In Proceedings of the Machine Learning and Data Mining in Pattern Recognition: 14th International Conference, MLDM 2018, New York, NY, USA, 15–19 July 2018; Volume 10935, pp. 258–272. [Google Scholar]
Andriyanov, N.; Khasanshin, I.; Utkin, D.; Gataullin, T.; Ignar, S.; Shumaev, V.; Soloviev, V. Intelligent System for Estimation of the Spatial Position of Apples Based on YOLOv3 and Real Sense Depth Camera D415. Symmetry 2022, 14, 148. [Google Scholar] [CrossRef]
Rolandi, S.; Brunori, G.; Bacco, M.; Scotti, I. The Digitalization of Agriculture and Rural Areas: Towards a Taxonomy of the Impacts. Sustainability 2021, 13, 5172. [Google Scholar] [CrossRef]
López-Morales, J.A.; Martínez, J.A.; Skarmeta, A.F. Digital Transformation of Agriculture through the Use of an In-teroperable Platform. Sensors 2020, 20, 1153. [Google Scholar] [CrossRef]
Cho, W.; Kim, S.; Na, M.; Na, I. Forecasting of Tomato Yields Using Attention-Based LSTM Network and ARMA Model. Electronics 2021, 10, 1576. [Google Scholar] [CrossRef]
United Nations: Population. Available online: https://www.un.org/en/global-issues/population (accessed on 15 January 2023).
Bahn, R.A.; Yehya, A.K.; Zurayk, R. Digitalization for Sustainable Agri-Food Systems: Potential, Status, and Risks for the MENA Region. Sustainability 2021, 13, 3223. [Google Scholar] [CrossRef]
Andriyanov, N.A.; Dementiev, V.E.; Tashlinskii, A.G. Detection of objects in the images: From likelihood relationships towards scalable and efficient neural networks. Comput. Opt. 2022, 46, 139–159. [Google Scholar] [CrossRef]
Andriyanov, N. Estimating Object Coordinates Using Convolutional Neural Networks and Intel Real Sense D415/D455 Depth Maps. In Proceedings of the 2022 VIII International Conference on Information Technology and Nanotechnology (ITNT), Samara, Russia, 23–27 May 2022; pp. 1–4. [Google Scholar] [CrossRef]
Nasiri, M.; Liebchen, B. Reinforcement learning of optimal active particle navigation. arXiv 2022, arXiv:2202.00812. [Google Scholar] [CrossRef]
Redmon, J.; Farhadi, A. YOLOv3: An Incremental Improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
YOLOv5 Release. Available online: https://github.com/ultralytics/yolov5 (accessed on 31 December 2022).
Titov, V.S.; Spevakov, A.G.; Primenko, D.V. Multispectral optoelectronic device for controlling an autonomous mobile platform. Comput. Opt. 2021, 45, 399–404. [Google Scholar] [CrossRef]
Info. D415 Camera. Available online: https://www.intelrealsense.com/depth-camera-d415/ (accessed on 31 December 2022).
Info. ZED-2 Camera. Available online: https://www.stereolabs.com/zed-2/ (accessed on 31 December 2022).
Sumanas, M.; Petronis, A.; Bucinskas, V.; Dzedzickis, A.; Virzonis, D.; Morkvenaite-Vilkonciene, I. Deep Q-Learning in Robotics: Improvement of Accuracy and Repeatability. Sensors 2022, 22, 3911. [Google Scholar] [CrossRef]
Păvăloaia, V.-D.; Husac, G. Tracking Unauthorized Access Using Machine Learning and PCA for Face Recognition Developments. Information 2023, 14, 25. [Google Scholar] [CrossRef]
Darabant, A.S.; Borza, D.; Danescu, R. Recognizing Human Races through Machine Learning—A Multi-Network, Multi-Features Study. Mathematics 2021, 9, 195. [Google Scholar] [CrossRef]
Tan, M.; Chao, W.; Cheng, J.-K.; Zhou, M.; Ma, Y.; Jiang, X.; Ge, J.; Yu, L.; Feng, L. Animal Detection and Classification from Camera Trap Images Using Different Mainstream Object Detection Architectures. Animals 2022, 12, 1976. [Google Scholar] [CrossRef]
Villa, A.; Salazar, A.; Vargas, F. Towards automatic wild animal monitoring: Identification of animal species in camera-trap images using very deep convolutional neural networks. Ecol. Inform. 2017, 41, 24–32. [Google Scholar] [CrossRef]
Rupinder, K.; Jain, A.; Saini, P.; Kumar, S. A Review Analysis Techniques of Flower Classification Based on Machine Learning Algorithms. ECS Trans. 2022, 107, 9609. [Google Scholar]
Zhenzhen, S.; Longsheng, F.; Jingzhu, W.; Zhihao, L.; Rui, L.; Yongjie, C. Kiwifruit detection in field images using Faster R-CNN with VGG16. IFAC-Pap. 2019, 52, 76–81. [Google Scholar]
Andriyanov, N.A.; Gavrilina, Y.N. Image Models and Segmentation Algorithms Based on Discrete Doubly Stochastic Autoregressions with Multiple Roots of Characteristic Equations. CEUR Workshop Proc. 2018, 2076, 1–10. [Google Scholar]
Vasilev, K.K.; Dementev, V.E.; Andriyanov, N.A. Application of mixed models for solving the problem on restoring and estimating image parameters. Pattern Recognit. Image Anal. 2016, 26, 240–247. [Google Scholar] [CrossRef]
Scharstein, D.; Szeliski, R. A taxonomy and evaluation of dense two-frame Stereo correspondence algorithms. Int. J. Comput. Vis. 2002, 47, 7–42. [Google Scholar] [CrossRef]
Brown, M.Z.; Burschka, D.; Hager, G.D. Advances in computational stereo. IEEE Trans. Pattern Anal. Mach. Intell. 2003, 25, 993–1008. [Google Scholar] [CrossRef]
Tombari, F.; Mattoccia, S.; Stefano, L.D.; Addimanda, E. Classification and evaluation of cost aggregation methods for stereo correspondence. In Proceedings of the 26th IEEE Conference on Computer Vision and Pattern Recognition (CVPR’08), Anchorage, Alaska, 23–28 June 2008; pp. 1–8. [Google Scholar]
Lazaros, N.; Sirakoulis, G.C.; Gasteratos, A. Review of stereo vision algorithms: From software to hardware. Int. J. Optomechatronics 2008, 2, 435–462. [Google Scholar] [CrossRef]
Tombari, F.; Gori, F.; Di Stefano, L. Evaluation of stereo algorithms for 3D object recognition. In Proceedings of the IEEE International Conference on Computer Vision Workshops (ICCV’11), Barcelona, Spain, 6–13 November 2011; pp. 990–997. [Google Scholar]
Tippetts, B.; Lee, D.J.; Lillywhite, K.; Archibald, J. Review of stereo vision algorithms and their suitability for resourcelimited systems. J. Real-Time Image Process. 2013, 8, 1–21. [Google Scholar]
Stentoumis, C.; Grammatikopoulos, L.; Kalisperakis, I.; Karras, G.; Petsa, E. Stereo matching based on Census transformation of image gradients. In Proceedings Volume 9528, Videometrics, Range Imaging, and Applications XIII; SPIE: Bellingham, WA, USA, 2015; p. 12210. [Google Scholar]
Andriyanov, N.A.; Vasiliev, K.K.; Dementiev, V.E. Investigation of Filtering and Objects Detection Algorithms for a Multizone Image Sequence. ISPRS Int. Arch. Photogramm. Remote. Sens. Spat. Inf. Sci. 2019, XLII-2/W12, 7–10. [Google Scholar] [CrossRef]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet classification with deep convolutional neural networks. Neural Inf. Process. Syst. (NeurIPS) 2012, 2012, 1106–1114. [Google Scholar] [CrossRef]
Girshick, R. Fast R-CNN. Available online: https://arxiv.org/abs/1504.08083 (accessed on 13 January 2023).
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. Available online: https://arxiv.org/abs/1506.01497 (accessed on 14 January 2023).
Andriyanov, N.; Papakostas, G. Optimization and Benchmarking of Convolutional Networks with Quantization and OpenVINO in Baggage Image Recognition. In Proceedings of the 2022 VIII International Conference on Information Technology and Nanotechnology (ITNT), Samara, Russia, 23–27 May 2022; pp. 1–4. [Google Scholar] [CrossRef]
Wu, R.; Guo, X.; Du, J.; Li, J. Accelerating Neural Network Inference on FPGA-Based Platforms—A Survey. Electronics 2021, 10, 1025. [Google Scholar] [CrossRef]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. Available online: https://arxiv.org/abs/1506.02640 (accessed on 31 December 2022).
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.; Berg, A. SSD: Single Shot MultiBox Detector. Available online: https://arxiv.org/abs/1512.02325 (accessed on 31 December 2022).
Lin, T.; Goyal, P.; Girshick, R.; He, K.; Dollar, P. Focal Loss for Dense Object Detection. Available online: https://arxiv.org/abs/1708.02002 (accessed on 31 December 2022).
DarkNet-53. Available online: https://github.com/pjreddie/darknet (accessed on 31 December 2022).
Zhang, Y.; Guo, Z.; Wu, J.; Tian, Y.; Tang, H.; Guo, X. Real-Time Vehicle Detection Based on Improved YOLO v5. Sustainability 2022, 14, 12274. [Google Scholar] [CrossRef]
Andriyanov, N.A.; Dementiev, V.E.; Tashlinskiy, A.G. Development of a Productive Transport Detection System Using Convolutional Neural Networks. Pattern Recognit. Image Anal. 2022, 32, 495–500. [Google Scholar] [CrossRef]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. Int. Conf. Learn. Represent. 2021, 2021, 1–22. [Google Scholar]
Kuznetsova, A.; Maleva, T.; Soloviev, V. Using YOLOv3 Algorithm with Pre- and Post-Processing for Apple Detection in Fruit-Harvesting Robot. Agronomy 2020, 10, 1016. [Google Scholar] [CrossRef]
Yan, B.; Fan, P.; Lei, X.; Liu, Z.; Yang, F. A Real-Time Apple Targets Detection Method for Picking Robot Based on Improved YOLOv5. Remote Sens. 2021, 13, 1619. [Google Scholar] [CrossRef]
Huang, Z.; Zhang, P.; Liu, R.; Li, D. Immature Apple Detection Method Based on Improved Yolov3. ASP Trans. Internet Things 2021, 1, 9–13. [Google Scholar] [CrossRef]
Andriyanov, N.A.; Andriyanov, D.A. The using of data augmentation in machine learning in image processing tasks in the face of data scarcity. J. Phys. Conf. Ser. 2020, 1661, 012018. [Google Scholar] [CrossRef]
Xuan, G. Apple Detection in Natural Environment Using Deep Learning Algorithms. IEEE Access 2020, 8, 216772–216780. [Google Scholar] [CrossRef]
Itakura, K.; Narita, Y.; Noaki, S.; Hosoi, F. Automatic pear and apple detection by videos using deep learning and a Kalman filter. OSA Contin. 2021, 4, 1688. [Google Scholar] [CrossRef]
Wang, D.; Zhang, H.; Ge, B. Adaptive Unscented Kalman Filter for Target Tacking with Time-Varying Noise Covariance Based on Multi-Sensor Information Fusion. Sensors 2021, 21, 5808. [Google Scholar] [CrossRef]
Gómez-Espinosa, A.; Rodríguez-Suárez, J.B.; Cuan-Urquizo, E.; Cabello, J.A.E.; Swenson, R.L. Colored 3D Path Extraction Based on Depth-RGB Sensor for Welding Robot Trajectory Generation. Automation 2021, 2, 252–265. [Google Scholar] [CrossRef]
Servi, M.; Mussi, E.; Profili, A.; Furferi, R.; Volpe, Y.; Governi, L.; Buonamici, F. Metrological Characterization and Comparison of D415, D455, L515 RealSense Devices in the Close Range. Sensors 2021, 21, 7770. [Google Scholar] [CrossRef] [PubMed]
Maru, M.B.; Lee, D.; Tola, K.D.; Park, S. Comparison of Depth Camera and Terrestrial Laser Scanner in Monitoring Structural Deflections. Sensors 2021, 21, 201. [Google Scholar] [CrossRef] [PubMed]
Andriyanov, N. Methods for Preventing Visual Attacks in Convolutional Neural Networks Based on Data Discard and Dimensionality Reduction. Appl. Sci. 2021, 11, 5235. [Google Scholar] [CrossRef]
Surmann, H.; Jestel, C.; Marchel, R.; Musberg, F.; Elhadji, H.; Ardani, M. Deep Reinforcement learning for real autonomous mobile robot navigation in indoor environments. arXiv 2020, arXiv:2005.13857. [Google Scholar]
Dalal, M.; Pathak, D.; Salakhutdinov, R. Accelerating Robotic Reinforcement Learning via Parameterized Action Primitives. arXiv 2021, arXiv:2110.15360. [Google Scholar]
Vacaro, J.; Marques, G.; Oliveira, B.; Paz, G.; Paula, T.; Staehler, W.; Murphy, D. Sim-to-Real in Reinforcement Learning for Everyone. In Proceedings of the 2019 Latin American Robotics Symposium (LARS), 2019 Brazilian Symposium on Robotics (SBR) and 2019 Workshop on Robotics in Education (WRE), Rio Grande, Brazil, 23–25 October 2019; pp. 305–310. [Google Scholar]
Computer Vision Annotation Tool. Available online: https://cvat.org/ (accessed on 16 January 2023).
Laganiere, R.; Gilbert, S.; Roth, G. Robust object pose estimation from feature-based stereo. IEEE Trans. Instrum. Meas. 2006, 55, 1270–1280. [Google Scholar] [CrossRef]
Lin, C.-J.; Jhang, J.-Y.; Lin, H.-Y.; Lee, C.-L.; Young, K.-Y. Using a Reinforcement Q-Learning-Based Deep Neural Network for Playing Video Games. Electronics 2019, 8, 1128. [Google Scholar] [CrossRef]
Q-learning. Available online: https://en.wikipedia.org/wiki/Q-learning (accessed on 15 January 2023).
Shaohua, W.; Sotirios, G. Faster R-CNN for multi-class fruit detection using a robotic vision system. Comput. Netw. 2020, 168, 107036. [Google Scholar]
Bhargava, A.; Bansal, A. Fruits and vegetables quality evaluation using computer vision: A review. J. King Saud Univ.-Comput. Inf. Sci. 2021, 33, 243–257. [Google Scholar] [CrossRef]

Figure 1. Stereo vision system model.

Figure 2. Image of apples for training.

Figure 3. Image markup in CVAT.

Figure 4. Manipulator with computing device and camera.

Figure 5. Scheme of operation of the manipulator.

Figure 6. ZED 2 stereo camera.

Figure 7. Simulation environment state with 3 apples.

Figure 8. Apple detection in pictures.

Figure 9. Positioning apples in 3D (Apple Confidence: 0.87, X: 1.26 mm, Y: 40.67 mm, Z: 447.0 mm).

Figure 10. Apple grabbing.

Table 1. Comparative analysis of apple detection algorithms.

Model	Precision	Performance (fps)
YOLOv5	86.32%	6.54
YOLOv3	74.85%	4.75
R-CNN	69.45%	1.06
Fast R-CNN	66.78%	2.78

Table 2. Comparative analysis of apple detection algorithms.

Model	mAP	mAP after Fine-Tuning
Yolov5n (640 px)	46.42%	46.47%
Yolov5s (640 px)	53.22%	53.44%
Yolov5m (640 px)	84.75%	86.22%
Yolov5l (640 px)	88.92%	90.13%
Yolov5x (640 px)	90.31%	94.27%
Yolov5x6 (1280 px)	91.14%	96.22%
Yolov5x6+ (1280 px)	93.72%	97.08%

Table 3. Comparative analysis of red and green apple detection.

Model	mAP Red	mAP Green
Yolov5n (640 px)	47.54 %	46.22%
Yolov5s (640 px)	54.86%	52.91%
Yolov5m (640 px)	86.78%	85.42%
Yolov5l (640 px)	91.07%	88.95%
Yolov5x (640 px)	95.64%	93.12%
Yolov5x6 (1280 px)	96.92%	95.05%
Yolov5x6+ (1280 px)	97.96%	95.38%

Table 4. Comparative characteristics of coordinate estimation (RMSE are measured in mm).

Camera	RMSE on X (mm)	RMSE on Y (mm)	RMSE on Z (mm)
Intel Real Sense D415	7.22	6.54	16.59
ZED 2	6.24	4.68	13.22

Table 5. Movement politics comparison.

Politics	Time (ms)	Penalties
Q-learning	36.85	0
Naive (snake)	124.75	5.6

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Andriyanov, N. Development of Apple Detection System and Reinforcement Learning for Apple Manipulator. Electronics 2023, 12, 727. https://doi.org/10.3390/electronics12030727

AMA Style

Andriyanov N. Development of Apple Detection System and Reinforcement Learning for Apple Manipulator. Electronics. 2023; 12(3):727. https://doi.org/10.3390/electronics12030727

Chicago/Turabian Style

Andriyanov, Nikita. 2023. "Development of Apple Detection System and Reinforcement Learning for Apple Manipulator" Electronics 12, no. 3: 727. https://doi.org/10.3390/electronics12030727

APA Style

Andriyanov, N. (2023). Development of Apple Detection System and Reinforcement Learning for Apple Manipulator. Electronics, 12(3), 727. https://doi.org/10.3390/electronics12030727

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Development of Apple Detection System and Reinforcement Learning for Apple Manipulator

Abstract

1. Introduction

2. Related Works

3. Materials and Methods

4. Results

5. Conclusions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI