Research on Apple Object Detection and Localization Method Based on Improved YOLOX and RGB-D Images

Hu, Tiantian; Wang, Wenbo; Gu, Jinan; Xia, Zilin; Zhang, Jian; Wang, Bo

doi:10.3390/agronomy13071816

Open AccessEditor’s ChoiceArticle

Research on Apple Object Detection and Localization Method Based on Improved YOLOX and RGB-D Images

by

Tiantian Hu

,

Wenbo Wang

^*

,

Jinan Gu

,

Zilin Xia

,

Jian Zhang

and

Bo Wang

School of Mechanical Engineering, Jiangsu University, Zhenjiang 212013, China

^*

Author to whom correspondence should be addressed.

Agronomy 2023, 13(7), 1816; https://doi.org/10.3390/agronomy13071816

Submission received: 5 June 2023 / Revised: 4 July 2023 / Accepted: 6 July 2023 / Published: 8 July 2023

(This article belongs to the Section Precision and Digital Agriculture)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

The vision-based fruit recognition and localization system is the basis for the automatic operation of agricultural harvesting robots. Existing detection models are often constrained by high complexity and slow inference speed, which do not meet the real-time requirements of harvesting robots. Here, a method for apple object detection and localization is proposed to address the above problems. First, an improved YOLOX network is designed to detect the target region, with a multi-branch topology in the training phase and a single-branch structure in the inference phase. The spatial pyramid pooling layer (SPP) with serial structure is used to expand the receptive field of the backbone network and ensure a fixed output. Second, the RGB-D camera is used to obtain the aligned depth image and to calculate the depth value of the desired point. Finally, the three-dimensional coordinates of apple-picking points are obtained by combining two-dimensional coordinates in the RGB image and depth value. Experimental results show that the proposed method has high accuracy and real-time performance: F1 is 93%, mean average precision (mAP) is 94.09%, detection speed can reach 167.43 F/s, and the positioning errors in X, Y, and Z directions are less than 7 mm, 7 mm, and 5 mm, respectively.

Keywords:

apple object detection; three-dimensional localization; YOLOX; RGB-D images

1. Introduction

In recent years, the artificial intelligence boom has brought new opportunities and challenges to various areas. More and more intelligent agricultural robots are enabled by combining vision technology, driving innovation and development in precision agriculture. Agricultural robots are gradually replacing humans to perform time-consuming and labor-intensive tasks [1], such as autonomous irrigation, fertilization, and harvesting. As the largest apple producer in the world [2], China accounts for over 50% of the global apple planting area and production. Manual picking is costly, inefficient, and cannot guarantee the timely harvesting of fruits. Apple picking operations require short-term intensive labor. Therefore, developing technologies such as apple object detection and localization methods will play a crucial role in the future of fruit harvesting automation. Such innovations can ensure the timely and efficient harvesting of fruits, ultimately increasing productivity and reducing costs [3].

Machine vision plays an essential role in the development of agricultural automation, including crop yield estimation [4], autonomous crop harvesting [5], and crop growth period monitoring [6,7]. The vision-based accurate target recognition and three-dimensional (3D) positioning of objects are critical tasks for autonomous robotic harvesting [8]. Early researchers mostly used traditional digital image processing technology to pre-process the fruit images and find the location of the picking point to guide the robot in performing picking tasks. In recent years, the rapid development of deep learning has provided new ideas for the visual recognition and localization of harvesting robots [9]. Deep learning can autonomously learn the features of target objects to make up for the shortage of traditional manual feature extraction [10,11]. It is now widely used in various aspects, such as image classification [12,13], image segmentation [14,15], and object detection [16,17].

Traditional image processing techniques mainly fulfill object detection based on fruit color, texture, and shape [18,19]. Wei et al. [20] proposed an automatic fruit feature extraction method for the vision system of harvesting robots, using the improved Otsu algorithm to segment the image according to different thresholds to distinguish the fruit from the background. Lv et al. [21] segmented the apples in the image based on different color features. Then, they identified the apples by edge detection and Hough transform, which solved the recognition problem of overlapping and occluded apples. Liu et al. [22] proposed a color-feature-based detection method using a simple linear iterative clustering method to segment apple images into superpixel blocks and extract color features from them to identify candidate regions.

Since the AlexNet model [23] won the ImageNet image classification challenge, it has sparked the interest of many researchers in the feature representation capabilities of convolutional neural networks (CNN), leading to the emergence of various deep learning algorithms, such as VGGNet [24], Faster R-CNN [25,26], SSD [27], and YOLOv3 [28,29]. Deep learning for direct end-to-end recognition and automatic feature extraction has vastly improved detection accuracy [30]. Sun et al. [31] proposed a balanced feature pyramid network, which improved the accuracy of small apple detection and optimized the performance of small target detection. Wang and He [32] proposed an apple fruit detection method based on channel pruning YOLOv5’s deep learning algorithm to lighten the network by pruning redundant channels and weight parameters. Wang et al. [33] proposed a fruit segmentation method based on a lightweight backbone, which effectively solves the problem of low accuracy and over-complexity of fruit segmentation models with the same color as the background.

In recent years, scholars have proposed numerous efficient methods for 3D target localization, including monocular color camera-based [34], binocular stereo vision matching [35], and depth camera-based localization [36,37]. Tang et al. [38] completed the localization of Camellia oleifera fruit by building a binocular stereo-vision system. They used the bounding box information to reduce the computational effort of stereo matching. The principle of triangulation was used to determine the optimal picking position for Camellia oleifera fruit, which enabled the algorithm to show high stability of localization in the natural environment. Zhang et al. [39] proposed an apple localization method based on an RGB-D camera. The apple fruit region is detected based on Mask-RCNN first, and then it is combined with an RGB-D camera to obtain the depth information of the corresponding region to fulfill the apple localization. To obtain the 3D position of apples directly, Gené-Mola et al. [40] used LiDAR scanning to build a point cloud of sites and to segment the point cloud of apples based on the reflectance of the measured elements.

Despite significant progress in the development of harvesting robots, the detection methods are still susceptible to natural conditions, such as seasons and weather, due to various uncertainties in the natural environment. Therefore, improvements are important in fruit recognition and localization in practical work for current harvesting robots. These issues can be summarized as follows:

Many uncertainties exist in the fruit-growing environment, resulting in inaccurate identification of most harvesting robots, which is prone to misidentification and missing identification, especially for apples obscured by branches and leaves.
Due to the increased network depth, the existing object detection model has high complexity. In the actual picking task, the network inference speed is slow, and the operation efficiency is low, which does not meet the real-time requirements of the picking robot.
The two-dimensional (2D) image cannot provide depth information for acquiring apple-picking points. Moreover, the localization based on the 3D data of the scene has too much redundant data, which are complicated and inefficient to process. How to fuse multiple data to achieve efficient localization of target apples is another research problem.

To solve the above problems, this paper proposes a method for accurate apple recognition and fast localization based on improved YOLOX and RGB-D images. The detection of target apples is accomplished using an improved YOLOX network to speed up the network inference speed while maintaining the detection accuracy. At the same time, a depth camera is used to acquire depth images aligned with RGB images. The 3D coordinates of apple-picking points are obtained by combining RGB images and depth information to meet the real-time recognition and localization requirements of harvesting robots.

2. Materials and Methods

2.1. Overall Framework

The purpose of this study is to propose a fast and high-precision method for detecting and localizing apples in 3D space, which can guide robots for real-time fruit picking and improve picking efficiency. The overall framework of the proposed method is shown in Figure 1, which mainly consists of three parts:

Construction of the apple dataset. The Intel RealSense L515 camera is used to collect original apple images. Data enhancement approaches such as flipping, scaling, mosaic, mixup, and HSV transformation [41] are designed to expand the sample size and improve model generalization. The software LabelImg [42] is used to create the dataset labels in PASCAL VOC format that provide target category and location information.
Apple object detection based on the improved YOLOX network. The improved YOLOX network adopts a multi-branch stacking structure during training and a single-branch structure during inference, significantly improving inference speed without sacrificing accuracy.
Apple localization based on RGB and depth image. The depth camera is used to capture a depth image aligned with the RGB image. By combining the 2D coordinates in the RGB image with the depth information, the 3D coordinates of the apple-picking points are obtained to complete the target apple localization.

2.2. Images Acquisition

The original images of the apples were acquired at Green-hang Orchard, located in Liu-he District, Nanjing City, Jiangsu Province, China. The image acquisition equipment used was the Intel RealSense L515 camera, which supports distance detection from 0.25 m to 9 m. The image acquisition time was from 10:00 a.m. to 5:00 p.m. Apple images were collected from different lighting and angles at a distance of 0.4 m to 1.2 m from fruit trees. The sample images were in JPG format with a resolution of 640 × 480. In total, 861 images were collected for this study. As shown in Figure 2, the sample images included images with whole and partial fruit trees, branch and leaf shading, and different light, so that the dataset could reflect the natural environment. Considering the limited diversity of samples collected in the same orchard, an apple dataset with a total of 4785 images was constructed by combining it with public apple images [43] from the Internet. The dataset is divided into a training set, validation set, and test set, in the ratio of 8:1:1.

The software LabelImg was used to create the dataset labels. After labeling, the corresponding XML files were generated. The labeling format adopted was the PASCAL VOC format. The target category and location information can be obtained by reading tag files for the subsequent network training.

When the dataset is small and single, the model may mistake noise or unimportant sample features for real features. Therefore, large and diverse data samples are required when using deep networks. This paper uses online data augmentation methods, including flipping, scaling, mosaic, mixup, and HSV transformation, to generate new data that are similar but not identical during each iteration of training. This helps the model better learn different features and variations of apple fruits and improves the generalization ability and accuracy of the model.

2.3. Apple Object Detection Based on an Improved YOLOX Network

The YOLO family of detection networks is currently widely used in agricultural detection. Considering the orchard environment and real-time requirements of harvesting equipment, the one-stage object detection network YOLOX was selected and improved based on the original network to recognize apples. Compared with other detection networks, YOLOX exhibits a superior ability to handle complex scenes, such as occlusion, blurring, and lighting variations, and it possesses stronger robustness [44]. Meanwhile, YOLOX effectively improves detection speed and accuracy by using adaptive convolutional neural networks. It is a detection algorithm that combines speed, accuracy, versatility, and scalability, which is suitable for apple fruits’ real-time target detection task.

The structure of the improved network is shown in Figure 3, which can be roughly divided into three modules: the backbone, neck, and head. The backbone is the critical part of feature extraction, which generates feature maps by extracting input image information for subsequent network use. The neck adopts the Feature Pyramid Network and Path Aggregation Network (FPN+PAN) pyramid structure, mainly used for the fusion processing of multi-scale feature layers on the backbone. Moreover, it can enhance the expressiveness of the network by fusing the semantic and localization features of high and low layers. Finally, the head predicts the target class and location using all the extracted features.

The improvements in this study for the YOLOX-based object detection network specifically include the following. (1) Re-parameterizing the backbone and neck network parts by replacing the original standard convolution and residual blocks with DBB and RepBlock modules, respectively. This measure can strengthen the feature extraction capability of the network and speed up the inference speed. (2) Replacing the spatial pyramid pooling (SPP) layer of the backbone network with a spatial pyramid pooling-faster (SPPF) layer, which uses a serial structure and ReLU activation function to accelerate computation. (3) A method of adaptive dynamic loss weight coefficients is adopted. The coefficients are adjusted dynamically by monitoring the changes in the loss function, which makes the model reach the convergence state faster. The following section describes the critical parts of the network’s improvement.

2.3.1. Structural Reparameterization

This study mainly uses structural reparameterization to improve the YOLOX-network-based object detection method further. The improved YOLOX network replaces the original standard convolution of the backbone network with DBB modules [45] and the original residual blocks of the network with RepBlock modules. This approach aims to enhance the network’s performance and accuracy by optimizing its structural components.

As shown in Figure 4a, the DBB module adopts a multi-branch topology, and each branch mainly includes 1 × 1 convolution, 3 × 3 convolution, average pooling (AVG), and batch normalization (BN) layers. The structure of the DBB module is different during the training and inference stages. During training, the multi-branch model improves the model’s characterization ability and enables the deep network to extract more robust features, mitigating the problem of deep network gradient disappearance. During inference, the model transforms into a single-branch structure that considers each branch’s varying computational speed and maximizes hardware computing power. This makes the network focus on accuracy in the training stage and more on speed in the inference stage. The RepBlock modules consist of multiple DBB modules, as shown in Figure 4b. The first DBB module adjusts the size of the feature map, and the subsequent DBB modules modify the number of channels.

Then, the multi-branch model from the network training phase is converted into a single-branch model for the inference phase. This process aims to convert the convolutional layers of each branch into 3 × 3 convolutional layers and then fuse with BN layers [46]. After this conversion and fusion, the output of each branch becomes a 3 × 3 sized convolution. Finally, the 3 × 3 convolutional layers from each branch are fused into a single 3 × 3 convolutional layer for inference. This approach significantly reduces the number of parameters in the inference stage and accelerates the inference speed without compromising the network’s accuracy.

For the i-th channel of the convolution layer, the formula for the convolution and BN layer fusion transformation is shown in Equation (1), and the new weight and bias calculation formula is shown in Equation (2):

B N (M \times W, μ, σ, γ, β) :, i, :, : = (M \times W^{'}) :, i, :, : + b_{i}^{'}

(1)

W_{i, :, :, :}^{'} = \frac{γ_{i}}{σ_{i}} \cdot W_{i, :, :, :}, b_{i}^{'} = - \frac{μ_{i} γ_{i}}{σ_{i}} + β_{i}

(2)

where

M

represents the feature map of the input BN layer.

μ

(accumulated mean),

σ^{2}

(standard deviation),

γ

(learning scaling factor), and

β

(bias) are the four calculated parameters of the BN layer. W is the initial weight of the convolutional layer.

W^{'}

and

b^{'}

are the new weights and bias of the convolutional layer after the conversion.

2.3.2. Spatial Pyramid Pooling-Faster

The SPP layer is designed primarily to solve the problem that the input image size is not uniform. Multiple layers of MaxPooling are used to ensure that the input passed to the next layer is of a fixed size. Compared with the original SPP layer [47], this study adopts a SPPF layer, which not only improves the accuracy of convolutional network architecture but also adopts a serial structure to make the calculation speed faster. Considering that the SPPF layer is located in the last layer of the backbone network, the original activation function Sigmoid Linear Unit (SiLU) is replaced by Rectified Linear Unit (ReLU) to make it have a faster convergence speed. Additionally, the ReLU activation function helps to increase the sparsity of the network by discarding certain inputs in the deep layer of the network, thus avoiding the gradient vanishing problem. As shown in Figure 5, the SPPF structure consists mainly of three serial MaxPool2d layers with a size of 5 × 5. The output after each pooling is the input for the subsequent pooling layer. Finally, the features are pooled, and the fixed-length output is generated through the SPPF layers.

2.3.3. Dynamic Loss Function

During the network training process, the loss contains three components: regression loss, confidence loss, and category loss. The original YOLOX algorithm does not have a clear definition of the weight coefficients for the three losses. For this reason, this study adopts an adaptive dynamic loss weighting coefficients method, which dynamically adjusts the coefficients by monitoring the changes in the loss function. Specifically, the proportion of different loss components to the total losses is used as their weights. The convergence of different loss components is targeted to accelerate according to the distribution of losses. This approach not only improves the training efficiency but also helps the model reach the convergence state faster. In the prediction task, the regression losses have an additional weight factor because the location of the prediction target is more important than the prediction class and object. The relevant calculation formula is shown below:

L_{s u m} = L_{b b o x} + L_{o b j} + L_{c l a s s}

(3)

L_{a l l} = λ \frac{L_{b b o x}}{L_{s u m}} L_{b b o x} + \frac{L_{o b j}}{L_{s u m}} L_{o b j} + \frac{L_{c l a s s}}{L_{s u m}} L_{c l a s s}

(4)

where

L_{b b o x}

is the regression loss,

L_{o b j}

is the confidence loss, and

L_{c l a s s}

is the category loss.

λ

is the additional weight factor for regression loss.

2.4. 3D Localization of Apple-Picking Point

In this study, the 3D location of apple-picking points is obtained based on detecting target fruit regions. The improved network detects the target fruit region on the captured RGB image, and the RGB-D camera obtains the aligned depth image and determines the coordinates for apple picking.

2.4.1. Obtain Coordinate

The bounding box predicted by the object detection network contains the category, confidence, and coordinate information (in pixels). As shown in Figure 6, the coordinate

(x_{0}, y_{0})

of the central point is calculated by extracting the coordinates of the upper left corner

(x_{l}, y_{l})

and the lower right corner

(x_{r}, y_{r})

of the bounding box. Its calculation formula is shown in Equation (5), and the width and height of the bounding boxes are calculated in Equation (6).

x_{0} = x_{l} + \frac{x_{r} - x_{l}}{2}, y_{0} = y_{l} + \frac{y_{r} - y_{l}}{2}

(5)

w = x_{r} - x_{l}, h = y_{r} - y_{l}

(6)

In the practical picking task, the RGB image is a projection of the 3D scene on a 2D plane and cannot provide depth information. Therefore, to determine the target apples’ location in 3D space, this study uses the Intel RealSense L515 depth camera to obtain the aligned depth image, as shown in Figure 7.

Considering the actual conditions in practical detection, the depth image may bring the problem of missing parts and precision error. Therefore, based on extracting the central pixel point of the target bounding boxes, the adjacent top, bottom, left, and right pixel points are further acquired. Finally, the depth values of five-pixel points are extracted on the aligned depth image, and their average value is used as the depth value

z_{0}

of apple-picking points. The formula for calculating the depth value is as in Equation (7), where

f_{t}

is an internal camera parameter for unit conversion, d represents the depth value of the corresponding point acquired by the camera.

z_{0} = m e a n (\frac{d (x_{i}, y_{i})}{f_{t}})

(7)

2.4.2. Coordinate Conversion

The corresponding points in the RGB image are in pixels and need to be converted from pixel coordinates to image coordinates and then to camera coordinates. The RGB image shown in Figure 8 is the imaging plane. The coordinate system o₁-u-v represents the pixel coordinate system, with its origin located at the top-left corner of the imaging plane and the unit of measurement in pixels. The coordinate system o₂-x-y represents the image coordinate system, with its origin located at the intersection of the optical axis and the imaging plane, and the unit of measurement in millimeters. Lastly, the coordinate system O-X-Y represents the camera coordinate system, with its origin located at the optical center of the lens and the unit of measurement in meters.

Then, to convert pixel coordinates to be used under the camera coordinate system, coordinate transformation can be carried out with the help of the internal parameter matrix of the camera, as shown in Equation (8):

M = [\begin{matrix} f_{x} & 0 & c_{x} \\ 0 & f_{y} & c_{y} \\ 0 & 0 & 1 \end{matrix}]

(8)

where

f_{x}

represents the length of the focal length in the x-axis direction using pixels,

f_{y}

represents the length of the focal length in the y-axis direction using pixels, and

c_{x}

,

c_{y}

represent the offsets of the central point of the camera sensor chip in the x and y directions of the pixel coordinate system, respectively.

The width and height of the target bounding boxes are further used to estimate the radius

r

of apples. To provide more inclusive coordinates for the picking robot, half of the relatively large value in the width and height information of the bounding boxes is selected as the radius

r

. The calculation formula is shown in Equation (9).

r = {\begin{matrix} \frac{z_{0} \cdot w}{2 f_{x}}, w \geq h \\ \frac{z_{0} \cdot h}{2 f_{y}}, w < h \end{matrix}

(9)

To ensure that the picking robotic arm can pick apples efficiently and accurately, the depth values need to be limited to avoid the misdetection of fruit on neighboring fruit trees at certain angles. Considering the working range of the picking robot arm and the spacing between fruit trees, the coordinate values with depth values exceeding two meters are filtered out. This results in improved centralized picking efficiency and facilitates path planning of the robotic arm in close proximity.

Finally, the 3D coordinates

(X, Y, Z)

of the apple-picking point in the camera coordinate system are calculated as Equation (10).

{\begin{array}{l} X = \frac{z_{0} \cdot (x_{o} - c_{x})}{f_{x}} \\ Y = \frac{z_{0} \cdot (y_{o} - c_{y})}{f_{y}} \\ Z = z_{0} + r \end{array}

(10)

3. Experiments and Results

3.1. Experimental Platform

The experimental environment of this study is shown in Table 1, and the constructed apple dataset is used to train the model. During the model training, four images were taken as a batch size, and the model was trained for 100 epochs. The best weights were saved every 10 epochs, and the BN layer was used for regularization during weight updates. The initial learning rate was set to 0.01. Since the original network architecture was modified, the pretraining weights were not loaded, and the training started from scratch.

3.2. Evaluation Indicators

Mainly three kinds of evaluation indicators are used for evaluating detection accuracy, model complexity, and detection speed.

Detection accuracy measurement. mAP and F1 score were used as objective criteria to evaluate the performance of detection accuracy. The mAP refers to the average value of AP for each category, and AP is the area under the Precision-Recall (P-R) curve. Since the data in this paper are a single category, mAP is equal to AP in this study. F1 is the index for weighing Precision and Recall. The relevant calculation formula is shown below.

$P r e c i s i o n = \frac{T P}{T P + F P}$

(11)

$R e c a l l = \frac{T P}{T P + F N}$

(12)

$A P = \int_{0}^{1} p (r) d r$

(13)

$F 1 = 2 \times \frac{P r e c i s i o n \times R e c a l l}{P r e c i s i o n + R e c a l l}$

(14)
Model complexity evaluation. The complexity of the model was measured by floating-point operations per second (FLOPs) and Params. FLOPs count the number of multiplication and addition operations. Params are the total number of weight parameters for all layers with parameters.
Detection speed evaluation. The FPS measures the network detection speed, i.e., the number of images that can be processed per second. The calculation formula is shown in Equation (15), where Latency is the time used by the model to predict an image.

$F P S = \frac{1}{L a t e n c y}$

(15)

3.3. Analysis of Experimental Results

3.3.1. Results of the Improved YOLOX Network

The proposed model obtains the optimal weights by training under the above-mentioned experimental platform and training parameters settings. During the training process, the loss trend with the number of iterations and the mAP curve are shown in Figure 9. It can be seen that the loss is obviously stable between the 80th and 100th Epoch, and the mAP value reaches the best on the validation set. Finally, the precision of the proposed model is 93.72%, the recall is 92.24%, the F1 is 93%, and the mAP is 94.09%.

By comparing the loss variation of the original YOLOX algorithm and the proposed model in the training phase, as shown in Figure 10, the loss curve of the proposed model decreases rapidly and stabilizes for the same number of iterations. This result demonstrates that the proposed dynamic loss function plays an active role in the training process of the optimized model, accelerating the convergence speed and reducing the loss value.

The optimal weight obtained from training was used for testing on 479 test images. The partial object detection effect of the proposed method is shown in Figure 11.

To further verify the superiority of the proposed method over other methods, the proposed method was compared with Faster R-CNN, YOLOv4, Mobilenetv3_YOLOv4, YOLOv5, YOLOv7, YOLOv8, and original YOLOX for object detection. As shown in Figure 12, the part circled in yellow in the figure indicates the unrecognized targets. It can be clearly seen that the comparison algorithms tend to ignore some occluded targets. However, the proposed method demonstrates better recognition performance for overlapping and obscuring apples in the natural environment, being able to identify almost all fruits within the target region. Additionally, the performance of the models was compared based on evaluation indicators, and the experimental results of different networks are shown in Table 2.

As shown in Table 2, the proposed method achieves the mAP of 94.09% for apple object detection. Compared to Faster-RCNN, YOLOv4, Mobilenetv3_YOLOv4, YOLOv5, YOLOv7, and YOLOv8, the proposed method improved the mAP by 5.86%, 6.54%, 21.78%, 3.4%, 11.88%, and 0.86%, respectively. Meanwhile, the proposed method shows a 1.18% mAP improvement over the original YOLOX network. Additionally, the proposed method has the highest F1 score of 93% relative to other networks. Although the parameter and FLOPs of the proposed method increased relative to the original YOLOX network, it is still less compared to other comparison algorithms. Based on the original detection speed, the proposed method improved the detection speed by 41.43%.

Furthermore, the P-R comparison curves of each network are shown in Figure 13 to visualize the performance of the above networks on the test set. The proposed model in this paper has the largest area under the P-R curve. Since AP refers to the area under the P-R curve, the AP value of the proposed method is the best compared to other networks, i.e., the highest accuracy. Considering the detection accuracy and operation speed, compared with other object detection networks, the proposed model in this study is more efficient and meets the accuracy and real-time requirements of the picking robot.

3.3.2. Results of Apple Localization

To quantitatively evaluate the accuracy of the positioning method proposed in this study, 15 target apples were selected for coordinate positioning error analysis. After fixing the camera view, a vernier caliper was used to measure the horizontal and vertical diameters of the target apples, and half of the measured values were taken as the central positioning point. Then, a marker pen was used to mark this central point on the surface of the individual fruit. Next, the coordinates

(X_{0}, Y_{0}, Z_{0})

of the marked point were obtained in the Intel RealSense L515 3D view and used as the actual coordinates of the apple fruit. Finally, the true radius

r

of each apple was determined by selecting the larger half of the measured horizontal and vertical diameters. Since the marker point is located on the fruit surface, the depth value from the recorded actual coordinate value plus the radius

r

is required as the

Z_{0}

value of the final positioning point. The detection coordinate values

(X, Y, Z)

were obtained according to the method described in Section 3.2. The coordinate errors were calculated by measuring 15 sets of experimental data and comparing the actual coordinate values with the detected coordinate values. The experimental results are shown in Table 3.

It can be seen that the error of coordinate errors in both the X-axis and Y-axis directions are less than 7 mm, and the coordinate errors in the Z-axis direction are less than 5 mm. The average error and average standard error in X-axis, Y-axis, and Z-axis directions were further calculated to be 2.31 mm and 1.37 mm, respectively. The errors include human measurement and internal parameter errors of camera calibration, which fully proves that the spatial coordinate values of fruits obtained by this method meet the picking requirements of apple harvesting robots.

To further verify the accuracy and stability of the proposed apple object detection and localization method, the Intel RealSense L515 camera and a six-degree-of-freedom robot arm were used to carry out a picking test in an apple orchard. As shown in Figure 14, the proposed model can detect and localize apples in real time under different light conditions in the morning and evening, and provide better picking points for most cases.

4. Conclusions

With the rapid development of intelligent agriculture, the research of object detection and localization methods of vision is beneficial to accelerate the development of next-generation agricultural systems. Many uncertainties exist in the fruit-growing environment, which increases picking difficulty. Meanwhile, the existing networks have high complexity and slow inference speed, making them unsuitable for harvesting robots’ real-time needs.

For this reason, an improved YOLOX network with multi-branch feature extraction and single-branch inference is proposed, which can improve inference speed and object detection accuracy. The improved network detects the target fruit region on the captured RGB image, while the RGB-D camera obtains the aligned depth image and determines the coordinates for apple picking. By comparing different networks, the experimental performance of the proposed model is measured according to the evaluation index. Experimental results show that the improved network has high accuracy and real-time performance: F1 is 93%, mAP is 94.09%, and the detection speed can reach 167.43 F/s. Fifteen target points were selected for coordinate error testing, the positioning error in the X and Y directions was less than 7 mm, and the error in the Z direction was less than 5 mm.

Future research may focus on expanding the apple dataset for different growth periods. In addition, considering the limitations of a single camera, multiple depth cameras can be further utilized to fuse multiple information and to provide more accurate positioning information.

Author Contributions

Conceptualization, T.H. and W.W.; methodology, T.H. and W.W.; software, T.H. and Z.X.; validation, T.H., J.Z., and B.W.; supervision, J.G.; data curation, J.Z. and B.W.; writing—original draft preparation, T.H.; writing—review and editing, T.H., W.W., J.G., and Z.X. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Key Project of Jiangsu Province Key Research and Development Program (No. BE2021016-3), the National Natural Science Foundation of China (No. 52105516), and the 21st batch of scientific research projects for university students in Jiangsu University (No. 21A152).

Data Availability Statement

The data presented in this study are available on request from the corresponding author. The data are not publicly available due to the privacy policy of the organization.

Conflicts of Interest

The authors declare no conflict of interest.

References

Bechar, A.; Vigneault, C. Agricultural Robots for Field Operations: Concepts and Components. Biosyst. Eng. 2016, 149, 94–111. [Google Scholar] [CrossRef]
Zhu, Z.; Jia, Z.; Peng, L.; Chen, Q.; He, L.; Jiang, Y.; Ge, S. Life Cycle Assessment of Conventional and Organic Apple Production Systems in China. J. Clean. Prod. 2018, 201, 156–168. [Google Scholar] [CrossRef]
Behera, S.K.; Rath, A.K.; Mahapatra, A.; Sethy, P.K. Identification, Classification & Grading of Fruits Using Machine Learning & Computer Intelligence: A Review. J. Ambient. Intell. Humaniz. Comput. 2020, 4, 1–11. [Google Scholar] [CrossRef]
Maheswari, P.; Raja, P.; Apolo-Apolo, O.E.; Pérez-Ruiz, M. Intelligent Fruit Yield Estimation for Orchards Using Deep Learning Based Semantic Segmentation Techniques—A Review. Front. Plant Sci. 2021, 12, 2603. [Google Scholar] [CrossRef] [PubMed]
Kang, H.; Zhou, H.; Wang, X.; Chen, C. Real-Time Fruit Recognition and Grasping Estimation for Robotic Apple Harvesting. Sensors 2020, 20, 5670. [Google Scholar] [CrossRef] [PubMed]
Tian, Y.; Yang, G.; Wang, Z.; Wang, H.; Li, E.; Liang, Z. Apple Detection during Different Growth Stages in Orchards Using the Improved YOLO-V3 Model. Comput. Electron. Agric. 2019, 157, 417–426. [Google Scholar] [CrossRef]
Wang, W.; Hu, T.; Gu, J. Edge-Cloud Cooperation Driven Self-Adaptive Exception Control Method for the Smart Factory. Adv. Eng. Inform. 2022, 51, 101493. [Google Scholar] [CrossRef]
Zhao, Y.; Gong, L.; Huang, Y.; Liu, C. A Review of Key Techniques of Vision-Based Control for Harvesting Robot. Comput. Electron. Agric. 2016, 127, 311–323. [Google Scholar] [CrossRef]
Li, Y.; Feng, Q.; Li, T.; Xie, F.; Liu, C.; Xiong, Z. Advance of Target Visual Information Acquisition Technology for Fresh Fruit Robotic Harvesting: A Review. Agronomy 2022, 12, 1336. [Google Scholar] [CrossRef]
Li, P.; Jing, R.; Shi, X. Apple Disease Recognition Based on Convolutional Neural Networks with Modified Softmax. Front. Plant Sci. 2022, 13, 820146. [Google Scholar] [CrossRef]
Sharma, V.; Mir, R.N. A Comprehensive and Systematic Look up into Deep Learning Based Object Detection Techniques: A Review. Comput. Sci. Rev. 2020, 38, 100301. [Google Scholar] [CrossRef]
Wang, Y.H.; Su, W.H. Convolutional Neural Networks in Computer Vision for Grain Crop Phenotyping: A Review. Agronomy 2022, 12, 2659. [Google Scholar] [CrossRef]
Li, S.; Song, W.; Fang, L.; Chen, Y.; Ghamisi, P.; Benediktsson, J.A. Deep Learning for Hyperspectral Image Classification: An Overview. IEEE Trans. GRS 2019, 57, 6690–6709. [Google Scholar] [CrossRef] [Green Version]
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask R-CNN. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 42, 386–397. [Google Scholar] [CrossRef]
Hao, S.; Zhou, Y.; Guo, Y. A Brief Survey on Semantic Segmentation with Deep Learning. Neurocomputing 2020, 406, 302–321. [Google Scholar] [CrossRef]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar] [CrossRef] [Green Version]
Zhang, C.; Kang, F.; Wang, Y. An Improved Apple Object Detection Method Based on Lightweight YOLOv4 in Complex Backgrounds. Remote Sens. 2022, 14, 4150. [Google Scholar] [CrossRef]
Kuang, H.; Liu, C.; Chan, L.L.H.; Yan, H. Multi-Class Fruit Detection Based on Image Region Selection and Improved Object Proposals. Neurocomputing 2018, 283, 241–255. [Google Scholar] [CrossRef]
Zhou, M.; Fakayode, O.A.; Ahmed Yagoub, A.E.G.; Ji, Q.; Zhou, C. Lignin Fractionation from Lignocellulosic Biomass Using Deep Eutectic Solvents and Its Valorization. Renew. Sustain. Energy Rev. 2022, 156, 111986. [Google Scholar] [CrossRef]
Wei, X.; Jia, K.; Lan, J.; Li, Y.; Zeng, Y.; Wang, C. Automatic Method of Fruit Object Extraction under Complex Agricultural Background for Vision System of Fruit Picking Robot. Optik 2014, 125, 5684–5689. [Google Scholar] [CrossRef]
Jidong, L.; De-An, Z.; Wei, J.; Shihong, D. Recognition of Apple Fruit in Natural Environment. Optik 2016, 127, 1354–1362. [Google Scholar] [CrossRef]
Liu, X.; Zhao, D.; Jia, W.; Ji, W.; Sun, Y. A Detection Method for Apple Fruits Based on Color and Shape Features. IEEE Access 2019, 7, 67923–67933. [Google Scholar] [CrossRef]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet Classification with Deep Convolutional Neural Networks. Commun. ACM 2017, 60, 84–90. [Google Scholar] [CrossRef] [Green Version]
Prasetyo, E.; Suciati, N.; Fatichah, C. Multi-Level Residual Network VGGNet for Fish Species Classification. J. King Saud Univ.-Comput. Inf. Sci. 2022, 34, 5286–5295. [Google Scholar] [CrossRef]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Wan, S.; Goudos, S. Faster R-CNN for Multi-Class Fruit Detection Using a Robotic Vision System. Comput. Netw. 2020, 168, 107036. [Google Scholar] [CrossRef]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.-Y.; Berg, A.C. SSD: Single Shot MultiBox Detector. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016. [Google Scholar] [CrossRef] [Green Version]
Liu, T.H.; Nie, X.N.; Wu, J.M.; Zhang, D.; Liu, W.; Cheng, Y.F.; Zheng, Y.; Qiu, J.; Qi, L. Pineapple (Ananas Comosus) Fruit Detection and Localization in Natural Environment Based on Binocular Stereo Vision and Improved YOLOv3 Model. Precis. Agric. 2023, 24, 139–160. [Google Scholar] [CrossRef]
Liu, Y.; Sun, P.; Wergeles, N.; Shang, Y. A Survey and Performance Evaluation of Deep Learning Methods for Small Object Detection. Expert. Syst. Appl. 2021, 172, 114602. [Google Scholar] [CrossRef]
Wang, W.; Zhang, Y.; Gu, J.; Wang, J. A Proactive Manufacturing Resources Assignment Method Based on Production Performance Prediction for the Smart Factory. IEEE Trans. Industr. Inform. 2022, 18, 46–55. [Google Scholar] [CrossRef]
Sun, M.; Xu, L.; Chen, X.; Ji, Z.; Zheng, Y.; Jia, W. BFP Net: Balanced Feature Pyramid Network for Small Apple Detection in Complex Orchard Environment. Plant Phenomics 2022, 2022, 9892464. [Google Scholar] [CrossRef]
Wang, D.; He, D. Channel Pruned YOLO V5s-Based Deep Learning Approach for Rapid and Accurate Apple Fruitlet Detection before Fruit Thinning. Biosyst. Eng. 2021, 210, 271–281. [Google Scholar] [CrossRef]
Wang, Z.; Zhang, Z.; Lu, Y.; Luo, R.; Niu, Y.; Yang, X.; Jing, S.; Ruan, C.; Zheng, Y.; Jia, W. SE-COTR: A Novel Fruit Segmentation Model for Green Apples Application in Complex Orchard. Plant Phenomics 2022, 2022, 0005. [Google Scholar] [CrossRef] [PubMed]
Mehta, S.S.; Burks, T.F. Vision-Based Control of Robotic Manipulator for Citrus Harvesting. Comput. Electron. Agric. 2014, 102, 146–158. [Google Scholar] [CrossRef]
Ji, W.; Meng, X.; Qian, Z.; Xu, B.; Zhao, D. Branch Localization Method Based on the Skeleton Feature Extraction and Stereo Matching for Apple Harvesting Robot. Int. J. Adv. Robot Syst. 2017, 14, 1729881417705276. [Google Scholar] [CrossRef] [Green Version]
Yuan, Z.; Li, Y.; Tang, S.; Li, M.; Guo, R.; Wang, W. A Survey on Indoor 3D Modeling and Applications via RGB-D Devices. Front. Inf. Technol. Electron. Eng. 2021, 22, 815–826. [Google Scholar] [CrossRef]
Fu, L.; Gao, F.; Wu, J.; Li, R.; Karkee, M.; Zhang, Q. Application of Consumer RGB-D Cameras for Fruit Detection and Localization in Field: A Critical Review. Comput. Electron. Agric. 2020, 177, 105687. [Google Scholar] [CrossRef]
Tang, Y.; Zhou, H.; Wang, H.; Zhang, Y. Fruit Detection and Positioning Technology for a Camellia Oleifera C. Abel Orchard Based on Improved YOLOv4-Tiny Model and Binocular Stereo Vision. Expert. Syst. Appl. 2023, 211, 118573. [Google Scholar] [CrossRef]
Zhang, K.; Lammers, K.; Chu, P.; Li, Z.; Lu, R. System Design and Control of an Apple Harvesting Robot. Mechatronics 2021, 79, 102644. [Google Scholar] [CrossRef]
Gené-Mola, J.; Gregorio, E.; Guevara, J.; Auat, F.; Sanz-Cortiella, R.; Escolà, A.; Llorens, J.; Morros, J.R.; Ruiz-Hidalgo, J.; Vilaplana, V.; et al. Fruit Detection in an Apple Orchard Using a Mobile Terrestrial Laser Scanner. Biosyst. Eng. 2019, 187, 171–184. [Google Scholar] [CrossRef]
Shorten, C.; Khoshgoftaar, T.M. A Survey on Image Data Augmentation for Deep Learning. J. Big Data 2019, 6, 60. [Google Scholar] [CrossRef] [Green Version]
Tzutalin. LabelImg. Git Code. 2015. Available online: https://github.com/tzutalin/labelImg (accessed on 31 March 2015).
Gené-Mola, J.; Ferrer-Ferrer, M.; Gregorio, E.; Rosell-Polo, J.R.; Vilaplana, V.; Ruiz-Hidalgo, J.; Morros, J.R. PApple_RGB-D-Size dataset [Data set]. Zenodo. 2022. Available online: https://github.com/GRAP-UdL-AT/Amodal_Fruit_Sizing (accessed on 4 October 2022).
Zhang, Y.; Zhang, W.; Yu, J.; He, L.; Chen, J.; He, Y. Complete and Accurate Holly Fruits Counting Using YOLOX Object Detection. Comput. Electron. Agric. 2022, 198, 107062. [Google Scholar] [CrossRef]
Ding, X.; Zhang, X.; Han, J.; Ding, G. Diverse Branch Block: Building a Convolution as an Inception-like Unit. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 19–25 June 2021; pp. 10881–10890. [Google Scholar] [CrossRef]
Ding, X.; Zhang, X.; Ma, N.; Han, J.; Ding, G.; Sun, J. RepVGG: Making VGG-Style ConvNets Great Again. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 19–25 June 2021; pp. 13728–13742. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition. In European Conference on Computer Vision; Springer: Cham, Switzerland, 2014; Volume 8691, pp. 346–361. [Google Scholar] [CrossRef] [Green Version]

Figure 1. The overall framework of apple object detection and localization.

Figure 2. Original apple image acquisition examples: (a) Whole fruit; (b) Partial fruit; (c) Front light; (d) Back light.

Figure 3. The structure of the improved YOLOX network.

Figure 4. Structural reparameterization: (a) DBB module; (b) ReBlock module.

Figure 5. SPPF structure (k represents the size of the convolution kernel; s represents the stride, which refers to the distance that the convolutional kernel moves on the input data at each step; p represents the padding, which means filling in fixed data around the input data).

Figure 6. Obtaining the central pixel coordinates of the target bounding box (a represents the label name of the target).

Figure 7. RGB image and depth image in Intel RealSense Viewer: (a) RGB image; (b) Depth image.

Figure 8. Coordinate conversion.

Figure 9. Loss and mAP curves: (a) Train loss and Val loss curve; (b) mAP curve.

Figure 10. Comparison of loss values during training.

Figure 11. The results of object detection on the apple test set.

Figure 12. The results of object apple recognition: (a) Proposed model; (b) Faster R-CNN; (c) YOLOv4; (d) Mobilenetv3_YOLOv4; (e) YOLOv5; (f) YOLOX; (g) YOLOv7; (h) YOLOv8 (The yellow circles in the figure indicate unrecognized targets).

Figure 13. The P-R comparison curves of each network.

Figure 14. Real-time fruit recognition and location results at different times of the day in the apple orchard: (a) Experimental equipment; (b) Test results in the morning; (c) Test results in the evening.

Table 1. The experimental environment.

Experimental environment	Components	Version
Hardware environment	Processor	Core i7-10700k
Hardware environment	Graphics	NVIDIA GeForce RTX 2080Ti
Software environment	System	Ubuntu 18.04
	Development Framework	Pytorch 1.7
	Programming Language	Python 3.6
	Integrated Development Environment (IDE)	Pycharm 2021.3

Table 2. Experimental performance of different networks.

Model	mAP (%)	F1	Parameters (M)	FLOPs (G)	FPS
Faster R-CNN	88.23%	0.69	137.10	370.41	19.43
YOLOv4	87.55%	0.84	64.36	60.33	60.42
Mobilenetv3_YOLOv4	72.31%	0.78	11.30	7.13	85.45
YOLOv5	90.69%	0.89	21.38	51.43	73.03
YOLOX	92.91%	0.92	8.97	26.81	118.38
YOLOv7	82.21%	0.85	37.20	104.76	52.99
YOLOv8	93.23%	0.93	25.86	79.07	75.25
Proposed model	94.09%	0.93	11.71	38.62	167.43

Table 3. Coordinate error test results.

Number	Actual Coordinate Values (m)			Detection of Coordinate Values (m)			Error of Coordinates (m)
Number	X₀	Y₀	Z₀	X	Y	Z	$\| X - X_{0} \|$	$\| Y - Y_{0} \|$	$\| Z - Z_{0} \|$
1	−0.152	−0.219	0.692	−0.153	−0.220	0.690	0.001	0.001	0.002
2	−0.073	−0.025	0.585	−0.075	−0.022	0.585	0.002	0.003	0
3	−0.040	−0.068	0.594	−0.044	−0.071	0.591	0.004	0.003	0.003
4	0.097	0.028	0.584	0.095	0.022	0.583	0.002	0.006	0.001
5	−0.161	−0.011	0.630	−0.164	−0.013	0.627	0.003	0.002	0.003
6	−0.128	−0.063	0.632	−0.134	−0.064	0.628	0.006	0.001	0.004
7	0.225	−0.020	0.673	0.222	−0.020	0.674	0.003	0	0.001
8	0.253	−0.069	0.670	0.250	−0.071	0.667	0.003	0.002	0.003
9	0.097	−0.222	0.722	0.095	−0.226	0.719	0.002	0.004	0.003
10	−0.189	−0.172	0.692	−0.187	−0.170	0.690	0.002	0.002	0.002
11	0.051	−0.144	0.612	0.047	−0.145	0.610	0.004	0.001	0.002
12	−0.304	−0.042	0.589	−0.299	−0.041	0.590	0.005	0.001	0.001
13	−0.269	−0.095	0.595	−0.270	−0.094	0.592	0.001	0.001	0.003
14	−0.160	−0.023	0.541	−0.157	−0.022	0.542	0.003	0.001	0.001
15	−0.082	−0.010	0.589	−0.077	−0.009	0.589	0.005	0.001	0

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Hu, T.; Wang, W.; Gu, J.; Xia, Z.; Zhang, J.; Wang, B. Research on Apple Object Detection and Localization Method Based on Improved YOLOX and RGB-D Images. Agronomy 2023, 13, 1816. https://doi.org/10.3390/agronomy13071816

AMA Style

Hu T, Wang W, Gu J, Xia Z, Zhang J, Wang B. Research on Apple Object Detection and Localization Method Based on Improved YOLOX and RGB-D Images. Agronomy. 2023; 13(7):1816. https://doi.org/10.3390/agronomy13071816

Chicago/Turabian Style

Hu, Tiantian, Wenbo Wang, Jinan Gu, Zilin Xia, Jian Zhang, and Bo Wang. 2023. "Research on Apple Object Detection and Localization Method Based on Improved YOLOX and RGB-D Images" Agronomy 13, no. 7: 1816. https://doi.org/10.3390/agronomy13071816

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Research on Apple Object Detection and Localization Method Based on Improved YOLOX and RGB-D Images

Abstract

1. Introduction

2. Materials and Methods

2.1. Overall Framework

2.2. Images Acquisition

2.3. Apple Object Detection Based on an Improved YOLOX Network

2.3.1. Structural Reparameterization

2.3.2. Spatial Pyramid Pooling-Faster

2.3.3. Dynamic Loss Function

2.4. 3D Localization of Apple-Picking Point

2.4.1. Obtain Coordinate

2.4.2. Coordinate Conversion

3. Experiments and Results

3.1. Experimental Platform

3.2. Evaluation Indicators

3.3. Analysis of Experimental Results

3.3.1. Results of the Improved YOLOX Network

3.3.2. Results of Apple Localization

4. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI