1. Introduction
The Korean pine is the most important plantation tree species in northeast China. Pine cones are the reproductive structure of the Korean pine. Pine cone is not only a traditional part of the arts and crafts of cultures, it is also an important source of modern composite materials. The seeds in pine cone have high edible and medicinal value. Traditional pine cone harvesting is divided into two steps. In the first step, the pickers climb to the top of pine trees and knock the pine cones off the branch with a long pole. During the second step, the pickers collects pinecones on the ground. This harvesting method has high risk and low efficiency, and manual collection is easy to produce omission, resulting in the decline of yield. Automatic harvesting can reduce costs, risk reduction and improve production efficiency [
1]. Therefore, the automatic harvesting of pine cones is a problem that needs to be solved at present.
The detection of pine cones can provide location information for the automatic picking of pine cones, and provide data support for the analysis of pine cone yields. So, the key to realize the automatic picking of pine cones is to detect the pine cones. With the development of object detection technology, detect pine cones by images become possible. Traditional fruit detection methods rely on the extraction of shallow features such as color, shape, reflectivity, etc. It is difficult to solve problems such as overlap and occlusion, and it is easily affected by lighting conditions. The effect is poor in a complex background, resulting in relatively high cost and poor applicability of these methods [
2,
3,
4,
5].
Over the last years, the application of deep-learning technology in computer vision has made great progress [
6]. People use deep-learning to address key tasks in computer vision, such as object detection, face recognition, action recognition and human-pose estimation [
7]. The use of deep-learning to achieve pine cones detection not only overcomes the shortcomings of traditional methods, but also enables the integration of feature extraction and classification to achieve real-time accurate detection. Deep-learning technology can use the deep abstract information of the data and has a strong generalization ability, so it has been widely used in crop detection.
Pine cone detection using deep-learning is a small target and multi-object detection problem. Gathering adequate data into a dataset is critical for any segmentation system based on deep-learning techniques [
8]. At present, there is no public data set of pine cones on the Internet, it is difficult to find sufficient data, so data enhancement technology is necessary. Data augmentation is a common technique that has been demonstrated to benefit the training of machine learning models in general and deep architectures in particular [
8]. Traditional data augmentation methods base on the camera model structure and imaging principles, including rotation, mirroring, translation, random cropping, and affine transformation. These data augmentation methods can improve deep-learning technology to a certain extent [
9,
10]. As computer vision develops, scholars have been constantly exploring new ways to enhance image data. The Generative Adversarial Network (GAN) [
11] designed by Goodfellow et al. provides new solutions. John Atanbori used GAN to generate cassava images to provide missing class training data and developed an efficient cassava counting system [
12]. YC Chou proposed a deep-learning technology combined with GAN to detect defective coffee beans, this method can improve the efficiency of removing inferior coffee beans [
13]. In response to the shortcomings of poor data diversity generated by the original GAN, D Berthelot et al. designed BEGAN based on the idea of equilibrium, which improved the ability of GAN network data augmentation to a new level [
14]. The images generated by BEGAN are high quality and effectively improve the prediction performance of the model [
15,
16].
On the other hand, the network structure used by deep-learning technology has a great influence on the accuracy of real-time detection. In the current mainstream target detection network, the YOLO network directly performs regression to detect objects in the image, so its detecting speed is faster than that of other networks [
17,
18]. YOLO has gone through three iterations, the initial YOLOv1 [
19] network detection accuracy is poor, YOLOv2 [
20] on the basis of YOLO increases the model convergence speed and reduces overfitting by adding batch normalization layers, at the same time, the detection accuracy is significantly improved by using convolutional layers instead of fully connected layers, high-resolution classifiers, direction prediction and multiscale training. YOLOv3 [
21] has replaced the new backbone network on the basis of YOLOv2 and improved the single-label classification into multilabel classification. By using the multiscale fusion prediction method, YOLOv3 is ideal for small target detection. Although YOLOv3 has performed very well on small targets and multiple objects, there was room for further improvement. First, like other detection networks with complex structure, YOLOv3 has problems of slow detection speed and gradient disappearance. The DenseNet [
22] proposed by Gao Huang, strengthens the transfer of features by multiplexing the features of the convolutional neural network, which can solve the problem of gradient disappearance while reducing the complexity of the network model. Second, the Feature Pyramid Networks (FPN) [
23] network was introduced in YOLOv3, and the features of different layers were merged through the pyramid model. However, YOLOv3 has only three detection scales, so some features with lower levels of information are omitted. By expanding the detection scale of YOLOv3, better results could be achieved in the detection of small targets. Finally, the loss function of YOLOv3 uses Intersection over Union (IoU) to calculate the gradient for regression. It cannot optimize the situation where the targets do not intersect, nor can it reflect the degree of intersection of the targets. By extending the concept of IoU to non-overlapping situations, Hamid Rezatofighi et al. proposed a new weight and measure Generalized Intersection over Union (GIoU) [
24] which effectively solved the shortcomings of IoU. Zhaohui Zheng et al. and put forward the concept of DIoU [
25] based on GIoU. By modeling the normalized distance between the anchor box and the real bounding box, the convergence speed is further improved.
In our research, BEGAN and an improved YOLOv3 deep-learning model are employed to detect pine cones, the contributions are as follows: First, because of the specifically of the pine cones, it is very difficult to acquire a large number of pine cone images. To overcome this deficiency, we captured 800 images of pine cones and adopt BEGAN deep-learning method to expand the datasets. BEGAN automatically balances the tradeoff between image diversity and quality of generation, effectively expand the size of a training dataset. Second, we improved the YOLOv3 model. Dense connection network structure is introduced into the YOLOv3 backbone network to improve the detection speed and detection accuracy, and then the detection accuracy is further improved by adding a new detection ratio and using DIoU to optimize the loss function.
The rest of this article is organized as follows:
Section 2 introduces the BEGAN and improved YOLOv3 algorithm;
Section 3 introduces the construction of image data sets, including image acquisition and image data augmentation;
Section 4 introduces the relevant content of the comparative experiment and discusses the experimental results;
Section 5 introduces the conclusions and future prospects of this article.
2. Methods
2.1. BEGAN
BEGAN network structure is shown in
Figure 1: BEGAN consists of generator (
) and discriminator (
) networks. The generator
generates an image
by receiving a random noise
. The function of the discriminator
is to determine whether an image is real, its input parameter is an image
and the output
represents the probability that
is a real image. Through training in the confrontation process, the generator
and the discriminator
play a minimum and maximum game and finally reach the Nash equilibrium. In the most ideal state, when
is equal to 0.5,
can generate an approximate real image
. The
in BEGAN is an autoencoder structure, its output
as follows:
The original GAN hopes that the data distribution generated by the generator is as close as possible to the distribution of real data. When the distribution of generated data are equal to that of the real data, it means that the performance of the generator is ideal enough. Hence, from this point of view, the researchers have designed various loss functions to make the distribution of generated data as close as possible to that of the real data. BEGAN replaces this estimated probability distribution method. It does not directly calculate the distance between the generated data distribution and the real data distribution , but calculates the distance between the errors of them. If the error distribution of and are similar, and are similar.
2.2. YOLOv3 Model
YOLO is a one-stage detection algorithm. With no requirement to generate propose, it generates bounding box coordinates and the probability of each class directly through regression. YOLO divides the input picture into
cells, the subsequent output is performed in units of cells. If the center of an object falls on a cell, then the cell is responsible for predicting the object. Each cell needs to predict
bounding boxes information. The bounding box information contains five data values
.
is the offset of the center point of the bounding box relative to the cell and the final predicted
is normalized. Assuming that the width of the picture is
, the height is
, the center coordinates of the bounding box
, and the cell coordinates are
, then the formula for
is as follows:
represents the ratio of the bounding box to the entire picture. Assuming that the predicted width and height of the bounding box are
, then the formulae for
are as follows:
The confidence is composed of two parts, one is whether there is a target in the grid, and the other is the accuracy of the bounding box. The confidence is calculated as follows:
If the bounding box contains an object, is equal to 1, otherwise is equal to 0. is the intersection area between the predicted bounding box and the real area of the object, the value is between [0, 1].
In addition to the confidence level, each grid also outputs probability information that the object belongs to a certain category, so the final output dimension of the network is .
The original YOLOv1 detection network structure as shown in
Figure 2. It consists of 24 convolution layers and 2 fully connection layers, the convolution layer is used to extract image features and the fully connection layer is used to predict image position and category probability. Due to the use of multiple down sampling layers, the object features learned by the network are not fine, which will affect the detection effect.
YOLOv3 is an improved version of YOLOv1. The backbone network of YOLOv3 uses Darknet-53 (
Figure 3) network, which consists of a total of 23 residual modules. Each residual module consists of two convolutional layers and a shortcut link. These residual modules are divided into 5 groups, and each group contains 1, 2, 8, 8, 4 residual modules.
For most convolutional neural networks, it is necessary to use shallow features to distinguish small targets and deep features to distinguish large targets. As shown in
Figure 4, YOLOv3 draws on the idea of multiscale feature fusion of FPN, detects at three feature map size of 13 × 13, 26 × 26, 52 × 52, and through 2 times upsampling, the feature map is transmitted on two adjacent scales.
The loss function can be used to evaluate a model, the loss function of YOLOv3 uses binary cross-entropy, which comprises coordinate error, confidence error and classification error:
The coordinate error is composed of two parts: Bounding box center point error and bounding box width and height error.
The confidence error is composed of two parts: the confidence error when there is an object in the prediction bounding box and the confidence error when there is no object in the prediction bounding box:
The classification error is expressed as follows:
In the above loss function, represents the weight. The coordinate error has a larger proportion in the entire loss, is set to 5. In the confidence error, is 1 when there is an object in the prediction bounding box and is 0.5 when there is no object in the prediction bounding box. The coefficient of the classification error term is fixed at 1.
2.3. YOLOv3 Improved Methods
This article proposes three improvements to the YOLOv3 network for pine cone detection.
- (1).
Introducing dense connection module
A densely connected network can improve the information flow and gradient of the entire network. Its principle is as follows: Assuming the input is
, each layer of the network implements a nonlinear transformation
,
represent the
layer. Denote the output of
layer as
, then:
Densely connected networks usually contain multiple dense modules, and a dense module consists of n dense layers. The specific structure of the basic dense layer is shown in
Figure 5, unlike the common post-activation mechanism, the dense layer uses a pre-activation mechanism. Its batch normalization layer and activation function layer (ReLU) and before the convolutional layer, first perform the activation operation and then perform 3 × 3 convolution output feature mapping.
Assuming that the dimension of the input
of a dense module is
, and each dense layer outputs
feature maps. According to the principle of dense networks, the input of the
dense layer is
feature maps, so the 3 × 3 convolution operation performed directly will bring a huge amount of calculation. At this time, the bottleneck structure (
Figure 6) can be used to reduce the amount of calculation, the main method is to add a 1 × 1 convolution layer to the original dense module to reduce the number of features. In the dense layer of the bottleneck structure we constructed, we first obtained 2k feature maps through the 1 × 1 convolution layer, and then output k feature maps through the 3 × 3 convolution layer.
Figure 7 shows the original YOLOv3 structure and YOLOv3 with dense network structure. To balance the detection speed and accuracy, we retain the residual modules of the original network output as 208 × 208 and 104 × 104, the three groups of residual modules with the output of 52 × 52, 26 × 26 and 13 × 13 were replaced with dense modules. Each dense module is composed of 4 dense layers of bottleneck structure. Finally, the network output dimension is consistent with the original network.
- (2).
Extending scale detection module
In view of the fact that most of the pine cones to be detected are small targets, we have improved the scale detection module in YOLOv3. The method of improving scale detection module is shown in
Figure 8: In the original YOLOv3, a total of three detections were performed, respectively in the output feature maps 13 × 13, 26 × 26 and 52 × 52. When the output feature maps are 13 × 13, it is detected once. The 13 × 13 feature maps are fused with 26 × 26 feature maps by one up sampling and the second detection is carried out. The feature maps of the second detection are up sampled again and fused with the feature maps of 52 × 52 for the third detection. The feature maps of 104 × 104 in the network contain more fine-grained features and position information of small targets. Fusion of these feature maps and high-level feature maps for detection can improve the accuracy of detecting small targets [
26].
Inside the dashed box, the feature maps of the third detection are up sampled, and then they are fused with the feature maps of 104 × 104 to carry out the fourth detection. In this way, the feature fusion target detection layer that is down sampled by 4 times is established, and the three detection ratios in the original YOLOv3 are expanded to four.
- (3).
Optimizing the loss function
The confidence error in the YOLOv3 loss function is calculated based on
IoU which represents the intersection ratio of the prediction bounding box and the target bounding box. When the prediction bounding box is
A and the target bounding box is
B:
Though
IoU widely used as an evaluation indicator in target detection tasks, it has some shortcomings: If the prediction bounding box and the target bounding box do not intersect, then according to the definition,
IoU is equal to 0, this cannot reflect the distance between the two bounding boxes. At the same time, the position error and confidence error in the loss function cannot return the gradient, which affects the learning and training of the network; When the intersection areas of the target bounding box and the prediction bounding box are equal but the distances are unequal, the calculated
IoU is equal, which cannot accurately reflect the coincidence of the two and will also cause the performance of the network to decrease. To solve this problem, Hamid Rezatofighi et al. proposed an improved method of
GIoU. The calculation method of
GIoU is very simple, it is calculated by the minimum convex set of the prediction bounding box
A and the target bounding box
B, assuming that the minimum convex set of
A and
B is
C, then:
When
A and
B do not coincide, the greater the distance between them, the closer
GIoU is to −1, so the loss function can be expressed by 1 −
GIoU, which can better reflect the degree of coincidence of
A and
B. However, when
A is in
B,
GIoU will be completely downgraded to
IoU. Zhaohui Zheng et al. proposed an improved method of
DIoU:
In the loss function, and represent the center points of A and B, represents Euclidean distance of and , represents the diagonal distance of the smallest rectangle that can cover both A and B. DIoU can directly minimize the distance between A and B, so it converges much faster than GIoU. DIoU inherits the excellent features of IoU and avoids the disadvantages of IoU, in 2D/3D computer vision tasks based on IoU as an indicator, DIoU is a good choice. We introduce DIoU in the loss function of YOLOv3 to improve the detection accuracy.
In summary, the experimental scheme of our work is shown in
Figure 9. We will enhance the collected raw data through the BEGAN network and then conduct as much convergence training as possible on the improved YOLOv3 model. The visual inspection results and evaluation indices of the model are tested.
5. Conclusions
This paper proposes a pine cone detection method under complex background. First of all, we manually collected and marked 800 images containing pine cones. To solve the problem of insufficient data volume, two methods of traditional image augmentation technology and adversarial network BEGAN were used to achieve data augmentation and enrich the diversity of the data set. Then we proposed an improved YOLOv3 model for detecting pine cones in complex backgrounds. To balance the accuracy and speed of detection, we introduced a densely connected network structure in the backbone network of YOLOv3 and at the same time expanded the three detection ratios of YOLOv3 to four. Finally, we optimized the loss function of YOLOv3 using the DIoU algorithm.
We conducted detailed comparative experiments to prove the effectiveness of our proposed method. Experimental results show that the improved model has a significant improvement in detection speed and accuracy compared to the original YOLOv3, which can meet the requirements for real-time detection of pine cones. The use of the BEGAN network effectively achieves data augmentation and further improves the performance of the model.
Our proposed method can effectively detect pine cones. Since collecting pine cone image data is a difficult task, our data set is not balanced enough. Although there are still certain limitations in our research, it is still helpful to realize automatic harvesting of pine cones. The focus of future work is to deploy it in embedded devices to achieve better portability in use. In addition, we will collect more image data of pine cones for more comprehensive research.