1. Introduction
The plum, a woody plant of the genus Plum in the family Rosaceae, is native to southeastern China and is one of the oldest cultivated fruits in the world, and it is now cultivated in most areas around and south of the Qinling Mountains in China. China is the largest grower and producer of plums, and its plum industry has an important position in both domestic and international economic markets [
1]. From 2010–2022, China’s plum production increased from 5,521,700 tons to 6,626,300 tons, accounting for 51.6% to 55.2% of global plum production, and the harvested area increased from 1665.2 thousand hectares to 1946.5 thousand hectares; the share of the global plum harvested area increased from 69.1% to 74.8% (Food and Agriculture Organization of the United Nations, 2022). The domestic plum supply is abundant, but the overall import volume is still greater than the export volume due to the huge consumer market [
2]. Plum fruits are about 4 cm in diameter, with a single fruit weight of about 50 g to 210 g. The fruit is full, delicious, and juicy, and the flesh contains basic components, such as protein, carbohydrates, and fat, as well as antioxidant-active substances, such as flavonoids, carotenoids, and anthocyanins, which are rich in nutritional value, and the plum is one of the main excellent fresh fruits in summer [
3,
4]. In addition, the fruit is resistant to storage and transportation, which is favored by many operators and consumers [
5].
During the growth of plum trees, too much fruit at the same time will consume a lot of nutrients, weaken the tree’s vigor, and affect the quality and size of the fruit, so it is necessary to artificially reduce the number of fruits. However, if plum trees produce less fruit, it will affect economic performance and require consideration of changing varieties [
6,
7]. At the same time, changes in climate, soil, and the surrounding environment affect the growth and development of fruit trees, with pests and diseases causing greater economic losses [
8]. Therefore, continuous testing and analysis are needed during the growth of plums, thus avoiding problems such as yield reduction due to fruit yield and pests and diseases. However, the small size of plum fruits and the complexity of the actual plum orchard environment make the detection of plum fruits difficult due to leaf shielding and fruit overlap. Most of the traditional plum farming industry uses manual visual estimation to determine the number of fruits and the problems of pests and diseases [
9]. In recent years, there has been an unprecedented increase in labor costs due to the continuous loss of agricultural labor, and this problem has become more prominent after the COVID-19 pandemic [
10]. In summary, the design of an efficient and accurate method for plum fruit detection is necessary to estimate the number of fruiting plum trees and to facilitate research on pest and disease identification.
With the continuous development of computer science and technology, computer vision technology in deep learning has penetrated people’s production life, and its importance is becoming more and more prominent [
11]. More and more researchers had put computer vision technology into agricultural production and have a wide range of practical applications [
12,
13], such as intelligent greenhouses [
14], agricultural robots [
15], etc. Target detection is an important task in the field of computer vision, and its main job is to identify and localize the target of interest in the input image, which solves the problem of where and what the object is in the image [
16]. Target detection techniques have achieved good results in agricultural production areas, such as pest and disease monitoring [
17,
18], crop yield prediction [
19], and crop growth monitoring [
20]. The applications of target detection models for crop fruit detection [
21] can be divided into two categories: detection using one-stage models and detection using two-stage models [
22]. In the detection process, two-stage models separate the proposed region from the background and then classify and localize the target [
23], among which the representative one uses the region-CNN (RCNN) series for detection. Sa et al. [
24] used the improved Faster-RCNN [
25] based on multi-modal (RGB and NIR) information fusion to detect a variety of fruits, including apples, bell pepper, and melon, with an average detection accuracy of 0.838. Yu et al. [
26] detected strawberry fruits based on Mask-RCNN with a detection accuracy of 0.898. Fu et al. [
27] built an algorithm consisting of ZFNet and Faster R-CNN architecture of VGG16 to detect apples in vertical fruit wall trees, and the fruit detection accuracy was improved by 2.5% after removing the background tree with a depth filter. RCNN series have a high detection accuracy for crop fruit detection, but the detection speed is relatively slow due to the large number of calculation parameters and complex models, which posed certain challenges for computing hardware in practical applications. The one-stage model uses an initial anchor frame to locate the target region and directly predict the target class [
28]. The representative of one-stage models is the You Only Look Once (YOLO) series, and deep learning models based on YOLO are widely used for fruit detection and recognition [
29]. Fu et al. [
30] cut the less important network layer by analyzing the characteristics and network structure of bananas. As a result, the YOLOv4 model was improved to form the YOLO-Banana model, and the average accuracy (AP) of banana string and banana stem was 98.4% and 85.98%, respectively. Li et al. [
31] improved the YOLOX model by modifying the loss function and adding attention to detect sweet cherries and achieved a detection accuracy of 84.96%, an improvement of 2.34% relative to the initial YOLOX model. Wang and He [
32] detected young apple fruits based on the YOLOv5 model with channel pruning, and the detection accuracy reached 95.80% with an average detection time of 8 milliseconds per image. Compared with the RCNN series, the YOLO series achieved similar performance and higher computational efficiency.
Based on the above studies, the YOLO model has great potential for fruit detection under natural conditions. The YOLOv7 model additionally employs efficient aggregation networks, reparameterized convolution, positive and negative sample matching strategies, auxiliary head training, and model scaling [
33,
34], which enables the model to significantly improve the feature extraction capability of the target. In the range of 5 FPS to 160 FPS, the YOLOv7 model is faster and more accurate than any known target detector [
35]. This paper selected the YOLOv7 network as the initial detection model after taking into account the object to be detected, the detection accuracy of the network, and the need for lightweighting.
The spatial Pyramid Pooling (SPP) module [
36] was proposed by Kai-Ming He in 2015; its main role is to solve the problem of uneven input image size. The SPP structure used in YOLOv7 is Cross Stage Partial Spatial Pyramid Pooling (CSPSPP), although its effect is better than SPP and Spatial Pyramid Pooling-Fast (SPPF) used in YOLOv5, the feature extraction speed of CSPSPP is slowed down due to the increasing number of parameters and calculation amount. Upsampling refers to inserting new elements between pixel points based on the original image pixels by using a suitable interpolation algorithm, i.e., enlarging the original image to make it higher resolution [
37]. The interpolation algorithm used in YOLOv7 for upsampling is nearest interpolation, which inserts the pixel values as its own by finding the pixel points closest to the location of the inserted pixel points, which is less computationally intensive and simpler [
38]. However, it only uses the gray value of the pixel closest to the sampling point to be measured as the gray value of that sampling point without considering the influence of other neighboring pixel points; thus, the gray value has obvious discontinuity after resampling, and the image quality loss is large, which will produce obvious mosaic and jagged phenomenon, so determining how to choose a more suitable interpolation algorithm in the upsampling process is worth studying. In computer vision, the method of focusing attention on important regions of an image and discarding irrelevant ones is called the attention mechanism [
39]. Attention mechanisms are a common data processing method that applies the human perception style and the behavior of attention to the machine so that the machine learns to perceive the important and unimportant parts of the data [
40]. The introduction of an attention mechanism can make the network focus on the target area, enhance the effective feature extraction ability of the network, reduce the interference of invalid targets, and obtain more detailed information [
41,
42]. The Convolutional Block Attention Module (CBAM) is a simple and effective precursor convolutional attention module proposed by Woo et al. [
43]. It focuses on both spatial focus and channel focus and can be applied in YOLOv7 networks to improve the feature perception of the network for targets in images [
44,
45,
46].
In view of the above limitations, the aim of this study is to achieve the fast and accurate detection of plum fruits in natural environments, so this paper proposed a YOLOv7-based target detection algorithm, YOLOv7-plum. By modifying the structure of the convolutional layer, reducing the number of parameters in the process of feature extraction, and changing the parallel structure to a more lightweight series structure, this paper improved the CSPSPP structure used in YOLOv7 to Cross Stage Partial Spatial Pyramid Pooling–Fast (CSPSPPF) structure. To further improve the upsampling operation in YOLOv7, the nearest interpolation in the original model was replaced with bilinear interpolation [
47]. In addition, two CBAMs were embedded in the network structure of YOLOv7 to improve the perception of the model for different spaces and different channels so that the model could focus more on the key information in the image. Finally, the feasibility and reliability of the method were verified in this paper with ablation experiments and a statistical analysis. The experimental results showed that our proposed YOLOv7-plum algorithm could better achieve plum fruit detection in complex backgrounds, which helped to promote the estimation of fruit number of plum trees and yield prediction of plum and laid the foundation for the research of a plum pest detection model.
The remainder of this paper is organized as follows.
Section 2 introduces the details of the dataset used in this study and the plum detection algorithm model YOLOv7-plum.
Section 3 evaluates the performance of the YOLOv7-plum algorithm through experiments.
Section 4 shows the discussion results. Finally,
Section 5 summarizes the work of this study.
5. Conclusions
In this study, deep learning technology was applied to the detection of plum fruits, and the YOLOv7 model was improved to obtain a YOLOv7-plum model with a better detection effect on plum fruits. First, the collected plum image data were enhanced and marked, and then, the plum fruit detection data set was formed. Second, the CSPSPP structure used by YOLOv7 was modified to the CSPSPPF structure to improve the computational efficiency in the feature extraction process. Third, bilinear interpolation was used to replace the nearest neighbor interpolation used by the original model to improve the image quality in the upsampling process. Finally, two CBAMs were embedded between the head and backbone of the model structure to make the model pay more attention to the important information of plum fruit in the image during operation. The main conclusions were as follows:
(1) The AP of the YOLOv7-plum model proposed in this study reached 94.91%, 2.03% higher than that of the YOLOv7 model. It was found that the YOLOv7-plum model could alleviate the missing detection of the YOLOv7 model to some extent. Therefore, the improved method is believed to be effective, and the YOLOv7-plum model can effectively support continuous detection and analysis during plum growth;
(2) Through comparative tests, it was found that the detection accuracy of the YOLOv7-plum model was better than that of the current mainstream target detection network model. The size of 71.4 M was acceptable, and only 0.0193 s was needed to process a single image. The results showed that YOLOv7-plum could accurately complete the plum fruit detection task and could be run on lightweight devices;
(3) The method proposed in this study could effectively realize the detection of plum fruit under the natural background and reduced the problems of traditional manual detection, such as low efficiency, high cost, and subjectivity, which is conducive to the development of intelligent planting in the plum industry, such as yield prediction and pest detection.