1. Introduction
Blueberries are considered one of the fruits that are rich in a variety of nutrients. They are rich in antioxidants, which can help humans reduce the risk of chronic diseases such as cardiovascular disease and cancer [
1]. Blueberries are also rich in vitamin C and vitamin E, which help protect the eyes from free radical damage and prevent eye diseases. In orchard cultivation, blueberries are one of the fruit crops with high economic returns for orchard farmers, and the implementation of blueberry fruit monitoring during cultivation is economically important for the estimation of blueberry yields or the prediction of blueberry fruit maturity. However, due to disturbing factors such as outdoor light variations, color similarity of blueberry fruits to their own fruit tree canopy, and imaging distance and occlusion in real outdoor orchard scenes, it is still extremely challenging to investigate reliable visual detection methods to identify fruits in blueberry fruit tree canopies. In recent years, due to the rapid development of computer deep learning technology, visual detection systems have combined with deep learning technology to achieve the detection of blueberry fruit as a new research direction.
It is well known that blueberry fruit is relatively small, making it challenging to detect small objects using target detection algorithms in machine learning. Yang F proposed a cluster detection (ClusterNet) network, which will be used to detect each scale of the extracted input image through the scale estimation sub-network (ScaleNet). The normalized clustering region of the extracted input image will be detected through the dedicated detection network (DetecNet), which effectively improves the detection efficiency of small objects [
2]. Xiaolin F proposed a small-target detection method based on the improved Faster R-CNN algorithm in remote sensing images, which improves the detection of small targets by introducing the improved Anchor generation strategy and multi-scale feature fusion mechanism [
3]. Gong Y proposed a new concept, ‘fusion factor’, to control the amount of information passed from the deeper to the shallower layers of the feature pyramid (FPN), so that the FPN can adapt to the small-target detection, and when the FPN is configured with the appropriate fusion factor, the network can obtain a significant performance improvement on the small-target detection dataset [
4].
Small-target detection in UAV remote sensing images is also an important part of the research on target detection. Zhu X explored the prediction potential of the self-attention mechanism based on the YOLOv5 algorithm by adding more prediction heads to detect objects at different scales and by replacing the original prediction heads with those of the Transformer structure. The prediction potential of the self-attention mechanism was explored on the small target dataset. Experimental results conducted on VisDron2021 showed a 7% improvement in accuracy over the baseline model YOLOv5 [
5]. Li C proposed a density map-guided target detection network (DMNet) that determines whether there is an object in a region based on the change in pixel intensity distribution of an image, and achieved an excellent performance [
6]. Lu W constructed a hybrid backbone neural network model based on the combination of Convolutional Neural Network (CNN) and Transformer (ViT) for for UAV remote sensing image detection, and applied the Masked Image Modeling (MIM) method to the UAV remote sensing dataset with a volume lower than that of the mainstream natural scene dataset, to generate a pre-trained model for the proposed method to further improve the target detection accuracy of UAV imagery [
7]. Liao J et al. proposed an unsupervised cluster-guided detection framework (UCGNet) in order to solve the problem of compressing the resolution of high-resolution images prior to training by focusing on the densely distributed regions of objects [
8]. Buters T used low-altitude UAV imagery and an automated object-based image analysis software that detects and counts target seeds and seedlings in non-target grass substrates from a variety of reflective localized restoration substrates [
9]. Melnychenko O proposed a new approach combining Artificial Intelligence (AI), Deep Learning (DL), and Unmanned Aerial Vehicles (UAVs) to demonstrate superior real-time capabilities in fruit detection and counting using a combination of AI techniques and multi-UAV systems [
10]. Shiu Y S used high-resolution UAVs with a ground sampling distance (GSD) of 3 cm for (UAV) images for pineapple plant body detection using Faster R-CNN to locate and count the number of ripe fruits based on input image tiles of a size of 256 × 256 pixels [
11]. Khan S et al. implemented and evaluated high-resolution UAV images captured on two different target fields to study and develop a deep learning system for identifying weeds and crops in agricultural field weeds and crops [
12]. Gallo I combined the YOLOv7 target detection algorithm with an unmanned aircraft system [
13]. Wittstruck L developed an image processing method for Unmanned Aerial Vehicles (UAVs) with high-resolution RGB data and applied it to the fields of a pumpkin farmer in Hokkaido, northwestern Germany [
14]. Gai R presented a blueberry recognition algorithm TL-YOLOv8 based on the YOLOv8 algorithm. Feature extraction during training is enhanced by introducing an improved MPCA (Multiplexed Coordinated Attention) in the last layer of the backbone network. By replacing the C2f module with OREPA (Online Convolutional Reparameterisation) module, not only is the training accelerated, but the representation is also enhanced. The fruit occlusion problem was effectively solved [
15]. Xiao F proposed a lightweight detection method based on the improved YOLOv5 algorithm. A lightweight deep convolutional neural network was implemented using the ShuffleNet module, and the effectiveness of the method was evaluated using the blueberry fruit dataset; it effectively detected blueberry fruits in an orchard environment and identified their ripening stages [
16].
Currently, the growth monitoring process of blueberries is still mainly carried out by on-site manual inspection, which not only requires professional knowledge and experience, but also requires the use of a large number of human resources, so advanced optical sensors assembled by UAVs to take remotely sensed images have been gradually applied in the detection of the growth of fruit trees hanging fruit, the detection of crop diseases, yield prediction of fruit trees fruit, and fruit ripeness detection. Lu C et al. proposed a practical workflow based on YOLOv5 and UAV images to detect maize plant seedlings [
17]. Li W designed a set of processes combining improved U-net and VGG-19 nets to solve the problem of difficulty in obtaining grape leaves for identification in UAV images in complex environments [
18]. Bao W proposed an unmanned remote sensing method based on the DDMA-YOLO method for the effective detection of tea leaf blight [
19]. Wang C constructed a YOLO-BLBE model combined with an innovative I-MSRCR method to accurately identify blueberry fruits with a different ripeness [
20]. Yang W designed a blueberry recognition model based on the improved YOLOv5, and designed a new attention module NCBAM to improve the backbone network’s ability to extract blueberry features [
21]. Huang H et al. improved the accuracy of detecting citrus fruit images captured by UAVs by extending the target detection algorithm [
22]. Yong Z X proposed a method for detecting individual grapefruit fruit trees by combining deep learning with UAV remote sensing [
23].
The You Only Look Once (YOLO) series, a widely used and excellent single-stage target detection method (the better ones in recent years are YOLOv5, YOLOv7, and YOLOv8 [
24,
25]), which can obtain the category and position of the object through the single-stage detection method. Compared with two-stage target detection algorithms such as RCNN [
26,
27,
28], single-stage target detection algorithms can reduce a large amount of computational time and cost, and are more suitable for UAVs to quickly detect and identify blueberry fruits in the canopy, and the YOLO series of algorithms has been widely used in remote sensing and agriculture.
The detection and counting of fruit tree canopies is important for orchard management, yield estimation, and phenotyping; however, so far there have been relatively few studies using UAV remote sensing imagery and deep learning methods to monitor blueberry canopy fruits in orchards. In this study, UAV remote sensing images are used to detect blueberry fruit in the canopy in the orchard, but three problems will be encountered: the first problem is that the UAV images cover a wide area, the image as a whole has a high resolution, and most of the pixel space is allocated to useless background information, and there is very little pixel space allocated to the small objects of the blueberry fruits, which leads to the difficulty of the neural network in accurately detecting the small target fruit. The second problem is that the blueberry fruits in the UAV image are densely distributed and similar in shape, which the neural network uses to produce false detections. The third problem is that when the YOLO algorithm is used to train UAV remote sensing images, many target prior frames will be generated in advance in the images, and the use of non-greatly suppressed NMS [
29] to filter the most optimal prior frame inference is slow, and the use of ClusterNMS [
30] will improve the inference speed, but it is difficult to select the optimal prior frame.
To address the above problems, this research paper proposes a PF-YOLO-based UAV remote sensing monitoring method for blueberry canopy fruits, which can monitor blueberry canopy fruits a timely in and accurate manner. This research made the following contributions:
1. High-resolution remote sensing images for detecting blueberry crown fruits were collected based on UAV photography in real blueberry orchards, and a high-resolution remote sensing blueberry crown fruit target detection dataset was constructed.
2. Combined with the position information coding structure fused to the target detection feature extraction part, the PAC3 structure is proposed, which pays more attention to the spatial position information features of the detected targets, in order to solve the problem of the missed detection of small targets in the smoke sensing images of UAVs.
3. Using fast convolution instead of conventional convolution so as to reduce the number of parameters and computational complexity of the neural network model, with the YOLO framework as the basis, it fused the PAC3 structure and fast-convolution proposed UAV blueberry canopy fruit remote sensing image target detection model PF-YOLO.
4. The cluster-NMF algorithm is proposed to speed up the inference stage of blueberry canopy fruit cluster detection and improve the detection efficiency by optimizing the redundant target detection frame screening process.
3. Experiments and Methods
3.1. Location Information Encoding
The SE structure compresses global spatial information into a single channel descriptor, using global average pooling to transform the feature map containing global information into a one-dimensional feature vector, alleviating inter-channel information dependency. The CBAM structure effectively integrates the spatial and channel dimensions of the feature map, learning adaptive channel weights to focus on specific channel location information.
SE only considers the internal channel information and ignores the importance of location information, but in target detection, the spatial location information of the detected target object is extremely important. Although CBAM tries to introduce the location information by global pooling on the channels, this way it can only capture the local information, and cannot obtain the information of the long-distance dependence, because the input image feature maps of each location contain information about a local region of the input image after the operation of the convolutional layer; as such, CBAM is used as a weighting coefficient by taking the maximum and average of multiple channels for each region, so this weighting knowledge takes into account the information of the local region.
Standard two-dimensional convolution cannot associate channel relationships with input image features, and constructing links between channels can improve the sensitivity of neural network models for classification decisions to information channels. The global pooling operation is usually used to encode the global spatial information channel note, which can make up for the shortcomings of the convolution operation, but this will directly compress the global spatial information into the channel descriptor, which will lose the position information, and have a great impact on capturing the spatial structure in the visual task. In the positional information encoding mechanism, the channel and remote dependencies of the input features are encoded in two steps, and then positional information is embedded and positional attention is generated to produce a feature map with accurate positional information. This is shown in
Figure 3 below:
The location information encoding uses three 1D feature encoding operations to realize the spatially captured remote interaction dependence of the attention block with the precise location information. Let the input be
. Using the horizontal spatial range pooling kernel
and the vertical spatial range pooling kernel
, each channel within the horizontal and vertical coordinates of
is encoded, respectively. So the output of the cth channel with height h is represented as follows:
The output of the
th channel with width
is denoted as follows:
These two transformations aggregate features in two spatial directions to obtain a pair of direction-aware feature maps, which allows the attention block to capture long-range dependencies in one spatial direction and retains precise location information in the other, helping the neural network to more accurately localize the location of the feature region of the object of interest in the input image.
After the aggregated feature maps are generated by the above two processes, respectively, we stitch them together and feed them into the shared 1 × 1 convolution
for feature integration to obtain
:
denotes a nonlinear activation,
is an intermediate feature map which encodes spatial information in the horizontal and vertical directions, and
is a hyperparameter controlling the block size. Then,
is split along the spatial dimension to obtain two independent tensors,
,
, which are later obtained by 1 × 1 convolutional feature integration:
Attention is weighted in both directions, and the final output positional attention
y is denoted as follows:
Positional information coding focuses on the spatial information of the input image, aggregating feature extraction in both horizontal and vertical directions. This allows for the precise localization of targets located in different spatial directions. During training, the neural network model can more accurately concentrate on the features of the target’s location, enabling it to learn more about the target’s characteristics in that area.
3.2. PAC3 Structure
In the UAV remote sensing images, the number of blueberry canopy fruits is high, the location is similar, and the shape and color are closer, resulting in a dense distribution of blueberry fruits. It is very easy for a neural network facing this dense distribution problem to produce leakage and misdetection. The C3 module in yolov5 is the main module for learning the residual features, which is used for feature extraction from the input image to capture the feature information of different scales and semantic features, including edges, textures, shapes, and so on. Its structure is divided into two branches: one uses multiple bottleneck stacks and three standard convolutional layers as specified above, and the other goes through only one basic convolutional module, and finally the two branches are subjected to concat operation.
However, the standard convolution and standard pooling operations used in the C3 module do not capture the positional features of the target to be detected in the input image, and incorporate a large amount of useless background information, and these redundant features also greatly increase the difficulty of detection. Therefore, we integrate the positional attention module into C3, and after channel feature splicing, the feature map with captured edges, textures, shapes, and other feature information is input into the positional information encoding, which pays more attention to the exact location of the blueberry fruit in the canopy by integrating the horizontal and vertical spatial information features, and reduces the influence of redundant background information. As shown in
Figure 4 below:
3.3. Fast Convolutional Structure
FLOPs represent the total number of floating-point operations needed during model execution, while MAC indicates the total number of multiply accumulated operations. FLOPs provide an overall estimate of the model’s computational load, whereas MAC focuses more on describing the specific operational requirements within the model. These two metrics can be used to evaluate and compare the efficiency and performance of different models. We know that when forward propagation is performed in a neural network model, operations such as convolution, pooling, BatchNorm, Relu, upsampling, etc. are performed. All of these operations have corresponding computational power consumption, and convolution corresponds to the highest percentage of computational power consumption. Therefore, we introduce a fast convolution which extracts image spatial features more efficiently by simultaneously reducing redundant computations and memory accesses. It is shown in
Figure 5:
Instead of performing a convolution operation on all the pass tubes on the input channel, it performs a convolution operation on a specific portion of the paraphernalia to extract the spatial features of the image, while the rest of the channel remains unchanged. The channel/
r channel is considered as a representation of the whole feature map and is computed in the memory access process. The
FLOPs and
MAC (Memory Access Cost) at this time are as follows:
FLOPs and
MAC for regular convolution are as follows:
When = 1/4, the FLOPs for fast convolution are only 1/16 of those for regular convolution, and the memory accesses for fast convolution are only 1/4 of those for regular convolution.
3.4. ClusterNMF (Non-Maximal Fusion)
In the target detection task, the single-stage target detection neural network model based on the YOLO framework generates a large number of prediction frames in the pre-task period; however, the vast majority of prediction frames are redundant frames that do not correctly contain the target detector, so an algorithm is needed to filter them, and NMS (non-maximal suppression) is an algorithm used for the target detection to remove highly redundant prediction frames in the reasoning stage. The process of prediction frame generation generates a large number of predictions at the same target location, and these candidate frames may overlap with each other; at this time we need to find the best target prediction frames using non-maximal suppression to eliminate redundant prediction bounding boxes.
The steps of the NMS are shown in
Figure 6: (1) first, initially set a confidence threshold, and initially screen to remove the bounding boxes that are lower than this confidence threshold; (2) sort the initially screened prediction boxes according to the confidence scores in ascending order; (3) find out the boxes with the highest confidence scores in the current category and save them; (4) delete all the boxes that are higher than the IoU threshold retained in step (2); and (5) repeat steps (3) and (4) until all boxes are processed.
However, the NMS algorithm has some drawbacks: (1) the need to manually set the threshold, as if the threshold is set too large it will cause false detection, or if too small it will not achieve the desired effect (recognition accuracy decreases); (2) if the nms is below the threshold, the box score is set to 0, and is directly removed—this practice is too hard; (3) for the cycle step, parallel processing for the GPU is difficult, and so the operational efficiency is low, resulting in a slow model inference speed; (4) evaluated by predicting the IoU of the box, the IoU practice has different effects on the target box scale and distance.
To address the above problem, Bodla N proposed the Soft-NMS algorithm, which uses a reduced detection score to suppress the confidence score of each box through a penalized Gaussian function, for which the confidence score is not high; after the suppression of the penalized function, its confidence score will be lower than the threshold and thus will be removed, and for those boxes with a high confidence score, even after the suppression of the penalized function, its confidence is still higher than the threshold and is retained, thus reducing the leakage rate due to the overlapping of multiple detected targets and the strong removal of some of these boxes. If its function is suppressed, its confidence score is still higher than the threshold and is retained, thus reducing the number of boxes that are forcefully removed because multiple detected targets overlap together, i.e., reducing the leakage rate [
32]. Ning C proposed Weighted NMS, where the confidence scores of the prediction boxes are weighted and normalized according to the confidence scores of the prediction boxes to obtain the new bounding box, and the bounding box is used as the final predicted bounding box, and then the other boxes are eliminated [
33]. Zheng Z proposed DIoU-NMS to filter the predicted bounding boxes by combining the area overlap region and the centroid distance of the two boxes through the calculation of DIoU [
34].
The above NMS has a common bottleneck, which is the computation of IoU and sequential iterative suppression. If there are n detection frames in an image, according to the sequential processing, a certain frame M and the other frames have to compute the IoU at least once, at most n−1 times. Together with the sequential iterative suppression, the above NMS algorithm calculates the IoU at least n−1 times, up to
times, so if you want to speed up the process of NMS, you should parallelize the calculation of IoU. As such, Zheng Z et al. proposed Cluster-NMS, which utilizes the GPU matrix operations of pytorch for NMS, but at the same time ensures the performance remains the same as Traditional NMS [
30].
However, Cluster-NMS only speeds up the inference by guaranteeing the same accuracy as Traditional NMS, but does not further improve the accuracy of inference. For such a problem, we propose Cluster-NMF (Cluster Non-maximum fusion). Firstly, we parallelize the computation of IoUs in the same matrix format, and only need to compute the IoUs of the set of detection frames
with itself, and the set of detection frames
will be firstly sorted by the descending order of the confidence scores, i.e., the score of
is the highest, and the score of
is the lowest. The following matrix is obtained;
represents the IoUs of detection frames
and detection frames
.
Due to the heap symmetry of the IoU matrix
.,
, and the fact that a detection frame is meaningless with respect to calculating the IoUs on its own, the
is then up-triangulated to obtain an up-triangulated matrix
with elements of 0 on the diagonal and below.
Then, take the maximum value of
by column and binarize it according to the NMS threshold to obtain a one-dimensional tensor
;
represents the maximum value of the element on the
th column of the new matrix
. Then, binarize the tensor
according to the threshold set by the NMS, and assign a value of 0 to those greater than the threshold, and of 1 for those smaller than the threshold. An element of
with a value of 0 denotes a suppressed prediction box, and elements with a value of 1 indicate retained boxes. The b is then expanded into a diagonal matrix
. The matrix
is then left-multiplied by the IoU matrix
. The new tensor
is obtained by again taking the maximum value by column and binarizing it according to the NMS threshold.
Repeat the above steps until after two iterations
is found, and the final IoU matrix
is obtained at this point to determine which detection frames should be retained and which detection frames should be suppressed. Then, for each column of the matrix
, according to a new fusion IoU threshold, determine which detection boxes are retained for fusion, determine the need to fuse the detection box through the classification confidence value and the IoU of similar objects of all the border coordinates of the weighted average to update the coordinates of the new prediction box.
The pseudo-code of the Cluster-NMF algorithm is shown in Algorithm 1;
denotes the IoU matrix of the retained and suppressed detection frames screened by Cluster-NMF;
is the new fusion IoU threshold, greater than which it should be fused, less than which it is not fused;
is the new list of scores for the test box obtained after the function f determines whether the test box should be fused or not;
denotes the coordinate matrix of the retained detection frame;
is the weight matrix;
denotes the score weight of the
box;
denotes the IoU of the
-th box and
-th box;
is the score of the
-th box;
denotes the prediction frames that were retained before fusion;
indicates the prediction frame that was retained after fusion; and
indicates the prediction box that is ultimately retained.
Algorithm 1: Cluster-NMF |
Input: N detected boxes with non-ascending sorting by classification score-descending order. Output: encodes final detection result, where 1 denotes reservation and 0 denotes supression. Initialize . Compute IoU matrix . Upper traingular matrix with While do end if end While
Return Box |
3.5. PF-YOLO Network Model
The YOLO framework is a popular method for single-stage target detection based on deep learning, which has a high detection accuracy and fast inference speed. And to realize the neural network model deployed in UAV embedded devices for real-time monitoring, the weight file needed should be small, so considering the accuracy and efficiency of the detection model, this study combines the above improved modules and algorithms with the YOLO framework, and proposes a PF-YOLO neural network model for detecting the blueberry canopy fruits using remote sensing from UAVs.
As shown in
Figure 7 below, the PF-YOLO network model is divided into three parts: Backbone, Neck, and Head. In Backbone, the PAC3 module is used to improve the ability of extracting the exact positional features of the blueberry crown fruit target from the input image, in order to reduce the leakage of the blueberry crown fruit detection problem. In the Neck, the PF-YOLO network model is used by using the fast convolution in order to reduce the number of parameters and speed up the memory access speed of the overall model, and to maintain the original feature extraction ability number and speed up memory access, and maintain the original feature extraction capability. In the Head, four scales of detection heads are applied to the feature maps of blueberry crown fruit clusters of different sizes to generate information of different categories, coordinates, and confidence levels.