Object Detection in High-Resolution UAV Aerial Remote Sensing Images of Blueberry Canopy Fruits

Zhao, Yun; Li, Yang; Xu, Xing

doi:10.3390/agriculture14101842

Open AccessArticle

Object Detection in High-Resolution UAV Aerial Remote Sensing Images of Blueberry Canopy Fruits

by

Yun Zhao

,

Yang Li

and

Xing Xu

^*

School of Information and Electronic Engineering, Zhejiang University of Science and Technology, Hangzhou 310023, China

^*

Author to whom correspondence should be addressed.

Agriculture 2024, 14(10), 1842; https://doi.org/10.3390/agriculture14101842

Submission received: 25 August 2024 / Revised: 7 October 2024 / Accepted: 16 October 2024 / Published: 18 October 2024

(This article belongs to the Section Agricultural Product Quality and Safety)

Download

Browse Figures

Versions Notes

Abstract

Blueberries, as one of the more economically rewarding fruits in the fruit industry, play a significant role in fruit detection during their growing season, which is crucial for orchard farmers’ later harvesting and yield prediction. Due to the small size and dense growth of blueberry fruits, manual detection is both time-consuming and labor-intensive. We found that there are few studies utilizing drones for blueberry fruit detection. By employing UAV remote sensing technology and deep learning techniques for detection, substantial human, material, and financial resources can be saved. Therefore, this study collected and constructed a UAV remote sensing target detection dataset for blueberry canopy fruits in a real blueberry orchard environment, which can be used for research on remote sensing target detection of blueberries. To improve the detection accuracy of blueberry fruits, we proposed the PAC3 module, which incorporates location information encoding during the feature extraction process, allowing it to focus on the location information of the targets and thereby reducing the chances of missing blueberry fruits. We adopted a fast convolutional structure instead of the traditional convolutional structure, reducing the model’s parameter count and computational complexity. We proposed the PF-YOLO model and conducted experimental comparisons with several excellent models, achieving improvements in mAP of 5.5%, 6.8%, 2.5%, 2.1%, 5.7%, 2.9%, 1.5%, and 3.4% compared to Yolov5s, Yolov5l, Yolov5s-p6, Yolov5l-p6, Tph-Yolov5, Yolov8n, Yolov8s, and Yolov9c, respectively. We also introduced a non-maximal suppression algorithm, Cluster-NMF, which accelerates inference speed through matrix parallel computation and merges multiple high-quality target detection frames to generate an optimal detection frame, enhancing the efficiency of blueberry canopy fruit detection without compromising inference speed.

Keywords:

target detection; drone images; high-resolution images; PF-YOLO; cluster-NMF

1. Introduction

Blueberries are considered one of the fruits that are rich in a variety of nutrients. They are rich in antioxidants, which can help humans reduce the risk of chronic diseases such as cardiovascular disease and cancer [1]. Blueberries are also rich in vitamin C and vitamin E, which help protect the eyes from free radical damage and prevent eye diseases. In orchard cultivation, blueberries are one of the fruit crops with high economic returns for orchard farmers, and the implementation of blueberry fruit monitoring during cultivation is economically important for the estimation of blueberry yields or the prediction of blueberry fruit maturity. However, due to disturbing factors such as outdoor light variations, color similarity of blueberry fruits to their own fruit tree canopy, and imaging distance and occlusion in real outdoor orchard scenes, it is still extremely challenging to investigate reliable visual detection methods to identify fruits in blueberry fruit tree canopies. In recent years, due to the rapid development of computer deep learning technology, visual detection systems have combined with deep learning technology to achieve the detection of blueberry fruit as a new research direction.

It is well known that blueberry fruit is relatively small, making it challenging to detect small objects using target detection algorithms in machine learning. Yang F proposed a cluster detection (ClusterNet) network, which will be used to detect each scale of the extracted input image through the scale estimation sub-network (ScaleNet). The normalized clustering region of the extracted input image will be detected through the dedicated detection network (DetecNet), which effectively improves the detection efficiency of small objects [2]. Xiaolin F proposed a small-target detection method based on the improved Faster R-CNN algorithm in remote sensing images, which improves the detection of small targets by introducing the improved Anchor generation strategy and multi-scale feature fusion mechanism [3]. Gong Y proposed a new concept, ‘fusion factor’, to control the amount of information passed from the deeper to the shallower layers of the feature pyramid (FPN), so that the FPN can adapt to the small-target detection, and when the FPN is configured with the appropriate fusion factor, the network can obtain a significant performance improvement on the small-target detection dataset [4].

Small-target detection in UAV remote sensing images is also an important part of the research on target detection. Zhu X explored the prediction potential of the self-attention mechanism based on the YOLOv5 algorithm by adding more prediction heads to detect objects at different scales and by replacing the original prediction heads with those of the Transformer structure. The prediction potential of the self-attention mechanism was explored on the small target dataset. Experimental results conducted on VisDron2021 showed a 7% improvement in accuracy over the baseline model YOLOv5 [5]. Li C proposed a density map-guided target detection network (DMNet) that determines whether there is an object in a region based on the change in pixel intensity distribution of an image, and achieved an excellent performance [6]. Lu W constructed a hybrid backbone neural network model based on the combination of Convolutional Neural Network (CNN) and Transformer (ViT) for for UAV remote sensing image detection, and applied the Masked Image Modeling (MIM) method to the UAV remote sensing dataset with a volume lower than that of the mainstream natural scene dataset, to generate a pre-trained model for the proposed method to further improve the target detection accuracy of UAV imagery [7]. Liao J et al. proposed an unsupervised cluster-guided detection framework (UCGNet) in order to solve the problem of compressing the resolution of high-resolution images prior to training by focusing on the densely distributed regions of objects [8]. Buters T used low-altitude UAV imagery and an automated object-based image analysis software that detects and counts target seeds and seedlings in non-target grass substrates from a variety of reflective localized restoration substrates [9]. Melnychenko O proposed a new approach combining Artificial Intelligence (AI), Deep Learning (DL), and Unmanned Aerial Vehicles (UAVs) to demonstrate superior real-time capabilities in fruit detection and counting using a combination of AI techniques and multi-UAV systems [10]. Shiu Y S used high-resolution UAVs with a ground sampling distance (GSD) of 3 cm for (UAV) images for pineapple plant body detection using Faster R-CNN to locate and count the number of ripe fruits based on input image tiles of a size of 256 × 256 pixels [11]. Khan S et al. implemented and evaluated high-resolution UAV images captured on two different target fields to study and develop a deep learning system for identifying weeds and crops in agricultural field weeds and crops [12]. Gallo I combined the YOLOv7 target detection algorithm with an unmanned aircraft system [13]. Wittstruck L developed an image processing method for Unmanned Aerial Vehicles (UAVs) with high-resolution RGB data and applied it to the fields of a pumpkin farmer in Hokkaido, northwestern Germany [14]. Gai R presented a blueberry recognition algorithm TL-YOLOv8 based on the YOLOv8 algorithm. Feature extraction during training is enhanced by introducing an improved MPCA (Multiplexed Coordinated Attention) in the last layer of the backbone network. By replacing the C2f module with OREPA (Online Convolutional Reparameterisation) module, not only is the training accelerated, but the representation is also enhanced. The fruit occlusion problem was effectively solved [15]. Xiao F proposed a lightweight detection method based on the improved YOLOv5 algorithm. A lightweight deep convolutional neural network was implemented using the ShuffleNet module, and the effectiveness of the method was evaluated using the blueberry fruit dataset; it effectively detected blueberry fruits in an orchard environment and identified their ripening stages [16].

Currently, the growth monitoring process of blueberries is still mainly carried out by on-site manual inspection, which not only requires professional knowledge and experience, but also requires the use of a large number of human resources, so advanced optical sensors assembled by UAVs to take remotely sensed images have been gradually applied in the detection of the growth of fruit trees hanging fruit, the detection of crop diseases, yield prediction of fruit trees fruit, and fruit ripeness detection. Lu C et al. proposed a practical workflow based on YOLOv5 and UAV images to detect maize plant seedlings [17]. Li W designed a set of processes combining improved U-net and VGG-19 nets to solve the problem of difficulty in obtaining grape leaves for identification in UAV images in complex environments [18]. Bao W proposed an unmanned remote sensing method based on the DDMA-YOLO method for the effective detection of tea leaf blight [19]. Wang C constructed a YOLO-BLBE model combined with an innovative I-MSRCR method to accurately identify blueberry fruits with a different ripeness [20]. Yang W designed a blueberry recognition model based on the improved YOLOv5, and designed a new attention module NCBAM to improve the backbone network’s ability to extract blueberry features [21]. Huang H et al. improved the accuracy of detecting citrus fruit images captured by UAVs by extending the target detection algorithm [22]. Yong Z X proposed a method for detecting individual grapefruit fruit trees by combining deep learning with UAV remote sensing [23].

The You Only Look Once (YOLO) series, a widely used and excellent single-stage target detection method (the better ones in recent years are YOLOv5, YOLOv7, and YOLOv8 [24,25]), which can obtain the category and position of the object through the single-stage detection method. Compared with two-stage target detection algorithms such as RCNN [26,27,28], single-stage target detection algorithms can reduce a large amount of computational time and cost, and are more suitable for UAVs to quickly detect and identify blueberry fruits in the canopy, and the YOLO series of algorithms has been widely used in remote sensing and agriculture.

The detection and counting of fruit tree canopies is important for orchard management, yield estimation, and phenotyping; however, so far there have been relatively few studies using UAV remote sensing imagery and deep learning methods to monitor blueberry canopy fruits in orchards. In this study, UAV remote sensing images are used to detect blueberry fruit in the canopy in the orchard, but three problems will be encountered: the first problem is that the UAV images cover a wide area, the image as a whole has a high resolution, and most of the pixel space is allocated to useless background information, and there is very little pixel space allocated to the small objects of the blueberry fruits, which leads to the difficulty of the neural network in accurately detecting the small target fruit. The second problem is that the blueberry fruits in the UAV image are densely distributed and similar in shape, which the neural network uses to produce false detections. The third problem is that when the YOLO algorithm is used to train UAV remote sensing images, many target prior frames will be generated in advance in the images, and the use of non-greatly suppressed NMS [29] to filter the most optimal prior frame inference is slow, and the use of ClusterNMS [30] will improve the inference speed, but it is difficult to select the optimal prior frame.

To address the above problems, this research paper proposes a PF-YOLO-based UAV remote sensing monitoring method for blueberry canopy fruits, which can monitor blueberry canopy fruits a timely in and accurate manner. This research made the following contributions:

1. High-resolution remote sensing images for detecting blueberry crown fruits were collected based on UAV photography in real blueberry orchards, and a high-resolution remote sensing blueberry crown fruit target detection dataset was constructed.

2. Combined with the position information coding structure fused to the target detection feature extraction part, the PAC3 structure is proposed, which pays more attention to the spatial position information features of the detected targets, in order to solve the problem of the missed detection of small targets in the smoke sensing images of UAVs.

3. Using fast convolution instead of conventional convolution so as to reduce the number of parameters and computational complexity of the neural network model, with the YOLO framework as the basis, it fused the PAC3 structure and fast-convolution proposed UAV blueberry canopy fruit remote sensing image target detection model PF-YOLO.

4. The cluster-NMF algorithm is proposed to speed up the inference stage of blueberry canopy fruit cluster detection and improve the detection efficiency by optimizing the redundant target detection frame screening process.

2. Experimental Equipment and Materials

2.1. Data Source

The image data used in this study were collected and photographed on the afternoon of 13 June 2023, in Shimen Town, Tongxiang City, Zhejiang Province, China, under clear weather conditions with wind speeds of 2 to 3 levels. A DJI PHANTOM4 PRO drone equipped with a 20-megapixel visible light camera was used to capture the images (Shenzhen Dajiang Innovation Technology Co., Ltd., Shenzhen, China). The camera model is FC6310, with a focal length of 8 mm; an aperture range from f/2.8 to f/11; and features autofocus. In motion mode, the drone has a maximum ascent speed of 6 m/s, a maximum descent speed of 4 m/s, a maximum horizontal flight speed of 72 km/h, and a maximum flight altitude of 6000 m, with a maximum endurance time of 30 min. The captured images are set to a resolution of 5472 × 3078 in JPEG format. Due to the complexity of the geographic environment, during the image acquisition period, the flight speed and direction were manually controlled, with the camera positioned vertically at a 90-degree angle to the ground. The drone is equipped with a GPS positioning function, allowing it to record the latitude, longitude, and altitude of each image in detail.

In this study, the flight height of the UAV captured is about 5 m from the ground, and because the resolution of the images captured by the UAV is too large, applying them directly to neural network training requires great computational resources, and the same problem of insufficient samples of the training images will be faced. In this study, 10 images out of 96 collected are used as a test inference, and 86 high-resolution images of a 5472 × 3078 resolution are cut into 1032 images of a 1536 × 1536 resolution, of which 840 are used as the training set and 192 are used as the validation set. To ensure consistency in sample labeling, all annotations during the data labeling process were carried out by a single person. For an image label annotation tool, LabelImg software 1.8.6. was used; in this experiment, the design of the label was based on the size of the cluster set of the number of blueberry fruits, and the label category was divided into the three categories large_cluster, middle_cluster, and small_cluster, as shown in Figure 1a, which shows a UAV image of the blueberry crown fruits with a resolution of 5472 × 3078 pixels, where the collected UAV images cover a relatively wide space and are too large in resolution. Using the originally captured data directly to train the neural network model does not consume a large amount of computational resources and takes a very long time, and the number of sample images used for training in this way is insufficient. b shows the cut-up drone blueberry canopy fruit image with a resolution of 1536 × 1536, c shows the original image of the trained model recognizing the blueberry fruit, and d shows the cut-up trained model image of recognizing blueberry fruit.

2.2. Data Processing

In order to increase the diversity of the training data, we used Mosaic data augmentation [31], where four randomly selected samples were spliced together by random scaling, random cropping, random alignment, and adjusting the brightness. This improves the richness of the dataset and increases the robustness of the network. This is shown in Figure 2 below.

3. Experiments and Methods

3.1. Location Information Encoding

The SE structure compresses global spatial information into a single channel descriptor, using global average pooling to transform the feature map containing global information into a one-dimensional feature vector, alleviating inter-channel information dependency. The CBAM structure effectively integrates the spatial and channel dimensions of the feature map, learning adaptive channel weights to focus on specific channel location information.

SE only considers the internal channel information and ignores the importance of location information, but in target detection, the spatial location information of the detected target object is extremely important. Although CBAM tries to introduce the location information by global pooling on the channels, this way it can only capture the local information, and cannot obtain the information of the long-distance dependence, because the input image feature maps of each location contain information about a local region of the input image after the operation of the convolutional layer; as such, CBAM is used as a weighting coefficient by taking the maximum and average of multiple channels for each region, so this weighting knowledge takes into account the information of the local region.

Standard two-dimensional convolution cannot associate channel relationships with input image features, and constructing links between channels can improve the sensitivity of neural network models for classification decisions to information channels. The global pooling operation is usually used to encode the global spatial information channel note, which can make up for the shortcomings of the convolution operation, but this will directly compress the global spatial information into the channel descriptor, which will lose the position information, and have a great impact on capturing the spatial structure in the visual task. In the positional information encoding mechanism, the channel and remote dependencies of the input features are encoded in two steps, and then positional information is embedded and positional attention is generated to produce a feature map with accurate positional information. This is shown in Figure 3 below:

The location information encoding uses three 1D feature encoding operations to realize the spatially captured remote interaction dependence of the attention block with the precise location information. Let the input be

X

. Using the horizontal spatial range pooling kernel

(X, 1)

and the vertical spatial range pooling kernel

(1, Y)

, each channel within the horizontal and vertical coordinates of

X

is encoded, respectively. So the output of the cth channel with height h is represented as follows:

z_{c}^{h} (h) = \frac{1}{W} \sum_{0 \leq i < W} x_{c} (h, i)

(1)

The output of the

c

th channel with width

w

is denoted as follows:

z_{c}^{w} (w) = \frac{1}{H} \sum_{0 \leq j < H} x_{c} (j, w)

(2)

These two transformations aggregate features in two spatial directions to obtain a pair of direction-aware feature maps, which allows the attention block to capture long-range dependencies in one spatial direction and retains precise location information in the other, helping the neural network to more accurately localize the location of the feature region of the object of interest in the input image.

After the aggregated feature maps are generated by the above two processes, respectively, we stitch them together and feed them into the shared 1 × 1 convolution

F_{1}

for feature integration to obtain

f

:

f = δ (F_{1} ([z^{h}, z^{w}]))

(3)

δ

denotes a nonlinear activation,

f \in R^{C / r \times (H + W)}

is an intermediate feature map which encodes spatial information in the horizontal and vertical directions, and

r

is a hyperparameter controlling the block size. Then,

f

is split along the spatial dimension to obtain two independent tensors,

f^{h} \in R^{C / r \times H}

,

f^{w} \in R^{C / r \times W}

, which are later obtained by 1 × 1 convolutional feature integration:

\begin{matrix} g^{h} = σ (F_{h} (f^{h})) \\ g^{w} = σ (F_{w} (f^{w})) \end{matrix}

(4)

Attention is weighted in both directions, and the final output positional attention y is denoted as follows:

y_{c} (i, j) = x_{c} (i, j) \times g_{c}^{h} (i) \times g_{c}^{w} (j)

(5)

Positional information coding focuses on the spatial information of the input image, aggregating feature extraction in both horizontal and vertical directions. This allows for the precise localization of targets located in different spatial directions. During training, the neural network model can more accurately concentrate on the features of the target’s location, enabling it to learn more about the target’s characteristics in that area.

3.2. PAC3 Structure

In the UAV remote sensing images, the number of blueberry canopy fruits is high, the location is similar, and the shape and color are closer, resulting in a dense distribution of blueberry fruits. It is very easy for a neural network facing this dense distribution problem to produce leakage and misdetection. The C3 module in yolov5 is the main module for learning the residual features, which is used for feature extraction from the input image to capture the feature information of different scales and semantic features, including edges, textures, shapes, and so on. Its structure is divided into two branches: one uses multiple bottleneck stacks and three standard convolutional layers as specified above, and the other goes through only one basic convolutional module, and finally the two branches are subjected to concat operation.

However, the standard convolution and standard pooling operations used in the C3 module do not capture the positional features of the target to be detected in the input image, and incorporate a large amount of useless background information, and these redundant features also greatly increase the difficulty of detection. Therefore, we integrate the positional attention module into C3, and after channel feature splicing, the feature map with captured edges, textures, shapes, and other feature information is input into the positional information encoding, which pays more attention to the exact location of the blueberry fruit in the canopy by integrating the horizontal and vertical spatial information features, and reduces the influence of redundant background information. As shown in Figure 4 below:

3.3. Fast Convolutional Structure

FLOPs represent the total number of floating-point operations needed during model execution, while MAC indicates the total number of multiply accumulated operations. FLOPs provide an overall estimate of the model’s computational load, whereas MAC focuses more on describing the specific operational requirements within the model. These two metrics can be used to evaluate and compare the efficiency and performance of different models. We know that when forward propagation is performed in a neural network model, operations such as convolution, pooling, BatchNorm, Relu, upsampling, etc. are performed. All of these operations have corresponding computational power consumption, and convolution corresponds to the highest percentage of computational power consumption. Therefore, we introduce a fast convolution which extracts image spatial features more efficiently by simultaneously reducing redundant computations and memory accesses. It is shown in Figure 5:

Instead of performing a convolution operation on all the pass tubes on the input channel, it performs a convolution operation on a specific portion of the paraphernalia to extract the spatial features of the image, while the rest of the channel remains unchanged. The channel/r channel is considered as a representation of the whole feature map and is computed in the memory access process. The FLOPs and MAC (Memory Access Cost) at this time are as follows:

F L O P s = h \times w \times k^{2} \times {(\frac{c}{r})}^{2}

(6)

M A C = h \times w \times 2 (c / r) + k^{2} \times {(\frac{c}{r})}^{2} \approx h \times w \times 2 (\frac{c}{r})

(7)

FLOPs and MAC for regular convolution are as follows:

F L O P s = h \times w \times k^{2} \times c^{2}

(8)

M A C = h \times w \times 2 c + k^{2} \times (c / r)^{2} \approx h \times w \times 2 c

(9)

When

r

= 1/4, the FLOPs for fast convolution are only 1/16 of those for regular convolution, and the memory accesses for fast convolution are only 1/4 of those for regular convolution.

3.4. ClusterNMF (Non-Maximal Fusion)

In the target detection task, the single-stage target detection neural network model based on the YOLO framework generates a large number of prediction frames in the pre-task period; however, the vast majority of prediction frames are redundant frames that do not correctly contain the target detector, so an algorithm is needed to filter them, and NMS (non-maximal suppression) is an algorithm used for the target detection to remove highly redundant prediction frames in the reasoning stage. The process of prediction frame generation generates a large number of predictions at the same target location, and these candidate frames may overlap with each other; at this time we need to find the best target prediction frames using non-maximal suppression to eliminate redundant prediction bounding boxes.

The steps of the NMS are shown in Figure 6: (1) first, initially set a confidence threshold, and initially screen to remove the bounding boxes that are lower than this confidence threshold; (2) sort the initially screened prediction boxes according to the confidence scores in ascending order; (3) find out the boxes with the highest confidence scores in the current category and save them; (4) delete all the boxes that are higher than the IoU threshold retained in step (2); and (5) repeat steps (3) and (4) until all boxes are processed.

However, the NMS algorithm has some drawbacks: (1) the need to manually set the threshold, as if the threshold is set too large it will cause false detection, or if too small it will not achieve the desired effect (recognition accuracy decreases); (2) if the nms is below the threshold, the box score is set to 0, and is directly removed—this practice is too hard; (3) for the cycle step, parallel processing for the GPU is difficult, and so the operational efficiency is low, resulting in a slow model inference speed; (4) evaluated by predicting the IoU of the box, the IoU practice has different effects on the target box scale and distance.

To address the above problem, Bodla N proposed the Soft-NMS algorithm, which uses a reduced detection score to suppress the confidence score of each box through a penalized Gaussian function, for which the confidence score is not high; after the suppression of the penalized function, its confidence score will be lower than the threshold and thus will be removed, and for those boxes with a high confidence score, even after the suppression of the penalized function, its confidence is still higher than the threshold and is retained, thus reducing the leakage rate due to the overlapping of multiple detected targets and the strong removal of some of these boxes. If its function is suppressed, its confidence score is still higher than the threshold and is retained, thus reducing the number of boxes that are forcefully removed because multiple detected targets overlap together, i.e., reducing the leakage rate [32]. Ning C proposed Weighted NMS, where the confidence scores of the prediction boxes are weighted and normalized according to the confidence scores of the prediction boxes to obtain the new bounding box, and the bounding box is used as the final predicted bounding box, and then the other boxes are eliminated [33]. Zheng Z proposed DIoU-NMS to filter the predicted bounding boxes by combining the area overlap region and the centroid distance of the two boxes through the calculation of DIoU [34].

The above NMS has a common bottleneck, which is the computation of IoU and sequential iterative suppression. If there are n detection frames in an image, according to the sequential processing, a certain frame M and the other frames have to compute the IoU at least once, at most n−1 times. Together with the sequential iterative suppression, the above NMS algorithm calculates the IoU at least n−1 times, up to

(n - 1) + (n - 2) + \dots + 1 = \frac{1}{2} n^{2} - \frac{n}{2}

times, so if you want to speed up the process of NMS, you should parallelize the calculation of IoU. As such, Zheng Z et al. proposed Cluster-NMS, which utilizes the GPU matrix operations of pytorch for NMS, but at the same time ensures the performance remains the same as Traditional NMS [30].

However, Cluster-NMS only speeds up the inference by guaranteeing the same accuracy as Traditional NMS, but does not further improve the accuracy of inference. For such a problem, we propose Cluster-NMF (Cluster Non-maximum fusion). Firstly, we parallelize the computation of IoUs in the same matrix format, and only need to compute the IoUs of the set of detection frames

B = {B_{i}}_{i = 1 to n}

with itself, and the set of detection frames

B

will be firstly sorted by the descending order of the confidence scores, i.e., the score of

B_{1}

is the highest, and the score of

B_{n}

is the lowest. The following matrix is obtained;

X_{i j}

represents the IoUs of detection frames

B_{i}

and detection frames

B_{j}

.

X = I o U (B, B) = (\begin{matrix} x_{11} & x_{12} & \dots & x_{1 n} \\ x_{21} & x_{22} & \dots & x_{2 n} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ x_{n 1} & x_{n 2} & \dots & x_{n n} \end{matrix}), x_{i j} = I o U (B_{i}, B_{j})

(10)

Due to the heap symmetry of the IoU matrix

X

.,

I o U (B_{i}, B_{j}) = I o U (B_{j}, B_{i})

, and the fact that a detection frame is meaningless with respect to calculating the IoUs on its own, the

X

is then up-triangulated to obtain an up-triangulated matrix

X^{'}

with elements of 0 on the diagonal and below.

X = (\begin{matrix} x_{11} & x_{12} & x_{13} & \dots & x_{1 n} \\ x_{21} & x_{22} & x_{23} & \dots & x_{2 n} \\ x_{31} & x_{32} & x_{33} & \dots & x_{3 n} \\ ⋮ & ⋮ & ⋮ & ⋱ & ⋮ \\ x_{n 1} & x_{n 2} & x_{n 3} & \dots & x_{n n} \end{matrix}) \overset{u p p e r}{\underset{t r i a n g u l a t i o n}{\to}} X^{'} = (\begin{matrix} 0 & x_{12} & x_{13} & \dots & x_{1 n} \\ 0 & 0 & x_{23} & \dots & x_{2 n} \\ 0 & 0 & 0 & \dots & x_{3 n} \\ ⋮ & ⋮ & ⋮ & ⋱ & ⋮ \\ 0 & 0 & 0 & \dots & 0 \end{matrix})

(11)

Then, take the maximum value of

X^{'}

by column and binarize it according to the NMS threshold to obtain a one-dimensional tensor

b = [b_{1}, b_{2}, \dots, b_{n}]

;

b_{i}

represents the maximum value of the element on the

i

th column of the new matrix

X

. Then, binarize the tensor

b

according to the threshold set by the NMS, and assign a value of 0 to those greater than the threshold, and of 1 for those smaller than the threshold. An element of

b

with a value of 0 denotes a suppressed prediction box, and elements with a value of 1 indicate retained boxes. The b is then expanded into a diagonal matrix

E

. The matrix

E

is then left-multiplied by the IoU matrix

X

. The new tensor

b^{1}

is obtained by again taking the maximum value by column and binarizing it according to the NMS threshold.

\begin{array}{l} b^{1} = & 1 0 1 0 0 \\ ↑ T h r e s h o l d - b i n a r i z a t i o n \\ b = & 0 0.54 0.32 0.68 0.88 \\ ↑ C o l u m n - m a x i m u m \\ X^{'} = & (\begin{matrix} 0 & 0.54 & 0.11 & 0.40 & 0.88 \\ 0 & 0.32 & 0.68 & 0.10 \\ 0 & 0.45 & 0.25 \\ 0 & 0.28 \\ 0 \end{matrix}) \end{array}

(12)

Repeat the above steps until after two iterations

b^{i} = b^{j}

is found, and the final IoU matrix

C^{f i n a l}

is obtained at this point to determine which detection frames should be retained and which detection frames should be suppressed. Then, for each column of the matrix

C^{f i n a l}

, according to a new fusion IoU threshold, determine which detection boxes are retained for fusion, determine the need to fuse the detection box through the classification confidence value and the IoU of similar objects of all the border coordinates of the weighted average to update the coordinates of the new prediction box.

S c o r e s = f (C_{i, j}^{f i n a l}, n e w t h r e s)

(13)

W e i g h t s_{i j} = C_{i j}^{f i n a l} \times S c o r e s_{i}

(14)

B o x |_{i, j = 1}^{n} = \frac{\sum_{i = 1, j}^{n} W e i g h t s_{i j} \times B_{i j}}{\sum_{i = 1, j}^{n} W e i g h t s_{i j}}

(15)

B o x = B o x |_{i, j}^{n} + {B o x}^{f i n a l}

(16)

The pseudo-code of the Cluster-NMF algorithm is shown in Algorithm 1;

C^{f i n a l}

denotes the IoU matrix of the retained and suppressed detection frames screened by Cluster-NMF;

n e w t h r e s

is the new fusion IoU threshold, greater than which it should be fused, less than which it is not fused;

S c o r e s

is the new list of scores for the test box obtained after the function f determines whether the test box should be fused or not;

B_{i j}

denotes the coordinate matrix of the retained detection frame;

W e i g h t s

is the weight matrix;

W e i g h t s_{i j}

denotes the score weight of the

{B o x}_{i j}

box;

C_{i j}^{f i n a l}

denotes the IoU of the

i

-th box and

j

-th box;

S c o r e s_{i}

is the score of the

i

-th box;

{B o x}^{f i n a l}

denotes the prediction frames that were retained before fusion;

B o x |_{i, j = 1}^{n}

indicates the prediction frame that was retained after fusion; and

B o x

indicates the prediction box that is ultimately retained.

Algorithm 1: Cluster-NMF

Input: N detected boxes

B = [B_{1}, B_{2}, \dots, B_{n}]^{T}

with non-ascending sorting by classification score-descending order.
Output:

B o x = {b_{i}}_{1 \times N}, b_{i} \in {0, 1}

encodes final detection result, where 1 denotes reservation and 0 denotes supression.
Initialize

T = N, t = 1, t^{*} = T

and t^{0} = 1

.
Compute IoU matrix

X = {x_{i j}}_{N \times N} w i t h x_{i j} = I o U (b^{i}, b^{j})

.

X^{'} = t r i u (X)

Upper traingular matrix with

x_{i} = 0, \forall_{i}

While

t \leq T

do

\begin{matrix} E^{t} = diag (b^{t - 1}) \\ C^{t} = E^{t} \times X \\ g \leftarrow \max C^{t} \\ b^{t} \leftarrow find {g < t h r e s} . i f g_{i} < t h r e s, b_{i} = 1 e l s e b_{i} = 0 \\ i f : \\ t^{*} = t, break \\ t = t + 1 \end{matrix}

end if
end While

C^{f i n a l} = C^{t}

S c o r e s = f (C_{i j}^{f i n a l}, n e w t h r e s)

{W e i g t h s}_{i j} = C_{i j}^{f i n a l} \times S c o r e s_{i}

\begin{matrix} Box |_{i, j = 1}^{n} = \frac{\sum_{i = 1, j}^{n} W e i g h t s_{i j} \times B_{i j}}{\sum_{i = 1, j}^{n} W e i g h t s_{i j}} \\ Box = Box |_{i, j}^{n} + {B o x}^{f i n a l} \end{matrix}

Return Box

3.5. PF-YOLO Network Model

The YOLO framework is a popular method for single-stage target detection based on deep learning, which has a high detection accuracy and fast inference speed. And to realize the neural network model deployed in UAV embedded devices for real-time monitoring, the weight file needed should be small, so considering the accuracy and efficiency of the detection model, this study combines the above improved modules and algorithms with the YOLO framework, and proposes a PF-YOLO neural network model for detecting the blueberry canopy fruits using remote sensing from UAVs.

As shown in Figure 7 below, the PF-YOLO network model is divided into three parts: Backbone, Neck, and Head. In Backbone, the PAC3 module is used to improve the ability of extracting the exact positional features of the blueberry crown fruit target from the input image, in order to reduce the leakage of the blueberry crown fruit detection problem. In the Neck, the PF-YOLO network model is used by using the fast convolution in order to reduce the number of parameters and speed up the memory access speed of the overall model, and to maintain the original feature extraction ability number and speed up memory access, and maintain the original feature extraction capability. In the Head, four scales of detection heads are applied to the feature maps of blueberry crown fruit clusters of different sizes to generate information of different categories, coordinates, and confidence levels.

4. Experimental Results and Analysis

4.1. Experimental Configuration and Evalution Index

In this study, the PyTorch deep learning framework was used to implement and train the PF-YOLO neural network model. The configuration of the experimental hardware and software environments is shown in the following Table 1. The training parameters of PF-YOLO during the training process are configured, as shown in Table 2.

The performance evaluation metrics for a model and method in a target detection task mainly include the detection accuracy, detection completeness, average precision (AP), category-wide average accuracy (mAP), number of parameters, and computation. The detection rate is the proportion of all correctly predicted targets in the prediction frame, or the proportion of correctly predicted targets in all true detection frames, the AP is the average precision of a single target category, and the mAP is used to measure the performance of target detection models and algorithms, and is obtained by weighting the average correctness rate of all categories of detection. The number of parameters and the amount of computation is used to evaluate the complexity of the model and algorithm, combined with the above five indicators. Here, the higher the detection rate and average accuracy, the lower the number of parameters and the amount of computation, which means that the performance of a model is better. The definitions of the formulas is as follows:

P r e c i s i o n = \frac{T P}{T P + F P}

(17)

R e c a l l = \frac{T P}{T P + F N}

(18)

A P = \int_{0}^{1} P (R) d R

(19)

m A P = \frac{\sum_{i = 1}^{n} A P_{i}}{n}

(20)

In small-object detection, mAP provides a comprehensive evaluation of the model’s performance across different thresholds, making it suitable for assessing the overall detection capability for small targets. Precision emphasizes the accuracy of the model, ensuring that detected objects are correct. Recall focuses on the model’s ability to capture all relevant instances, ensuring that small objects are not missed. Average Precision combines both Precision and Recall, effectively capturing the model’s performance across different classes and IoU thresholds. Therefore, these metrics collectively reflect the model’s performance in small-object detection tasks.

4.2. Comparing Different Feature Extraction Modules

In this study, the PAC3 structure is proposed to extract image target features while paying more attention to the location information of the target, which is used to realize the training and detection of blueberry fruit clusters in blueberry canopies from remote sensing images of blueberry canopies captured by unmanned aerial vehicles (UAVs). Comparing a variety of excellent feature extraction modules, PAC3-Yolo indicated that with the use of the PAC3 module, it pays more attention to the exact location of the blueberry fruits in the canopy by integrating the spatial information features in the horizontal and vertical directions, and reduces the omission and false detection of the blueberry fruits; as shown in Table 3 below, in the case of the same backbone network, the PAC3 module has a performance 2.1% higher than the C3 module on the mAP and 3.5% higher than the SE module, 1.9% higher than the CBAM module, and 1.7% higher than the Ghost module.

4.3. Comparative Experiment of Different Models

In this study, we proposed the PF-Yolo target detection model using the YOLO model as a framework. In order to verify the effectiveness of our proposed model PF-Yolo in blueberry canopy fruit detection, we compared our model with some representative state-of-the-art models, as shown in Table 4. The PAC3-Yolo is a model that uses the PAC3 feature extraction module we proposed. Compared with Yolov5s, Yolov5l, Yolov5s-p6, Yolov5l-p6, Tph-Yolov5, Yolov8n, Yolov8s, and Yolov9c, the maps were improved by 5.1%, 6.4%, 2.1%, 1.7%, 5.3%, 2.5%, 1.1%, and 3.0% respectively. PF-Yolo represents the improvement in the PAC3-Yolo in order to further reduce the number of parameters and computation of the model, while improving the detection accuracy of blueberry canopy fruit clusters, which is achieved by performing spatial feature extraction with channel-specific convolutional operations. Compared with the PAC3-Yolode map, the accuracy of the model is improved by 0.4% and the number of parameters and computation amount are reduced by 0.91 M and 1.4 G, respectively. The detection and recognition results for blueberries were visualized, as shown in Figure 8.

4.4. Comparative Experiment of Different NMS Algorithms

In the target detection task, it is generally divided into two processes: the training process and the inference process; in the training process, the target detection algorithm tunes the learning parameters of the model to fit the target features of the dataset according to the specific true values, and after the training is completed, the parameters of the neural network model are fixed to be able to make target prediction directly on new images. However, in the actual detection, a very large number of target detection frames are generated, in which there are many duplicate frames localized to the same target, and the NMS is used to remove these duplicate frames to obtain the true target detection frames. In this study, based on the Cluster-NMS algorithm, Cluster-NMF is proposed to improve the inference speed and at the same time further improve the detection efficiency of blueberry crown fruits by fusing the several relatively better target detection frames to obtain the optimal target detection frame. In order to verify the effectiveness of Cluster-NMF, we conducted experimental comparisons with some excellent NMS algorithms. Fps denotes the number of images that can be processed per second when processing the remote sensing images of blueberry canopy fruits taken by UAVs during the inference process, and a larger value indicates that the processing algorithm is more excellent. Inference denotes the time consumed in processing each image, and a smaller value is better. As shown in Table 5 below, our improved algorithm Cluster-NMF has a certain superior performance over other NMS algorithms.

5. Discussion

In this study, we collect and construct a blueberry dataset for agricultural remote sensing high-resolution target detection in a real blueberry orchard scenario, using UAV remote sensing combined with a deep learning target detection method to detect blueberry fruits in the crown layer of a blueberry orchard. This not only saves a lot of manual labor and workload, but also improves picking efficiency by monitoring the blueberry fruit clusters on blueberry trees in time for timely picking. We constructed a PF-YOLO algorithm, which reduces the omission and misdetection of blueberry fruits by integrating the spatial information features in horizontal and vertical directions to pay more attention to the exact location of blueberry fruits in the canopy while feature extraction is performed on the input image by using the proposed PAC3 module. The use of fast convolution for spatial feature extraction with some channel-specific convolution operations aims to reduce the number of parameters and computational complexity of the model for subsequent practical deployment without losing the accuracy of blueberry target detection. We also propose an improved Cluster-NMS algorithm named Cluster-NMF, which also utilizes matrix parallel operations to speed up the computation of IOUs, after initially filtering out the redundant target prediction frames, then filtering out the better target prediction frames based on a new threshold, and then fusing the above prediction frames to obtain the optimal target prediction frame. This not only accelerates the inference speed of detecting blueberry fruit clusters in the canopy, but also improves the accuracy of detection. The experimental results show that the model and method proposed in this study can help orchard farmers to monitor blueberry fruits in real-time, and according to the detected situation, farmers can pick on time, thus avoiding a delay to the best picking time which causes a certain degree of economic loss. However, all our experiments were conducted in good conditions. Although we preprocessed the collected data by adding noise and applying blurring, the recognition of blueberries would be significantly affected under harsh weather conditions. We found that there is an issue with the leaf obstruction of the fruits. Leaves can block parts of the fruits, leading to partial or complete occlusion, which affects the recognition capability of the detection algorithm. Additionally, occlusion may cause the detection model to misclassify leaves as fruits, reducing accuracy. Therefore, we will focus on tackling this challenge in the future.

6. Conclusions

Aiming at the shortcomings of the current small-target detection algorithms in high-resolution images in terms of detection effect and inference speed, we collected and constructed a blueberry canopy fruit dataset dedicated to the study of remote sensing target detection by using a UAV in a real blueberry orchard environment. In the process of feature extraction, we combined the encoding of positional information, and designed the PAC3 structure, which is intended to pay more attention to target positional information when feature extraction is performed on the input image, so as to effectively reduce the false detection and omission of target detection in remotely sensing targets to pay more attention to the location information of the target during feature extraction, so as to effectively reduce the misdetection and omission of blueberry fruit targets. In order to reduce the number of parameters and computational complexity of the model, we used a fast convolutional structure instead of the conventional convolutional structure. Based on this, we propose the PF-YOLO model, which significantly improves the detection accuracy of blueberry canopy fruits. In addition, we proposed the Cluster-NMF algorithm, which further optimizes the final detection frame by fusing several better target detection frames in the detection inference stage, improving the detection efficiency of blueberry crown fruits. We also conducted a critical analysis of the results and discussed the challenges encountered, as well as potential directions for future research.

Author Contributions

Conceptualization, Y.Z. and Y.L.; methodology, Y.L.; validation, Y.Z., Y.L. and X.X.; data curation, X.X.; writing—original draft preparation, Y.L.; writing—review and editing, X.X. All authors have read and agreed to the published version of the manuscript.

Funding

This study was supported by the National Key Research and Development Program of China (2019YFE0126100); the Key Research and Development Program in Zhejiang Province of China (2019C54005); the National Natural Science Foundation of China (61605173) and (61403346); and the Natural Science Foundation of Zhejiang Province (LY16C130003).

Data Availability Statement

Since the dataset collected for this experiment is the public property of the laboratory, if you need the experimental dataset, please contact the corresponding author via email at xuxing@zust.edu.cn or alpacaly@163.com.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Sivapragasam, N.; Neelakandan, N.; Rupasinghe, H.V. Potential health benefits of fermented blueberry: A review of current scientific evidence. Trends Food Sci. Technol. 2023, 132, 103–120. [Google Scholar] [CrossRef]
Yang, F.; Fan, H.; Chu, P.; Blasch, E.; Ling, H. Clustered object detection in aerial images. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar]
Xiaolin, F.; Fan, H.; Ming, Y.; Tongxin, Z.; Ran, B.; Zenghui, Z.; Zhiyuan, G. Small object detection in remote sensing images based on super-resolution. Pattern Recognit. Lett. 2022, 153, 107–112. [Google Scholar] [CrossRef]
Gong, Y.; Yu, X.; Ding, Y.; Peng, X.; Zhao, J.; Han, Z. Effective fusion factor in FPN for tiny object detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2021. [Google Scholar]
Zhu, X.; Lyu, S.; Wang, X.; Zhao, Q. TPH-YOLOv5: Improved YOLOv5 Based on Transformer Prediction Head for Object Detection on Drone-captured Scenarios. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, Montreal, BC, Canada, 11–17 October 2021; pp. 2778–2788. [Google Scholar]
Li, C.; Yang, T.; Zhu, S.; Chen, C.; Guan, S. Density map guided object detection in aerial images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
Lu, W.; Niu, C.; Lan, C.; Liu, W.; Wang, S.; Yu, J.; Hu, T. High-Quality Object Detection Method for UAV Images Based on Improved DINO and Masked Image Modeling. Remote Sens. 2023, 15, 4740. [Google Scholar] [CrossRef]
Liao, J.; Piao, Y.; Su, J.; Cai, G.; Huang, X.; Chen, L.; Wu, Y. Unsupervised cluster guided object detection in aerial images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2021, 14, 11204–11216. [Google Scholar] [CrossRef]
Buters, T.; Belton, D.; Cross, A. Seed and Seedling Detection Using Unmanned Aerial Vehicles and Automated Image Classification in the Monitoring of Ecological Recovery. Drones 2019, 3, 53. [Google Scholar] [CrossRef]
Melnychenko, O.; Scislo, L.; Savenko, O.; Sachenko, A.; Radiuk, P. Intelligent integrated system for fruit detection using multi-UAV imaging and deep learning. Sensors 2024, 24, 1913. [Google Scholar] [CrossRef]
Shiu, Y.S.; Lee, R.Y.; Chang, Y.C. Pineapples’ detection and segmentation based on faster and mask R-CNN in UAV imagery. Remote Sens. 2023, 15, 814. [Google Scholar] [CrossRef]
Khan, S.; Tufail, M.; Khan, M.T.; Khan, Z.A.; Anwar, S. Deep learning-based identification system of weeds and crops in strawberry and pea fields for a precision agriculture sprayer. Precis. Agric. 2021, 22, 1711–1727. [Google Scholar] [CrossRef]
Gallo, I.; Rehman, A.U.; Dehkordi, R.H.; Landro, N.; La Grassa, R.; Boschetti, M. Deep object detection of crop weeds: Performance of YOLOv7 on a real case dataset from UAV images. Remote Sens. 2023, 15, 539. [Google Scholar] [CrossRef]
Wittstruck, L.; Kühling, I.; Trautz, D.; Kohlbrecher, M.; Jarmer, T. UAV-based RGB imagery for Hokkaido pumpkin (Cucurbita max.) detection and yield estimation. Sensors 2020, 21, 118. [Google Scholar] [CrossRef]
Gai, R.; Liu, Y.; Xu, G. TL-YOLOv8: A Blueberry Fruit Detection Algorithm Based on Improved YOLOv8 and Transfer Learning. IEEE Access 2024, 12, 86378–86390. [Google Scholar] [CrossRef]
Xiao, F.; Wang, H.; Xu, Y.; Shi, Z. A Lightweight Detection Method for Blueberry Fruit Maturity Based on an Improved YOLOv5 Algorithm. Agriculture 2023, 14, 36. [Google Scholar] [CrossRef]
Lu, C.; Nnadozie, E.; Camenzind, M.P.; Hu, Y.; Yu, K. Maize plant detection using UAV-based RGB imaging and YOLOv5. Front. Plant Sci. 2024, 14, 1274813. [Google Scholar] [CrossRef]
Li, W.; Yu, X.; Chen, C.; Gong, Q. Identification and localization of grape diseased leaf images captured by UAV based on CNN. Comput. Electron. Agric. 2023, 214, 108277. [Google Scholar] [CrossRef]
Bao, W.; Zhu, Z.; Hu, G.; Zhou, X.; Zhang, D.; Yang, X. UAV remote sensing detection of tea leaf blight based on DDMA-YOLO. Comput. Electron. Agric. 2023, 205, 107637. [Google Scholar] [CrossRef]
Wang, C.; Han, Q.; Li, J.; Li, C.; Zou, X. YOLO-BLBE: A Novel Model for Identifying Blueberry Fruits with Different Maturities Using the I-MSRCR Method. Agronomy 2024, 14, 658. [Google Scholar] [CrossRef]
Yang, W.; Ma, X.; Hu, W.; Tang, P. Lightweight Blueberry Fruit Recognition Based on Multi-Scale and Attention Fusion NCBAM. Agronomy 2022, 12, 2354. [Google Scholar] [CrossRef]
Huang, H.; Huang, T.; Li, Z.; Lyu, S.; Hong, T. Design of citrus fruit detection system based on mobile platform and edge computer device. Sensors 2021, 22, 59. [Google Scholar] [CrossRef] [PubMed]
Xiong, Y.; Zeng, X.; Lai, W.; Liao, J.; Chen, Y.; Zhu, M.; Huang, K. Detecting and Mapping Individual Fruit Trees in Complex Natural Environments via UAV Remote Sensing and Optimized YOLOv5. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 7554–7576. [Google Scholar] [CrossRef]
Wang, C.-Y.; Bochkovskiy, A.; Liao, H.-Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023. [Google Scholar]
Varghese, R.; Sambath, M. YOLOv8: A Novel Object Detection Algorithm with Enhanced Performance and Robustness. In Proceedings of the 2024 International Conference on Advances in Data Engineering and Intelligent Computing Systems (ADICS), Chennai, India, 18–19 April 2024. [Google Scholar]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar] [CrossRef]
Ross, G. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 1137–1149. [Google Scholar] [CrossRef]
Neubeck, A.; Van Gool, L. Efficient non-maximum suppression. In Proceedings of the 18th International Conference on Pattern Recognition (ICPR’06), Hong Kong, China, 20–24 August 2006; Volume 3. [Google Scholar]
Zheng, Z.; Wang, P.; Ren, D.; Liu, W.; Ye, R.; Hu, Q.; Zuo, W. Enhancing geometric factors in model learning and inference for object detection and instance segmentation. IEEE Trans. Cybern. 2021, 52, 8574–8586. [Google Scholar] [CrossRef] [PubMed]
Bochkovskiy, A.; Wang, C.Y.; Liao, H.Y.M. Yolov4: Optimal speed and accuracy of object detection. arXiv 2020, arXiv:2004.10934. [Google Scholar]
Bodla, N.; Singh, B.; Chellappa, R.; Davis, L.S. Soft-NMS--improving object detection with one line of code. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017. [Google Scholar]
Ning, C.; Zhou, H.; Song, Y.; Tang, J. Inception single shot multibox detector for object detection. In Proceedings of the 2017 IEEE International Conference on Multimedia & Expo Workshops (ICMEW), Hong Kong, China, 10–14 July 2017. [Google Scholar]
Zheng, Z.; Wang, P.; Liu, W.; Li, J. Distance-IoU loss: Faster and better learning for bounding box regression. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–15 February 2020; Volume 34. [Google Scholar]

Figure 1. Example of drone aerial blueberry data. (a) is the unsegmented image, (b) is the segmented image, (c) is the detection map for the unsegmented image, and (d) is the detection map for the segmented image.

Figure 2. Mosaic data enhancement.

Figure 3. Location information coding structure.

Figure 4. Feature extraction for UAV remote sensing images: (a) shows PAC3 structure, (b) shows C3 structure.

Figure 5. Fast convolutional structures.

Figure 6. Non-maximal inhibition process.

Figure 7. PF-YOLO Structure.

Figure 8. Visualization results of blueberry detection and recognition.

Table 1. Experimental hardware and software environment configuration.

Platform	Configuration
Operating system	Windows 10 professional 64 bits
CPU	13th Gen Intel(R) Core(TM) i9-13900K
GPU	NVIDIA RTX A5000
GPU accelerator	CUDA 11.6 & cuDNN 8.3.2
Deep learning frame	Pytorch 1.13.0
Compilers	Pycharm & Anaconda
Scripting language	Python 3.9

Table 2. Configuration of training parameters during training with PF-YOLO.

Parameters	Configuration
Input size	1536 × 1536 × 3
Optimization algorithm	Adam
Batch size	4
Training epochs	160
Initiation learning rate	0.001
Momentum	0.9
Weight decay	0.1

Table 3. Comparison result data of different feature extraction modules.

Yolo	Rrecision (%)	Recall (%)	mAP@0.5 (%)	mAP@0.5–0.95
+C3	51.5	58.4	51.9	22.5
+SE	51.6	56.2	50.5	22.5
+CBAM	46.6	66.9	52.1	22.6
+Ghost	57.2	54.5	52.3	21.6
+PAC3	51.4	59.1	54.0	23.7

Table 4. Comparison of data from experimental results of different models.

Models	Precision (%)	Recall (%)	mAP@0.5 (%)	Parameters (M)	Flops (G)
Yolov5s	45.5	57.6	48.9	7.02	16.0
Yolov5l	38.8	67.9	47.6	46.12	107.7
Yolov5s-p6	51.5	58.4	51.9	12.36	16.9
Yolov5l-p6	52.3	60.2	52.3	76.15	110.6
Tph-Yolov5	46.8	59.8	48.7	7.43	36.8
Yolov8n	52.8	55.8	51.5	3.01	8.9
Yolov8s	51.6	57.5	52.9	11.12	28.8
Yolov9c	53.8	57.5	51.0	25.32	102.3
PAC3-Yolo	51.4	59.1	54.0	12.41	16.9
PF-Yolo	54.4	58.9	54.4	11.50	15.5

Table 5. Data comparison of experimental results of different NMS algorithms.

NMS	Fps	Inference (ms)	mAP@0.5 (%)	mAP@0.5–0.95
Traditional NMS	62.8	15.9	53.1	26.3
Soft-NMS	41.9	23.8	53.7	26.8
Weighted-NMS	61.9	16.2	53.5	26.4
Cluster-NMS	63.7	15.7	53.7	26.7
Cluster-NMF(ours)	63.6	15.7	54.4	27.1

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhao, Y.; Li, Y.; Xu, X. Object Detection in High-Resolution UAV Aerial Remote Sensing Images of Blueberry Canopy Fruits. Agriculture 2024, 14, 1842. https://doi.org/10.3390/agriculture14101842

AMA Style

Zhao Y, Li Y, Xu X. Object Detection in High-Resolution UAV Aerial Remote Sensing Images of Blueberry Canopy Fruits. Agriculture. 2024; 14(10):1842. https://doi.org/10.3390/agriculture14101842

Chicago/Turabian Style

Zhao, Yun, Yang Li, and Xing Xu. 2024. "Object Detection in High-Resolution UAV Aerial Remote Sensing Images of Blueberry Canopy Fruits" Agriculture 14, no. 10: 1842. https://doi.org/10.3390/agriculture14101842

APA Style

Zhao, Y., Li, Y., & Xu, X. (2024). Object Detection in High-Resolution UAV Aerial Remote Sensing Images of Blueberry Canopy Fruits. Agriculture, 14(10), 1842. https://doi.org/10.3390/agriculture14101842

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Object Detection in High-Resolution UAV Aerial Remote Sensing Images of Blueberry Canopy Fruits

Abstract

1. Introduction

2. Experimental Equipment and Materials

2.1. Data Source

2.2. Data Processing

3. Experiments and Methods

3.1. Location Information Encoding

3.2. PAC3 Structure

3.3. Fast Convolutional Structure

3.4. ClusterNMF (Non-Maximal Fusion)

3.5. PF-YOLO Network Model

4. Experimental Results and Analysis

4.1. Experimental Configuration and Evalution Index

4.2. Comparing Different Feature Extraction Modules

4.3. Comparative Experiment of Different Models

4.4. Comparative Experiment of Different NMS Algorithms

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI