2.2.1. Structure of the YOLOv8-Seg Network
The YOLO (You Only Look Once) series of algorithmic frameworks stand out among various detection methods due to their rapid detection capabilities and high precision [
25]. With continuous updates and iterations of the model framework, the YOLO series has become a popular real-time object detection model, extensively used in precision and automated agriculture for the detection and segmentation of crops, pests, and weeds.
In 2023, the Ultralytics team introduced the latest YOLOv8 (
https://github.com/ultralytics/ultralytics) object detection algorithm, evolving from YOLOv5, as a single-stage anchorless detection framework. The overall network structure is divided into four main components: Input, Backbone, Neck, and Head. The Input section, as the interface, is responsible for scaling input images to the dimensions required for training. It features modules such as mosaic data augmentation, adaptive anchor calculations, adaptive image scaling, and Mixup data enhancement. The Backbone is an enhancement over the YOLOv5 model, adopting ELAN’s design principles by replacing the C3 structure with a more gradient-rich C2f structure, enhancing feature extraction through additional skip connections and split operations, and varying channel numbers across different model scales to maintain lightness while capturing more gradient flow. The Neck intensifies feature integration across dimensions, following the Feature Pyramid Network (FPN) [
26] and Path Aggregation Network (PAN) [
27] architectures, with convolution operations in the upsampling phases removed in layers 4–9 and 10–15 compared to YOLOv5. The Head section employs the current mainstream decoupled structure (Decoupled-Head), separating the classification and detection heads. It replaces the traditional anchor-based approach with an anchor-free method, as shown in
Figure 2. The original Objectness branch is removed, leaving only the decoupled classification and regression branches. Moreover, the regression branch utilizes the integral form representation proposed in Distribution Focal Loss, allowing each independent branch to focus more on its respective feature information. YOLOv8 instance segmentation (YOLOv8-seg), an extension of the YOLOv8 model for instance segmentation, enhances the base target detection model by incorporating the YOLACT [
28] network to achieve pixel-level instance segmentation. The model outputs masks, class labels, and confidence scores for each object located in the image.
2.2.2. Structure of the BFFDC-YOLOv8-Seg Network
To enhance the accuracy of weed segmentation in complex, cluttered agricultural fields with overlapping plants, while ensuring model simplicity and real-time segmentation, this paper introduces the new BFFDC-YOLOv8-seg instance segmentation network. As shown in
Figure 3, the network reconstructs the Concat module in the original Neck structure, replacing the existing FPN and PAN with BiFPN for innovative multiscale feature fusion, effectively enhancing the network’s ability to detect small targets. The introduction of DSConv to replace some of the
convolutions in the original Backbone, integrating features extracted by
convolutions with DSConv, increases the flexibility of the convolution kernels, thus improving the accuracy of segmentation for irregular edges of plant stems and leaves. The BFFDC-YOLOv8-seg network achieves precise segmentation of highly similar, cluttered, and overlapping weeds in complex field backgrounds, with features that ensure simplicity and real-time performance, making it feasible for operation on standalone devices.
Ultralytics officially provides five different scales of networks (N/S/M/L/X) and corresponding initial weight files on Github to cater to various application scenarios. These training weight files, essential for assisting training, enhance the accuracy and speed of training, containing model parameters such as weights and biases for each layer. However, the official weight files, trained on the COCO2017 dataset, lack the capability to perceive vegetable field environments. To better acclimate the model to vegetable fields and achieve improved training outcomes, this paper utilizes a public plant instance segmentation dataset with over 5000 images. By iterating 200 times on the original YOLOv8(N/S/M/L/X)-seg networks, this study obtains training weights adapted to vegetable field environments, serving as optimized training weights for the model.
Table 2 clearly shows that under the same training batches, the S model has significantly lower inference speed compared to the N model, with only a 1.3% improvement in accuracy and a substantial increase in model size. The M/L/X models, compared to the N model, show a notable increase in size and a significant decrease in inference speed, with a maximum of only 1.9% improvement in accuracy. With no significant gains in detection accuracy, the larger models require substantial storage resources and higher processing power, making them unsuitable for resource-limited laser weeding devices. Therefore, this paper chooses to optimize the N-scale model.
The Concat module in the Neck section, which includes both FPN and PAN, plays a critical role in the fusion of image information. As depicted in
Figure 4a and described by Equation (1), traditional FPN networks use input features
with 3–7 layers, where
denotes the resolution level
of the input image. Features are aggregated from top to bottom, and
usually involves upsampling or downsampling for resolution matching, while
is typically used for feature processing. This results in feature fusion being limited by a unidirectional flow of information, which is ineffective in extracting features of small weed targets in agricultural settings with high similarity and indistinct color features. To enhance the detection capabilities of small targets in agricultural environment, this paper introduces a Bidirectional Feature Pyramid Network (BiFPN).
BiFPN represents an efficient bidirectional framework for cross-scale connections and fast normalization of feature fusion. From the network topology (
Figure 4b,c), it can be seen that BiFPN modifies the multiscale connections within the PAN architecture: Initially, it removes network nodes that have a single input feature edge without fusion, creating a simplified bidirectional network. Subsequently, when the original input and output nodes are at the same level, an additional pathway is added between them to enable more feature fusion without significantly increasing computational costs; Lastly, unlike PAN, which features only one top-down and one bottom-up pathway, each bidirectional (top-down and bottom-up) pathway is implemented as a feature network layer, repeated multiple times to achieve advanced feature fusion.
Different input features have varying resolutions, and compared to the high resolution of crops, the smaller resolution of weed inputs leads to a significant imbalance in the network’s output contributions. Traditional methods treat all input features equally without distinction, which is not ideal in practical applications. Tan et al. [
24] assigned varying weights to input features, significantly enhancing the network’s performance in detecting small objects. Therefore, we propose adding an additional weight to each input feature, enabling the network to learn the significance of each feature and preventing it from overlooking the small-scale features of weeds. Based on this concept, BiFPN uses fast normalized fusion, as described in Equation (2), an efficient and stable weighted fusion mechanism that applies an
activation function after each
to ensure that
and
are small, keeping each normalized weight between 0 and 1. Fast normalized fusion, similar in learning behavior and accuracy to Equation (3) (Softmax-based fusion), omits the
operation, allowing BiFPN to run 30% faster on GPUs. To further enhance efficiency, we use depthwise separable convolutions for feature fusion and add batch normalization and activation after each convolution.
The original Concat module is restructured, using BiFPN to replace the traditional FPN and PAN for a novel feature fusion approach, thereby assigning higher weights to small object features, enhancing the focus on small targets.
In laser weeding operations, targeting the critical tissue parts of weeds with laser beams is essential for effective weed eradication; imprecise targeting can increase the accidental injury rate to crop seedlings. The original YOLOv8-seg network relies on the detection accuracy of bounding boxes within the Backbone, but square bounding boxes are not sensitive to the local information of irregular targets. Therefore, to enhance the network’s perception of the irregular edges of weeds, deformable convolutions (DCNs) [
29] are considered for integration into the Backbone, allowing some
convolutional kernels to adjust their shapes to fit the irregular structures of weeds, while maintaining the stability of the convolutional structure and reducing deviation. Given that Dynamic Snake Convolution (DSConv) [
30] performs well in segmenting tubular structures, adapting to slender and twisted local structural features to enhance geometric structure perception, this paper introduces DSConv, constructing convolutional kernels with strong perception of irregular curves.
This section elucidates the application of DSConv in extracting irregular local features of weed stems and leaf edges, assuming standard 2D convolutional coordinates
, with the center coordinate as
. The original
convolutional kernel is
, represented by
By introducing deformation offsets
, the convolutional kernel becomes more flexible, focusing on the irregular edges of tubular weed stems and leaves.
Figure 5 linearizes the standard kernel in both axial directions, expanding it into a kernel of size 9. Taking the
axial direction as an example, each grid position in
is denoted as
, where
represents the horizontal distance from the center grid. The selection of each grid position
in kernel
is a cumulative process. Starting from the central position
, the position away from the center grid depends on the position of the previous grid:
increases by an offset
relative to
[
30]. Therefore, the offsets need to be accumulated to ensure that the kernel conforms to a linear structural form.
The change in the X-axis direction is
The change in the Y-axis direction is
Since the offset
is typically a decimal, while coordinates are usually in integer form, bilinear interpolation is employed, expressed as
Here,
represents the decimal positions in Equations (5) and (6).
enumerates all integer spatial positions.
is a bilinear interpolation kernel, which can be decomposed into two one-dimensional kernels:
As shown in
Figure 6, the changes in the two-dimensional (
and
axes) setup enable the dynamic serpentine convolutional kernels described in this paper to cover a
perceptual field during their deformation, better adapting to elongated tubular structures and enhancing the perception of critical features. The use of
convolutional kernels to perform the functions of
kernels allow for greater flexibility in the model’s kernels while keeping scale increases minimal.