The Rep-ViG-Apple network architecture is illustrated in
Figure 3. The feature extraction network is Rep-Vision-GCN (Backbone), and the feature fusion network is Rep-FPN-PAN (neck). The prediction head adopts the decoupled head concept from YOLOX [
15], consisting of both a classification head and a regression head. The loss function for the regression head comprises two parts: CIoU and distribution focal loss (DFL). For sample matching, it uses the TaskAlignedAssigner [
16] strategy for positive and negative sample assignment and the anchor-free [
17] strategy. The classification head uses the binary cross-entropy (BCE) loss function.
2.2.1. Rep-Vision-GCN Feature Extraction Network
In the context of apple detection tasks within orchard environments, the complex weather conditions and distracting orchard backgrounds pose significant challenges. Issues such as occlusion by branches and leaves, as well as overlapping fruits, make it difficult to accurately identify apple targets. YOLOv8n, with its single-scale feature extraction, often fails to sufficiently capture feature information in these complex environments and multi-scale detection scenarios. To address these challenges, this paper proposes an inverted residual re-parameterized multi-scale feature extraction module (RepIRD Block), designed to enhance feature extraction capabilities for occluded and overlapping targets. Additionally, the YOLOv8n model employs the traditional convolutional neural network (CNN) architecture, which treats images as grid structures. Convolution operations in this architecture are performed within a fixed-size local receptive field, making it difficult to extract global, long-range feature information, and because it is difficult to suppress interference from complex environments, a sparse graph attention mechanism is employed to increase focus on apples and suppress interference from complex environmental features. Thus, a robust feature extraction network tailored for apple detection in complex environments is constructed.
Our proposed inverted residual multi-scale re-parameterization feature extraction module (RepIRD Block) is a lightweight, multi-branch architecture that leverages depthwise convolutions to learn more expressive features across high-dimensional feature maps. The RepIRD Block achieves reduced computational complexity by decoupling the spatial and channel dimensions. Multi-scale feature information is extracted in the spatial dimension using RepDwConv, followed by feature extraction in the channel dimension using 1 × 1 standard convolutions. The features output by RepDwConv are then subjected to cross-channel information exchange to enhance the model’s non-linear representational capacity. Additionally, residual connections are incorporated to mitigate the risk of gradient vanishing. During the training phase, the RepIRD Block leverages the multi-branch structure of RepDwConv to capture rich apple feature information. In the inference phase, the RepDwConv is re-parameterized into a single-branch structure, thereby reducing the computational and parameter burden of the model. This approach ensures that computational efficiency is attained without sacrificing accuracy, resulting in a compressed model with no loss in precision for apple detection tasks. The structure of the RepIRD Block is illustrated in
Figure 4.
In 2016, Thomas Kpif et al. [
18] first proposed graph convolutional neural networks, applying convolutional neural network methods to graph structures and showing excellent performance. Today, graph convolutional neural networks are applied to different fields, such as computer vision tasks [
19,
20] and natural language processing tasks [
21,
22,
23]. The sparse vision graph attention [
24] (SVGA) module in the graph convolutional network (GCN) architecture addresses this issue by constructing a sparse graph structure for feature extraction. The sparse graph attention mechanism captures global features, enhancing focus on apples and mitigating interference from complex environments. This method effectively improves the ability to recognize apple targets under challenging conditions, such as rainy, foggy, snowy weather, and low-light environments at night. SVGA is mainly divided into two processes: sparse graph construction and feature information aggregation update.
(1) Sparse Graph Construction Process in SVGA
In constructing the sparse graph structure within the sparse vision graph attention (SVGA) module, the input feature map is segmented, with each image pixel considered as a vertex. Aggregation operations are performed in parallel along the width and height dimensions of the feature map. This process is illustrated in
Figure 5.
Where k denotes the moving step length of the aggregation operation (k = 2), Height represents the height of the feature map, Width indicates the width of the image, Down stands for the downward direction, and Right signifies the rightward direction.
(2) Process of Aggregating and Updating Feature Information in SVGA
The SVGA module employs max-neighbor graph convolution (
) for aggregation and update operations. The max aggregator consolidates feature information from the current vertex and its neighboring vertices. In the update operation, a fully connected layer is utilized to integrate the feature information from the current vertex and its neighbors. The specific methodology is illustrated by Equations (1)–(4):
In Equations (1)–(4), denotes a vertex and represents the vertex feature aggregation function. signifies the vertex feature update function, while indicates the feature information of vertex before aggregation and update. denotes the feature information of the sparse neighboring vertices of vertex , and represents the set of sparse neighboring vertices of vertex . refers to a sparse neighboring vertex of , and and stand for the weights of the aggregation and update functions. GeLU indicates the activation function, and mlp represents the multi-layer perceptron operation.
Finally, by integrating the
module, the SVGA effectively mitigates the over-smoothing phenomenon caused by multiple serialization in deep network layers of graph convolution models, while also enhancing the model’s non-linear expressive capability. The specific procedures of the SVGA module are formalized in Equations (5) and (6):
where
denotes the input feature map information. The
refers to the multi-layer perceptron operations, which are combined with the residual connection to prevent network performance degradation in deeper layers. The act represents the GeLU activation function. The SVGA module is illustrated in
Figure 6.
Combining the strengths of CNN in local multi-scale feature extraction with the advantages of GCN for global modeling, we designed a CNN-GCN architecture feature extraction network (Rep-Vision-GCN) using the RepIRD Block module and the SVGA block module. The combination of CNN and GCN allows for the effective capture of both local and global image information, thereby enhancing the model’s feature extraction ability and improving its overall performance and effectiveness.
The Rep-Vision-GCN network is composed of 9 layers, utilizing the RepIRD Block module to progressively extract multi-scale visual feature information of apples in complex environments from shallow to deep layers. In the deeper layers, semantic information is extracted using the SVGA block and the SPPF module. The SVGA block consists of two SVGA modules connected in series.
The Rep-Vision-GCN network uses the feature maps output from the 4th layer with a size of (84, 80, 80), from the 6th layer with a size of (168, 40, 40), and from the 9th layer with a size of (256, 20, 20) as inputs of three different scales for the feature fusion network. Detailed information on the Rep-Vision-GCN apple feature extraction network is provided in
Table 2.
2.2.2. Rep-FPN-PAN Feature Fusion Network
In the task of apple detection within complex backgrounds, the varying sizes of apples in the same scene due to differences in distance pose a challenge for recognizing small distant objects. The feature fusion module (C2f) in the YOLOv8n model has a single scale, with a high complexity split operation. To enhance the multi-scale representational ability of the feature fusion module, a RepConvsBlock multi-scale feature fusion module is proposed. Additionally, a Rep-FPN-PAN multi-scale apple feature fusion network is designed to effectively address the issue of difficult recognition caused by varying apple sizes in near and far views. During training, the multi-branch structure is used to fuse features, which is then consolidated into a single-branch structure during inference. This approach reduces the model’s parameter count and computational load, thereby improving inference speed and preventing wastage of computational resources.
The
activation function in the re-parameterization module of the RepConv [
25] structure may lead to the vanishing gradient problem during training. The
activation function, however, is smooth, continuous, and differentiable at zero, thereby mitigating the issue of local gradient vanishing. The computation of the
activation function is illustrated in Equation (7):
The computation method for the ReLU activation function is illustrated in Equation (8):
The construction of the RepConvs module involves replacing the activation function with the activation function for output processing. The RepConvsBlock module and the RepConvsBlockPlus module are composed of serially connected RepConvs modules. Specifically, the RepConvsBlock module consists of two serially connected RepConvs, while the RepConvsBlockPlus module consists of three serially connected RepConvs.
The Rep-FPN-PAN feature fusion network consists of Rep-FPN (rep feature pyramid networks) and the Rep-PAN (rep path aggregation network), as illustrated in
Figure 7. The Rep-FPN is a top-down network that fuses deep multi-scale semantic information with shallow multi-scale visual information. Conversely, the Rep-PAN is a bottom-up network that further integrates the feature information outputted by the Rep-FPN with the deep semantic feature information.
2.2.3. Model Pruning
The apple detection algorithm in complex backgrounds is a deep convolutional neural network model, which requires substantial computational power and memory for the inference process. By applying neural network pruning algorithms to eliminate channels that carry less feature information, the model’s parameter count and computational load can be effectively reduced. Layer-adaptive magnitude-based pruning [
26] (LAMP) is an adaptive sparse pruning method for fine-grained unstructured pruning of model weights. However, the weight matrix after pruning is in sparse format with indexes, which is not conducive to parallel operations. Channel pruning is a sparse pruning method for the channel level, and the original convolutional structure of the pruned model is still preserved. We employ channel pruning based on LAMP scores for Rep-ViG-Apple model compression. The trained Rep-VG-Apple model is first pruned, and subsequently, a lower learning rate is employed for retraining and fine-tuning to recover any performance losses incurred due to pruning.
The channel pruning method based on LAMP scores involves computing the sum of squares of the weights for each input channel, followed by sorting these weights. Subsequently, an index mapping is established according to the order of the weights. That is, for indices
and
, if index
is less than index
, then the sum of squares of the channel weights
corresponding to index
is less than or equal to that of the channel weights corresponding
to index
. Cut off the channels with smaller weight squares to compress the model. The definition of the LAMP score is given by Equation (9):
Here
represents the sum of squares of the weights of the target channel, and
represents the sum of squares of the weights of all remaining channels that were not pruned and have indices greater than the target channel index.
indirectly reflects the relative importance of the target channel compared to the unpruned channels; specifically, the larger the sum of squares of the channel’s weights, the more important the channel is. This is detailed in Equation (10):
Channel pruning based on can be classified into two categories: global pruning and local pruning. Global pruning assigns adaptive sparsity to different layers in the model based on the of the global channels of the model. However, this approach may result in the over-pruning of a specific layer, which can lead to a more pronounced degradation in the model’s accuracy after pruning. In contrast, local pruning involves the pruning of channels by calculating the of the local channels of each layer. This method ensures that the same pruning ratio of channels is applied to each layer, maintaining the number of channels in the model.