1. Introduction
Caterpillar fungus, which is formed by entomogenous fungi which are parasitic to the nymphs or larvae of insects, is an important constituent part of animal-derived traditional Chinese medicines. In Asia, these caterpillar fungi are utilized as tonics and medicinal foods, playing significant roles in the dietotherapy of daily life [
1]. The caterpillar fungus is celebrated for its extensive medicinal use, extraordinary life history, and high market value [
2], The fungus is believed to strengthen the lungs and kidneys, increase energy and vitality, stop bleeding, and decrease phlegm [
3]. Caterpillar fungus and products are available as over-the-counter remedies or tonics in western countries, which promote their efficacy and anti-aging, anti-cancer, and immune-boosting effects [
4].
A plethora of relevant studies have demonstrated that the current market price of the highest quality caterpillar fungus products is USD 140,000 per kilogram, thus classifying it as one of the most expensive biological resources in the world. Current annual production levels suggest that the global trade in caterpillar fungus is estimated to be between USD 500 million and USD 11 billion [
5]. Research indicates that the international market price rose by 900% from 1997 to 2008 [
6]. Given its substantial market value, it is widely believed that caterpillar fungus plays a significant role in local and national economies, This has led to the collection of caterpillar fungus becoming a significant short-term seasonal activity for economic gain [
7]. Local residents primarily engage in the collection of caterpillar fungus through conventional manual digging techniques. Manual digging is one of the main methods for collecting and digging caterpillar fungus in most areas [
8] and has provided accumulated experience data such as the identification of the emerging characteristics of caterpillar fungus. However, it should be noted that the manual collection of caterpillar fungus is characterized by low efficiency and high labor costs [
9]. Consequently, there is an urgent need for an intelligent method for identifying caterpillar fungus that will enhance the efficiency and accuracy of recognition, thus replacing manual work. The employment of intelligent caterpillar fungus identification methods has the potential to enhance the efficiency of caterpillar fungus identification, ensuring precise operation without compromising work efficiency. Furthermore, it can facilitate the expansion of the caterpillar fungus sales market, reduce labor costs, and augment caterpillar fungus sales.
The subject of object detection represents a pivotal area of research in the context of agricultural production and life. As machine learning becomes an important tool for driving innovation and productivity across industries, many researchers utilize traditional object detection algorithms to obtain sufficient information, with the aim of improving detection accuracy. Some studies utilize regions of interest (ROI) clustering and employ the Canny edge detector to achieve edge detection of crops [
10]. Furthermore, classification systems for identifying crop diseases have been developed utilising color transformation, color histograms, and comparison techniques [
10]. Despite the evident potential of these systems in detecting various crop diseases, their detection accuracy remains constrained.
Although the above methods can extract sufficient target information, the traditional target detection algorithms often need a long extraction time for small target detection, and the overall efficiency is low. Therefore, these algorithms are difficult to apply to the growth environment of caterpillar fungus. Given the limitations of the above-mentioned technologies, the application of deep learning and CNNs is increasingly demonstrating its crucial role in agriculture. At present, for small target detection, solutions based on deep learning methods have become mainstream.
Tian et al. proposed a novel approach called MD-YOLO for detecting three typical small target lepidopteran pests on sticky insect boards [
11]. Liu et al. proposed an MAE-YOLOv8 model with YOLOv8s-p2 as the baseline for accurately detecting green crisp plums in real complex orchard environments. The prediction feature map preserves more feature information from small targets, thus improving the detection accuracy of small targets [
12].
To sum up, deep learning research is deepening, especially in the identification and detection of agricultural products and crop diseases, and a large number of studies have been conducted. Existing studies have focused on targets such as pests [
11] and fruits [
12], yielding good results. A number of relevant studies [
13] have demonstrated that, despite the model proposed by certain scholars enhancing the recognition accuracy of small targets, the model’s size and number of parameters are substantial, its reasoning speed is slow, and its lightweight nature is not realized [
14]. With the development of deep learning technology, there have been studies that address the issues of lightweighting mentioned above. However, the extraction of features by models is challenging when the targets are small in size and have low resolution [
15].
The utilization of deep learning for the detection of caterpillar fungus has not been addressed in extant studies. Presently, the field remains in the domain of manual recognition. The integration of deep learning algorithms within this domain has the potential to address and mitigate many of the limitations and drawbacks associated with manual recognition. The intelligent recognition of caterpillar fungus holds immense potential in commercial applications.
Small target detection has gradually replaced traditional methods. These models utilize deep learning methods, which are primarily categorized into three types: multi-layer feature interaction, attention mechanisms, and data augmentation [
16]. The methods that involve multi-layer feature interaction include FPN [
17], PAN [
18], Bi-FPN [
19], FPPNet [
20] and PVDet [
21]. Attention mechanism methods include the spatial location attention module (SLA) [
22], the convolutional block attention module (MCAB) [
23], the triple attention module [
24] and position-aware (PA) attention [
25]. Data augmentation methods include techniques such as IDADA (Impartial Differentiable Automatic Data Augmentation) [
26] and oversampling operations [
27]. Although various methods have been proposed, the following challenges remain in the detection of caterpillar fungus:
Challenge 1: In the natural environment of caterpillar fungus, the fungus exhibits a high degree of color similarity to its background. This color similarity can impede the efficacy of target detection algorithms, which may erroneously identify the background as caterpillar fungus. This phenomenon not only compromises the accuracy of recognition but also undermines the accuracy of the overall caterpillar fungus detection process.
Challenge 2: The issue of detecting small targets in the caterpillar fungus dataset is a challenging one, particularly in natural environments where small targets are less obvious. This results in small targets being easily ignored during the training process. Although many current feature fusion modules can be focused on the target region, they still have limitations in capturing the finer-grained features and location information of small targets, which further limits their feature extraction ability.
Challenge 3: The issue of the excessive size of existing models hinders their deployment on embedded devices. Consequently, there is a necessity for novel lightweight network architecture that preserves the characteristics of small targets while addressing this problem.
In order to address the challenges posed by background interference and the inconspicuous characteristics of caterpillar fungus, a multi-scale fusion self-attention (MRAA) model is proposed.
The present paper puts forward a feature fusion network (hereafter referred to as MRFPN) that integrates multi-scale context operations and branch diffusion mechanisms. The efficacy of these mechanisms in enhancing the distinctiveness of features for small targets, and in improving detection accuracy across various scales, is demonstrated in this study. The MRFPN architecture places particular emphasis on critical features of small caterpillar fungus during the training process, thus enhancing overall performance.
The Da-Conv module is integrated into the CSPDarknet53 network, effectively addressing the interference from surrounding weeds and background elements. The module is able to distinguish the target from distracting elements.
The experimental results highlighted substantial advantages of this model in terms of both detection accuracy and computational efficiency. Compared to traditional manual identification methods, this approach not only significantly improves recognition efficiency and accuracy but also accurately identifies caterpillar fungi of various sizes. This ensures precise operations while maintaining high work efficiency.
The capacity of deep learning to exhibit autonomous learning capabilities has been demonstrated, thereby enabling the independent extraction of low-level features from data and the subsequent formation of more sophisticated representations by integrating these features. Consequently, it is capable of uncovering the distributed characteristics of data. This method has been shown to be effective in detecting and locating targets of various sizes, as well as excelling in feature extraction and addressing issues related to low extraction efficiency. A number of researchers have successfully implemented small target detection using deep learning-based object detection methods. The primary categories of deep learning approaches include multi-layer feature fusion, attention mechanisms, and data augmentation [
16].
The integration of diverse characteristics information enhances the recognition of multi-scale targets and has been widely adopted in small target detection. This method computes features and performs recognition based on a single input scale, which not only achieves rapid processing speed but also ensures low memory consumption [
16].The Feature Pyramid Network (FPN) [
17] effectively aggregates feature information across multiple scales to improve detection performance for objects of varying sizes. Importantly, it achieves substantial improvements in detection accuracy for smaller objects. The Path Aggregation Network (PAN) [
18] adopts a bidirectional fusion strategy that integrates both top-down and bottom-up pathways. Furthermore, it incorporates two critical modules: adaptive feature pooling and fully connected fusion. This architecture enables effective multi-scale information aggregation, thereby enhancing the integration of feature information across various scales. As a result, especially for small and occluded objects, the detection performance of target detection models is significantly improved with PANet. Although FPN [
17] and PAN [
18] improved small target detection through bidirectional feature fusion, weak features (such as feature maps susceptible to background interference) of caterpillar fungus small targets were not optimized.
In the research of Yolov [
28], the classic FPN + PAN network was adopted, achieving bidirectional propagation of features among the three layers of the backbone network and enhancing semantic information for positioning and recognition capabilities [
16]. This attention mechanism not only effectively assists the network in detecting small objects but also accurately identifies partially occluded targets. Hu et al. [
29] introduced an adaptive multi-layer feature fusion network, termed LAMFFNet. This network integrates the Adaptive Weighted Feature Fusion Module (AWMFF), which enhances feature refinement by adaptively weighting outputs of different encoder layers and incorporating a Local–Global Self-Attention Layer (LGSE). As a result, this approach effectively optimizes feature fusion, but the problem of information loss in the process of feature diffusion is not solved.
Conventional channel attention mechanisms primarily focus on long-range dependencies; CoordAttention additionally preserves precise location information, thereby resulting in a more robust and comprehensive representation of features. Li et al. [
30] proposed a novel deep learning-based infrared small target detection method, termed YOLOSR-IST. The incorporation of a coordinated attention mechanism within the backbone network, in conjunction with the integration of low-level positional information into the shallowest feature map during the process of feature fusion, has been demonstrated to effectively mitigate the occurrence of missed detections and false alarms.
CBAM has been demonstrated to significantly enhance feature representation in convolutional neural networks (CNNs) [
31]. The incorporation of both channel and spatial attention mechanisms within CBAM has been demonstrated to enhance critical features, thereby leading to an enhancement in overall performance. In a related development, Liu et al. [
32] introduced CBAM in the YOLOv5 model for the detection of small targets in underwater environments, with a particular focus on targets of small size, densely distributed organisms, and those subject to occlusion. The experimental results demonstrate a substantial equilibrium between detection accuracy and efficiency. However, CBAM [
31] is unable to effectively process background noise in shallow features.
Liu et al. [
33] introduced an innovative attention mechanism, termed BiFormer. The integration of a bidirectional attention mechanism was achieved by combining ResNet-18 with Bi-Former and Reverse Attention. This combination was found to enhance detection in camouflage settings through the more effective concentration of network attention. The result of this enhanced concentration of attention was improved model performance. In a similar vein, Zhang et al. [
34] introduced the Triplet Attention Module, a development that enhanced the interaction between spatial and channel dimensions, thereby ensuring the more effective preservation of critical feature information.
The above methods can improve
APS relatively limited effectiveness. BiFormer [
33] reduced the amount of computation by sparse attention, but sacrificed local feature capturing ability for tiny targets. The Triplet Attention Module [
34] does not explicitly model color differences, and when a target is close to the background color (such as caterpillar fungus and dead leaves) the attention weight tends to favor the high-contrast region and ignore the low-contrast small target. Because of the interference of the field environment and the inherently weak characteristics of the target, no significant progress has been made in improving detection accuracy for caterpillar fungus.
Figure 1 depicts a common strategy to deepen the network architecture to provide additional features for small targets. In contrast, this paper proposes an innovative method, which introduces multi-scale convolution kernels into feature fusion through multi-scale fusion operation. By fusing features of different scales, the model’s adaptability to scale changes, object size differences, occlusion, and other situations is enhanced, compensating for the shortcomings of FPN [
17] and PAN [
18] in complex scenes. Through the branch diffusion operation, the positional information of small target caterpillar fungus is strengthened through multi-level feature diffusion, which is more flexible than the single bidirectional path of Bi-FPN [
19]. Multi-scale fusion + branch diffusion not only increases network depth but also solves the problem of the unclear characteristics of small target caterpillar fungus. This approach enhances key layer features of small targets while reducing confusion with other scale targets.
Furthermore, the attention mechanism has gained significant attention in recent years. In order to adapt to the MRFPN network, the attention mechanism was integrated into multi-scale features. The Da-Conv module has been embedded within the N-CSPDarknet53 backbone network, with the objective of suppressing weed background interference through the implementation of adaptive adjustment of the sensing field (illustrated in
Figure 2 as the transition from the orange to the red box). The Da-Conv module is integrated into the low-level feature map (e.g., the C3 stage), and background noise propagation is suppressed earlier than in Hu et al. [
29] only in the high-level feature fusion. The objective is to mitigate the confusion between caterpillar fungus and the background by employing an attention-based adaptive weighted fusion mechanism, multi-scale fusion, and branch diffusion operations, thereby strengthening the features of small targets and improving feature extraction capabilities.
2. Method and Experimental Design
This study proposes an MRAA network based on YOLOv8s, with a view to resolving the problem of detecting the presence of caterpillar fungus in agricultural production and life. As illustrated in
Figure 2, the MRAA network comprises two distinct components: MRFPN and N-CSPDarknet53. MRFPN comprises two components: the multi-scale fusion operation and the branch diffusion operation. Firstly, the multi-scale fusion operation accepts multiple inputs of different scales. The interior of the multi-scale fusion operation contains an Inception-style module. Subsequently, the branch diffusion operation diffuses the features with rich contextual information to various detection scales. Following feature enhancement, the individual feature maps are outputted to the subsequent detection heads, respectively. The subsequent section introduces a new backbone network, designated N-CSPDarknet53, which is built on the CSPDarknet53 network architecture. The integration of multiple Da-Conv modules into the network is a significant enhancement to the network’s perceptual capability for small target caterpillar fungus.
As illustrated in
Figure 2, the red dashed line box signifies the network’s capacity for identifying caterpillar fungus characteristics, while the orange dashed line box denotes erroneous identification of the background as caterpillar fungus. With the incorporation of branch diffusion operation and Da-Conv, the network progressively focuses on the vicinity of caterpillar fungus, the features become increasingly discernible, and the interference from the surrounding background is diminished, thereby enhancing the accuracy of caterpillar fungus identification. Subsequently, as the depth of the network increases, MRAA focuses on the caterpillar fungus itself. This refinement leads to an enhancement in the precision of the characteristic and location information of the fungus, facilitating a gradual refinement of its distinguishing features, thereby effectively differentiating it from the background. The results section in
Figure 2 provides a visual representation of the identified caterpillar fungus.
2.1. Feature Fusion Network
The MRAA is composed of two constituent components: the feature fusion network and the N-CSP Darknet53. The features that are extracted are then used for target detection, which itself comprises the following stages:
FocusFeature Extraction: Initially, the features from the P4 layer are processed through FocusFeature Extraction. This module’s function is to focus on the key information of the input image and prepare it for subsequent network layers.
Consequently, a convolution operation is employed, and the features from disparate layers (e.g., the features of the P4 and P5 layers) are integrated through a concat operation, thereby enabling the network to leverage information at varied levels.
Subsequently, the
C2f module conducts further in-depth processing on these features and outputs the feature map. The MRAA model architecture is illustrated in
Figure 3.
In
Figure 4, in MRFPN, FocusFeature is defined to improve the feature representation. [P3, P4, P5] are the feature maps extracted by the backbone network. The downsampling of the P3 layer results in P3’, which helps the model capture higher-level features and reduces computational cost. After the P4 layer undergoes a normal convolution to obtain P4’, the P5 layer is upsampled. Following upsampling, it undergoes normal convolution to obtain P5’. Finally, P5’ is concatenated and fused with P3’ and P4’ to obtain the output feature map. These feature maps of different levels are concatenated and passed through a 1 × 1 convolution kernel before being combined with the aforementioned output feature map to form the FocusFeature module.
2.1.1. Multi-Scale Fusion Method
The Feature Pyramid Network (FPN) [
17] is a widely utilized feature fusion technique in object recognition. In order to enhance modeling performance for objects of varying sizes and aspect ratios, it has the capacity to combine and generate multi-scale feature maps. As illustrated in
Figure 5a, the initial step involves the generation of five layers of feature maps, denoted as [
C1,
C2,
C3,
C4,
C5], based on the definition of FPN. Secondly, the generation path of
C3 is modified.
C3 is generated directly from P3, then
C3 is upsampled from
C4, and subsequently
C3 fuses with P2. Finally, we perform a sequence of downsampling and concatenation operations to generate a new feature map [N1, N2, N3, N4, N5], as proposed by the enhanced PAN [
18] approach. N1 is directly generated by
C1 without any intermediate operations. N2 is generated from N1 through the
Conv3×3 module, and N3, N4, and N5 are generated by a series of fusion modules. The fusion module employs the
C2f and Conv modules to adapt the feature map to a higher resolution and concatenates it with another. Take N2 as an exemplar: N2 is connected to
C2, and the
C2f module produces outputs. Finally, the spatial dimensions of N2 are reduced by half after passing through
Conv3×3.
In
Figure 5a, the model contains five probes connected to different network depths, which were originally designed to detect objects of different sizes [
35]. In the face of the challenges presented by the fact that the characteristics of some caterpillar fungus are not obvious, and the target caterpillar fungus is small, finding a solution is particularly challenging.
For targets like caterpillar fungus, low-level features offer enhanced spatial and detailed information, which is crucial for boosting the precision of small target detection. Consequently, to preserve a greater amount of low-level feature data during the fusion process, we incorporate MRFPN in place of PAFPN. In
Figure 5c, MRFPN extracts from each feature layer of the backbone network, denoted {
C2,
C3,
C4,
C5}, a set of features of different scales. Subsequently,
C4 is integrated into the fusion process. Finally, the four-layer features are fused and outputted to obtain {F2, F3, F4, F5}. This method retains important low-level feature information at the prediction stage, while allowing for direct integration and acquisition of multi-scale features.
The original FPN (Fusion of Perceptual Networks) fuses deep semantic features (
C3–
C5) with shallow detailed features (
C1–
C2) through a top-down path. As demonstrated in
Figure 5c, the multi-scale fusion operation is reconstructed through the
C3 generation path in the network structure. The generation of
C3 is accomplished by P3 directly, as opposed to the conventional upsampling of
C4 in traditional FPN. This approach facilitates the retention of more original details and addresses the issue of information loss in the deep features of small target caterpillar fungus. Consequently, a novel feature map [N1–N5] is generated by means of subsampling and concatenation operations. For instance, N2 is derived from N1 via 3 × 3 convolution, and N3 is subsampled from N2 and spliced with P2 to create a cross-level feature interaction. Finally, through interior FocusFeature horizontal connection strengthening, P3’ (downsampling P3), P4’ (general convolution processing P4), and P5’ (upsampling post-processing P5) are fused through a concat operation to form a mixed feature map containing multi-scale information. The multi-scale fusion operation integrates P3–P5 level features, with P3 carrying caterpillar fungus morphological details in high resolution and P5 deep features containing category semantics. Following the fusion process, detail–semantic complementarity is realized and small target recognition is enhanced.
UP(),
Conv3×3(),
Conv1×1(),
C2
f(), respectively, represent UP sampling
Conv3×3,
Conv1×1, and
C2
f modules. The multi-scale fusion process can be described by Equations (1) and (2).
In the process of multi-scale feature fusion, we adopt Adaptive Spatial Fusion (ASF) [
36] to assign different spatial weights to features of different layers. The key features of small targets are highlighted by adaptive weight allocation, which dynamically adjusts the contributions of each scale during feature fusion and helps detect small targets. The introduced structure is
Figure 5b.
In
Figure 5b, ASF adaptively learns the fusion spatial weights of each scale feature map through rescaling and adaptive fusion. In ASF
1, 3 × 3 max pooling and 3 × 3 convolution are applied to the layer3 feature map to obtain
, and a 3 × 3 convolution is applied to the layer2 feature map to obtain
. In ASF
2, a 3 × 3 convolution is applied to the layer3 feature map to obtain
, and a 1 × 1 convolution is applied to the layer2 feature map, after which its size is adjusted to twice the resolution of the original image to obtain
. In ASF
3, a 1 × 1 convolution is applied to the layer2 feature map, and then its size is adjusted to twice the resolution of the original image to obtain
. Additionally, a 1 × 1 convolution is applied to the layer1 feature map, and then its size is adjusted to four times the resolution of the original image to obtain
. Adaptive feature fusion refers to multiplying the obtained
,
, and
with the weight parameters
,
, and
to obtain the fusion feature. The equation is as follows:
In the equation, The eigenvector at position (i, j) is expressed as , where the feature mapping of the n-th layer is changed and fused to become the feature mapping , , and of the lth layer, which is a learnable parameter and satisfies Equation (4). is the generated feature vector. In the above equation, the ellipsis (...) indicates that the network can have any depth.
The integration of the ASF module has three principal benefits: Firstly, it has enhanced the accuracy of detecting objects at different scales. The ASF module enhances multi-scale object detection by intelligently combining features from various scales, thereby effectively addressing the challenges associated with detecting objects of different sizes. This approach has been shown to enhance the accuracy of detection for objects of all sizes, particularly in complex environments. Additionally, the ASF module minimizes feature conflicts through dynamic adjustment of scale-specific feature contributions, which helps reduce spatial misalignment and enhance detection reliability. Furthermore, it optimizes feature fusion efficiency, allowing the model to maintain high accuracy with fewer detection iterations or at lower resolutions, thus improving overall detection performance.
2.1.2. Branch Diffusion Method
As demonstrated in
Figure 6a, this paper employs the self-developed branch diffusion method. The feature map
C3 outputted to N2 is replaced by
C3’.
C3’ is obtained by up-sampling from
C4 and then fused with
C3 before being passed to the next layer (
C2). The
C2 branch diffuses and outputs a single feature map N2, which is rich in semantic information from both paths, including fine-grained features and positional information from the lower-layer features, and finally diffuses to the second detection scale. It is evident that Path 1 generates a feature map that exclusively contains deep semantic information, thereby establishing
C3’ as the primary candidate for fusion with
C3. A similar process occurs with the feature map
C4 outputted to N3, which is replaced by
C4’.
C4’ is obtained by upsampling from
C5, and then through the fusion of
C4’ and
C3 it is passed to the next layer (N3). The
C3 branch diffuses and outputs a single feature map N3. N3 is rich in semantic information from both paths and finally diffuses to the third detection scale. It is also evident that Path 1 generates a feature map that exclusively contains deep semantic information, thus rendering
C4’ the optimal choice for fusion with
C3 in
Figure 6b. The same operation is adopted in
Figure 6c. The diffusion output ultimately generates a single feature map, designated N1, which subsequently diffuses to the first detection scale.
As demonstrated in
Figure 6c,
C5 comprises features associated with larger objects, which are amplified through the FPN network and subsequently input into
C1. This is then input into N1 and Head1. Consequently, Head1, which was originally designed for small target detection, experiences difficulty obtaining the necessary small target features. The direct transfer of
C5 to
C1 can easily lead to confusion between the scales of large and small targets. In order to address the twofold challenge of mitigating the degradation of key layer information of small targets caused by the branch diffusion of the two paths, and to prevent scale confusion between large and small targets, we propose a solution that involves directly outputting the feature maps
C4 and
C5 to obtain N4 and N5, as illustrated in
Figure 6d. The branch diffusion operation can be expressed in equation form, as shown in Equation (5).
The branch diffusion mechanism has been demonstrated to facilitate the dissemination of features that are rich in contextual information across a range of detection scales. This process is achieved through the diffusion mechanism, thereby ensuring optimal performance. Consequently, the network incorporating multi-scale fusion and branch diffusion was selected as the final MRFPN.
In comparison with the baseline, MRFPN accentuates the features associated with small targets to a greater extent. The presence of multi-scale fusion and branch diffusion operations ensures that the features of large targets persist in the shallow feature maps, though they no longer dominate. Instead, the features of small targets become prominent. This contributes to the ability of the shallow detection head to focus more acutely on small targets.
2.2. The Main Backbone: N-CSPDarknet53
In the task of object detection, the attention mechanism is usually adopted to focus the attention on the area of the object and to suppress the interference from non-object areas, such as the background. For caterpillar fungus and weeds with highly similar surface colors, the interference from color and background is particularly significant.
In order to enhance the feature extraction ability of the backbone network, an improved variant has been proposed. This variant is named N-CSPDarknet53 and is based on the CSPDarknet53 network used in YOLOv8s. This architecture integrates the
C2f and SPPF modules from the original YOLOv8s architecture. The N-CSPDarknet53 network incorporates an attention-based adaptive weighted fusion mechanism in the form of the Da-Conv module. The Da-Conv module is illustrated in
Figure 7b.
In the improved Da-Conv module, we propose an adaptive weighted fusion mechanism (AWF). AWF has a two-branch structure. In the upper branch,
represents the feature map at any network depth. Then, the
is processed by the CBR operation to obtain
. Afterwards, inspired by ECA [
37], We process
through spatial attention (SA) to further explore the representative regions of the target. Specifically, as shown in the blue dashed rectangular box in
Figure 7, the SA block is composed of CBR operations, channel-wise operations, the Sigmoid function, and 1 × 1 convolution operations. Based on these operations, the network can obtain the importance
of pixels in the spatial domain. To distinguish the target from the background, the input feature map
of the SA block is multiplied by
. Such an operation helps the network to focus its attention on the target area. As shown in the following equation:
where
is the SA block output and Ce(−) and Ca(−) are the channel average and channel maximum algorithms, respectively.
The architecture of the lower branch of the AWF is analogous to that of the upper branch, exhibiting two distinct variations. Firstly, the input differs; in the lower branch, (1 −
) is used to process the feature map. The reverse operation in the lower branch, as opposed to the traditional view in the upper branch, facilitates the network’s ability to distinguish the target from the background from an alternative perspective. By processing
from two perspectives simultaneously, the network can better understand the target area and more easily identify important features of the target, thereby suppressing interference from non-target areas such as the background. A further distinction lies in the utilization of the channel attention (CA) block in lieu of the spatial attention (SA) block. As illustrated in the yellow dashed rectangle in
Figure 7, the CA block’s configuration bears a resemblance to that of the SA block, with the primary distinction being the replacement of channel-wise operations with spatial operations. The output
of the lower branch can be obtained through the following equation:
where
represents the weight obtained by the CA block in the channel domain. After processing
with two branches from different perspectives, through summation operation, fusing the two tensors can improve the effectiveness of feature representation and address issues related to color and background interference. The predicted target
can be obtained in the following ways:
where ⊕ represents the summation operation.
As demonstrated in
Figure 7b, the Da-Conv module has been shown to enhance the capacity to discern small objects from the background by means of a double-branch attention mechanism and reverse feature processing. The Da-Conv module employs a two-branch structure, comprising spatial attention (SA) and channel attention (CA), to enhance target features from spatial and channel dimensions, respectively. Spatial attention (SA): The spatial weights are extracted by global average pooling (GAP) and maximum pooling (GMP), and the target region is dynamically focused (as shown in Equation (6)). In the context of shallow features, the edge and texture information of small targets is susceptible to submergence by background noise. Enhancement of the spatial saliency of targets by SA is achieved through the inhibition of the activation of irrelevant areas (such as weeds). Channel attention (CA): The reverse feature (1 −
) is recalibrated at channel level to screen channels that are sensitive to classification. In scenes with similar colors, CA improves the channel response of the target by enhancing channels that differ significantly from the target color, such as brightness or texture. Reverse feature processing (1 −
) is then applied to the input features by the underlying branch, which is essentially an adversarial feature enhancement strategy. The reverse operation forces the network to learn the difference between the caterpillar fungus and the background from a complementary perspective, especially if the color is similar (such as a green object in a green background), and the reverse branch reduces false detections by comparing the response difference between the original and reverse features to capture subtle gradient changes (such as edges or shadows) (the orange box in
Figure 2 reduces interference).
Finally, the double-branch outputs in the Da-Conv module achieve cross-dimensional interaction through adaptive weighted fusion. This process combines spatial positioning accuracy (SA branch) with channel discrimination (CA branch) to form a more robust feature representation. Compared with a single attention mechanism (such as SE or CBAM), the two-branch design avoids the feature bias caused by fixed weights by dynamic weight allocation (α, β, and γ), improving the model’s adaptability to complex backgrounds. Compared with SE, CBAM and other modules, Da-Conv increases APS by 0.017–0.037, which verifies the effectiveness of the double-branch structure.
Through weighted fusion, these operations promote the establishment of cross-dimensional interactions, combining channel information with spatial information and overcoming the fragmentation problem of channel–spatial information. The Da-Conv module is a highly portable module that can be inserted into any model. This facilitates the allocation of attention, significantly improving the ability to distinguish targets from the background and effectively reducing false detections. This is achieved by extracting features from multiple dimensions and capturing the relationships between these features.
2.3. Dataset and Experimental Setup
2.3.1. The Construction of the Dataset
In order to collect images of caterpillar fungus in its natural state, this study utilized a professional Canon R8 camera to capture images from June to July 2024. These images depict caterpillar fungus in its natural environment, encompassing various weather conditions, overlaps, and complex lighting conditions to thoroughly simulate the actual scene in the natural environment, as illustrated in
Figure 8. The selection of caterpillar fungus under clear and snow weather conditions was undertaken to simulate different weather conditions in the natural environment. The choice of daytime natural light and night light was made to meet the different lighting conditions in the natural environment, and caterpillar fungi in independent and overlapping situations were selected to recreate the real locations of caterpillar fungus in the natural environment. It is important to note that different shielding degrees will lead to changes in the shape characteristics of caterpillar fungus, resulting in non-obvious features of small targets. The processed images and their annotations are integrated into a structured dataset with a complex background, and the target scales are diverse. The validation set contains a total of 664 targets, of which 373 (56%) are classified as small targets. The training set comprises a total of 5100 targets, of which 2267 (44%) are classified as small targets. This paper has collected a substantial number of caterpillar fungus images under various conditions, with the objective of capturing the diversity of caterpillar fungi in different environments, thereby meeting the data requirements for algorithm training.
2.3.2. Evaluation Indicators
To confirm the validity and credibility of the MRAA network, the standard COCO accuracy metrics used in this study include
AP,
AP50,
APS,
APM, and
APL [
38].
AP50 represents average accuracy when IOU is 0.50.
APS,
APM, and
APL represent the accuracies for small, medium-sized, and large targets, respectively. while Params indicates the number of parameters used. Higher values of each accuracy indicator suggest better performance of the detection model. The specific evaluation indicators are the following:
Accuracy is the proportion of true positives in the total number of predictions.
Recall is an index to measure the ability of classification models to recognize positive (class of concern) samples.
TP represents the quantity of correct identifications and
FN represents the quantity of incorrect identifications.
AP fully considers both the accuracy and recall of the model. This metric is considered one of the most challenging. The equation for
AP is shown as follows:
By accumulating and summing up precision at different recall rates, the value of the AP metric can be obtained.
The
AP metric is the average value of
AP across all categories, where
AP is calculated for each category. In this article, mAP is evaluated at an IOU threshold of 0.5.
2.3.3. Experimental Setup
As a version of the YOLO series, YOLOv8 combines the advantages of previous generations of YOLO and further improves the performance of object detection by introducing a deeper network structure, optimized loss function, and training strategy. YOLOv8 adopts a more efficient network architecture, and further optimizes the feature extraction module, so that it can better capture the target features of different scales and complex backgrounds [
39].
In this study, the small model of the YOLOv8 series was selected as the benchmark model to conduct experiments. On the one hand, the accuracy of YOLOv8s has approached or even exceeded some large models, and at the same time, YOLOV8s has more advantages in speed and resource consumption. This balance makes it an ideal benchmark to verify the effectiveness of the new method (MRAA network). On the other hand, YOLOv8s performs well in the lightweight model in terms of parameter number and computational efficiency, which meets the actual deployment requirements. Moreover, the small model has a low parameter number and computational overhead, which is suitable for resource-limited agricultural scenarios (handheld devices) [
40]. Therefore, this study chooses the small model in the v8 series as the benchmark model.
The parameters of the software and hardware environments used in the experiment are presented in
Table 1, and the experimental parameters configuration is in
Table 2. In this study, we used 300 training cycles as selected values. The 300 training cycles were chosen because experimental verification shows that the
APS of the MRAA network increases by 12% (0.180→0.202) at 300 cycles, indicating that the model has fully converged at this time. Initial tests were performed in the experiment and 300 cycles were determined to be sufficient to reach the minimum loss point on the validation dataset. Increasing the epoch may not improve accuracy and may even result in overfitting. The relevant literature shows that for small targets with small dataset size but high complexity, 300 cycles is consistent with experience [
11,
12]. In addition, the MRAA model needs to be deployed on embedded devices, and efficiency should be taken into account during training. Using 300 cycles can avoid over-training while ensuring accuracy, which meets the requirements of lightweight deployment. In this study, the dataset was divided into a training set, a test set, and a validation set with proportions of 70%, 20%, and 10%, respectively. Two different network architectures (YOLOv8s and MRAA) were used for experiments.
Finally, in the experimental part of this paper, we will use the strawberry-DS [
41] dataset for testing. Strawberries in the growing period has become a typically challenging scene for agricultural small target detection due to their small fruit size, highly similar color to the background weeds, and complex growth environment (leaf occlusion rate > 40%). As a typical small target detection case, growing strawberries is challenging and can verify the validity of the model. This provides an ideal test vector for verifying the fine-grained feature extraction capability of MRAA networks. According to statistics, the artificial false detection rate of strawberry picking is up to 23%. This verification can provide a certain practical value for automatic strawberry picking.
2.4. Ablation Experiment
2.4.1. Component Ablation Experiment of MRAA
The Da-Conv and MRFPN modules are each added to the YOLOv8s model and four sets of comparative ablation experiments are performed. These experiments are trained and tested using consistent datasets and parameters to ensure comparability. The MRFPN and Da-Conv in
Table 3 represent the two new modules.
As shown in
Figure 9, the MRAA network has achieved significant improvements in the accuracy of targets at different scales. This improvement is reflected in an increase of 0.061 in
AP, 0.022 in
APS, 0.047 in
APM, and 0.133 in
APL. Compared with the baseline (YOLOv8s) network, the accuracy of small targets (
APS) has increased by 12%, and the accuracy of medium targets (
APM) has increased by 11%. These experiments have proven the effectiveness of this network in enhancing the detection of small and medium targets. When the Da-Conv module is used alone on the baseline model,
AP increases by 0.035,
APS increases by 0.015,
APM increases by 0.04, and
APL increases by 0.109.
These results can be compared with the basic model, MRFPN, which showed accuracy results of 0.018 in AP, 0.011 in APS, 0.017 in APM, and 0.012 in APL. The accuracy improvements are particularly significant for small and medium-sized targets, This study adopts shallow feature mapping to emphasize the important features of small targets, thereby increasing the chance of accurately detecting them as positive instances. This adjustment changed the training dynamics of the model, resulting in a more obvious bias towards small targets.
2.4.2. Ablation Experiment with MRFPN
For ease of description, the multiscale fusion method and the branch diffusion method will be referred to as ‘Method1’ and ‘Method2’, respectively. The ablation experiments were performed on YOLOv8s, with the environment and parameter settings remaining unchanged throughout the experiment.
Table 4 shows the comparative experiments integrating both Method1 and Method2 into the model.
In
Table 4, after adding the multi-scale fusion method alone, the
AP of the model increased by 0.002,
APS increased by 0.006,
APM increased by 0.022, and
APL decreased by 0.011. After adding the branch diffusion method alone, the
AP of the YOLOv8s + Method2 model increased by 0.036,
APS increased by 0.013,
APM increased by 0.041, and
APL decreased by 0.018.
In the experimental results shown in
Figure 10, both
APS and
APM have increased by varying degrees, while the accuracy of
APL has decreased when using a single method alone. In the YOLOv8s + Method1 + Method2 experiment, the combination of the two methods not only improved the
APL but also further enhanced the
APS.
This discovery confirmed the improved accuracy of multi-scale fusion and branch diffusion operations while avoiding the problem of confusion in feature maps of different scales. It is worth noting that in addition to large targets, the separate addition of Method1 and Method2 improves detection accuracy for other targets.
2.4.3. The Ablation Experiment with the Da-Conv Module
This paper selects four mainstream attention mechanisms for comparison: squeeze-and-excitation networks (SE) [
42], GAM [
43], CBAM [
44] and EMA [
45]. Da-Conv is a dual attention mechanism module that adopts weighted fusion; here, “Conv” indicates the use of regular convolution. In this part, a total of six experiments were conducted.
Table 5 presents the comparative experiments with different attention mechanisms combined with YOLOv8s.
Firstly, the weighted fusion Da-Conv module is added to the YOLOv8s network for comparison. It can be seen that the Da-Conv module with learnable weights demonstrates stronger capability and can better enhance the accuracy of the model. Compared with the baseline, Da-Conv increases AP by 0.105, APS by 0.017, APM by 0.04, and APL by 0.178. The reason is that the module can extract features from multiple dimensions and capture the relationships between these features.
Secondly, in
Figure 11, four additional experiments were conducted using the YOLOv8s network. The purpose was to compare the changes in detection accuracy under the effects of Da-Conv and mainstream attention mechanisms. SE, GAM, EMA, and CBAM were compared, and these attention modules were all located at the same position. SE, EMA, and CBAM improved the detection accuracy for the target, but the improvement was not as significant as that of Da-Conv. This fully demonstrates that Da-Conv is more effective than SE, EMA, and CBAM.
2.4.4. Comparison with Advanced Algorithms
The MRAA network was compared with some other advanced algorithms. In the comparative experiments, the experiments evaluated Faster R-CNN [
46], Dynamic R-CNN [
47], Mask R-CNN [
48], RT-detr [
49], VitDet [
50], Baseline, YOLOv5s [
51], YOLOv9s [
52], TS-DETR [
53], NanoDet-Plus [
54] and YOLOv7tiny [
55] from various aspects. In the experiment, ensure that the experimental parameters and the dataset used are the same.
In
Table 6, MRAA is as high as 0.202. In terms of mesoscale target detection, the MRAA network used in this paper outperforms Faster R-CNN (0.396), Mask R-CNN (0.412), Dynamic R-CNN (0.405), RT-detr (0.418), YOLOv5s (0.420), VitDet (0.415), YOLOv7tiny (0.427), YOLOv9s (0.463), NanoDet-Plus (0.455), TS-DETR (0.469), and YOLOv8s (0.428). Moreover, in terms of large-scale target detection, the MRAA network achieves the highest accuracy of 0.659.
In
Table 6, MRAA has similar FLOPs to YOLOv8s. As well as a reduction in the use of computing resources, there is also a significant improvement in accuracy. Compared with other advanced models, MRAA is still classified as a lightweight model. As shown in
Figure 12, models closer to the origin represent more lightweight models, and the MRAA network we proposed falls into the category of lightweight models. This is because the MRAA network employs multi-scale fusion and branch diffusion operations, which enable features at each scale to possess detailed contextual information. These features with rich contextual information are then diffused to each detection scale via the diffusion mechanism. Finally, the background and target are distinguished and there is an improvement in the ability to capture key features. This improves the accuracy of small target detection.
As demonstrated by the experimental findings, the proposed MRAA model demonstrates superiority over YOLOv9s (0.194) and NanoDet-Plus (0.187) in terms of the APS index, while exhibiting an 81% parameter reduction compared to YOLOv9s. In terms of model efficiency, the FLOPs of MRAA is 21 G, which is significantly lower than TS-DETR (132 G), and the parameter number of MRAA is 9.88 M, which is better than YOLOv9s (12.1 M) and the original basic model YOLOv8s (11 M), while maintaining the performance. In terms of processing targets at other scales, MRAA achieves an APM index of 0.475, which is superior to all comparison models. A comparison of the transformer architecture with MRAA’s branch diffusion mechanism reveals the former to be less suitable for target continuity characteristics in agricultural scenarios. Extensive experimentation has demonstrated that MRAA attains a level of performance that is at the cutting edge of the field, achieving a 4.1% improvement over the highly advanced YOLOv9s in terms of AP metrics. The efficiency of the model is particularly advantageous in agricultural scenarios, and the multi-scale fusion mechanism is more suitable for continuous small target detection than pure transformer architecture.
As demonstrated in
Figure 13, a comparison is made between the performance of the MRAA model and several advanced models. The position of the model in the figure is indicative of its performance; as the model progresses further to the right, the inference time increases, and the model becomes more complex or less efficient. The higher the value on the Y-axis, the better the model performs in the target detection task. The size of the bubbles is indicative of the number of parameters, with larger bubbles corresponding to more complex models and higher computing resource requirements. The upper-left model exhibits characteristics such as low inference time and high
AP, which is conducive to accuracy and efficiency, rendering it suitable for real-time applications. Additionally, the number of parameters is minimal and within the range of lightweight models. In comparison with alternative models, our enhanced model boasts the advantages of rapid reasoning speed, high precision, and a reduced number of parameters, thereby achieving a balance between precision and efficiency. In conclusion, the MRAA network proposed in this paper is more universal and interpretable. As a single-stage algorithm, the MRAA network has the potential to be further optimized.
2.4.5. Robustness Testing in Different Scenarios
The dataset of this study was collected in the natural environment. In order to test the performance of the model in the natural environment, we simulated the real environment under different weather conditions and different overlapping and complex lighting conditions, and further tested and optimized the performance of the model. The experimental results for the MRAA model tested in different scenarios are shown in
Table 7. The experimental results of the basic model tested in different scenarios are shown in
Table 8.
As demonstrated in
Table 9, under the same training set, the
AP index of MRAA is 24% superior to that of Faster R-CNN and 12% better than YOLOv5s, particularly in the domain of small target detection (
APS), where it exhibits a performance more than 20% ahead of the competition. A comparison of MRAA with RetinaNet, which has been optimized for small targets, reveals an enhancement in
APL of 3.7% (0.635→0.659) through the utilization of MRFPN’s multi-scale retention mechanism.
As demonstrated by the experimental findings, the APS of the fundamental model is 0.180 in sunny conditions, but declines to 0.153 in rain and snow environments, leading to a 15% reduction in recognition accuracy for small targets. Conversely, the MRAA model significantly enhances feature expression capability through the multi-scale fusion and branch diffusion operations of MRFPN. The experimental results demonstrate that the AP stability of MRAA in complex scenarios is enhanced by approximately 8%, the AP decline is reduced to 3.5% in rain and snow weather, and the recognition accuracy of small targets (APS) is improved by 14% compared with the basic model. The APL of the basic model is 0.519 under natural light and the MRAA model is increased to 0.553 (+6.5%) through the multi-scale feature retention mechanism of MRFPN. In the night scene, the AP50 of the basic model is 0.625 and the AP50 of the MRAA model is increased to 0.685 (+9.6%) through multi-scale context fusion. The recognition accuracy for small targets (APS) is increased by 12% compared with the basic model. In the overlapping target scenario, the APM of the basic model is 0.418 and the APM of the MRAA model is enhanced to 0.452 (+8.1%) by the branch diffusion operation, and the recognition accuracy for small targets (APS) is improved by 10% compared with that of the basic model. In summary, the MRAA network has been demonstrated to enhance detection robustness in complex scenarios through innovation in feature fusion architecture. Notably, the network demonstrates particular proficiency in the domains of small target detection, as evidenced by an augmented APS of 12%, and environmental interference suppression, marked by a substantial decrease in the false detection rate of 59%. These findings offer a novel technical framework for intelligent detection of caterpillar fungus.
2.5. Ablation Experiments Using Other Datasets
We conducted a comparative analysis with the baseline (YOLOv8s) model on the Strawberry-DS [
41] dataset. As shown in
Table 10. When using Strawberry-DS as the dataset compared with the baseline the
APS value of the MRAA model has increased by 0.036. The MRAA model also shows significant improvements in the
APM and
APL metrics, increasing by 0.035 and 0.028, respectively, which proves the effectiveness of the branch diffusion operation. This experiment shows that MRAA significantly improves small target accuracy on other datasets.
The ablation experiment involved selecting images of strawberry fruits and flowers under different shooting angles and varying degrees of occlusion. On the one hand, compared with the basic model, MRAA effectively reduces the probability of missed and false detection and distinguishes the target from the background using the Da-Conv module. On the other hand, in comparison with the base model, MRAA can improve the detection accuracy of strawberry fruit and flowers to a certain extent, effectively addressing issues caused by obscure features and low accuracy. The effectiveness and robustness of the proposed MRAA model is thus further demonstrated.
To further validate the effectiveness of the improved model in a wider range of agricultural scenarios, a comparative analysis was performed with the baseline (YOLOv8s) model on the APHID-4K [
56] dataset. As shown in
Table 11, both MRFPN and Da-Conv improve the detection accuracy of the model. When APHID-4K is used as a dataset, the MRAA model shows improvements in
APS values of 0.036 compared to baseline, as shown in
Table 11. The MRAA model also showed significant improvement in
APM and
APL indexes, which proved the effectiveness of branch fusion operation. The experiment shows that MRAA achieves enhanced detection accuracy for small targets in different datasets and the generalization ability in different agricultural scenarios is improved.