Recognition of Cordyceps Based on Machine Vision and Deep Learning

Xia, Zihao; Sun, Aimin; Hou, Hangdong; Song, Qingfeng; Yang, Hongli; Ma, Liyong; Dong, Fang

doi:10.3390/agriculture15070713

Open AccessArticle

Recognition of Cordyceps Based on Machine Vision and Deep Learning

by

Zihao Xia

^1,2,

Aimin Sun

^1,2,

Hangdong Hou

^1,2,

Qingfeng Song

^1,2,

Hongli Yang

^1,2,

Liyong Ma

^1,2,*

and

Fang Dong

^3,*

¹

School of Mechanical Engineering, Hebei University of Architecture, Zhangjiakou 075051, China

²

Hebei Technology Innovation Center for Intelligent Production Line of Prefabricated Building Components, Zhangjiakou 075051, China

³

Light Alloy Research Institute, Central South University, Changsha 410083, China

^*

Authors to whom correspondence should be addressed.

Agriculture 2025, 15(7), 713; https://doi.org/10.3390/agriculture15070713

Submission received: 16 February 2025 / Revised: 19 March 2025 / Accepted: 25 March 2025 / Published: 27 March 2025

(This article belongs to the Special Issue Advanced Image Collection, Processing, and Analysis in Crop and Livestock Management)

Download

Browse Figures

Versions Notes

Abstract

:

In a natural environment, due to the small size of caterpillar fungus, its indistinct features, similar color to surrounding weeds and background, and overlapping instances of caterpillar fungus, identifying caterpillar fungus poses significant challenges. To address these issues, this paper proposes a new MRAA network, which consists of a feature fusion pyramid network (MRFPN) and the backbone network N-CSPDarknet53. MRFPN is used to solve the problem of weak features. In N-CSPDarknet53, the Da-Conv module is proposed to address the background and color interference problems in shallow feature maps. The MRAA network significantly improves accuracy, achieving an accuracy rate of 0.202 AP_S for small-target recognition, which represents a 12% increase compared to the baseline of 0.180 AP_S. Additionally, the model size is small (9.88 M), making it lightweight. It is easy to deploy in embedded devices, which greatly promotes the development and application of caterpillar fungus identification.

Keywords:

caterpillar fungus recognition; small target detection; lightweight model; feature fusion network; attention mechanism

1. Introduction

Caterpillar fungus, which is formed by entomogenous fungi which are parasitic to the nymphs or larvae of insects, is an important constituent part of animal-derived traditional Chinese medicines. In Asia, these caterpillar fungi are utilized as tonics and medicinal foods, playing significant roles in the dietotherapy of daily life [1]. The caterpillar fungus is celebrated for its extensive medicinal use, extraordinary life history, and high market value [2], The fungus is believed to strengthen the lungs and kidneys, increase energy and vitality, stop bleeding, and decrease phlegm [3]. Caterpillar fungus and products are available as over-the-counter remedies or tonics in western countries, which promote their efficacy and anti-aging, anti-cancer, and immune-boosting effects [4].

A plethora of relevant studies have demonstrated that the current market price of the highest quality caterpillar fungus products is USD 140,000 per kilogram, thus classifying it as one of the most expensive biological resources in the world. Current annual production levels suggest that the global trade in caterpillar fungus is estimated to be between USD 500 million and USD 11 billion [5]. Research indicates that the international market price rose by 900% from 1997 to 2008 [6]. Given its substantial market value, it is widely believed that caterpillar fungus plays a significant role in local and national economies, This has led to the collection of caterpillar fungus becoming a significant short-term seasonal activity for economic gain [7]. Local residents primarily engage in the collection of caterpillar fungus through conventional manual digging techniques. Manual digging is one of the main methods for collecting and digging caterpillar fungus in most areas [8] and has provided accumulated experience data such as the identification of the emerging characteristics of caterpillar fungus. However, it should be noted that the manual collection of caterpillar fungus is characterized by low efficiency and high labor costs [9]. Consequently, there is an urgent need for an intelligent method for identifying caterpillar fungus that will enhance the efficiency and accuracy of recognition, thus replacing manual work. The employment of intelligent caterpillar fungus identification methods has the potential to enhance the efficiency of caterpillar fungus identification, ensuring precise operation without compromising work efficiency. Furthermore, it can facilitate the expansion of the caterpillar fungus sales market, reduce labor costs, and augment caterpillar fungus sales.

The subject of object detection represents a pivotal area of research in the context of agricultural production and life. As machine learning becomes an important tool for driving innovation and productivity across industries, many researchers utilize traditional object detection algorithms to obtain sufficient information, with the aim of improving detection accuracy. Some studies utilize regions of interest (ROI) clustering and employ the Canny edge detector to achieve edge detection of crops [10]. Furthermore, classification systems for identifying crop diseases have been developed utilising color transformation, color histograms, and comparison techniques [10]. Despite the evident potential of these systems in detecting various crop diseases, their detection accuracy remains constrained.

Although the above methods can extract sufficient target information, the traditional target detection algorithms often need a long extraction time for small target detection, and the overall efficiency is low. Therefore, these algorithms are difficult to apply to the growth environment of caterpillar fungus. Given the limitations of the above-mentioned technologies, the application of deep learning and CNNs is increasingly demonstrating its crucial role in agriculture. At present, for small target detection, solutions based on deep learning methods have become mainstream.

Tian et al. proposed a novel approach called MD-YOLO for detecting three typical small target lepidopteran pests on sticky insect boards [11]. Liu et al. proposed an MAE-YOLOv8 model with YOLOv8s-p2 as the baseline for accurately detecting green crisp plums in real complex orchard environments. The prediction feature map preserves more feature information from small targets, thus improving the detection accuracy of small targets [12].

To sum up, deep learning research is deepening, especially in the identification and detection of agricultural products and crop diseases, and a large number of studies have been conducted. Existing studies have focused on targets such as pests [11] and fruits [12], yielding good results. A number of relevant studies [13] have demonstrated that, despite the model proposed by certain scholars enhancing the recognition accuracy of small targets, the model’s size and number of parameters are substantial, its reasoning speed is slow, and its lightweight nature is not realized [14]. With the development of deep learning technology, there have been studies that address the issues of lightweighting mentioned above. However, the extraction of features by models is challenging when the targets are small in size and have low resolution [15].

The utilization of deep learning for the detection of caterpillar fungus has not been addressed in extant studies. Presently, the field remains in the domain of manual recognition. The integration of deep learning algorithms within this domain has the potential to address and mitigate many of the limitations and drawbacks associated with manual recognition. The intelligent recognition of caterpillar fungus holds immense potential in commercial applications.

Small target detection has gradually replaced traditional methods. These models utilize deep learning methods, which are primarily categorized into three types: multi-layer feature interaction, attention mechanisms, and data augmentation [16]. The methods that involve multi-layer feature interaction include FPN [17], PAN [18], Bi-FPN [19], FPPNet [20] and PVDet [21]. Attention mechanism methods include the spatial location attention module (SLA) [22], the convolutional block attention module (MCAB) [23], the triple attention module [24] and position-aware (PA) attention [25]. Data augmentation methods include techniques such as IDADA (Impartial Differentiable Automatic Data Augmentation) [26] and oversampling operations [27]. Although various methods have been proposed, the following challenges remain in the detection of caterpillar fungus:

Challenge 1: In the natural environment of caterpillar fungus, the fungus exhibits a high degree of color similarity to its background. This color similarity can impede the efficacy of target detection algorithms, which may erroneously identify the background as caterpillar fungus. This phenomenon not only compromises the accuracy of recognition but also undermines the accuracy of the overall caterpillar fungus detection process.

Challenge 2: The issue of detecting small targets in the caterpillar fungus dataset is a challenging one, particularly in natural environments where small targets are less obvious. This results in small targets being easily ignored during the training process. Although many current feature fusion modules can be focused on the target region, they still have limitations in capturing the finer-grained features and location information of small targets, which further limits their feature extraction ability.

Challenge 3: The issue of the excessive size of existing models hinders their deployment on embedded devices. Consequently, there is a necessity for novel lightweight network architecture that preserves the characteristics of small targets while addressing this problem.

In order to address the challenges posed by background interference and the inconspicuous characteristics of caterpillar fungus, a multi-scale fusion self-attention (MRAA) model is proposed.

The present paper puts forward a feature fusion network (hereafter referred to as MRFPN) that integrates multi-scale context operations and branch diffusion mechanisms. The efficacy of these mechanisms in enhancing the distinctiveness of features for small targets, and in improving detection accuracy across various scales, is demonstrated in this study. The MRFPN architecture places particular emphasis on critical features of small caterpillar fungus during the training process, thus enhancing overall performance.

The Da-Conv module is integrated into the CSPDarknet53 network, effectively addressing the interference from surrounding weeds and background elements. The module is able to distinguish the target from distracting elements.

The experimental results highlighted substantial advantages of this model in terms of both detection accuracy and computational efficiency. Compared to traditional manual identification methods, this approach not only significantly improves recognition efficiency and accuracy but also accurately identifies caterpillar fungi of various sizes. This ensures precise operations while maintaining high work efficiency.

The capacity of deep learning to exhibit autonomous learning capabilities has been demonstrated, thereby enabling the independent extraction of low-level features from data and the subsequent formation of more sophisticated representations by integrating these features. Consequently, it is capable of uncovering the distributed characteristics of data. This method has been shown to be effective in detecting and locating targets of various sizes, as well as excelling in feature extraction and addressing issues related to low extraction efficiency. A number of researchers have successfully implemented small target detection using deep learning-based object detection methods. The primary categories of deep learning approaches include multi-layer feature fusion, attention mechanisms, and data augmentation [16].

The integration of diverse characteristics information enhances the recognition of multi-scale targets and has been widely adopted in small target detection. This method computes features and performs recognition based on a single input scale, which not only achieves rapid processing speed but also ensures low memory consumption [16].The Feature Pyramid Network (FPN) [17] effectively aggregates feature information across multiple scales to improve detection performance for objects of varying sizes. Importantly, it achieves substantial improvements in detection accuracy for smaller objects. The Path Aggregation Network (PAN) [18] adopts a bidirectional fusion strategy that integrates both top-down and bottom-up pathways. Furthermore, it incorporates two critical modules: adaptive feature pooling and fully connected fusion. This architecture enables effective multi-scale information aggregation, thereby enhancing the integration of feature information across various scales. As a result, especially for small and occluded objects, the detection performance of target detection models is significantly improved with PANet. Although FPN [17] and PAN [18] improved small target detection through bidirectional feature fusion, weak features (such as feature maps susceptible to background interference) of caterpillar fungus small targets were not optimized.

In the research of Yolov [28], the classic FPN + PAN network was adopted, achieving bidirectional propagation of features among the three layers of the backbone network and enhancing semantic information for positioning and recognition capabilities [16]. This attention mechanism not only effectively assists the network in detecting small objects but also accurately identifies partially occluded targets. Hu et al. [29] introduced an adaptive multi-layer feature fusion network, termed LAMFFNet. This network integrates the Adaptive Weighted Feature Fusion Module (AWMFF), which enhances feature refinement by adaptively weighting outputs of different encoder layers and incorporating a Local–Global Self-Attention Layer (LGSE). As a result, this approach effectively optimizes feature fusion, but the problem of information loss in the process of feature diffusion is not solved.

Conventional channel attention mechanisms primarily focus on long-range dependencies; CoordAttention additionally preserves precise location information, thereby resulting in a more robust and comprehensive representation of features. Li et al. [30] proposed a novel deep learning-based infrared small target detection method, termed YOLOSR-IST. The incorporation of a coordinated attention mechanism within the backbone network, in conjunction with the integration of low-level positional information into the shallowest feature map during the process of feature fusion, has been demonstrated to effectively mitigate the occurrence of missed detections and false alarms.

CBAM has been demonstrated to significantly enhance feature representation in convolutional neural networks (CNNs) [31]. The incorporation of both channel and spatial attention mechanisms within CBAM has been demonstrated to enhance critical features, thereby leading to an enhancement in overall performance. In a related development, Liu et al. [32] introduced CBAM in the YOLOv5 model for the detection of small targets in underwater environments, with a particular focus on targets of small size, densely distributed organisms, and those subject to occlusion. The experimental results demonstrate a substantial equilibrium between detection accuracy and efficiency. However, CBAM [31] is unable to effectively process background noise in shallow features.

Liu et al. [33] introduced an innovative attention mechanism, termed BiFormer. The integration of a bidirectional attention mechanism was achieved by combining ResNet-18 with Bi-Former and Reverse Attention. This combination was found to enhance detection in camouflage settings through the more effective concentration of network attention. The result of this enhanced concentration of attention was improved model performance. In a similar vein, Zhang et al. [34] introduced the Triplet Attention Module, a development that enhanced the interaction between spatial and channel dimensions, thereby ensuring the more effective preservation of critical feature information.

The above methods can improve AP_S relatively limited effectiveness. BiFormer [33] reduced the amount of computation by sparse attention, but sacrificed local feature capturing ability for tiny targets. The Triplet Attention Module [34] does not explicitly model color differences, and when a target is close to the background color (such as caterpillar fungus and dead leaves) the attention weight tends to favor the high-contrast region and ignore the low-contrast small target. Because of the interference of the field environment and the inherently weak characteristics of the target, no significant progress has been made in improving detection accuracy for caterpillar fungus.

Figure 1 depicts a common strategy to deepen the network architecture to provide additional features for small targets. In contrast, this paper proposes an innovative method, which introduces multi-scale convolution kernels into feature fusion through multi-scale fusion operation. By fusing features of different scales, the model’s adaptability to scale changes, object size differences, occlusion, and other situations is enhanced, compensating for the shortcomings of FPN [17] and PAN [18] in complex scenes. Through the branch diffusion operation, the positional information of small target caterpillar fungus is strengthened through multi-level feature diffusion, which is more flexible than the single bidirectional path of Bi-FPN [19]. Multi-scale fusion + branch diffusion not only increases network depth but also solves the problem of the unclear characteristics of small target caterpillar fungus. This approach enhances key layer features of small targets while reducing confusion with other scale targets.

Furthermore, the attention mechanism has gained significant attention in recent years. In order to adapt to the MRFPN network, the attention mechanism was integrated into multi-scale features. The Da-Conv module has been embedded within the N-CSPDarknet53 backbone network, with the objective of suppressing weed background interference through the implementation of adaptive adjustment of the sensing field (illustrated in Figure 2 as the transition from the orange to the red box). The Da-Conv module is integrated into the low-level feature map (e.g., the C3 stage), and background noise propagation is suppressed earlier than in Hu et al. [29] only in the high-level feature fusion. The objective is to mitigate the confusion between caterpillar fungus and the background by employing an attention-based adaptive weighted fusion mechanism, multi-scale fusion, and branch diffusion operations, thereby strengthening the features of small targets and improving feature extraction capabilities.

2. Method and Experimental Design

This study proposes an MRAA network based on YOLOv8s, with a view to resolving the problem of detecting the presence of caterpillar fungus in agricultural production and life. As illustrated in Figure 2, the MRAA network comprises two distinct components: MRFPN and N-CSPDarknet53. MRFPN comprises two components: the multi-scale fusion operation and the branch diffusion operation. Firstly, the multi-scale fusion operation accepts multiple inputs of different scales. The interior of the multi-scale fusion operation contains an Inception-style module. Subsequently, the branch diffusion operation diffuses the features with rich contextual information to various detection scales. Following feature enhancement, the individual feature maps are outputted to the subsequent detection heads, respectively. The subsequent section introduces a new backbone network, designated N-CSPDarknet53, which is built on the CSPDarknet53 network architecture. The integration of multiple Da-Conv modules into the network is a significant enhancement to the network’s perceptual capability for small target caterpillar fungus.

As illustrated in Figure 2, the red dashed line box signifies the network’s capacity for identifying caterpillar fungus characteristics, while the orange dashed line box denotes erroneous identification of the background as caterpillar fungus. With the incorporation of branch diffusion operation and Da-Conv, the network progressively focuses on the vicinity of caterpillar fungus, the features become increasingly discernible, and the interference from the surrounding background is diminished, thereby enhancing the accuracy of caterpillar fungus identification. Subsequently, as the depth of the network increases, MRAA focuses on the caterpillar fungus itself. This refinement leads to an enhancement in the precision of the characteristic and location information of the fungus, facilitating a gradual refinement of its distinguishing features, thereby effectively differentiating it from the background. The results section in Figure 2 provides a visual representation of the identified caterpillar fungus.

2.1. Feature Fusion Network

The MRAA is composed of two constituent components: the feature fusion network and the N-CSP Darknet53. The features that are extracted are then used for target detection, which itself comprises the following stages:

FocusFeature Extraction: Initially, the features from the P4 layer are processed through FocusFeature Extraction. This module’s function is to focus on the key information of the input image and prepare it for subsequent network layers.

Consequently, a convolution operation is employed, and the features from disparate layers (e.g., the features of the P4 and P5 layers) are integrated through a concat operation, thereby enabling the network to leverage information at varied levels.

Subsequently, the C2f module conducts further in-depth processing on these features and outputs the feature map. The MRAA model architecture is illustrated in Figure 3.

In Figure 4, in MRFPN, FocusFeature is defined to improve the feature representation. [P3, P4, P5] are the feature maps extracted by the backbone network. The downsampling of the P3 layer results in P3’, which helps the model capture higher-level features and reduces computational cost. After the P4 layer undergoes a normal convolution to obtain P4’, the P5 layer is upsampled. Following upsampling, it undergoes normal convolution to obtain P5’. Finally, P5’ is concatenated and fused with P3’ and P4’ to obtain the output feature map. These feature maps of different levels are concatenated and passed through a 1 × 1 convolution kernel before being combined with the aforementioned output feature map to form the FocusFeature module.

2.1.1. Multi-Scale Fusion Method

The Feature Pyramid Network (FPN) [17] is a widely utilized feature fusion technique in object recognition. In order to enhance modeling performance for objects of varying sizes and aspect ratios, it has the capacity to combine and generate multi-scale feature maps. As illustrated in Figure 5a, the initial step involves the generation of five layers of feature maps, denoted as [C1, C2, C3, C4, C5], based on the definition of FPN. Secondly, the generation path of C3 is modified. C3 is generated directly from P3, then C3 is upsampled from C4, and subsequently C3 fuses with P2. Finally, we perform a sequence of downsampling and concatenation operations to generate a new feature map [N1, N2, N3, N4, N5], as proposed by the enhanced PAN [18] approach. N1 is directly generated by C1 without any intermediate operations. N2 is generated from N1 through the Conv_3×3 module, and N3, N4, and N5 are generated by a series of fusion modules. The fusion module employs the C2f and Conv modules to adapt the feature map to a higher resolution and concatenates it with another. Take N2 as an exemplar: N2 is connected to C2, and the C2f module produces outputs. Finally, the spatial dimensions of N2 are reduced by half after passing through Conv_3×3.

In Figure 5a, the model contains five probes connected to different network depths, which were originally designed to detect objects of different sizes [35]. In the face of the challenges presented by the fact that the characteristics of some caterpillar fungus are not obvious, and the target caterpillar fungus is small, finding a solution is particularly challenging.

For targets like caterpillar fungus, low-level features offer enhanced spatial and detailed information, which is crucial for boosting the precision of small target detection. Consequently, to preserve a greater amount of low-level feature data during the fusion process, we incorporate MRFPN in place of PAFPN. In Figure 5c, MRFPN extracts from each feature layer of the backbone network, denoted {C2, C3, C4, C5}, a set of features of different scales. Subsequently, C4 is integrated into the fusion process. Finally, the four-layer features are fused and outputted to obtain {F2, F3, F4, F5}. This method retains important low-level feature information at the prediction stage, while allowing for direct integration and acquisition of multi-scale features.

The original FPN (Fusion of Perceptual Networks) fuses deep semantic features (C3–C5) with shallow detailed features (C1–C2) through a top-down path. As demonstrated in Figure 5c, the multi-scale fusion operation is reconstructed through the C3 generation path in the network structure. The generation of C3 is accomplished by P3 directly, as opposed to the conventional upsampling of C4 in traditional FPN. This approach facilitates the retention of more original details and addresses the issue of information loss in the deep features of small target caterpillar fungus. Consequently, a novel feature map [N1–N5] is generated by means of subsampling and concatenation operations. For instance, N2 is derived from N1 via 3 × 3 convolution, and N3 is subsampled from N2 and spliced with P2 to create a cross-level feature interaction. Finally, through interior FocusFeature horizontal connection strengthening, P3’ (downsampling P3), P4’ (general convolution processing P4), and P5’ (upsampling post-processing P5) are fused through a concat operation to form a mixed feature map containing multi-scale information. The multi-scale fusion operation integrates P3–P5 level features, with P3 carrying caterpillar fungus morphological details in high resolution and P5 deep features containing category semantics. Following the fusion process, detail–semantic complementarity is realized and small target recognition is enhanced.

UP(), Conv_3×3(), Conv_1×1(), C2f(), respectively, represent UP sampling Conv_3×3, Conv_1×1, and C2f modules. The multi-scale fusion process can be described by Equations (1) and (2).

F 5 = f (P 3) + f (P 4) + f (P 5)

(1)

f (x) = C o n v_{1 \times 1} (C 2 f (x))

(2)

In the process of multi-scale feature fusion, we adopt Adaptive Spatial Fusion (ASF) [36] to assign different spatial weights to features of different layers. The key features of small targets are highlighted by adaptive weight allocation, which dynamically adjusts the contributions of each scale during feature fusion and helps detect small targets. The introduced structure is Figure 5b.

In Figure 5b, ASF adaptively learns the fusion spatial weights of each scale feature map through rescaling and adaptive fusion. In ASF₁, 3 × 3 max pooling and 3 × 3 convolution are applied to the layer3 feature map to obtain

H^{3 \to 1,}

, and a 3 × 3 convolution is applied to the layer2 feature map to obtain

H^{2 \to 1}

. In ASF₂, a 3 × 3 convolution is applied to the layer3 feature map to obtain

H^{3 \to 2}

, and a 1 × 1 convolution is applied to the layer2 feature map, after which its size is adjusted to twice the resolution of the original image to obtain

H^{1 \to 2 .}

. In ASF₃, a 1 × 1 convolution is applied to the layer2 feature map, and then its size is adjusted to twice the resolution of the original image to obtain

H^{2 \to 3}

. Additionally, a 1 × 1 convolution is applied to the layer1 feature map, and then its size is adjusted to four times the resolution of the original image to obtain

H^{1 \to 3}

. Adaptive feature fusion refers to multiplying the obtained

H^{1 \to l,}

,

H^{2 \to l}

, and

H^{3 \to l}

with the weight parameters

α^{l}

,

β^{l}

, and

γ^{l}

to obtain the fusion feature. The equation is as follows:

W_{i j}^{l} = α_{i j}^{l} \cdot H_{ij}^{1 \to l} + β_{i j}^{l} \cdot H_{i j}^{2 \to l} + γ_{i j}^{3 \to l} \cdot H_{i j}^{3 \to l} + \dots + n_{i j}^{l} \cdot H_{i j}^{n \to l}

(3)

α_{i j}^{l} + β_{i j}^{l} + γ_{i j}^{l} + \dots n_{i j}^{l} = 1

(4)

In the equation, The eigenvector at position (i, j) is expressed as

H_{i j}^{n \to l}

, where the feature mapping of the n-th layer is changed and fused to become the feature mapping

α_{i j}^{l}

,

β_{i j}^{l}

, and

γ_{i j}^{l}

of the lth layer, which is a learnable parameter and satisfies Equation (4).

W_{i j}^{l}

is the generated feature vector. In the above equation, the ellipsis (...) indicates that the network can have any depth.

The integration of the ASF module has three principal benefits: Firstly, it has enhanced the accuracy of detecting objects at different scales. The ASF module enhances multi-scale object detection by intelligently combining features from various scales, thereby effectively addressing the challenges associated with detecting objects of different sizes. This approach has been shown to enhance the accuracy of detection for objects of all sizes, particularly in complex environments. Additionally, the ASF module minimizes feature conflicts through dynamic adjustment of scale-specific feature contributions, which helps reduce spatial misalignment and enhance detection reliability. Furthermore, it optimizes feature fusion efficiency, allowing the model to maintain high accuracy with fewer detection iterations or at lower resolutions, thus improving overall detection performance.

2.1.2. Branch Diffusion Method

As demonstrated in Figure 6a, this paper employs the self-developed branch diffusion method. The feature map C3 outputted to N2 is replaced by C3’. C3’ is obtained by up-sampling from C4 and then fused with C3 before being passed to the next layer (C2). The C2 branch diffuses and outputs a single feature map N2, which is rich in semantic information from both paths, including fine-grained features and positional information from the lower-layer features, and finally diffuses to the second detection scale. It is evident that Path 1 generates a feature map that exclusively contains deep semantic information, thereby establishing C3’ as the primary candidate for fusion with C3. A similar process occurs with the feature map C4 outputted to N3, which is replaced by C4’. C4’ is obtained by upsampling from C5, and then through the fusion of C4’ and C3 it is passed to the next layer (N3). The C3 branch diffuses and outputs a single feature map N3. N3 is rich in semantic information from both paths and finally diffuses to the third detection scale. It is also evident that Path 1 generates a feature map that exclusively contains deep semantic information, thus rendering C4’ the optimal choice for fusion with C3 in Figure 6b. The same operation is adopted in Figure 6c. The diffusion output ultimately generates a single feature map, designated N1, which subsequently diffuses to the first detection scale.

As demonstrated in Figure 6c, C5 comprises features associated with larger objects, which are amplified through the FPN network and subsequently input into C1. This is then input into N1 and Head1. Consequently, Head1, which was originally designed for small target detection, experiences difficulty obtaining the necessary small target features. The direct transfer of C5 to C1 can easily lead to confusion between the scales of large and small targets. In order to address the twofold challenge of mitigating the degradation of key layer information of small targets caused by the branch diffusion of the two paths, and to prevent scale confusion between large and small targets, we propose a solution that involves directly outputting the feature maps C4 and C5 to obtain N4 and N5, as illustrated in Figure 6d. The branch diffusion operation can be expressed in equation form, as shown in Equation (5).

\begin{matrix} C 1 = C 2 f (U p (C 2)) \\ C 2 = f_{1} (P 2 + Up (C 3) + Up (C 3 ’)) \\ C 3 = f_{1} (P 3) \\ C 4 = f_{1} (P 4 + Up (C 5)) \\ C 5 = f_{1} (P 5) \\ C 2 ’ = f_{1} (Up (C 3)) \\ C 3 ’ = f_{1} (Up (C 4)) \\ N 1 = f_{1} (P 1 + Up (C 2) + Up (C 2')) \\ N 2 = {C o n v}_{3 \times 3} (N 1) + C 2 \\ f_{1} (x) = {C o n v}_{1 \times 1} (C 2 f (x)) \end{matrix}

(5)

The branch diffusion mechanism has been demonstrated to facilitate the dissemination of features that are rich in contextual information across a range of detection scales. This process is achieved through the diffusion mechanism, thereby ensuring optimal performance. Consequently, the network incorporating multi-scale fusion and branch diffusion was selected as the final MRFPN.

In comparison with the baseline, MRFPN accentuates the features associated with small targets to a greater extent. The presence of multi-scale fusion and branch diffusion operations ensures that the features of large targets persist in the shallow feature maps, though they no longer dominate. Instead, the features of small targets become prominent. This contributes to the ability of the shallow detection head to focus more acutely on small targets.

2.2. The Main Backbone: N-CSPDarknet53

In the task of object detection, the attention mechanism is usually adopted to focus the attention on the area of the object and to suppress the interference from non-object areas, such as the background. For caterpillar fungus and weeds with highly similar surface colors, the interference from color and background is particularly significant.

In order to enhance the feature extraction ability of the backbone network, an improved variant has been proposed. This variant is named N-CSPDarknet53 and is based on the CSPDarknet53 network used in YOLOv8s. This architecture integrates the C2f and SPPF modules from the original YOLOv8s architecture. The N-CSPDarknet53 network incorporates an attention-based adaptive weighted fusion mechanism in the form of the Da-Conv module. The Da-Conv module is illustrated in Figure 7b.

In the improved Da-Conv module, we propose an adaptive weighted fusion mechanism (AWF). AWF has a two-branch structure. In the upper branch,

P_{i}

represents the feature map at any network depth. Then, the

P_{i}

is processed by the CBR operation to obtain

P_{i}^{u}

. Afterwards, inspired by ECA [37], We process

P_{i}^{u}

through spatial attention (SA) to further explore the representative regions of the target. Specifically, as shown in the blue dashed rectangular box in Figure 7, the SA block is composed of CBR operations, channel-wise operations, the Sigmoid function, and 1 × 1 convolution operations. Based on these operations, the network can obtain the importance

W_{s}

of pixels in the spatial domain. To distinguish the target from the background, the input feature map

P_{i}^{u}

of the SA block is multiplied by

W_{s}

. Such an operation helps the network to focus its attention on the target area. As shown in the following equation:

P_{i}^{s} = P_{i}^{u} \underset{w_{s}}{\underset{⏟}{\otimes σ (C o n v_{1 \times 1} (C e (C (P_{i}^{u})) ⊙ C a (C (P_{i}^{u}))))}}

(6)

where

P_{i}^{s}

is the SA block output and Ce(−) and Ca(−) are the channel average and channel maximum algorithms, respectively.

The architecture of the lower branch of the AWF is analogous to that of the upper branch, exhibiting two distinct variations. Firstly, the input differs; in the lower branch, (1 −

P_{i}

) is used to process the feature map. The reverse operation in the lower branch, as opposed to the traditional view in the upper branch, facilitates the network’s ability to distinguish the target from the background from an alternative perspective. By processing

P_{i}

from two perspectives simultaneously, the network can better understand the target area and more easily identify important features of the target, thereby suppressing interference from non-target areas such as the background. A further distinction lies in the utilization of the channel attention (CA) block in lieu of the spatial attention (SA) block. As illustrated in the yellow dashed rectangle in Figure 7, the CA block’s configuration bears a resemblance to that of the SA block, with the primary distinction being the replacement of channel-wise operations with spatial operations. The output

P_{i}^{c}

of the lower branch can be obtained through the following equation:

P_{i}^{c} = P_{i}^{v} \otimes C o n v_{1 \times 1} (\underset{w_{c}}{\underset{⏟}{σ (G A P (C (P_{i}^{v})) + G M P (C (P_{i}^{v})))}})

(7)

where

W_{c}

represents the weight obtained by the CA block in the channel domain. After processing

P_{i}

with two branches from different perspectives, through summation operation, fusing the two tensors can improve the effectiveness of feature representation and address issues related to color and background interference. The predicted target

θ_{i}

can be obtained in the following ways:

θ_{i} = {Conv}_{1 \times 1} (P_{i}^{s} \oplus P_{i}^{c}) \leftrightarrow

(8)

where ⊕ represents the summation operation.

As demonstrated in Figure 7b, the Da-Conv module has been shown to enhance the capacity to discern small objects from the background by means of a double-branch attention mechanism and reverse feature processing. The Da-Conv module employs a two-branch structure, comprising spatial attention (SA) and channel attention (CA), to enhance target features from spatial and channel dimensions, respectively. Spatial attention (SA): The spatial weights are extracted by global average pooling (GAP) and maximum pooling (GMP), and the target region is dynamically focused (as shown in Equation (6)). In the context of shallow features, the edge and texture information of small targets is susceptible to submergence by background noise. Enhancement of the spatial saliency of targets by SA is achieved through the inhibition of the activation of irrelevant areas (such as weeds). Channel attention (CA): The reverse feature (1 −

P_{i}

) is recalibrated at channel level to screen channels that are sensitive to classification. In scenes with similar colors, CA improves the channel response of the target by enhancing channels that differ significantly from the target color, such as brightness or texture. Reverse feature processing (1 −

P_{i}

) is then applied to the input features by the underlying branch, which is essentially an adversarial feature enhancement strategy. The reverse operation forces the network to learn the difference between the caterpillar fungus and the background from a complementary perspective, especially if the color is similar (such as a green object in a green background), and the reverse branch reduces false detections by comparing the response difference between the original and reverse features to capture subtle gradient changes (such as edges or shadows) (the orange box in Figure 2 reduces interference).

Finally, the double-branch outputs in the Da-Conv module achieve cross-dimensional interaction through adaptive weighted fusion. This process combines spatial positioning accuracy (SA branch) with channel discrimination (CA branch) to form a more robust feature representation. Compared with a single attention mechanism (such as SE or CBAM), the two-branch design avoids the feature bias caused by fixed weights by dynamic weight allocation (α, β, and γ), improving the model’s adaptability to complex backgrounds. Compared with SE, CBAM and other modules, Da-Conv increases AP_S by 0.017–0.037, which verifies the effectiveness of the double-branch structure.

Through weighted fusion, these operations promote the establishment of cross-dimensional interactions, combining channel information with spatial information and overcoming the fragmentation problem of channel–spatial information. The Da-Conv module is a highly portable module that can be inserted into any model. This facilitates the allocation of attention, significantly improving the ability to distinguish targets from the background and effectively reducing false detections. This is achieved by extracting features from multiple dimensions and capturing the relationships between these features.

2.3. Dataset and Experimental Setup

2.3.1. The Construction of the Dataset

In order to collect images of caterpillar fungus in its natural state, this study utilized a professional Canon R8 camera to capture images from June to July 2024. These images depict caterpillar fungus in its natural environment, encompassing various weather conditions, overlaps, and complex lighting conditions to thoroughly simulate the actual scene in the natural environment, as illustrated in Figure 8. The selection of caterpillar fungus under clear and snow weather conditions was undertaken to simulate different weather conditions in the natural environment. The choice of daytime natural light and night light was made to meet the different lighting conditions in the natural environment, and caterpillar fungi in independent and overlapping situations were selected to recreate the real locations of caterpillar fungus in the natural environment. It is important to note that different shielding degrees will lead to changes in the shape characteristics of caterpillar fungus, resulting in non-obvious features of small targets. The processed images and their annotations are integrated into a structured dataset with a complex background, and the target scales are diverse. The validation set contains a total of 664 targets, of which 373 (56%) are classified as small targets. The training set comprises a total of 5100 targets, of which 2267 (44%) are classified as small targets. This paper has collected a substantial number of caterpillar fungus images under various conditions, with the objective of capturing the diversity of caterpillar fungi in different environments, thereby meeting the data requirements for algorithm training.

2.3.2. Evaluation Indicators

To confirm the validity and credibility of the MRAA network, the standard COCO accuracy metrics used in this study include AP, AP₅₀, AP_S, AP_M, and AP_L [38]. AP₅₀ represents average accuracy when IOU is 0.50. AP_S, AP_M, and AP_L represent the accuracies for small, medium-sized, and large targets, respectively. while Params indicates the number of parameters used. Higher values of each accuracy indicator suggest better performance of the detection model. The specific evaluation indicators are the following:

Accuracy is the proportion of true positives in the total number of predictions.

p r e c i s i o n = \frac{T P}{T P + F P} \leftarrow

(9)

Recall is an index to measure the ability of classification models to recognize positive (class of concern) samples. TP represents the quantity of correct identifications and FN represents the quantity of incorrect identifications.

R e c a l l = \frac{T P}{T P + F N} \leftrightarrow

(10)

AP fully considers both the accuracy and recall of the model. This metric is considered one of the most challenging. The equation for AP is shown as follows:

A P = \sum {(P_{i} \times Δ R_{i})}_{\leftrightarrow}

(11)

By accumulating and summing up precision at different recall rates, the value of the AP metric can be obtained.

The AP metric is the average value of AP across all categories, where AP is calculated for each category. In this article, mAP is evaluated at an IOU threshold of 0.5.

m A P = \frac{\sum_{i}^{N} A P_{i}}{N} \leftrightarrow

(12)

2.3.3. Experimental Setup

As a version of the YOLO series, YOLOv8 combines the advantages of previous generations of YOLO and further improves the performance of object detection by introducing a deeper network structure, optimized loss function, and training strategy. YOLOv8 adopts a more efficient network architecture, and further optimizes the feature extraction module, so that it can better capture the target features of different scales and complex backgrounds [39].

In this study, the small model of the YOLOv8 series was selected as the benchmark model to conduct experiments. On the one hand, the accuracy of YOLOv8s has approached or even exceeded some large models, and at the same time, YOLOV8s has more advantages in speed and resource consumption. This balance makes it an ideal benchmark to verify the effectiveness of the new method (MRAA network). On the other hand, YOLOv8s performs well in the lightweight model in terms of parameter number and computational efficiency, which meets the actual deployment requirements. Moreover, the small model has a low parameter number and computational overhead, which is suitable for resource-limited agricultural scenarios (handheld devices) [40]. Therefore, this study chooses the small model in the v8 series as the benchmark model.

The parameters of the software and hardware environments used in the experiment are presented in Table 1, and the experimental parameters configuration is in Table 2. In this study, we used 300 training cycles as selected values. The 300 training cycles were chosen because experimental verification shows that the AP_S of the MRAA network increases by 12% (0.180→0.202) at 300 cycles, indicating that the model has fully converged at this time. Initial tests were performed in the experiment and 300 cycles were determined to be sufficient to reach the minimum loss point on the validation dataset. Increasing the epoch may not improve accuracy and may even result in overfitting. The relevant literature shows that for small targets with small dataset size but high complexity, 300 cycles is consistent with experience [11,12]. In addition, the MRAA model needs to be deployed on embedded devices, and efficiency should be taken into account during training. Using 300 cycles can avoid over-training while ensuring accuracy, which meets the requirements of lightweight deployment. In this study, the dataset was divided into a training set, a test set, and a validation set with proportions of 70%, 20%, and 10%, respectively. Two different network architectures (YOLOv8s and MRAA) were used for experiments.

Finally, in the experimental part of this paper, we will use the strawberry-DS [41] dataset for testing. Strawberries in the growing period has become a typically challenging scene for agricultural small target detection due to their small fruit size, highly similar color to the background weeds, and complex growth environment (leaf occlusion rate > 40%). As a typical small target detection case, growing strawberries is challenging and can verify the validity of the model. This provides an ideal test vector for verifying the fine-grained feature extraction capability of MRAA networks. According to statistics, the artificial false detection rate of strawberry picking is up to 23%. This verification can provide a certain practical value for automatic strawberry picking.

2.4. Ablation Experiment

2.4.1. Component Ablation Experiment of MRAA

The Da-Conv and MRFPN modules are each added to the YOLOv8s model and four sets of comparative ablation experiments are performed. These experiments are trained and tested using consistent datasets and parameters to ensure comparability. The MRFPN and Da-Conv in Table 3 represent the two new modules.

As shown in Figure 9, the MRAA network has achieved significant improvements in the accuracy of targets at different scales. This improvement is reflected in an increase of 0.061 in AP, 0.022 in AP_S, 0.047 in AP_M, and 0.133 in AP_L. Compared with the baseline (YOLOv8s) network, the accuracy of small targets (AP_S) has increased by 12%, and the accuracy of medium targets (AP_M) has increased by 11%. These experiments have proven the effectiveness of this network in enhancing the detection of small and medium targets. When the Da-Conv module is used alone on the baseline model, AP increases by 0.035, AP_S increases by 0.015, AP_M increases by 0.04, and AP_L increases by 0.109.

These results can be compared with the basic model, MRFPN, which showed accuracy results of 0.018 in AP, 0.011 in AP_S, 0.017 in AP_M, and 0.012 in AP_L. The accuracy improvements are particularly significant for small and medium-sized targets, This study adopts shallow feature mapping to emphasize the important features of small targets, thereby increasing the chance of accurately detecting them as positive instances. This adjustment changed the training dynamics of the model, resulting in a more obvious bias towards small targets.

2.4.2. Ablation Experiment with MRFPN

For ease of description, the multiscale fusion method and the branch diffusion method will be referred to as ‘Method1’ and ‘Method2’, respectively. The ablation experiments were performed on YOLOv8s, with the environment and parameter settings remaining unchanged throughout the experiment. Table 4 shows the comparative experiments integrating both Method1 and Method2 into the model.

In Table 4, after adding the multi-scale fusion method alone, the AP of the model increased by 0.002, AP_S increased by 0.006, AP_M increased by 0.022, and AP_L decreased by 0.011. After adding the branch diffusion method alone, the AP of the YOLOv8s + Method2 model increased by 0.036, AP_S increased by 0.013, AP_M increased by 0.041, and AP_L decreased by 0.018.

In the experimental results shown in Figure 10, both AP_S and AP_M have increased by varying degrees, while the accuracy of AP_L has decreased when using a single method alone. In the YOLOv8s + Method1 + Method2 experiment, the combination of the two methods not only improved the AP_L but also further enhanced the AP_S.

This discovery confirmed the improved accuracy of multi-scale fusion and branch diffusion operations while avoiding the problem of confusion in feature maps of different scales. It is worth noting that in addition to large targets, the separate addition of Method1 and Method2 improves detection accuracy for other targets.

2.4.3. The Ablation Experiment with the Da-Conv Module

This paper selects four mainstream attention mechanisms for comparison: squeeze-and-excitation networks (SE) [42], GAM [43], CBAM [44] and EMA [45]. Da-Conv is a dual attention mechanism module that adopts weighted fusion; here, “Conv” indicates the use of regular convolution. In this part, a total of six experiments were conducted. Table 5 presents the comparative experiments with different attention mechanisms combined with YOLOv8s.

Firstly, the weighted fusion Da-Conv module is added to the YOLOv8s network for comparison. It can be seen that the Da-Conv module with learnable weights demonstrates stronger capability and can better enhance the accuracy of the model. Compared with the baseline, Da-Conv increases AP by 0.105, AP_S by 0.017, AP_M by 0.04, and AP_L by 0.178. The reason is that the module can extract features from multiple dimensions and capture the relationships between these features.

Secondly, in Figure 11, four additional experiments were conducted using the YOLOv8s network. The purpose was to compare the changes in detection accuracy under the effects of Da-Conv and mainstream attention mechanisms. SE, GAM, EMA, and CBAM were compared, and these attention modules were all located at the same position. SE, EMA, and CBAM improved the detection accuracy for the target, but the improvement was not as significant as that of Da-Conv. This fully demonstrates that Da-Conv is more effective than SE, EMA, and CBAM.

2.4.4. Comparison with Advanced Algorithms

The MRAA network was compared with some other advanced algorithms. In the comparative experiments, the experiments evaluated Faster R-CNN [46], Dynamic R-CNN [47], Mask R-CNN [48], RT-detr [49], VitDet [50], Baseline, YOLOv5s [51], YOLOv9s [52], TS-DETR [53], NanoDet-Plus [54] and YOLOv7tiny [55] from various aspects. In the experiment, ensure that the experimental parameters and the dataset used are the same.

In Table 6, MRAA is as high as 0.202. In terms of mesoscale target detection, the MRAA network used in this paper outperforms Faster R-CNN (0.396), Mask R-CNN (0.412), Dynamic R-CNN (0.405), RT-detr (0.418), YOLOv5s (0.420), VitDet (0.415), YOLOv7tiny (0.427), YOLOv9s (0.463), NanoDet-Plus (0.455), TS-DETR (0.469), and YOLOv8s (0.428). Moreover, in terms of large-scale target detection, the MRAA network achieves the highest accuracy of 0.659.

In Table 6, MRAA has similar FLOPs to YOLOv8s. As well as a reduction in the use of computing resources, there is also a significant improvement in accuracy. Compared with other advanced models, MRAA is still classified as a lightweight model. As shown in Figure 12, models closer to the origin represent more lightweight models, and the MRAA network we proposed falls into the category of lightweight models. This is because the MRAA network employs multi-scale fusion and branch diffusion operations, which enable features at each scale to possess detailed contextual information. These features with rich contextual information are then diffused to each detection scale via the diffusion mechanism. Finally, the background and target are distinguished and there is an improvement in the ability to capture key features. This improves the accuracy of small target detection.

As demonstrated by the experimental findings, the proposed MRAA model demonstrates superiority over YOLOv9s (0.194) and NanoDet-Plus (0.187) in terms of the AP_S index, while exhibiting an 81% parameter reduction compared to YOLOv9s. In terms of model efficiency, the FLOPs of MRAA is 21 G, which is significantly lower than TS-DETR (132 G), and the parameter number of MRAA is 9.88 M, which is better than YOLOv9s (12.1 M) and the original basic model YOLOv8s (11 M), while maintaining the performance. In terms of processing targets at other scales, MRAA achieves an AP_M index of 0.475, which is superior to all comparison models. A comparison of the transformer architecture with MRAA’s branch diffusion mechanism reveals the former to be less suitable for target continuity characteristics in agricultural scenarios. Extensive experimentation has demonstrated that MRAA attains a level of performance that is at the cutting edge of the field, achieving a 4.1% improvement over the highly advanced YOLOv9s in terms of AP metrics. The efficiency of the model is particularly advantageous in agricultural scenarios, and the multi-scale fusion mechanism is more suitable for continuous small target detection than pure transformer architecture.

As demonstrated in Figure 13, a comparison is made between the performance of the MRAA model and several advanced models. The position of the model in the figure is indicative of its performance; as the model progresses further to the right, the inference time increases, and the model becomes more complex or less efficient. The higher the value on the Y-axis, the better the model performs in the target detection task. The size of the bubbles is indicative of the number of parameters, with larger bubbles corresponding to more complex models and higher computing resource requirements. The upper-left model exhibits characteristics such as low inference time and high AP, which is conducive to accuracy and efficiency, rendering it suitable for real-time applications. Additionally, the number of parameters is minimal and within the range of lightweight models. In comparison with alternative models, our enhanced model boasts the advantages of rapid reasoning speed, high precision, and a reduced number of parameters, thereby achieving a balance between precision and efficiency. In conclusion, the MRAA network proposed in this paper is more universal and interpretable. As a single-stage algorithm, the MRAA network has the potential to be further optimized.

2.4.5. Robustness Testing in Different Scenarios

The dataset of this study was collected in the natural environment. In order to test the performance of the model in the natural environment, we simulated the real environment under different weather conditions and different overlapping and complex lighting conditions, and further tested and optimized the performance of the model. The experimental results for the MRAA model tested in different scenarios are shown in Table 7. The experimental results of the basic model tested in different scenarios are shown in Table 8.

As demonstrated in Table 9, under the same training set, the AP index of MRAA is 24% superior to that of Faster R-CNN and 12% better than YOLOv5s, particularly in the domain of small target detection (AP_S), where it exhibits a performance more than 20% ahead of the competition. A comparison of MRAA with RetinaNet, which has been optimized for small targets, reveals an enhancement in AP_L of 3.7% (0.635→0.659) through the utilization of MRFPN’s multi-scale retention mechanism.

As demonstrated by the experimental findings, the AP_S of the fundamental model is 0.180 in sunny conditions, but declines to 0.153 in rain and snow environments, leading to a 15% reduction in recognition accuracy for small targets. Conversely, the MRAA model significantly enhances feature expression capability through the multi-scale fusion and branch diffusion operations of MRFPN. The experimental results demonstrate that the AP stability of MRAA in complex scenarios is enhanced by approximately 8%, the AP decline is reduced to 3.5% in rain and snow weather, and the recognition accuracy of small targets (AP_S) is improved by 14% compared with the basic model. The AP_L of the basic model is 0.519 under natural light and the MRAA model is increased to 0.553 (+6.5%) through the multi-scale feature retention mechanism of MRFPN. In the night scene, the AP₅₀ of the basic model is 0.625 and the AP₅₀ of the MRAA model is increased to 0.685 (+9.6%) through multi-scale context fusion. The recognition accuracy for small targets (AP_S) is increased by 12% compared with the basic model. In the overlapping target scenario, the AP_M of the basic model is 0.418 and the AP_M of the MRAA model is enhanced to 0.452 (+8.1%) by the branch diffusion operation, and the recognition accuracy for small targets (AP_S) is improved by 10% compared with that of the basic model. In summary, the MRAA network has been demonstrated to enhance detection robustness in complex scenarios through innovation in feature fusion architecture. Notably, the network demonstrates particular proficiency in the domains of small target detection, as evidenced by an augmented AP_S of 12%, and environmental interference suppression, marked by a substantial decrease in the false detection rate of 59%. These findings offer a novel technical framework for intelligent detection of caterpillar fungus.

2.5. Ablation Experiments Using Other Datasets

We conducted a comparative analysis with the baseline (YOLOv8s) model on the Strawberry-DS [41] dataset. As shown in Table 10. When using Strawberry-DS as the dataset compared with the baseline the AP_S value of the MRAA model has increased by 0.036. The MRAA model also shows significant improvements in the AP_M and AP_L metrics, increasing by 0.035 and 0.028, respectively, which proves the effectiveness of the branch diffusion operation. This experiment shows that MRAA significantly improves small target accuracy on other datasets.

The ablation experiment involved selecting images of strawberry fruits and flowers under different shooting angles and varying degrees of occlusion. On the one hand, compared with the basic model, MRAA effectively reduces the probability of missed and false detection and distinguishes the target from the background using the Da-Conv module. On the other hand, in comparison with the base model, MRAA can improve the detection accuracy of strawberry fruit and flowers to a certain extent, effectively addressing issues caused by obscure features and low accuracy. The effectiveness and robustness of the proposed MRAA model is thus further demonstrated.

To further validate the effectiveness of the improved model in a wider range of agricultural scenarios, a comparative analysis was performed with the baseline (YOLOv8s) model on the APHID-4K [56] dataset. As shown in Table 11, both MRFPN and Da-Conv improve the detection accuracy of the model. When APHID-4K is used as a dataset, the MRAA model shows improvements in AP_S values of 0.036 compared to baseline, as shown in Table 11. The MRAA model also showed significant improvement in AP_M and AP_L indexes, which proved the effectiveness of branch fusion operation. The experiment shows that MRAA achieves enhanced detection accuracy for small targets in different datasets and the generalization ability in different agricultural scenarios is improved.

3. Conclusions and Discussion

The present paper puts forward a new caterpillar fungus detection model based on the MRAA network, by introducing innovative structural improvements that enhance the detection of small targets. The study aims to address the three major challenges of caterpillar fungus detection, and proposes the following solutions:

(1) The Da-Conv module is designed and integrated into the improved N-CSPDarknet53 backbone network to enhance the sensitivity of shallow layer features to color differences through dynamic adaptive convolution, effectively distinguish between the similar colors of caterpillar fungus and background, and solve the background interference problem.

(2) Construct MRFPN feature pyramid network, use multi-scale fusion operation to integrate cross-level features, combine branch diffusion operation to strengthen small target fine-grained feature expression, and solve the problem of small target features not being obvious.

(3) The lightweight N-CSPDarknet53 architecture was constructed, and the number of pa parameters was reduced by 30%. The final model volume was compressed to 9.88 M, the inference time was 11.7 ms, and the small target detection accuracy (AP_S) was improved by 12% over the baseline (0.180→0.202), facilitating efficient deployment on edge devices.

Experiments show that the novel caterpillar fungus detection model not only improves detection accuracy but also retains its lightweight characteristics. This study provides a simple, portable, and effective solution for caterpillar fungus detection in agricultural settings, promoting the development and application of intelligent identification of caterpillar fungus.

Author Contributions

Conceptualization, Z.X.; methodology, Z.X.; software, Z.X. and H.H.; validation, A.S., Q.S. and H.Y.; formal analysis, H.H. and A.S.; investigation, Z.X., H.H., A.S., Q.S., H.Y. and L.M.; resources, L.M. and Z.X.; data curation, F.D.; writing—original draft preparation, Z.X., L.M., A.S. and Q.S.; writing—review and editing, Z.X., L.M. and F.D.; supervision, L.M. and F.D.; project administration, Z.X., H.H. and Q.S.; funding acquisition, L.M., F.D., Q.S. and H.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Natural Science Foundation of China (92371106), the Science and Technology Research and Development Command Plan Project of Zhangjiakou, China (2311005A), the Hebei Province graduate student innovation ability training funding project (CXZZSS2025127), the Graduate Student Innovation Fund Grants Established Projects of Hebei University of Architecture (XY2025033), the Graduate Student Innovation Fund Grants Established Projects of Hebei University of Architecture (XY2024080), the outstanding youth project of the Education Department of Hunan Province, China (24B0792), and the Changsha Municipal Natural Science Foundation (kq2402042).

Institutional Review Board Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Nxumalo, W.; Elateeq, A.A.; Sun, Y. Can Cordyceps cicadae be used as an alternative to Cordyceps militaris and Cordyceps sinensis?—A review. J. Ethnopharmacol. 2020, 257, 112879. [Google Scholar] [CrossRef]
Holliday, J.C.; Cleaver, M.P. Medicinal Value of the Caterpillar Fungi Species of the Genus Cordyceps (Fr.) Link (Ascomycetes). A Review. Int. J. Med. Mushrooms 2008, 10, 219–234. [Google Scholar] [CrossRef]
Dai, Y.; Wu, C.; Yuan, F.; Wang, Y.; Huang, L.; Chen, Z.; Zeng, W.; Wang, Y.; Yang, Z.; Zeng, P.; et al. Evolutionary biogeography on Ophiocordyceps sinensis: An indicator of molecular phylogeny to geochronological and ecological exchanges. Geosci. Front. 2020, 11, 807–820. [Google Scholar] [CrossRef]
Paterson, R.R.M. Cordyceps—A traditional Chinese medicine and another fungal therapeutic biofactory? Phytochemistry 2008, 69, 1469–1495. [Google Scholar] [CrossRef] [PubMed]
Shrestha, U.B. Asian medicine: A fungus in decline. Nature 2012, 482, 35. [Google Scholar] [CrossRef]
Winkler, D. Caterpillar Fungus (Ophiocordyceps sinensis) Production and Sustainability on the Tibetan Plateau and in the Himalayas. Asian Med. 2009, 5, 291–316. [Google Scholar] [CrossRef]
Kuniyal, C.P.; Sundriyal, R.C. Conservation salvage of Cordyceps sinensis collection in the Himalayan mountains is neglected. Ecosyst. Serv. 2013, 3, e40–e43. [Google Scholar] [CrossRef]
Shrestha, U.B.; Bawa, K.S. Trade, harvest, and conservation of caterpillar fungus (Ophiocordyceps sinensis) in the Himalayas. Biol. Conserv. 2013, 159, 514–520. [Google Scholar] [CrossRef]
Wei, Y.; Zhang, L.; Wang, J.; Wang, W.; Niyati, N.; Guo, Y.; Wang, X. Chinese caterpillar fungus (Ophiocordyceps sinensis) in China: Current distribution, trading, and futures under climate change and overexploitation. Sci. Total Environ. 2021, 755, 142548. [Google Scholar] [CrossRef]
Hamdani, H.; Septiarini, A.; Sunyoto, A.; Suyanto, S.; Utaminingrum, F. Detection of oil palm leaf disease based on color histogram and supervised classifier. Optik 2021, 245, 167753. [Google Scholar] [CrossRef]
Tian, Y.; Wang, S.; Li, E.; Yang, G.; Liang, Z.; Tan, M. MD-YOLO: Multi-scale Dense YOLO for small target pest detection. Comput. Electron. Agric. 2023, 213, 108233. [Google Scholar] [CrossRef]
Liu, Q.; Lv, J.; Zhang, C. MAE-YOLOv8-based small object detection of green crisp plum in real complex orchard environments. Comput. Electron. Agric. 2024, 226, 109458. [Google Scholar] [CrossRef]
Wei, L.; Tong, Y. Enhanced-YOLOv8: A new small target detection model. Digit. Signal Process. 2024, 153, 104611. [Google Scholar] [CrossRef]
Liu, L.; Chu, C.; Chen, C.; Huang, S. MarineYOLO: Innovative deep learning method for small target detection in underwater environments. Alex. Eng. J. 2024, 104, 423–433. [Google Scholar] [CrossRef]
Ding, W.; Li, H.; Chew, C.; Zhang, X.; Huang, H. Deep learning-based recognition of small maritime targets for obstacle avoidance in visual wave gliders. Ocean. Eng. 2025, 322, 120471. [Google Scholar] [CrossRef]
Sha, X.; Guo, Z.; Guan, Z.; Li, W.; Wang, S.; Zhao, Y. PBTA: Partial Break Triplet Attention Model for Small Pedestrian Detection Based on Vehicle Camera Sensors. IEEE Sens. J. 2024, 24, 21628–21640. [Google Scholar] [CrossRef]
Lin, T.; Dollar, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature Pyramid Networks for Object Detection; IEEE: New York, NY, USA, 2017. [Google Scholar]
Liu, S.; Lu, Q.; Qin, H.; Shi, J.; Jia, J. Path Aggregation Network for Instance Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018. [Google Scholar]
Doherty, J.; Gardiner, B.; Kerr, E.; Siddique, N. BiFPN-YOLO: One-stage object detection integrating Bi-Directional Feature Pyramid Networks. Pattern Recognit. 2025, 160, 111209. [Google Scholar] [CrossRef]
Liu, W.; Zhou, B.; Wang, Z.; Yu, G.; Yang, S. FPPNet: A Fixed-Perspective-Perception Module for Small Object Detection Based on Background Difference. IEEE Sens. J. 2023, 23, 1. [Google Scholar] [CrossRef]
Mo, W.; Zhang, W.; Wei, H.; Cao, R.; Ke, Y.; Luo, Y. PVDet: Towards pedestrian and vehicle detection on gigapixel-level images. Eng. Appl. Artif. Intell. 2023, 118, 105705. [Google Scholar] [CrossRef]
Zhang, Y.; Liu, T.; Yu, P.; Wang, S.; Tao, R. SFSANet: Multiscale Object Detection in Remote Sensing Image Based on Semantic Fusion and Scale Adaptability. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–10. [Google Scholar] [CrossRef]
Bhanbhro, H.; Hooi, Y.K.; Zakaria, M.N.B.; Kusakunniran, W.; Amur, Z.H. MCBAN: A Small Object Detection Multi-Convolutional Block Attention Network. Comput. Mater. Contin. 2024, 81, 2243–2259. [Google Scholar]
Zhang, D.; Wu, C.; Zhou, J.; Zhang, W.; Lin, Z.; Polat, K.; Alenezi, F. Robust underwater image enhancement with cascaded multi-level sub-networks and triple attention mechanism. Neural Netw. 2024, 169, 685–697. [Google Scholar] [PubMed]
Ge, Y.; Zhong, Y.; Ren, J.; He, M.; Bi, H.; Zhang, Q. Camouflaged Object Detection via location-awareness and feature fusion. Image Vis. Comput. 2024, 152, 105339. [Google Scholar] [CrossRef]
Zhou, S.; Tang, Y.; Liu, M.; Wang, Y.; Wen, H. Impartial Differentiable Automatic Data Augmentation Based on Finite Difference Approximation for Pedestrian Detection. IEEE Trans. Instrum. Meas. 2022, 71, 2510611. [Google Scholar]
Lin, J.; Hu, G.; Chen, J. Mixed data augmentation and osprey search strategy for enhancing YOLO in tomato disease, pest, and weed detection. Expert Syst. Appl. 2025, 264, 125737. [Google Scholar]
Bochkovskiy, A.; Chien-Yao, W.; Hong-Yuan, M.L. YOLOv4: Optimal Speed and Accuracy of Object Detection. arXiv 2020, arXiv:2004.10934. [Google Scholar]
Hu, M.; Dong, Y.; Li, J.; Jiang, L.; Zhang, P.; Ping, Y. LAMFFNet: Lightweight Adaptive Multi-layer Feature Fusion network for medical image segmentation. Biomed. Signal Process. Control 2025, 103, 107456. [Google Scholar]
Li, R.; Shen, Y. YOLOSR-IST: A deep learning method for small target detection in infrared remote sensing images based on super-resolution, Y.O.L.O. Signal Process. 2023, 208, 108962. [Google Scholar]
Ma, S.; Wang, H.; Yu, Z.; Du, L.; Zhang, M.; Fu, Q. AttenEpilepsy: A 2D convolutional network model based on multi-head self-attention. Eng. Anal. Bound. Elem. 2024, 169, 105989. [Google Scholar] [CrossRef]
Liu, P.; Qian, W.; Wang, Y. YWnet: A convolutional block attention-based fusion deep learning method for complex underwater small target detection. Ecol. Inform. 2024, 79, 102401. [Google Scholar]
Liu, Y.; Che, S.; Ai, L.; Song, C.; Zhang, Z.; Zhou, Y.; Yang, X.; Xian, C. Camouflage detection: Optimization-based computer vision for Alligator sinensis with low detectability in complex wild environments. Ecol. Inform. 2024, 83, 102802. [Google Scholar] [CrossRef]
Zhang, P.; Deng, H.; Chen, Z. RT-YOLO: A Residual Feature Fusion Triple Attention Network for Aerial Image Target Detection. Comput. Mater. Contin. 2023, 75, 1411–1430. [Google Scholar] [CrossRef]
Altay, F.; Velipasalar, S. The Use of Thermal Cameras for Pedestrian Detection. IEEE Sens. J. 2022, 22, 11489–11498. [Google Scholar] [CrossRef]
Hu, J.; Wang, L.; Peng, B.; Teng, F.; Li, T. Efficient fire and smoke detection in complex environments via adaptive spatial feature fusion and dual attention mechanism. Digit. Signal Process. 2025, 159, 104982. [Google Scholar] [CrossRef]
Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient Channel Attention for Deep Convolutional Neural Networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
Tsung Yi, L.; Maire, M.; Belongie, S.; Bourdev, L.; Girshick, R.; Hays, J.; Perona, P.; Ramanan, D.; Zitnick, C.L.; Dollár, P. Microsoft COCO: Common Objects in Context. In Proceedings of the Computer vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, 6–12 September 2014; Springer International Publishing: Cham, Switzerland, 2014; pp. 740–755. [Google Scholar]
Terven, J.; Córdova-Esparza, D.; Romero-González, J. A Comprehensive Review of YOLO Architectures in Computer Vision: From YOLOv1 to YOLOv8 and YOLO-NAS. Mach. Learn. Knowl. Extr. 2023, 5, 1680–1716. [Google Scholar] [CrossRef]
Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollar, P. Focal Loss for Dense Object Detection. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 42, 318–327. [Google Scholar] [CrossRef]
Elhariri, E.; El-Bendary, N.; Mahmoud Saleh, S. Strawberry-DS: Dataset of annotated strawberry fruits images with various developmental stages. Data Brief 2023, 48, 109165. [Google Scholar] [CrossRef] [PubMed]
Xiao, D.; Wang, H.; Liu, Y.; Li, W.; Li, H. DHSW-YOLO: A duck flock daily behavior recognition model adaptable to bright and dark conditions. Comput. Electron. Agric. 2024, 225, 109281. [Google Scholar]
Liu, Y.; Shao, Z.; Hoffmann, N. Global Attention Mechanism: Retain Information to Enhance Channel-Spatial Interactions. arXiv 2021, arXiv:2112.05561. [Google Scholar]
Woo, S.; Park, J.; Joon-Young, L.; Kweon, I.S. CBAM: Convolutional Block Attention Module. arXiv 2018, arXiv:2112.05561. [Google Scholar]
Ouyang, D.; He, S.; Zhang, G.; Luo, M.; Guo, H.; Zhan, J.; Huang, Z. Efficient Multi-Scale Attention Module with Cross-Spatial Learning; IEEE: New York, NY, USA, 2023. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [PubMed]
Zhang, H.; Chang, H.; Ma, B.; Wang, N.; Chen, X. Dynamic R-CNN: Towards High Quality Object Detection via Dynamic Training. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Springer International Publishing: Cham, Switzerland, 2020; pp. 260–275. [Google Scholar]
He, K.; Gkioxari, G.; Dollar, P.; Girshick, R. Mask R-CNN. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 42, 386–397. [Google Scholar]
Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Liu, Y.; Chen, J. DETRs Beat YOLOs on Real-time Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 16965–16974. [Google Scholar]
Li, Y.; Mao, H.; Girshick, R.; He, K. Exploring Plain Vision Transformer Backbones for Object Detection. In European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer Nature: Cham, Switzerland, 2022; pp. 280–296. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
Chien-Yao, W.; I-Hau, Y.; Hong-Yuan, M.L. YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information. In Proceedings of the European Conference on Computer Vision, Fiera Milano, Italy, 29 September–4 October 2024; Springer Nature: Cham, Switzerland, 2024; pp. 1–21. [Google Scholar]
Cui, Y.; Han, Y.; Guo, D. TS-DETR: Multi-scale DETR for traffic sign detection and recognition. Pattern Recognit. Lett. 2025, 190, 147–152. [Google Scholar]
Huan, Z.; Zhou, J.; Xie, Y.; Xu, J.; Wang, H.; Ma, W.; Li, X.; Zhou, W.; Luo, T. Automated droplet manipulation enabled by a machine-vision-assisted acoustic tweezer. Sens. Actuators B Chem. 2024, 418, 136352. [Google Scholar]
Chien-Yao, W.; Bochkovskiy, A.; Hong-Yuan, M.L. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 7464–7475. [Google Scholar]
Du, J.; Liu, L.; Li, R.; Jiao, L.; Xie, C.; Wang, R. Towards densely clustered tiny pest detection in the wild environment. Neurocomputing 2022, 490, 400–412. [Google Scholar]

Figure 1. Schematic diagram of the overall improvement implemented by FPN + PAN: (a) The classic structure of FPN + PAN and (b) common improvement methods of the feature fusion network.

Figure 2. Overall framework of the MRAA network.

Figure 3. Visualization of the MRAA model architecture.

Figure 4. Structure diagram of the FocusFeature module.

Figure 5. Overview of the modification methods: (a) Schematic diagram of the feature fusion network; (b) adaptive spatial fusion; and (c) schematic diagram of the multi-scale fusion method.

Figure 6. Schematic diagram of the branch diffusion method: (a) Diagram of N2 branch diffusion operation; (b) Diagram of N3 branch diffusion operation; (c) Diagram of N1 branch diffusion operation; and (d) N4, N5 branch diffusion operation diagram.

Figure 7. The backbone network N-CSPDarknet53 and the Da-Conv module: (a) The backbone network N-CSPDarknet53 and (b) the structure of the Da-Conv module.

Figure 8. Caterpillar fungi in different scenes in the natural environment.

Figure 9. Comparison of the detection accuracy of MRAA network components.

Figure 10. Results of the ablation experiment with MRFPN.

Figure 11. Comparison of detection accuracy of different attention mechanisms.

Figure 12. Comparison of MRAA with other advanced models.

Figure 13. Comparison of performance of different models.

Table 1. The parameters of the hardware and software environments employed in experiments.

Parameter Setting	Configuration
Operating system	Windows11
GPU	NVIDIA GeForce RTX 4060
CPU	i9-13980HX
Host memory	16 GB
Programming language	Python 3.8
Deep learning frameworks	Pytorch 2.0.1

Table 2. The key parameters of MRAA training.

Parameter	Setting
Each training cycle	300
Batch size	8
Image input size	640 × 640
Learning rate	0.01
Momentum	0.937
Weight decay	0.0005

Table 3. Ablation experiments with MRAA network components.

MRFPN	Da-Conv	AP	AP₅₀	AP_S	AP_M	AP_L
-	-	0.388	0.648	0.180	0.428	0.526
√	-	0.406	0.660	0.191	0.445	0.538
-	√	0.423	0.665	0.195	0.468	0.635
√	√	0.449	0.695	0.202	0.475	0.659

Table 4. Ablation experiments with MRFPN.

Module	AP	AP₅₀	AP_S	AP_M	AP_L
YOLOv8s	0.388	0.648	0.180	0.428	0.526
YOLOv8s + Method1	0.390	0.652	0.186	0.450	0.515
YOLOv8s + Method2	0.424	0.674	0.193	0.469	0.508
YOLOv8s + Method1 + Method2	0.445	0.680	0.197	0.475	0.655

Table 5. Results of the ablation experiments with Da-Conv.

Module	AP	AP₅₀	AP_S	AP_M	AP_L
YOLOv8s	0.388	0.648	0.180	0.428	0.526
YOLOv8s + SE	0.396	0.664	0.173	0.435	0.558
YOLOv8s + CBAM	0.445	0.665	0.169	0.448	0.578
YOLOv8s + GAM	0.360	0.642	0.160	0.411	0.514
YOLOv8s + EMA	0.449	0.678	0.178	0.453	0.635
YOLOv8s + Da-Conv	0.493	0.687	0.197	0.468	0.704

Table 6. Comparison of different algorithms.

Method	Backbone	AP	AP₅₀	AP_S	AP_M	AP_L	Param/M	FLOPs/G	Inference Time/ms
FasterR-CNN	ResNet-50	0.346	0.531	0.083	0.396	0.431	38.8	158	84.5
MaskR-CNN	ResNet + FPN	0.351	0.554	0.102	0.412	0.450	102.5	96.7	107
DynamicR-CNN	ResNet-50	0.350	0.539	0.100	0.405	0.444	45.7	158	72
RT-detr	ResNet-50	0.359	0.590	0.113	0.418	0.498	40	126	25
VitDet	ViT-B	0.354	0.563	0.109	0.415	0.467	109	830	217
YOLOv5s	CSP-Darknet53	0.361	0.575	0.140	0.420	0.513	25.3	11.3	8.1
YOLOv7tiny	CSP-Darknet53	0.370	0.615	0.175	0.427	0.608	14.8	18.9	7.2
Baseline	Modified-CSPDarknet53	0.388	0.648	0.180	0.428	0.526	11	28.4	9.8
YOLOv9s	CSPDarknet-X	0.438	0.682	0.194	0.463	0.641	12.1	25.6	15.6
NanoDet-Plus	GhostNetV2	0.427	0.673	0.187	0.455	0.629	4.2	8.9	15.9
TS-DETR	Detection Transformer	0.445	0.695	0.198	0.469	0.653	48.7	132.5	30
MRAA-ours	MRFPN	0.449	0.695	0.202	0.475	0.659	9.88	21	11.7

Table 7. Recognition performance of the MRAA model in different scenarios.

Test Condition	AP	AP₅₀	AP_S	AP_M	AP_L
Clear Weather	0.449	0.695	0.202	0.475	0.659
Snow Weather	0.411	0.677	0.186	0.449	0.637
Independent	0.445	0.690	0.196	0.460	0.650
Overlap	0.428	0.682	0.190	0.458	0.641
Natural light	0.448	0.692	0.199	0.471	0.654
Night light	0.403	0.655	0.180	0.420	0.645

Table 8. Recognition performance of the basic model in different scenarios.

Test Condition	AP	AP₅₀	AP_S	AP_M	AP_L
Clear Weather	0.388	0.648	0.180	0.428	0.526
Snow Weather	0.365	0.603	0.163	0.401	0.498
independent	0.390	0.659	0.186	0.430	0.525
overlap	0.380	0.630	0.172	0.418	0.510
Natural light	0.382	0.640	0.181	0.423	0.519
Night light	0.373	0.625	0.160	0.415	0.520

Table 9. Comparative analysis with different advanced technologies.

Model	AP	AP₅₀	AP_S	AP_M	AP_L
Faster R-CNN	0.362	0.610	0.158	0.437	0.576
YOLOv5s	0.401	0.653	0.173	0.421	0.598
RetinaNet	0.388	0.642	0.165	0.455	0.635
MRAA(Ours)	0.449	0.695	0.202	0.475	0.659

Table 10. Ablation study with Strawberry-DS dataset.

Dataset	MRFPN	Da-Conv	AP	AP₅₀	AP_S	AP_M	AP_L
Strawberry-DS	-	-	0.486	0.692	0.196	0.425	0.650
	√	-	0.495	0.698	0.217	0.444	0.663
	-	√	0.502	0.712	0.214	0.456	0.667
	√	√	0.516	0.715	0.232	0.460	0.678

Table 11. Ablation study with APHID-4K dataset.

Dataset	MRFPN	Da-Conv	AP	AP₅₀	AP_S	AP_M	AP_L
APHID-4K	-	-	0.415	0.654	0.127	0.386	0.627
	√	-	0.431	0.657	0.149	0.404	0.633
	-	√	0.439	0.678	0.144	0.416	0.651
	√	√	0.445	0.680	0.163	0.420	0.648

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Xia, Z.; Sun, A.; Hou, H.; Song, Q.; Yang, H.; Ma, L.; Dong, F. Recognition of Cordyceps Based on Machine Vision and Deep Learning. Agriculture 2025, 15, 713. https://doi.org/10.3390/agriculture15070713

AMA Style

Xia Z, Sun A, Hou H, Song Q, Yang H, Ma L, Dong F. Recognition of Cordyceps Based on Machine Vision and Deep Learning. Agriculture. 2025; 15(7):713. https://doi.org/10.3390/agriculture15070713

Chicago/Turabian Style

Xia, Zihao, Aimin Sun, Hangdong Hou, Qingfeng Song, Hongli Yang, Liyong Ma, and Fang Dong. 2025. "Recognition of Cordyceps Based on Machine Vision and Deep Learning" Agriculture 15, no. 7: 713. https://doi.org/10.3390/agriculture15070713

APA Style

Xia, Z., Sun, A., Hou, H., Song, Q., Yang, H., Ma, L., & Dong, F. (2025). Recognition of Cordyceps Based on Machine Vision and Deep Learning. Agriculture, 15(7), 713. https://doi.org/10.3390/agriculture15070713

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Recognition of Cordyceps Based on Machine Vision and Deep Learning

Abstract

1. Introduction

2. Method and Experimental Design

2.1. Feature Fusion Network

2.1.1. Multi-Scale Fusion Method

2.1.2. Branch Diffusion Method

2.2. The Main Backbone: N-CSPDarknet53

2.3. Dataset and Experimental Setup

2.3.1. The Construction of the Dataset

2.3.2. Evaluation Indicators

2.3.3. Experimental Setup

2.4. Ablation Experiment

2.4.1. Component Ablation Experiment of MRAA

2.4.2. Ablation Experiment with MRFPN

2.4.3. The Ablation Experiment with the Da-Conv Module

2.4.4. Comparison with Advanced Algorithms

2.4.5. Robustness Testing in Different Scenarios

2.5. Ablation Experiments Using Other Datasets

3. Conclusions and Discussion

Author Contributions

Funding

Institutional Review Board Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI