CGT-YOLOv5n: A Precision Model for Detecting Mouse Holes Amid Complex Grassland Terrains

Li, Chao; Luo, Xiaoling; Pan, Xin

doi:10.3390/app14010291

Open AccessArticle

CGT-YOLOv5n: A Precision Model for Detecting Mouse Holes Amid Complex Grassland Terrains

by

Chao Li

,

Xiaoling Luo

^* and

Xin Pan

College of Computer and Information, Inner Mongolia Agricultural University, Hohhot 010018, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(1), 291; https://doi.org/10.3390/app14010291

Submission received: 16 November 2023 / Revised: 21 December 2023 / Accepted: 27 December 2023 / Published: 28 December 2023

Download

Browse Figures

Versions Notes

Abstract

:

This study employs unmanned aerial vehicles (UAVs) to detect mouse holes in grasslands, offering an effective tool for grassland ecological conservation. We introduce the specially designed CGT-YOLOv5n model, addressing long-standing challenges UAVs face, particularly the decreased detection accuracy in complex grassland environments due to shadows and obstructions. The model incorporates a Context Augmentation Module (CAM) focused on improving the detection of small mouse holes and mitigating the interference of shadows. Additionally, to enhance the model’s ability to recognize mouse holes of varied morphologies, we have integrated an omni-dimensional dynamic convolution (ODConv), thereby increasing the model’s adaptability to diverse image features. Furthermore, the model includes a Task-Specific Context Decoupling (TSCODE) module, independently refining the contextual semantics and spatial details for classification and regression tasks and significantly improving the detection accuracy. The empirical results show that when the intersection over union (IoU) threshold is set at 0.5, the model’s mean average precision (mAP_0.5) for detection accuracy reaches 92.8%. The mean average precision (mAP_0.5:0.95), calculated over different IoU thresholds ranging from 0.5 to 0.95 in increments of 0.05, is 46.2%. These represent improvements of 3.3% and 4.3%, respectively, compared to the original model. Thus, this model contributes significantly to grassland ecological conservation and provides an effective tool for grassland management and mouse pest control in pastoral areas.

Keywords:

drone; deep learning; mouse holes; YOLOv5n

1. Introduction

Grassland ecosystems play a vital role in the earth’s wellbeing because they notably influence soil conservation, the water cycle, climate regulation, biodiversity, and human livelihood [1].

In recent years, the Inner Mongolia Autonomous Region grasslands have suffered severe mouse infestations, posing significant challenges to local sustainable development and the livestock industry [2]. Inner Mongolia’s natural grasslands, covering approximately 27% of China’s total grassland area, span about 788,000 square kilometers [3]. Of this area, around 42,000 square kilometers are affected by mouse damage, with 17,300 square kilometers undergoing severe degradation. This situation not only impedes the development of the livestock industry but also intensifies grassland degradation and the risk of disease outbreaks. Therefore, effectively monitoring and controlling mouse infestations are essential for protecting the ecological environment and promoting sustainable livestock farming in the region. Traditional methods of monitoring rodents in grasslands, such as counting mouse holes, are simple but limited by high costs, time consumption, and restricted coverage [4]. The advancement of UAV remote sensing technology offers new opportunities for grassland rodent damage monitoring, enabling rapid and accurate assessment of rodent infestations and aiding in the identification and quantification of mouse holes. Additionally, despite the uneven distribution and diversity of grassland resources globally, the application of UAV technology in detecting mouse holes in Inner Mongolia and other grassland regions holds universal significance, providing valuable insights and methods for global grassland management [5]. In prior studies, numerous scholars have utilized drones in conjunction with machine learning techniques to detect mouse holes. Amin et al. utilized low-altitude UAV remote sensing to monitor giant gerbil holes in the Junggar Basin by distinguishing them from other features [6]. Tao et al. investigated the distribution patterns of giant gerbil holes in desert forests through UAV remote-sensing imagery [7]. Zhou et al. employed low-altitude UAV remote sensing with object-orientated template matching and support vector machine methods to detect and identify mouse holes in the Sanjiangyuan region [8]. Di et al. employed the excellent likelihood method and object-orientated classification to identify mouse holes [9].

Although the studies above succeeded by combining low-altitude drone remote sensing with traditional machine learning techniques, enhancing the detection accuracy and speed remains a challenge, especially on complex grassland backgrounds. Traditional machine learning methods typically require manually designed features and substantial prior knowledge, which may struggle to adapt to different environments and scenarios. In recent years, the advent of deep learning, particularly convolutional neural networks (CNNs), has brought revolutionary transformations to image processing and object detection. Unlike traditional machine learning methods, deep learning can automatically learn and extract features from images. It offers a more vital generalization ability, offering new possibilities for improving mouse hole accuracy and real-time detection. Research indicates that models based on deep learning outperform those based on machine learning in terms of accuracy [10].

By integrating low-altitude drone remote sensing into deep learning, the detection and identification accuracy can be effectively enhanced. Wang Shubo’s team applied convolutional neural networks to classify three types of weeds and crops, achieving a high recognition rate of 95.6%. They also developed a method for detecting the density of individual weeds, significantly contributing to precision agriculture in terms of targeted application and monitoring [11]. Sun Yu and colleagues proposed a real-time monitoring method using unmanned aerial vehicles (UAVs) based on deep learning, employing the SSD300 framework for the rapid identification of pest-infested pine forests, achieving a test accuracy rate of 97.22% [12]. Zhou Su and his team integrated UAV remote sensing with the Mask R-CNN model to develop a novel method for identifying and segmenting grassland features, specifically for monitoring rodent damage in the Zoige Grassland, providing critical data support for the protection of grassland productivity and sustainable development [13]. Cui Bochao’s team combined UAV remote-sensing and machine-vision technology, utilizing YOLOv3 and its lightweight version to achieve the automatic recognition and localization of rodent burrows in desert forests, with an average precision rate of 92.37%, thereby enhancing the efficiency of rodent damage monitoring [14].

With the advancement of cutting-edge technologies, the widely used YOLOv5, the latest YOLOv8 released this year, and the popular DERT model from this year have demonstrated commendable performances in the application of object detection.

Yan Yuncai and his team combined ground data with UAV remote-sensing images, employing the YOLOv5s algorithm to detect diseases and pests on kiwi fruit leaves, achieving a high precision rate of 99.54% and a recall rate of 99.24%, providing technical support for the management of kiwi fruit orchards [15]. Leng Ruixuan developed an improved YOLOv8-based algorithm for detecting foreign objects on power transmission lines, enhancing the detection accuracy for everyday objects like bird nests and balloons, with the average detection precision increased to 94.91% [16]. Wang Z and colleagues proposed an enhanced YOLOv8m model for detecting foreign objects on high-voltage power lines, integrating new modules and loss functions and significantly improving the detection accuracy [17]. Du Yufeng and his team utilized a combination of CNN and a transformer, adopting the DETR network to automatically detect landslides, achieving an average accuracy rate of 0.997 and demonstrating its efficiency and precision [18].

Although the latest mouse hole detection model based on YOLOv3 has achieved incremental progress, some difficulties still exist in detecting mouse holes on complex grassland backgrounds. For instance, the shadows of weeds and rocks could be misidentified as mouse holes, and detecting mouse holes obscured by weeds also presents a challenge. Additionally, the YOLOv3 model has a slower detection speed and larger model size, which are not conducive to deployment in drone equipment. To address this problem and ensure suitability for deployment in resource-limited mobile devices such as drones, You Only Look Once version 5 nano (YOLOv5n for short) was selected for this study. YOLOv5n [19] is a lightweight target-detection model with the smallest size among the existing YOLO series models, making it suitable for resource-limited devices. YOLOv5n is specifically designed to detect small targets, such as mouse holes. Despite its smaller model size and higher detection precision than YOLOv3n, YOLOv5n still struggles on complex grassland backgrounds, frequently misidentifying shadows of rocks, weeds, and livestock feces as mouse holes. Consequently, in this study, a CGT-YOLOv5n model was constructed for mouse hole detection with a satisfactory performance on complex grass backgrounds.

The main contributions of this study are as follows:

1.: A Context Augmentation Module (CAM) was introduced to enhance feature extraction and fusion by integrating contextual information and adaptive fusion methods (AFMs). This approach utilizes dilated convolution with different dilation rates to capture contextual information from various sensory fields. Subsequently, the adaptive fusion component filters conflicting information and reduces semantic differences, thereby enhancing the model’s ability to understand mouse holes in images.
2.: The Task-Specific Context Decoupling (TSCODE) header was utilized to separate the classification and localization tasks in the model-detection process. Feature maps with weak spatial but vital semantic information are employed for classification, while high-resolution feature maps containing detailed edge information are utilized for localization. This approach facilitates a better regression of object boundaries, resulting in more accurate localization and classification by the model.
3.: The Omni-dimensional Dynamic Convolution (ODConv) technology was adopted. Unlike traditional dynamic convolution, Omnidirectional Dynamic Convolution enables the model to adapt to different input images through three modifications. Following these modifications, the model can allocate four-dimensional weights to the convolutional layers and generate convolutional layers suitable for the input images.

The remainder of this paper is organized as follows: Section 2 introduces the enhanced CGT-YOLOv5n model and provides a detailed exposition of three improvement methods. Section 3 discusses data acquisition, the experimental environment, and the experimental results, along with an analysis and discussion. Section 4 summarizes the research, outlines the limitations of the new model, and presents a perspective on future directions.

2. Models and Methods

2.1. CGT-YOLOv5n Model

The CGT-YOLOv5n model is designed to detect mouse holes effectively amidst complex grassland backgrounds. This model is an improvement on YOLOv5n, aimed at adapting to resource-limited mobile devices such as drones while addressing the insufficient recognition accuracy of traditional models in complex environments. CGT-YOLOv5n retains the lightweight properties of YOLOv5n, rendering it exceptionally suitable for devices with limited computational capabilities. This adaptation ensures that the model operates efficiently on such devices without considerably affecting the performance. The model is meticulously optimized for detecting mouse holes in intricate grassland terrains. Despite YOLOv5n’s already high detection precision, CGT-YOLOv5n further augments the capability to recognize features against complex backgrounds.

As shown in Figure 1, initially, the CGT-YOLOv5n model integrates a Context Awareness Module (CAM), enhancing feature fusion by amalgamating contextual information. This approach elevates the model’s ability to discern mouse holes amidst grassland backdrops while reducing misidentifications of non-target objects like weeds and rocks. Subsequently, within the C3 module of the YOLOv5n model, traditional convolutions are replaced by ODConv (adjustable convolution). ODConv permits dynamic adjustments of the convolutional kernel size across different dimensions, thereby boosting the model’s adaptability and recognition capacity for mouse holes of varying sizes and shapes. Finally, the CGT-YOLOv5n model employs a TSCODE (Task-Specific Decoupled Output Decoder) in place of the coupled head in the original model. This design allows detection sub-tasks to extract more specific information from feature maps, reducing the interplay between classification and regression tasks, and thus enhancing the overall detection accuracy.

2.2. Improving YOLOv5n with a Context Augmentation Module and Adaptive Fusion Mechanism

The Context Augmentation Module (CAM) [20] enhances the neural network’s ability to capture contextual information in images by employing dilated convolutions [21,22] to expand the receptive field and adaptively fusing multi-scale feature maps, thereby improving the perceptual capabilities without incurring additional computational overheads [23].

The adaptive fusion mechanism in the Context Augmentation Module is designed to dynamically integrate features at various levels, which is crucial for identifying objects of different sizes and scales. These enhancements aim to address object detection challenges in diverse and complex scenarios by enriching the feature representation and improving the model’s adaptability to different object scales and environmental conditions. In Figure 2,

F_{1}

represents the contextual information above feature map

F_{2}

, while

F_{3}

represents the contextual information below feature map

F_{2}

. The adaptive fusion method generates adaptive weights

P_{1}

,

P_{2}

,

P_{3}

for each spatial position by performing convolution, concatenation, and Softmax operations. These weights guide the fusion process, ensuring the importance of each feature map

F_{1}

,

F_{2}

,

F_{3}

is considered during fusion. The weights are multiplied by the feature maps and subsequently added together to obtain the final feature map

F_{4}

. The weight parameters can be learned through a backpropagation algorithm, enabling the model to adaptively adjust the contributions of feature maps at different scales, making them suitable for diverse tasks and scenarios. This adaptability enhances the model’s capability to effectively accommodate various requirements

2.3. TSCODE-Based Decoupling Header Improvement for the Improvement of YOLOv5n

To enhance YOLOv5n, a decoupled head based on TSCODE was introduced. Object detection tasks comprise two fundamental components: classification and localization. Classification primarily considers the texture information and corresponding categories of the image, determining whether a target exists in a specific region of the feature map. On the other hand, localization aims to accurately adjust the bounding box parameters [24] by leveraging the edge information within the image, precisely identifying the target’s position on the feature map. When classification and localization tasks share the same feature mapping, inconsistencies in task predictions at different locations may arise, leading to localization errors and a decreased performance.

In YOLOv5n, both tasks are executed on the same feature map. However, classification and localization have different requirements for position and spatial information on the feature map, potentially causing spatial misalignment issues. Therefore, the proposed decoupled head based on TSCODE aims to disentangle the classification and localization tasks, mitigating spatial misalignment and enhancing performance.

The learning in the TSCODE decoupling head for the classification task involves utilizing feature layers with high-level semantic information. Conversely, the localization task requires high-resolution feature maps that capture more edge information, enabling the accurate delineation of object boundaries [25]. Figure 3 illustrates this concept.

2.4. Improving YOLOv5n Based on ODConv

The improvement of YOLOv5n based on ODConv introduces dynamic convolution (DC), which generates adaptive convolution kernels during each forward pass to align better with the input data features [26].

In traditional dynamic convolution, although convolution kernels encompass four dimensions—the spatial, input channel, output channel, and kernel dimension—typically, only a single attention value is assigned to each kernel, neglecting the variation across these dimensions. This limits the flexibility in weight distribution, impacting dynamic optimization for specific tasks. To address this, Chao Li et al. proposed ODConv [27], which introduces a more comprehensive dynamic approach that covers not only the kernel dimension but also the spatial, input, and output channel dimensions. ODConv allocates distinct attention values to different dimensions of each convolution kernel, enhancing the network’s adaptability and generalization capabilities. For instance, the

s

parameter assigns attention values to convolution parameters at different spatial locations. In Figure 4, the

c

and

o

parameters allocate distinct attention values to input and output channels, and the

π

parameter assigns a single attention scalar to the entire convolutional kernel. This differentiation in convolution operations across all dimensions significantly augments the model’s feature-extraction capacity for various input images.

3. Experimental Results and Discussion

3.1. Data Acquisition and Production

This study selected three experimental areas in the Inner Mongolia Autonomous Region: the grassland south of S104 Provincial Road in Xilamuren Town, Wuchuan County, Hohhot (40°47′ N to 41°23′ N, 110°31′ E to 111°53′ E), the Chilechuan Grassland (40°55′08″ N, 111°52′12″ E), and Tongdege Village in Daerhan Maomingan Joint Banner, Baotou City (41°22′32″ N, 110°57′6″ E). These areas were chosen due to their high density of mouse holes, with a particular focus on Tongdege Village, where plague outbreaks have exacerbated the challenges of mouse control. The Xilamuren Grassland and Chilechuan Grassland were selected for their high burrow density, facilitating sampling, and the necessity of extensive annual manual mouse hole inspections.

Data collection for this study was conducted in March, July, and November using a DJI Mavic 2 Zoom drone set at a flight altitude of 2 m to accommodate the seasonal changes in grassland background color and enhance the model’s applicability across different seasons. March and November represent the periods of mouse hibernation and vegetation withering, respectively, while July signifies the grassland growth phase [28]. For demonstration purposes, forty-eight random mouse hole images from these months were selected (as shown in Figure 5). Using the PASCAL VOC format, image annotation was performed with LabelImg [29], a free, open-source tool.

3.2. Experimental Environment

The specifications of the experimental environment are listed in Table 1, and all experiments were conducted under these settings. A total of 2000 images were utilized, each with a resolution of 640 × 640 pixels, and were split into training and validation sets at a ratio of 8:2. The training set comprised 1600 images, while the validation set included 400 images. The training batch size was set to 32, totaling 300 training epochs. The initial seed was set to 0, the learning rate was set to 0.01, the weight decay was set to 0.005, and the momentum was set to 0.937 to ensure the reproducibility of the experiments.

In the post-processing stage of the model, the confidence threshold was set to 0.001, with only predictions with a confidence score of 0.001 or higher being considered. In contrast, those below this threshold were discarded. The intersection over union (IoU) threshold was set to 0.6, meaning that if the IoU of two bounding boxes exceeded 0.6, then they were deemed overlapping, and the bounding box with the lower confidence score would be discarded.

3.3. Evaluation Criteria

Four evaluation metrics were employed in this study, namely, the mean average precision (mAP), model size, inference speed latency, and transmission rate in frames per second (FPS), to assess the effectiveness of the original model compared with the improved model for mouse hole detection.

m A P

is a crucial performance metric in target detection. The

m A P

(mean average precision) value reflects the detection performance of the model across the entire dataset. A high

m A P

value indicates that the model exhibits a good detection performance, while a low

m A P

value may suggest issues with the model’s detection capabilities. Calculated using Equations (1)–(4), it quantifies the detector’s performance by calculating the area under the precision (

P

)–recall (

R

) curve at various thresholds [30]. In Equations (1)–(4),

P

represents the precision rate, which denotes the percentage of positive samples (true positive) correctly detected by the model out of all samples.

R

represents the recall rate, which refers to the proportion of positive samples (true positive) correctly detected by the model out of all positive samples (true positive + false negative) [31].

P = \frac{T P}{T P + F P} \times 100 %

(1)

R = \frac{T P}{T P + F N} \times 100 %

(2)

A P = \int_{0}^{1} P (R) d R

(3)

\begin{array}{l} m A P = \frac{\sum_{0}^{n} A P}{n} \end{array}

(4)

TP represents the number of samples correctly predicted as positive,

F P

represents the number of samples where negative samples are incorrectly predicted as positive, and

F N

represents the number of samples where positive samples are incorrectly predicted as negative [32]. Additionally,

n

denotes the number of categories,

A P

denotes the average precision, and

m A P

represents the mean average precision.

The intersection over union (IoU) is a crucial metric for assessing the localization accuracy of object detection models, and this study utilizes two thresholds, IoU = 0.5 and IoU = 0.5:0.95, to evaluate the model performance. The model’s size affects its deployment flexibility and efficiency, with larger models requiring more resources. The inference speed, measured as frames per second (FPS) [33], is an essential indicator of the model’s image processing speed and suitability for real-time applications

These four detection metrics are recognized and widely used measurement standards in object detection, boasting strong generality and comparability.

3.4. Comparison Results for the Original Model and the Improved Model

As depicted in Table 2, the improved model, CGT-YOLOv5n, exhibits a 3.3 percentage point improvement in mAP_0.5 and a 4.3 percentage point improvement in mAP_0.5:0.95 compared with YOLOv5n. Model enhancements were applied stepwise to evaluate the effects of each module in enhancing the original model. Integrating the CAM module into the YOLOv5n model enhances the model’s ability to distinguish targets from shadow interferences by incorporating contextual information related to mouse holes. This enhancement is achieved using an adaptive fusion method to minimize semantic disparities, thereby improving the model’s recognition of smaller mouse holes in images. As a result, the mean average precision (mAP) at the 0.5 threshold increased by 2.3 percentage points, and the mAP over the range of 0.5 to 0.9 thresholds increased by 1.2 percentage points. Additionally, the model’s accuracy in object boundary regression was improved by segregating the classification and localization tasks and utilizing semantically rich feature maps for classification paired with high-resolution maps for localization. Further advancements include replacing the coupled head of CAM-YOLOv5n with a TSCODE decoupled head, which led to an increase of 0.4 and 0.7 percentage points in mAP_0.5 and mAP_0.5:0.95, respectively. Finally, the model’s ability to recognize differently shaped mouse holes is enhanced by assigning four-dimensional weights to convolutional layers based on varying input images. Replacing the C3 module of the CAM-TSCODE-YOLOv5n with the OD_C3 module resulted in increases of 0.6 and 2.4 percentage points in mAP_0.5 and mAP_0.5:0.95, respectively. Because of the increased model complexity, the model size increased by 11.7 M, reaching 15.7 M. The inference speed increased by 1.9 ms, resulting in a speed of 161.3 FPS and a decrease of 56.1 frames. Although there is a slight increase in the processing time, considering the specificity of detecting mouse holes, a higher accuracy rate is required for identifying these holes. Given the significant improvement in the model’s accuracy, this minor extension in inference time is acceptable.

The stability of the original YOLOv5n model notably differed from that of the improved model CGT-YOLOv5n during the training process, as observed from the loss values of the validation set (Figure 6a). The blue curve represents the loss values (val/box_loss, val/obj_loss) of the original YOLOv5n model, indicating significant fluctuations and a lack of convergence. By contrast, the red curves represent the loss values (val/box_loss, val/obj_loss) of the improved model, which appear smooth and consistently lower than those of the original model. Eventually, the loss values of the improved model stabilize at approximately 0.04, indicating a superior convergence.

Figure 6b presents the comparison curves for precision, recall, mAP_0.5, and mAP_0.5:0.95. The improved model, CGT-YOLOv5n, exhibits a gradual and converging increase in all four indices. The curves representing the improved model indicate improvements over the original model, demonstrating that the performance of the improved model surpasses that of the original model.

Even under the conditions of non-complex grassland backgrounds, the detection process can be affected by various factors. In Figure 7, the top row displays the detection results of the original model, while the bottom row exhibits the outcomes from the improved model. As depicted in Figure 7a,b, due to the black area in the center of the mouse hole targets, the original model mistakenly detects the shadows of the cola bottle and stone as mouse holes. In contrast, the improved model avoids such misclassification, as shown in Figure 7b,d. Furthermore, as seen in Figure 7e, the improved model successfully detects irregular-shaped mouse holes that the original model fails to identify. Under complex conditions with various elements such as rocks, animal droppings, cattle, sheep, horse droppings, and overgrown weeds present on the grass, the detection of mouse holes is notably influenced, as illustrated in Figure 8. The comparison between Figure 8a,b shows that (a) mistakenly identifies animal droppings and rocks as mouse holes. Similarly, in the comparison between Figure 8c,d, the grass obscures the mouse holes considerably, leading to the failure of detection in Figure 8c, where two obscured mouse holes were not detected. Figure 8e,f also depict heavily obscured mouse holes that were not detected by the original model.

Figure 9 presents data captured by a drone at varying heights, illustrating the detection results of the original and the improved models. The first row displays the detection outcomes of the original model, while the second row showcases those of the improved model. In the figure, (a) and (b) represent a shooting height of 2 m, (c) and (d) represent 3 m, (e) and (f) represent 4 m, (g) and (h) represent 5 m, and (i) and (j) represent 6 m. The original and improved models can accurately detect mouse holes at shooting heights of two and three meters. At a height of 4 m, the original model erroneously identifies the shadow of weeds as mouse holes, as shown by the red box outside the blue circle in Figure 9e, whereas the improved model avoids this error. At the shooting heights of 5 and 6 m, the original model fails to detect mouse holes and mistakenly identifies the shadow of weeds as mouse holes, as indicated by the red boxes outside the blue circles in Figure 9g,i. Conversely, the improved model in Figure 9h,j neither detects mouse holes nor exhibits false detections (no red boxes appear). Overall, the original model presents one false detection in Figure 9e, and two false detections and two misses in Figure 9g,i, while the improved model only exhibits two misses in Figure 9h,j. Evidently, at higher shooting heights, the performance of the improved model still surpasses that of the original model.

3.5. Comparison and Discussion of Experimental Results from Various Models

To evaluate the effectiveness of the proposed method, we compared it with other mainstream object detection algorithms such as Faster-RCNN [34], YOLOv3 [35], and SSD [36], along with the latest YOLOv8n [37], and RT-DETR [38]. The experimental results are shown in Table 3. Faster R-CNN is a region proposal-based method, initially employing a Region Proposal Network (RPN) to extract potential target regions, followed by a classifier for categorization. It exhibits good accuracy but is characterized by a giant model size and slower speed. On the other hand, the SSD model forgoes region proposals and directly predicts the object classes and bounding boxes on feature maps of multiple scales, striking a balance between speed and accuracy. It operates faster than Faster R-CNN, albeit with a slightly lower accuracy than SSD. YOLOv3, a classic version in the YOLO series, performs predictions directly on the entire image, resembling SSD, but employs a different anchor box strategy and loss function. It boasts high accuracy, although it falls short in model size and speed compared to SSD. YOLOv8 is the latest model in the YOLO series, with YOLOv8n offering a variety of backbones and custom structures. While maintaining high accuracy, it achieves faster inference speeds than other object-detection models. However, YOLOv8n has a higher model complexity, necessitating more computational resources. RT-DETR is a real-time object detector based on the Vision Transformer (ViT), capable of efficiently handling multi-scale features by decoupling intra-scale interaction and cross-scale fusion. Nonetheless, its accuracy on mouse hole data is low and it has a larger model size.

To fairly assess the overall performance of different models, this study employed the sum of ranking differences (SRD) method [39,40]. As shown in Figure 10, this approach is based on a 5 × 7 matrix, with the rows representing various models and the columns representing evaluation metrics such as the mAP, FPS, model size, and inference speed. We calculate the absolute differences between the rankings of each solution and the reference rankings, and models with smaller SRD values are considered superior [41]. The results indicate that SSD, YOLOv5n, YOLOv8n, RT-DETR, and CGT-YOLOv5n performed the best, while Faster R-CNN exhibited the poorest overall performance. Due to the unique nature of mouse hole detection in grassland areas aimed at achieving mouse pest control, the following analysis will focus on the accuracy of mouse hole detection within the models exhibiting an optimal overall performance, specifically mAP_0.5 and mAP_0.5:0.95.

Among the five models with superior performance, the RT-DETR model has the lowest mAP_0.5 of 83%. The SSD model has the lowest mAP_0.5:0.95 of 33.7%. Moreover, the two models only have a slight difference of 0.3% and 0.1% in mAP_0.5 and mAP_0.5:0.95, respectively, and their larger model size is unfavorable for deployment on uncrewed aerial vehicles (UAVs). On the other hand, the other three models, with their compact size and faster detection speed, are more suitable for UAV deployment. Among the remaining three models, CGT-YOLOv5n has the highest mAP_0.5 of 92.8%, while YOLOv5n has the lowest mAP_0.5 of 89.5%. YOLOv8n, with a mAP_0.5 of 90.1%, sits in between. The mAP_0.5 of CGT-YOLOv5n is 3.3% and 2.7% higher than those of YOLOv5n and YOLOv8n, respectively. Also, CGT-YOLOv5n has the highest mAP_0.5:0.95, which is 4.3% and 3.8% higher than YOLOv5n and YOLOv8n, respectively. Regarding model size and detection speed, CGT-YOLOv5n slightly lags behind YOLOv5n and YOLOv8n. However, considering the characteristics of mouse hole detection in grasslands, where high accuracy is crucial and detection time is not overly stringent, a minor sacrifice in detection time is within an acceptable range. Therefore, with its superior overall performance and the highest detection accuracy, CGT-YOLOv5n is more apt for mouse hole detection in complex grassland terrains.

Figure 11 displays the detection results of the seven models mentioned in the text on complex grassland backgrounds. The presence of large areas of weeds, severe shadows on the grassland, and traces of rainwater erosion significantly affect the models’ detection of mouse holes. Figure 11a depicts the detection result by the Faster R-CNN model, where it misidentifies the shadow of weeds as mouse holes in five instances. Figure 11b portrays the detection results for SSD, which misses two mouse holes and incorrectly identifies a shadow of grass as a mouse hole in one instance. Figure 11c presents the detection results for YOLOv3, which misidentifies the shadow of grass as mouse holes in four instances. Figure 11d shows the detection results for YOLOv5n, with one missed mouse hole and two instances of misidentifying the shadow of grass as mouse holes. Figure 11e displays the detection results for YOLOv8n, which was relatively good yet still misidentified the shadow of grass as a mouse hole in one instance. Figure 11f presents the detection results for the RT-DETR model, detecting all three mouse holes but misidentifying shadows as mouse holes in five instances. Figure 11g illustrates the detection results for the improved model, detecting all three mouse holes without misidentifying shadows as mouse holes.

Based on the results, it is evident that missed detection and shadow interference are challenging issues in mouse hole detection. The models Faster R-CNN, SSD, YOLOv3, YOLOv5n, YOLOv8n, and RT-DETR tend to misidentify the shadows of interfering objects as mouse holes. In contrast, the improved model CGT-YOLOv5n, which integrates CAM, TSCODE, and ODconv, can differentiate between shadows and mouse holes, enabling a better detection performance in drones on complex grassland backgrounds.

From the experiment, it is found that the CGT-YOLOv5n model also has certain limitations. Using drones to detect mouse holes from heights above 4 m presents challenges, primarily due to the decreased image resolution and loss of detail caused by increased altitude. At higher observation points, the small features of mouse holes may become less distinct and difficult to differentiate from surrounding natural elements. This poses a significant challenge for models that rely on image recognition and feature detection, as their performance largely depends on the quality of the images and the recognizability of the target objects. Future research may need to explore how to enhance the recognition ability of small targets in high-altitude areas through feature enhancement or super-resolution techniques to overcome this limitation.

4. Conclusions

This study developed the CGT-YOLOv5n model, an enhancement of the YOLOv5n architecture aimed at improving the accuracy of mouse hole detection in complex grassland environments. By integrating CAM, TSCODE, and ODConv modules, the model successfully reduced the misidentification of non-target objects (such as stones, weeds, and animal feces) as mouse holes, achieving a 3.3% increase in mAP_0.5 and a 4.3% increase in mAP_0.5:0.95 compared to the original YOLOv5n model. However, challenges in detecting mouse holes arose when using drones at altitudes above 4 m. Future research may focus on feature enhancement or super-resolution technologies to improve the recognition of small targets in high-altitude aerial photography.

Author Contributions

Conceptualization, C.L. and X.L.; methodology, C.L. and X.L.; validation, X.L.; formal analysis, C.L.; investigation, C.L. and X.L.; resources, C.L. and X.L.; data curation, C.L. and X.L.; writing—original draft preparation, C.L.; writing—review and editing, C.L., X.L. and X.P. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by National Natural Science Foundation of China (6196070091), Scientific Research Project of Higher Education Institutions in Inner Mongolia Autonomous Region (NJZZ22502), Inner Mongolia Natural Science Foundation joint fund project (2023LHMS06020).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author. The data are not publicly available due to privacy.

Conflicts of Interest

The authors declare no conflict of interest.

References

Yun, J.F.; Liu, D.F. Characteristics, status and construction of grassland resources in Inner Mongolia desert area. Chinese Forestry Society. In Proceedings of the 2005 CCSA Annual Academic Conference 26 Session Proceedings (1), Xinjiang, China, 20–23 August 2005; Editorial Department of Forestry Science and Technology Management: Urumqi, China, 2005; pp. 29–31. [Google Scholar]
Chen, W.; Liu, W.; Zhao, Y.; Lu, J.; Lv, S.; Muyassar, S. Monitoring and Control Methods of Spatial Distribution of Pests and Rodents in Yili Grassland. J. Grassl. Forage Sci. 2023, 68–73. [Google Scholar] [CrossRef]
He, D.; Huang, X.; Tian, Q.; Zhang, Z. Changes in vegetation growth dynamics and relations with climate in inner Mongolia under more strict multiple pre-processing. (2000–2018). Sustainability 2020, 12, 2534. [Google Scholar] [CrossRef]
Liu, W.; Zhong, W.Q.; Wang, D.H. Seasonal pattern and dynamic mechanism of population survival of long-clawed gerbils in agro-pastoral ecotone in inner monglia. Acta Therioloica Sin. 2020, 40, 571–584. [Google Scholar]
Sun, H.L. Encyclopedia of China’s Resource Sciences; China Encyclopedia Publishing House: Beijing, China, 2000. [Google Scholar] [CrossRef]
Wen, A.M.; Zheng, J.H.; Chen, M.; Mu, C.; Ma, T. Monitoring Mouse-Hole Density by Rhombomys opimus in Desert Forests with UAV Remote Sensing Technology. Sci. Silvae Sin. 2018, 54, 186–192. [Google Scholar]
Wen, A.M.; Zheng, J.H.; Chen, M.; Mu, C.; Ma, T. Group coverage of burrowentrances and distribution characteristics of desert forest-dwelling Rhombomys opimus based on unmanned aerial vehicle (UAV) low-altitude remote sensing: A case study at the southern margin of the Gurbantunggut Desert in Xinjiang. Acta Ecol. Sin. 2018, 38, 953–963. [Google Scholar]
Zhou, X.L.; An, R.; Chen, Y.H.; Al, Z.T.; Huang, L.J. Identification of rat holes in the typical area of “Three-River Headwaters” region By UAV remote sensing. J. Subtrop. Resour. Environ. 2018, 13, 85–92. [Google Scholar]
Sun, D.; Ni, Y.; Chen, J.; Abuduwali Zheng, J. Application of UAV low-altitude image on rathole monitoring of Eolagurus luteus. China Plant Prot. 2019, 39, 35–43. [Google Scholar]
Liu, H.C. Comparative Analysis of Image Classification Algorithms Based on Traditional Machine Learning and Deep Learning. Comput. Inf. Technol. 2019, 27, 12–15. [Google Scholar]
Cui, B.; Zheng, J.; Liu, Z.; Ma, T.; Shen, J.; Zhao, X. Weed classification of remote sensing by UAV in ecological irrigation areas based on deep learning. J. Drain. Irrig. Mach. Eng. (JDIME) 2018, 36, 1137–1141. [Google Scholar]
Sun, Y.; Zhou, Y.; Yuan, M.; Liu, W.; Luo, Y.; Zong, S. UAV real-time monitoring for forest pest based on deep learning. Trans. Chin. Soc. Agric. Eng. (Trans. CSAE) 2018, 34, 74–81. [Google Scholar]
Zhou, S.; Han, L.L.; Yang, S.; Wang, Y.; Genxia, Y.; Niu, P. A Study of Rodent Monitoring in Ruoergai Grassland Based on Convolutional Neural Network. J. Grassl. Forage Sci. 2021, 2, 15–25. [Google Scholar]
Cui, B.; Zheng, J.; Liu, Z.; Ma, T.; Shen, J.; Zhao, X. YOLOv3 rat hole recognition technology for UAV remote sensing images. For. Sci. 2020, 56, 199–208. [Google Scholar]
Yan, Y.; Hao, S.; Gao, Y.; Xin, D.; Niu, Z. Design of a Kiwifruit Orchard Pest and Disease Detection System Based on Aerial and Ground Multisource Information. Trans. Chin. Soc. Agric. Mach. 2023, 54, 294–300. [Google Scholar]
Leng, R.X. Application of Transmission Line Foreign Object Recognition Algorithm Based on YOLOv8; Northeast Agricultural University: Harbin, China, 2023. [Google Scholar]
Wang, Z.; Yuan, G.; Zhou, H.; Ma, Y.; Ma, Y. Foreign-Object Detection in High-Voltage Transmission Line Based on Improved YOLOv8m. Appl. Sci. 2023, 13, 12775. [Google Scholar] [CrossRef]
Du, Y.F.; Huang, L.; Zhao, Z.; Li, G. Landslide Identification and Detection in High-Resolution Remote Sensing Images Based on DETR. Bull. Surv. Mapp. 2023, 16–20. [Google Scholar] [CrossRef]
Glenn Jocher. Available online: https://github.com/ultralytics/yolov5 (accessed on 6 March 2023).
Xiao, J.; Guo, H.; Zhou, J.; Zhao, T.; Yu, Q.; Chen, Y.; Wang, Z. Tiny object detection with context enhancement and feature purification. Expert Syst. Appl. 2023, 211, 118665. [Google Scholar] [CrossRef]
Lin, L.K. Full-Field High-Resolution Cell Morphology Analysis System and Its Application Research. Master’s Thesis, Nanjing University, Nanjing, China, 2021. [Google Scholar]
Li, G.B. Research on Pedestrian Detection Technology Based on Deep Learning. Master’s Thesis, Guizhou University, Guiyang, China, 2021. [Google Scholar]
Wang, H.M. Research on Railroad Traffic Safety Image Recognition Technology Based on Deep Learning. Ph.D. Thesis, Lanzhou Jiaotong University, Lanzhou, China, 2022. [Google Scholar]
Wu, N.; Mu, C.G.; He, Y.; Liu, T.H. Multi-scale infrared and visible image fusion based on nested connections. J. Beijing Univ. Aeronaut. Astronaut. 2023, 1–11. [Google Scholar] [CrossRef]
Zhuang, J.; Qin, Z.; Yu, H.; Chen, X. Task-Specific Context Decoupling for Object Detection. arXiv 2023, arXiv:2303.01047. [Google Scholar]
Chen, Y.; Dai, X.; Liu, M.; Chen, D.; Yuan, L.; Liu, Z. Dynamic convolution: Attention over convolution kernels. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 11030–11039. [Google Scholar]
Li, C.; Zhou, A.; Yao, A. Omni-dimensional dynamic convolution. arXiv 2022, arXiv:2209.07947. [Google Scholar]
Zhang, Y.S.; Cao, L.; Wang, S.F. Diagnosis analysis and management of bird damage fault on transmission line of 330 kV Yushu networking project. Qinghai Electr. Power 2017, 36, 58–62. [Google Scholar]
Git Code, T.L. 2015. Available online: https://github.com/tzutalin/labelImg (accessed on 15 October 2022).
Brown, C.D.; Davis, H.T. Receiver operating characteristics curves and related decision measures: A tutorial. Chemom. Intell. Lab. Syst. 2006, 80, 24–38. [Google Scholar] [CrossRef]
Shaikh, S.A. Measures derived from a 2 × 2 table for an accuracy of a diagnostic test. J. Biom. Biostat. 2011, 2, 1–4. [Google Scholar] [CrossRef]
Shi, H.C.; Jin, Z.Y.; Tang, W.J.; Wang, J.; Jiang, K.; Xia, W. Research on high-precision wafer defect detection method based on deep learning. J. Electron. Meas. Instrum. 2022, 36, 79–90. [Google Scholar]
Li, H. Research on Object Detection Algorithm Based on Lightweight Network; University of Chinese Academy of Sciences (Institute of Optoelectronic Technology, Chinese Academy of Sciences): Beijing, China, 2022. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. Adv. Neural Inf. Process. Syst. 2015, 28, 91–99. [Google Scholar] [CrossRef] [PubMed]
Redmon, J.; Farhadi, A. Yolov3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.-Y.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part I 14. Springer International Publishing: Berlin/Heidelberg, Germany, 2016. [Google Scholar]
Reis, D.; Kupec, J.; Hong, J.; Daoudi, A. Real-Time Flying Object Detection with YOLOv8. arXiv 2023, arXiv:2305.09972. [Google Scholar]
Lv, W.; Zhao, Y.; Xu, S.; Wei, J.; Wang, G.; Cui, C.; Du, Y.; Dang, Q.; Liu, Y. Detrs beat yolos on real-time object detection. arXiv 2023, arXiv:2304.08069. [Google Scholar]
Héberger, K. Sum of ranking differences compares methods or models fairly. TrAC Trends Anal. Chem. 2010, 29, 101–109. [Google Scholar] [CrossRef]
Kollár-Hunek, K.; Héberger, K. Method and model comparison by sum of ranking differences in cases of repeated observations (ties). Chemom. Intell. Lab. Syst. 2013, 127, 139–146. [Google Scholar] [CrossRef]
Sziklai, B.R.; Héberger, K. Apportionment and districting by Sum of Ranking Differences. PLoS ONE 2020, 15, e0229209. [Google Scholar] [CrossRef]

Figure 1. The CGT-YOLOv5n model. CBS refers to combining the Convolution, Batch Normalization, and SiLU (Sigmoid Linear Unit) activation functions. Concat refers to concatenating feature maps coming from multiple layers together. Upsample refers to upsampling, increasing the spatial resolution of the feature maps while retaining the learned feature information. OD_Conv replaces standard convolutions with full-dimensional dynamic convolutions, as detailed in Section 2.4. CAM refers to the Context Augmentation Module, described in Section 2.2. TS_head is a context-based decoupled head, discussed in Section 2.3.

Figure 2. Context Augmentation Module with Adaptive Fusion Mechanism.

Figure 3. TSCODE decoupling header.

Figure 4. Omni-dimensional Dynamic Convolution.

Figure 5. Images of mouse holes captured in different periods. (a) March, (b) July, (c) November, (d) under poor lighting conditions.

Figure 6. (a) Loss diagram of the original model and the improved model; (b) graph of the evaluation indexes of the original model and the improved model.

Figure 7. (a,c,e) represent the mouse hole detection results of YOLOv5n under simple background conditions; (b,d,f) represent the mouse hole detection results of CGT-YOLOv5n under simple background conditions.

Figure 8. (a,c,e) represent the mouse hole detection results of YOLOv5n under complex background conditions; (b,d,f) represent the mouse hole detection results of CGT-YOLOv5n under complex background conditions.

Figure 9. Comparison of detection results between the original and improved models using data captured at different heights by a drone: (a,b) at 2 m, (c,d) at 3 m, (e,f) at 4 m, (g,h) at 5 m, and (i,j) at 6 m.

Figure 10. Sum of ranking differences (SRD) graph, where Min represents the minimum reference standard; Rnk, Rnk1, Rnk2, Rnk3, Rnk4, Rnk5, Rnk6, and Rnk7 represent the rankings of the reference standard, Faster R-CNN, SSD, YOLOv3, YOLOv5n, YOLOv8n, RT-DETR, and CGT-YOLOv5n, respectively, from the least to the most significant. Diff1, Diff2, Diff3, Diff4, Diff5, Diff6, and Diff7, respectively, denote the differences between Faster R-CNN, SSD, YOLOv3, YOLOv5n, YOLOv8n, RT-DETR, and CGT-YOLOv5n and the reference standard.

Figure 11. Detection results for mouse holes on complex backgrounds using different models: (a) Faster R-CNN; (b) SSD; (c) YOLOv3; (d) YOLOv5n; (e) YOLOv8n; (f) RT-DETR; and (g) CGT-YOLOv5n.

Table 1. Experimental environment.

Software and Hardware Platform	Model Parameters
Operating System	Windows 10
CPU	Intel core i5
GPU	NVIDIA GeForce GTX 3080
Operating Memory	10 GB
CUDA	11.7
Frame	PyTorch 1.13.1
Programming Environment	Python 3.7

Table 2. YOLOv5n and improved experimental results.

Models	mAP (%) IoU = 0.5	mAP (%) IoU = 0.5:0.95	Model Size (M)	Latency (ms)	FPS (f/s)
YOLOv5n	89.5	42.0	3.7	2.4	217.4
CAM-YOLOv5n	91.8	43.2	4.6	2.8	217.4
CAM-TSCODE-YOLOv5n	92.2	43.9	14.9	3.8	178.6
CAM-TSCODE-ODConv-YOLOv5n	92.8	46.3	15.4	4.3	161.3

Table 3. Experimental results for different models.

Models	mAP (%) IoU = 0.5	mAP (%) IoU = 0.5:0.95	Model Size (M)	Latency (ms)	FPS (f/s)
Faster R-CNN	84.1	36.0	315.0	64.9	15.4
SSD	83.3	33.7	100.2	8	125
YOLOv3	89.2	45.2	117.8	11.3	80.6
YOLOv5n	89.5	42	3.7	2.4	217.4
YOLOv8n	90.1	42.5	6	2.4	277.7
RT-DETR	83.0	33.8	63.1	8.0	119.0
CGT-YOLOv5n	92.8	46.3	15.4	4.3	161.3

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, C.; Luo, X.; Pan, X. CGT-YOLOv5n: A Precision Model for Detecting Mouse Holes Amid Complex Grassland Terrains. Appl. Sci. 2024, 14, 291. https://doi.org/10.3390/app14010291

AMA Style

Li C, Luo X, Pan X. CGT-YOLOv5n: A Precision Model for Detecting Mouse Holes Amid Complex Grassland Terrains. Applied Sciences. 2024; 14(1):291. https://doi.org/10.3390/app14010291

Chicago/Turabian Style

Li, Chao, Xiaoling Luo, and Xin Pan. 2024. "CGT-YOLOv5n: A Precision Model for Detecting Mouse Holes Amid Complex Grassland Terrains" Applied Sciences 14, no. 1: 291. https://doi.org/10.3390/app14010291

APA Style

Li, C., Luo, X., & Pan, X. (2024). CGT-YOLOv5n: A Precision Model for Detecting Mouse Holes Amid Complex Grassland Terrains. Applied Sciences, 14(1), 291. https://doi.org/10.3390/app14010291

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

CGT-YOLOv5n: A Precision Model for Detecting Mouse Holes Amid Complex Grassland Terrains

Abstract

1. Introduction

2. Models and Methods

2.1. CGT-YOLOv5n Model

2.2. Improving YOLOv5n with a Context Augmentation Module and Adaptive Fusion Mechanism

2.3. TSCODE-Based Decoupling Header Improvement for the Improvement of YOLOv5n

2.4. Improving YOLOv5n Based on ODConv

3. Experimental Results and Discussion

3.1. Data Acquisition and Production

3.2. Experimental Environment

3.3. Evaluation Criteria

3.4. Comparison Results for the Original Model and the Improved Model

3.5. Comparison and Discussion of Experimental Results from Various Models

4. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI