Enhancing Sustainable Traffic Monitoring: Leveraging NanoSight–YOLO for Precision Detection of Micro-Vehicle Targets in Satellite Imagery

Guo, Dudu; Zhao, Chenao; Shuai, Hongbo; Zhang, Jinquan; Zhang, Xiaojiang

doi:10.3390/su16177539

Open AccessArticle

Enhancing Sustainable Traffic Monitoring: Leveraging NanoSight–YOLO for Precision Detection of Micro-Vehicle Targets in Satellite Imagery

by

Dudu Guo

^1,2,

Chenao Zhao

^2,3,*

,

Hongbo Shuai

^2,3,

Jinquan Zhang

⁴ and

Xiaojiang Zhang

⁵

¹

School of Transportation Engineering, Xinjiang University, Urumqi 830017, China

²

Xinjiang Key Laboratory of Green Construction and Smart Traffic Control of Transportation Infrastructure, Xinjiang University, Urumqi 830017, China

³

School of Intelligent Manufacturing Modern Industry, Xinjiang University, Urumqi 830017, China

⁴

Xinjiang Hualing Logistics Co., Ltd., Urumqi 830017, China

⁵

Xinjiang Xinte Energy Logistics Co., Ltd., Urumqi 830017, China

^*

Author to whom correspondence should be addressed.

Sustainability 2024, 16(17), 7539; https://doi.org/10.3390/su16177539

Submission received: 29 July 2024 / Revised: 28 August 2024 / Accepted: 29 August 2024 / Published: 30 August 2024

(This article belongs to the Section Sustainable Transportation)

Download

Browse Figures

Versions Notes

Abstract

Satellite remote sensing technology significantly aids road traffic monitoring through its broad observational scope and data richness. However, accurately detecting micro-vehicle targets in satellite imagery is challenging due to complex backgrounds and limited semantic information hindering traditional object detection models. To overcome these issues, this paper presents the NanoSight–YOLO model, a specialized adaptation of YOLOv8, to boost micro-vehicle detection. This model features an advanced feature extraction network, incorporates a transformer-based attention mechanism to emphasize critical features, and improves the loss function and BBox regression for enhanced accuracy. A unique micro-target detection layer tailored for satellite imagery granularity is also introduced. Empirical evaluations show improvements of 12.4% in precision and 11.5% in both recall and mean average precision (mAP) in standard tests. Further validation of the DOTA dataset highlights the model’s adaptability and generalization across various satellite scenarios, with increases of 3.6% in precision, 6.5% in recall, and 4.3% in mAP. These enhancements confirm NanoSight–YOLO’s efficacy in complex satellite imaging environments, representing a significant leap in satellite-based traffic monitoring.

Keywords:

road traffic operation status; satellite remote sensing technology; complex scene; micro-vehicle targets; YOLOv8

1. Introduction

With rapid economic development, the number of automobiles has significantly increased in recent years, placing tremendous pressure on road traffic systems and leading to complex traffic issues. These issues include severe traffic congestion, parking difficulties, worsening environmental pollution, and rising traffic accidents. These problems affect the efficiency of residents’ daily travel and the sustainable development of road traffic systems. To effectively monitor the operational status of road traffic, vehicles are typically detected, tracked, and counted in specific areas to assess traffic flow conditions, thereby alleviating the pressure on road traffic systems.

Currently, road traffic operations are mainly monitored using fixed traffic detection devices to calculate and realize the operational status. These devices offer flexible deployment and scalability, allowing for the addition of cameras or adjustments to monitoring strategies at critical locations as needed, thereby gathering richer traffic flow information [1]. Based on this, many scholars [2,3,4,5] have introduced advanced computational models and technologies to improve traffic monitoring and management accuracy and efficiency. Regarding traffic data processing, scholars [6,7,8] have combined deep learning technologies and data analysis techniques to address the imbalance in traffic flow prediction data, effectively analyzing vehicle speed, density, and flow to determine the operational status of road traffic. Regarding network traffic and traffic parameter estimation, scholars [9,10,11] have utilized machine vision technology and multidimensional data analysis techniques to extract and analyze data features, optimizing network traffic management, traffic parameter estimation, and network security performance. Although significant progress has been made in this field, fixed detection devices have inherent limitations in installation and maintenance, coverage range, and the interconnection and interoperability of data collected by various sensors. These devices are costly to install and challenging to maintain and manage, making acquiring road traffic data and monitoring traffic operational status difficult.

In recent years, advancements in satellite remote sensing technology have been substantial, especially with the introduction of video satellites. These innovations offer fresh prospects for the continuous, real-time observation of the Earth. This innovation upgrades traditional static single-scene imagery to multi-frame dynamic video imagery, offering new methods for monitoring road traffic conditions. Gao et al. [12] improved CycleGAN, enhancing the representation of vehicle features in satellite remote sensing imagery, especially in environments with poor imagery quality or indistinct vehicle features, significantly improving vehicle detection accuracy. Kong et al. [13] effectively utilized temporal semantic information through cross-frame feature updates and cross-frame training processes, enhancing the spatiotemporal recognition of targets, which is outstanding in dynamic traffic monitoring applications. Other scholars [14,15,16] have improved feature extraction efficiency through various advanced technologies, enhancing vehicle detection performance in complex environments. In summary, vehicle target detection methods based on satellite remote sensing imagery, with comprehensive coverage and continuous observation advantages, have become essential for monitoring road traffic conditions. These technologies obtain vital parameters such as traffic flow and adapt to complex environmental changes, providing accurate traffic analysis data to support urban traffic management and planning decisions, significantly improving the efficiency and accuracy of monitoring road traffic conditions [17].

Compared to traditional fixed traffic monitoring equipment, which is highly dependent on its deployment, expensive to install, limited in coverage, and easily affected by ground environmental changes [18], traffic monitoring based on satellite remote sensing imagery offers broader geographical coverage without being restricted by installation locations. It provides comprehensive, large-scale data. Despite significant progress in road traffic monitoring through satellite remote sensing imagery, several challenges remain, such as low resolution, background blurring, and high noise levels. Vehicle features in satellite imagery are sparse, small, and strongly influenced by background interference. Additionally, weather conditions and lighting changes significantly affect the quality of satellite imagery, often reducing the accuracy and reliability of road traffic condition monitoring [19]. Therefore, while satellite remote sensing imagery offers new perspectives for traffic management, technical optimization and standardization are still needed to overcome existing limitations and challenges.

Traditional methods for vehicle target detection in satellite remote sensing imagery depend on manual feature extraction and machine learning techniques. These methods generally suffer from low robustness, limited generalization capabilities, and subpar real-time performance. On the other hand, methods based on deep learning dramatically enhance accuracy, generalization, and real-time capabilities through automated complex feature learning, end-to-end training, and hardware acceleration. Such approaches are more apt for handling the intricate environments and urgent processing demands of satellite remote sensing imagery. Given the distinct attributes of this imagery, conventional deep learning techniques for target detection often fall short. Consequently, this paper introduces an advanced YOLOv8 model for target detection, dubbed NanoSight–YOLO, derived from an extensive analysis of vehicle target data characteristics specific to satellite remote sensing. This model aims to extract traffic parameters better and is more suitable for monitoring road traffic conditions. The specific improvements are as follows:

Optimized feature extraction and target detection capabilities: In the NanoSight–YOLO model, the conventional CSPDarknet backbone is replaced with FasterNet, specifically designed to reduce computational complexity and minimize information loss when dealing with micro-vehicle targets in satellite imagery. This strategic substitution significantly enhances the model’s operational efficiency and reliability in complex environmental conditions, providing a robust framework for high-precision target detection. Furthermore, integrating a specialized micro-target detection layer that employs advanced feature fusion and upsampling technologies marks a crucial innovation. This layer significantly amplifies the model’s ability to discern intricate details within micro targets, drastically improving the detection accuracy and reducing the incidence of false positives, thereby advancing the state of the art in satellite imagery analysis.
Advanced localization and accuracy enhancement techniques: The NanoSight–YOLO model introduces innovative loss functions—Wise-intersection over union (Wise-IoU) and Focaler-IoU—which significantly outperform the traditional CIoU and bounding box loss functions used in prior models. These new techniques are tailored to enhance detection precision, particularly in scenarios involving micro targets and low-overlap situations typical in satellite imagery. Wise-IoU utilizes a dynamic non-monotonic focusing mechanism that adeptly handles the intricacies of small target detection, ensuring high consistency and precision. Moreover, the replacement of YOLOv8’s decoupled head with a dynamic head that incorporates comprehensive attention mechanisms for scale and spatial awareness further boosts the model’s ability to accurately localize micro-vehicle targets against complex backgrounds, highlighting the innovative approach to improving accuracy and efficiency in object detection.
Enhanced focus and precision in target recognition: The incorporation of the bi-level routing attention (BiLRA) mechanism within the NanoSight–YOLO model represents a significant advancement in the field of target detection in remote sensing. This innovative attention mechanism enhances the model’s focus on probable target-rich regions, effectively highlighting essential features while suppressing irrelevant background noise. By dynamically adjusting the focus to the most informative parts of the satellite imagery, BiLRA significantly refines the accuracy of target recognition and localization. This attention to detail in the design of attention mechanisms ensures that the model maintains a high accuracy and improves overall performance in complex detection environments, setting a new benchmark in precision-focused satellite imagery analysis.

This paper is organized as follows: Section 2 discusses research related to data processing methods in satellite remote sensing imagery and methods for detecting targets using deep learning; Section 3 elaborates on the enhancements made to the target detection model using deep learning techniques; Section 4 examines and interprets the experimental outcomes; Section 5 highlights the advantages of the detection and tracking model by presenting ablation studies and benchmarking against state-of-the-art models; and Section 6 concludes with a summary of the findings from this study and outlines prospective avenues for further research.

2. Related Works

2.1. Satellite Remote Sensing Imagery Data Processing Methods

Raw satellite remote sensing imagery often encounters atmospheric effects, sensor noise, geometric distortions, variable lighting conditions, and shadow issues during acquisition. Without proper handling, these factors can affect the color fidelity of the data and reduce target distinguishability and overall imagery quality, thus increasing the risks of false positives (misidentifying non-targets as targets) and false negatives (failing to detect actual targets). These issues impact the accuracy and reliability of vehicle tracking. Consequently, numerous researchers have conducted in-depth studies to address these challenges.

Several scholars [20,21,22,23,24,25] have employed contrast limited adaptive histogram equalization and noise removal to improve the quality of satellite remote sensing imagery, enhancing both imagery quality and the effectiveness of subsequent processing. Other researchers [26,27,28] have focused on data optimization, fusion, and analysis of satellite remote sensing imagery, improving the efficiency and accuracy of multi-source data processing. Additionally, some scholars [29,30,31,32] have applied multi-scale imagery processing and data fusion techniques in satellite remote sensing imagery to enhance the information extraction capability of imagery with different resolutions.

In summary, the current research in satellite remote sensing imagery data processing spans from imagery enhancement in preprocessing to feature extraction, complex data fusion, and target detection. These advancements optimize the visual quality of imagery and enhance feature recognition and classification accuracy through advanced models. Remarkably, by integrating deep learning-based super-resolution reconstruction techniques, researchers have significantly improved the spatial resolution of satellite remote sensing imagery, enhancing the recognition capabilities for micro-scale features. This is critically important for applications in environmental monitoring and road traffic planning.

2.2. Deep Learning Object Detection Methods

Deep learning object detection networks have significantly advanced object detection technology using convolutional neural networks (CNNs) to automatically learn complex features from imageries [33]. These networks are generally classified into two main categories: two-stage detectors and one-stage detectors. Two-stage object detection networks generate candidate regions via a region proposal network (RPN) and perform more detailed classification and bounding box regression on each candidate region. This method is highly accurate but slower [34]. In contrast, one-stage object detection networks directly predict classes and bounding boxes in the imagery, skipping the candidate region generation step, which enhances speed at the expense of some accuracy [35,36,37]. The core advantage of these networks lies in their ability to automatically learn compelling features for distinguishing various objects from training data without manual design, making object detection faster, more accurate, and adaptable to varying environmental conditions.

Significant progress has been made in applying object detection technology in various scenarios in recent years. Zou et al. [38] reviewed the evolution of object detection technology over the past two decades, covering the transition from early knowledge-based methods to modern deep learning methods. Some scholars [39,40,41] have reviewed targeted object detection tasks in specific scenarios and the adaptability and improvement of current object detection models in these scenarios. Other researchers [42,43] explored combining the advantages of CNNs and vision transformers to optimize object detection tasks by leveraging the complementary strengths of these technologies. Additional studies [44,45,46,47,48] have focused on improving deep learning object detection models by integrating other models or techniques to enhance model adaptability and detection performance.

In summary, the continuous development of deep learning object detection technology has significantly improved detection efficiency and accuracy in complex visual environments [49]. By integrating advanced network architectures and vision transformers, incorporating attention mechanisms, and utilizing feature pyramid networks, models have greatly enhanced their ability to recognize objects under multi-scale and occlusion conditions. Furthermore, by combining innovative learning strategies such as transfer learning, meta-learning, and federated learning, these technologies have increased the generalization ability of models and enabled the effective prediction of unseen categories, vastly expanding the application scope and adaptability of these models. This has profound practical application value in various fields.

3. Materials and Methods

3.1. YOLOv8

In 2023, the Ultralytics team developed the next-generation, state-of-the-art model YOLOv8, building on the early YOLO series with multiple innovations to improve its performance and adaptability. YOLOv8’s versatility allows it to handle various computational tasks, ranging from object detection to instance segmentation, while being compatible with multiple hardware platforms, from CPUs to GPUs. YOLOv8 has been divided into several editions (n, s, m, l, and x) depending on the network architecture’s depth and breadth to accommodate varying application requirements. Exceptionally, YOLOv8n is crafted to be compact, which is ideal for usage in environments with limited resources.

The YOLOv8 architecture mainly comprises four elements: the input, the backbone for feature extraction, the neck for feature enhancement, and the head dedicated to detection tasks. The configuration of the model is depicted in Figure 1.

In the input stage of YOLOv8, advanced data augmentation techniques, including Mosaic data augmentation, are employed. Mosaic augmentation boosts the model’s adaptability and generalization across diverse scales and backgrounds by randomly amalgamating various segments from multiple training imageries. Additionally, the model optimizes anchor box sizes and ratios through adaptive anchor technology, better matching the shapes of actual detection targets and improving detection accuracy. The anchor-free strategy simplifies the prediction process and increases processing speed by directly predicting objects’ center points and scales.

The backbone section focuses on feature extraction, with YOLOv8 incorporating several structural optimizations. The Conv module employs advanced convolutional networks to sequentially extract features from imageries while incorporating BatchNorm and activation functions at each level to augment the model’s capability for capturing non-linear characteristics. The C2f module integrates residual connections with extended convolutional operations, enhancing the transmission of features and addressing the issue of gradient disappearance often encountered in deep neural architectures. Simultaneously, the SPPF module utilizes multi-scale spatial pooling to amalgamate features across various scales, thereby boosting the model’s flexibility in adapting to varying object dimensions.

The neck section effectively fuses features extracted by the backbone. It uses FPN and PAN structures to optimize information flow between feature layers. FPN builds a top-down feature pyramid, combining high-level semantic information with low-level details. PAN enhances feature association through a bottom-up path, improving detail capture. These designs enable the neck to output high-quality feature maps rich in semantics and details.

The head section functions as the decision-making nucleus of the model and is tasked with the final classification and localization of objects. Its decoupled head configuration distinguishes classification and localization duties, tailoring the network architecture and parameters to enhance each function separately. This distinct separation results in more targeted and effective classification and localization. In terms of loss functions, YOLOv8 adopts a variety of methods for loss computation, such as VFL loss, along with DFL loss and CIOU loss. These methods refine the accuracy of regression and classification, thereby elevating the model’s proficiency in detecting targets of varying sizes and degrees of overlap.

3.2. NanoSight–YOLO

Detecting micro-vehicle targets in satellite remote sensing imagery poses unique challenges. Vehicle targets in high-resolution satellites typically occupy micro-pixel areas, significantly limiting the extraction of practical features and making conventional object detection methods challenging. Additionally, satellite imageries’ complex and variable backgrounds, including natural landscapes and artificial structures, can visually confuse vehicle targets. Furthermore, since satellite imageries often cover extensive areas, targets within the same image may appear at different angles and scales, requiring detection models to possess strong multi-scale and multi-view adaptability. These factors collectively constitute the main challenges in detecting micro-vehicle targets in satellite remote sensing imagery, necessitating advanced image processing technologies and algorithmic innovations to address them effectively. Figure 2 shows this paper’s improved NanoSight-YOLO model structure diagram.

YOLOv8, renowned for its robust data processing and feature extraction capabilities, has shown considerable promise in detecting micro-vehicle targets within satellite remote sensing imagery. Several factors complicate vehicle detection, adversely affecting the model’s accuracy. In response, this paper introduces a detection model termed NanoSight–YOLO, an enhancement based on YOLOv8. Tailored for the specific traits of micro-vehicle targets in such imagery, NanoSight–YOLO seeks to boost the model’s efficiency and flexibility in recognizing these micro-scale vehicles.

3.2.1. FasterNet Module

This study selects the lighter YOLOv8n as the base model to reduce the computational burden caused by using complex models for simple tasks and improve detection speed. This ensures a balance between detection accuracy and speed. Although the original CSPDarknet backbone network performs excellently in feature extraction, it has limitations in detecting micro targets due to potential information loss during the convolution and pooling processes.

FasterNet, an efficient backbone network designed by Chen et al. in 2023 for object detection tasks, emphasizes the optimization balance between speed and accuracy [50].

It achieves this goal through three core modules: the basic network module, the fast feature fusion module, and the efficient upsampling module. The essential network module extracts and processes features through convolutional layers, batch normalization layers, and activation function layers; the fast feature fusion module optimizes the integration of multi-scale features, enhancing the model’s ability to recognize targets of different scales; and the efficient upsampling module ensures computational efficiency while improving resolution, accurately restoring target positions. These designs enable FasterNet to quickly and accurately perform complex visual tasks in resource-constrained environments while maintaining a lightweight structure, making it suitable for the high-precision detection of micro targets in complex environments. The working principle and model structure are shown in Figure 3.

FasterNet typically consists of four hierarchical stages, each composed of multiple FasterNet blocks. Each block contains one PConv layer and two PWConv layers. Additionally, before each stage begins, the network includes either an embedding layer (used for dimension reduction and channel expansion) or a merging layer (used for further spatial downsampling).

PConv is the core of FasterNet. It applies regular convolution on a subset of input channels while keeping the remaining channels unchanged, thereby reducing unnecessary computations and memory accesses. The computation and memory access details of PConv are as follows.

F L O P s_{P C o n v} = h \times w \times k^{2} \times c_{p}^{2}

(1)

M e m o r y A c c e s s_{P C o n v} = h \times w \times 2 c_{p} + k^{2} \times c_{p}^{2} \approx h \times w \times 2 c_{p}

(2)

In the equations, h and w represent the height and width of the feature map, k is the size of the convolution kernel, and c_p is the subset size of input channels used for computation.

In FasterNet, PWConv serves as the bridge between PConv and advanced network structures. It integrates features extracted by PConv, adjusts the number of channels to meet different network requirements, and enhances the network’s nonlinearity. The effectiveness of PWConv allows FasterNet to improve information integration and expressive capacity while reducing computational expenses. This makes it especially appropriate for use in environments with limited resources. Details on the computational processes and memory access associated with PWConv are outlined below.

F L O P s_{P W C o n v} = h \times w \times c_{i n} \times c_{o u t}

(3)

M e m o r y A c c e s s_{P W C o n v} = h \times w \times (c_{i n} + c_{o u t})

(4)

Through the innovative PConv and the carefully designed hierarchical architecture, FasterNet significantly enhances the network’s running speed and efficiency while ensuring accuracy. The micro-vehicle target detection task in satellite remote sensing imagery effectively processes and retains critical spatial information and operates swiftly in resource-constrained environments. This makes it an ideal choice for micro-target detection in high-resolution satellite data.

3.2.2. Dynamic Head Module

The decoupled head in the YOLOv8 model has limitations in detecting micro-vehicle targets in satellite remote sensing imagery. While it performs well for larger targets, its high downsampling rate reduces its effectiveness in capturing the details of micro targets. This is especially problematic in satellite remote sensing, where features are not prominent, leading to missed and false detections. The primary reason is that the model’s feature extraction layers fail to retain sufficient detail when processing micro targets in highly complex backgrounds, and its integration of classification and localization is also limited for micro targets.

Dai et al. introduced a cutting-edge framework for object detection known as a dynamic head. This system incorporates multi-dimensional attention mechanisms within the model’s head structure for detecting objects, enabling comprehensive processing awareness of scale, space, and task-specific dimensions [51]. This method addresses the limitations of traditional object detection head designs by dynamically adjusting the distribution of attention to optimize performance without adding extra computational burden.

The dynamic head enhances the head structure by combining three distinct attention strategies: scale-aware, spatial-aware, and task-aware attention mechanisms. Scale-aware attention modifies layer weights within the feature pyramid to suit objects of varying sizes. Spatial-aware attention employs deformable convolution networks to sharpen the model’s emphasis on critical spatial points within images. Task-aware attention tailors the activation intensity of feature channels to boost performance across different detection tasks, such as classification and localization. These mechanisms enable the dynamic head to elevate detection precision and adaptability while preserving processing efficiency. The detailed implementation steps are outlined below.

W (F) = π (F) \cdot F

(5)

W (F) = π_{C} (π_{S} (π_{L} (F) \cdot F) \cdot F) \cdot F

(6)

π_{L} (F) \cdot F = σ (f (\frac{1}{S C} \sum_{S, C}^{} F)) \cdot F

(7)

π_{S} (F) \cdot F = \frac{1}{L} \sum_{l = 1}^{L} \sum_{k = 1}^{K} w_{l, k} \cdot F (l; p_{k} + Δ_{p k}; c) \cdot Δ m_{k}

(8)

π_{C} (F) \cdot F = m a x (α_{1} (F) \cdot F_{c} + β_{1} (F), α_{2} (F) \cdot F_{c} + β_{2} (F))

(9)

In this equation, F denotes the feature tensor, typically output from the backbone of a neural network, used for subsequent object detection tasks; L indicates the number of levels in the feature pyramid, reflecting different scales of feature layers; K is the number of sparse sampling locations, used for applying deformable convolution and executing attention weight focusing; S represents the number of spatial locations in the feature tensor, including the dimensions of width and height; C denotes the number of channels in the feature tensor, with each channel carrying a type of feature information; π_L, π_S, and π_C, respectively, represent attention functions for different dimensions: levels, spatial, and channels; σ represents the hard sigmoid function, used to constrain inputs within the range of 0 to 1, commonly for normalization; f denotes the linear function, often implemented through a 1 × 1 convolution layer, used to process feature data; W(F) represents the weighted feature tensor, which is the feature tensor processed by the attention mechanism; p_k and Δp_k denote the original and offset spatial locations, used in deformable convolution to adjust spatial positions; Δm_k signifies the importance weights of spatial locations, used to emphasize specific features; and α₁, α₂, β₁, and β₂ represent learned parameters for task-aware attention, used to control channel activations under different tasks.

Equation (5) demonstrates how fundamental self-attention weights enhance significant features while diminishing those of lesser importance. Equation (6) describes the decomposed attention operation, fine-tuning the feature tensor. Equation (7) represents scale-aware attention, dynamically adjusting the weights of features from different levels to accommodate objects of varying scales. Equation (8) denotes spatial-aware attention, achieved through deformable convolution to enable spatial awareness, allowing the model to focus more on critical regions within the imagery. Equation (9) illustrates task-aware attention, which optimally adjusts feature channels by comparing and selecting the results of different parameter combinations.

As seen in Figure 4, the workflow of the dynamic head begins with feature inputs extracted by the backbone network. These features first undergo adjustments in the scale-aware module to adapt to targets of different sizes within the imagery. The adjusted features then enter the spatial-aware module, which further refines the spatial positioning of the features, emphasizing important spatial locations. Finally, the task-aware module optimizes channel features for specific detection tasks, providing more precise feature representations for subsequent tasks such as classification or regression. The dynamic head effectively addresses different scale, location, and task requirements through this hierarchical application of attention mechanisms, thereby enhancing performance across various object detection frameworks.

Therefore, the comprehensive application of the dynamic head enables exceptional performance in detecting micro-vehicle targets in satellite remote sensing imagery. This is particularly significant in scenarios requiring accurately identifying and localizing micro-scale targets within extensive, complex backgrounds. By dynamically adjusting attention distribution, the dynamic head enhances the model’s ability to perceive micro-vehicle targets in satellite imagery. It optimizes detection, making it especially effective and practical for satellite remote sensing imagery analysis.

3.2.3. Bi-Level Routing Attention Module

In satellite remote sensing imagery analysis, vehicle targets typically occupy only a tiny portion of the imagery and often face challenges such as low resolution, insufficient detail, and target clustering. During feature extraction, many negative samples are usually generated, leading to a highly uneven sample distribution, which affects the efficiency of model training and feature learning. The model incorporates an attention mechanism to tackle these challenges and concentrate on essential regions within the remote sensing imagery. This approach successfully consolidates data across various scales and levels, boosting the analysis of feature channels with vital target information. Consequently, this enhances the precision and speed of identifying micro-vehicle targets.

However, their fixed receptive fields often limit traditional attention mechanisms, lacking flexibility and adaptability in handling diverse target distributions. This limitation can lead to unstable performance in different application scenarios, making it difficult to effectively capture global information while facing high computational complexity. Bi-level routing attention (BiLRA) is an attention mechanism designed for visual transformers, aimed at addressing the computational burden traditional transformers face when processing large-scale or high-resolution imagery [52]. The specific implementation details are shown in Figure 5.

The BiLRA model enhances performance via three principal elements: partitioning regions and projecting inputs, routing from one area to another using a directed graph, and computing attention from token to token.

First, by segmenting the input feature map into multiple micro-regions and extracting regional features, BiLRA reduces the amount of data processed and localizes the computation. This not only speeds up data processing but also improves efficiency. Next, by constructing an inter-regional association graph and selecting key regions, BiLRA can dynamically focus on the most informative areas, avoiding ineffective and redundant computations. Finally, by performing fine-grained token-level attention calculations within these critical regions, BiLRA accurately captures essential visual features, enhancing the model’s accuracy and performance across various visual tasks. This series of steps enables BiLRA to improve computational efficiency while ensuring high-quality output, making it a powerful tool for handling complex visual tasks. The computational flow is as follows.

Q, K, V = X_{r} W_{q}, X_{r} W_{k}, X_{r} W_{v}

(10)

A_{r} = Q_{r} {(K_{r})}^{T}

(11)

K_{g}, V_{g} = g a t h e r (K, I_{r}), g a t h e r (V, I_{r})

(12)

O = A t t e n t i o n (Q, K_{g}, V_{g}) + L C E (V)

(13)

In the equations, X_r represents the re-partitioned feature map; W_q, W_k, and W_v are the projection matrices for query, key, and value, used to transform the input feature X into query, key, and value; Q_r and K_r are the region-level query and key, used to calculate the affinity between regions; A_r is the region-to-region affinity matrix, used to determine the attention routing; I_r is the routing index matrix, selecting which other regions each region should focus on based on the highest values in the affinity matrix A_r; K_g and V_g are the gathered key and value tensors, collected from K and V based on the routing index I_r; O represents the output feature map, which is the value after being weighted by the attention mechanism; and LCE(V) stands for local equation enhancement, which typically uses deep convolution to enhance local information in the feature map.

Equation (10) forms the basis for implementing the attention mechanism. Equation (11) calculates the inter-regional affinity to determine the strength of associations between regions. Equation (12) focuses the attention mechanism’s computation on the most relevant areas selected through the region-to-region affinity graph, optimizing computational resources and processing efficiency. Equation (13) performs the final attention calculation, combining it with local equation enhancement to output the adjusted feature map.

3.2.4. Wise-Intersection over Union Module

In object detection, the loss function for bounding box regression is crucial. It aims to accurately predict the bounding box location to approximate the ground truth, achieving precise localization and recognition of the target area. The CIoU loss function, employed in YOLOv8, extends the DIoU loss function by incorporating the bounding box aspect ratio, thus enhancing regression precision. CIoU refines the prediction of bounding boxes by assessing three key factors: the area of overlap, the distance between center points, and the aspect ratio. Nonetheless, while it offers a thorough assessment of the bounding box’s shape and position, the CIoU encounters challenges in addressing minor variations among micro targets, which limits its utility in detecting micro-sized objects in densely populated scenes. The implementation of CIoU proceeds as described below.

ρ = \sqrt{{(C_{p r e d_{x}} - C_{g t_{x}})}^{2} + {(C_{p r e d_{y}} - C_{g t_{y}})}^{2}}

(14)

α = \frac{ν}{(1 - I O U) + ν}

(15)

ν = \frac{4}{π^{2}} {(\arctan \frac{w_{g t}}{h_{g t}} - \arctan \frac{w_{p r e d}}{h_{p r e d}})}^{2}

(16)

C I o U = 1 - I o U + \frac{ρ^{2}}{c^{2}} + α ν

(17)

In the equation, ρ represents the distance from the center point, C_pred is the center of the predicted bounding box, C_gt is the center of the ground truth bounding box, v denotes the aspect ratio loss, and w_pred and h_pred are the width and height of the predicted bounding box, respectively. At the same time, w_gt and h_gt are the width and height of the ground truth bounding box, respectively. Variable c is the diagonal length of the most micro-enclosing box, which contains predicted and ground truth bounding boxes. The parameter α is used to balance the aspect ratio loss, typically a coefficient that is dynamically adjusted based on the IoU. The computation method for α is shown in Equation (15), where the IoU is the intersection over the union between the predicted bounding box B_pred and the ground truth bounding box B_gt.

Equation (14) calculates the distance between the centers of the predicted and ground truth bounding boxes. Equation (16) considers the consistency of the aspect ratios of the predicted and ground truth bounding boxes. Equation (17) represents the final CIoU loss.

Wise-intersection over union (Wise-IoU) introduces a dynamic non-monotonic focusing mechanism that dynamically adjusts the gradient gain by evaluating the outlier degree of anchor boxes. This mechanism optimizes traditional loss functions’ performance when handling varied-quality anchor boxes [53]. This technique markedly enhances the precision of object localization in detection models using medium- and low-quality anchor boxes and diminishes the excessive emphasis on high-quality anchor boxes. It effectively balances the learning focus of the model, thereby enhancing overall detection performance in complex environments. The implementation process is as follows, and Figure 6 shows how the IOU works.

L_{I o U} = 1 - I o U

(18)

β = \frac{L_{I o U}}{\bar{L_{I o U}}}

(19)

r = \frac{β}{δ α β - δ}

(20)

L_{W I o U} = r \cdot L_{I o U}

(21)

From Equations (18)–(21), L_IoU calculates the loss based on the overlap between the predicted and ground truth bounding boxes. β represents the outlier degree, which assesses the current performance of the anchor box relative to its historical performance.

\bar{L_{I o U}}

denotes the exponential moving average of the IoU loss, used to smooth out the historical values of the IoU loss. r represents the gradient weight, which controls the attention given to anchor boxes of different quality in the loss function. α and δ are hyperparameters used to adjust the gradient weight, directly influencing the strength and manner in which the outlier degree regulates the gradient weight.

Equation (18) calculates the IoU loss, minimizing the overlapping area between the predicted and ground truth bounding boxes. Equation (19) measures the deviation of the current IoU loss of each anchor box from its historical average IoU loss to assess the quality of the anchor box. Equation (20) computes the gradient weight, dynamically adjusting the gradient allocation based on the outlier degree of the anchor box to optimize the model’s learning focus on anchor boxes of different quality. Equation (21) represents the final form of the Wise-IoU loss function, combining the basic IoU loss with the gradient weight adjusted by the outlier degree, used to train the object detection model more effectively.

3.2.5. Focaler-IoU Bounding Box Module

The conventional IoU loss function measures the overlap between predicted bounding boxes and those that are ground truth. Although it is generally effective, it encounters issues with disappearing gradients in cases of zero overlap, which obstructs the model’s learning in instances with notable positional differences. Furthermore, the IoU loss does not differentiate between simple and challenging samples; it applies the same criteria uniformly to all. This may lead the model to favor optimizing easily detectable samples while neglecting those that are more difficult to detect but crucial for improving overall performance.

Focaler-IoU is an improved loss function for bounding box regression in object detection, designed to enhance the model’s focus on complex samples by introducing a focal mechanism [54]. This mechanism is inspired by the application of focal loss in classification problems. By adjusting the form of the loss function and using linear interval mapping to improve the traditional IoU loss function, Focaler-IoU effectively enhances the model’s learning efficiency and localization accuracy when dealing with difficult-to-locate objects, such as micro or partially occluded targets. The design of Focaler-IoU is based on a thorough reflection and improvement of the shortcomings of existing IoU-based loss functions in handling complex samples, enabling superior performance in micro-object detection tasks. The specific implementation process is as follows.

I o U_{f o c a l e r} = \{\begin{cases} 0, i f I o U < d \\ \frac{I o U - d}{u - d}, i f d \leq I o U \leq u \\ 1, i f I o U > u \end{cases}

(22)

L_{F o c a l e r - I o U} = 1 - I o U_{f o c a l e r}

(23)

L_{F o c a l e r - G I o U} = L_{G I o U} + I o U - I o U_{F o c a l e r}

(24)

L_{F o c a l e r - D I o U} = L_{D I o U} + I o U - I o U_{F o c a l e r}

(25)

L_{F o c a l e r - C I o U} = L_{C I o U} + I o U - I o U_{F o c a l e r}

(26)

L_{F o c a l e r - E I o U} = L_{E I o U} + I o U - I o U_{F o c a l e r}

(27)

L_{F o c a l e r - S I o U} = L_{S I o U} + I o U - I o U_{F o c a l e r}

(28)

Equation (22) is the core of Focaler-IoU. Setting two thresholds, d and u, it adjusts the impact of the IoU values, causing the model to pay more attention to IoU values that fall between d and u during training. This modification helps sharpen the focus on challenging samples. Equation (23) details the loss function derived from Focaler-IoU, which targets reducing discrepancies between predicted and actual bounding boxes. Equations (24)–(28) integrate Focaler-IoU into the traditional IoU-based loss function. This integration retains the advantages of the original loss function while further enhancing the model’s ability to focus on and handle specific samples, such as hard-to-classify samples.

3.2.6. Micro-Object Detection Module

The head structure of the YOLOv8 model is a crucial component of the object detection architecture. It directly processes features extracted by the backbone network to make final predictions. A sophisticated design of upsampling and multi-level feature fusion layers effectively integrates features from different levels, improving the model’s accuracy in handling multi-scale information. This design includes multiple upsampling steps to restore high-resolution features and uses concatenation layers to fuse outputs from different depths, significantly enhancing the model’s capability to detect micro-objects. Adding C2f layers further optimizes feature representation, improving the efficiency of capturing local details. The final detection layer uses these enriched features for efficient object localization and classification. Overall, the head configuration of YOLOv8 ensures quick response times while adapting to complex and variable application environments, particularly excelling in detecting objects of different sizes and complex backgrounds.

While the original head of the YOLOv8 model has capabilities for multi-scale detection and feature fusion, it has certain limitations in micro-object detection due to constraints in feature capture, feature layer fusion methods, and sensitivity to micro-object details.

The micro-object detection layer enhances the original YOLOv8 model by introducing advanced upsampling and multi-layer feature fusion strategies, significantly improving its ability to detect micro-objects. Multiple upsampling operations ensure the quality of high-resolution feature maps, allowing the model to maintain and capture critical detail information even when the object size is extremely small. The newly added C2f layers and other feature fusion mechanisms strengthen the information flow between different levels of the network, optimizing the capture of micro-scale variations. These integrated technologies enhance the accuracy of micro-object recognition and localization, enabling the model to effectively utilize detailed information from various network layers, ensuring that even the most minor object features are not overlooked. This significantly improves micro-objects’ recognition precision and localization accuracy in complex backgrounds.

4. Experimental Results and Analysis

4.1. Experimental Platform and Dataset

4.1.1. Experimental Platform

Table 1 shows the experimental environment configuration for the improved model in this study. All experiments were conducted under the same configuration.

A transfer learning strategy was implemented to minimize the model’s training duration. Furthermore, the efficacy of this model was corroborated by conducting vehicle detection tasks in satellite remote sensing images, where it was compared against state-of-the-art object detection algorithms. Distributed training based on PyTorch was employed by setting environment variables and launching four processes on each node to handle data simultaneously. Unified parameters were used during the training phase, as shown in Table 2.

4.1.2. Experimental Dataset

(1): VISO Datasets

The VISO dataset [55] has been meticulously crafted to detect and track moving targets in satellite videos. This expansive dataset features 40 high-resolution videos recorded using the Jilin-1 satellite platform. It encompasses 853,911 manually labelled instances of diverse target types, including airplanes, automobiles, ships, and trains. This makes the VISO dataset ideal for researching and developing efficient vehicle detection models, mainly when dealing with high-complexity backgrounds and targets of varying scales.

The primary goal of this paper is to monitor road traffic conditions, with a specific focus on detecting micro-vehicle targets in satellite remote sensing datasets. To address the challenges of detecting micro-vehicle targets due to annotation errors, omissions, low resolution, and complex backgrounds in the original data, several vital optimizations have been made:

The original data were re-annotated to correct existing errors and fill in omissions, ensuring accuracy and completeness.
Super-resolution techniques were applied to reconstruct the satellite remote sensing imagery data, enhancing the imagery clarity and detail and making vehicle targets more discernible.
An object segmentation model was used to precisely extract road areas, effectively reducing the impact of background complexity on detection results and improving the accuracy and reliability of vehicle target detection.

These improvements significantly enhance the accuracy and robustness of vehicle target detection, providing more accurate and reliable technical support for monitoring traffic conditions.

(2): DOTA Datasets

To verify the model’s adaptability and generalization ability, this study employed the DOTA dataset, which was used in previous work. It utilized the same training parameters as earlier studies to evaluate the model’s performance under different data characteristics [56,57].

The DOTA dataset provides an excellent foundation for vehicle detection through its detailed category division and rotated bounding box annotations. Notably, in versions 1.5 and 2.0, the dataset further refines vehicle targets into small and large vehicles, allowing algorithms to more accurately identify and handle vehicles in diverse environments. This detailed annotation accommodates vehicles appearing at any angle in imagery and enhances the detection capability in high-density scenes, targets with significant size variations, and complex backgrounds. Consequently, this facilitates the development and practical deployment of technologies in applications such as traffic monitoring and urban planning.

4.2. Evaluation Metrics

In this study, we utilized precision (P), recall (R), and mean average precision (mAP) as the critical metrics for evaluating the model’s overall effectiveness. The formulas for these calculations are presented below.

Precision (P) reflects the ratio of accurately predicted positive samples out of all predicted positives. Optimistic predictions are categorized into true positives (TP) and false positives (FP). Usually, samples possessing an intersection over union (IoU) that meets or exceeds the set confidence threshold are deemed favorable. Conversely, samples where the IoU falls below this threshold are considered harmful. Therefore, P can be expressed as

P r e c i s i o n = \frac{T P}{T P + F P}

(29)

Recall (R) is derived from actual samples, indicating the fraction of positive samples correctly identified among all actual positives. These actual positive samples can be divided into TP and FN. Therefore, R can be expressed as

R e c a l l = \frac{T P}{T P + F N}

(30)

Mean average precision at 0.5 IoU (mAP@0.5) calculates the average precision for a given category at an IoU cutoff of 0.5. This indicator assesses the model’s capacity to sustain precision at different recall levels. A higher mAP@0.5 indicates that the model can sustain a high precision at higher recall rates, which is crucial for rapid detection scenarios. The calculation method is as follows:

A P @ 0.5 = \frac{P_{1} + P_{2} + \dots + P_{n}}{n} = \frac{1}{n} \sum_{i = 1}^{n} P_{i}

(31)

m A P @ 0.5 = \frac{1}{C} \sum_{k = 1}^{c} A P @ {0.5}_{k}

(32)

4.3. Experimental Results and Analysis

To verify that each improvement in this section of the model outperforms the baseline, we conducted an ablation study to analyze their effectiveness.

4.3.1. Effectiveness of FasterNet Module

To validate the effectiveness of the FasterNet model as the backbone feature extraction network and to demonstrate that FasterNet performs best among several advanced backbone feature extraction networks, we compared it with currently advanced backbone network modules. The experimental results are shown in Table 3.

Table 3 shows that the performance of different models as backbone feature extraction networks varies in the task of detecting micro-vehicle targets in satellite remote sensing imageries. Among them, CSPDarknet, the baseline model, exhibits balanced performance and is suitable for a wide range of object detection tasks. MobileNetV4, with a P of 62.0% and R of 65.3%, performs slightly worse, and its lightweight design sacrifices some detection details and coverage, resulting in a mAP of only 65.4%. LSKNet and StarNet performed poorly in this study’s data environment, potentially requiring further optimization of their receptive fields to adapt to different equation environments. FasterNet, on the other hand, showed the best performance across all three metrics, with a P of 67.1%, R of 68.0%, and a mAP of 70.7%, demonstrating its advantages in high-precision and high-coverage target detection tasks in satellite remote sensing imageries.

4.3.2. Effectiveness of Dynamic Head Module

In this section, we compared the dynamic head model with current advanced detection head modules to verify its effectiveness as a detection head in object detection models and demonstrate that it performs best among many advanced detection heads. The experimental results are shown in Table 4.

In Table 4, the AUX detection head is mainly designed as an auxiliary role, and when used independently, it cannot fully leverage its advantages. Therefore, its performance in various aspects is slightly inferior compared to the decoupled head used in the original model. The TBD detection head has shortcomings in feature mapping and spatial relationship encoding, leading to poor detection performance for micro-vehicle targets in satellite remote sensing imageries with sparse features, resulting in a significant decline in all indicators. In contrast, the dynamic head shows improvements across all metrics, particularly in R and the mAP, indicating its adaptive solid adjustment capability to the characteristics of the input data and its effectiveness in handling object detection tasks in complex backgrounds.

4.3.3. Effectiveness of Micro-Target Detection Layer

In this section, to verify the effectiveness of the micro-object detection layer, we compared the model with the added micro-object detection layer to the original model. The experimental results are shown in Table 5.

Table 5 clearly shows significant alterations in the model’s performance metrics following the integration of the micro-object detection layer. Specifically, the P slightly decreased from 64.7% to 63.3%. This decrease is mainly due to the model’s increased identification of false optimistic targets as it strives to capture more micro-objects, leading to an increase in false alarms, which affects precision. However, the R significantly improved from 67.2% to 73.6%, indicating that adding the micro-object detection layer markedly enhanced the model’s ability to detect natural micro-objects. This improvement is significant in complex satellite remote sensing imageries, enabling a more comprehensive identification and coverage of micro-vehicle targets. The mAP also increased from 68.6% to 73.4%, demonstrating a better balance between precision and recall, thus improving the model’s overall performance in accurately capturing micro targets.

4.3.4. Effectiveness of Bi-Level Routing Attention Module

We validated the effectiveness of integrating attention mechanisms into the object detection model by comparing it with prevailing state-of-the-art attention mechanisms. Our results demonstrate that the BiLRA mechanism surpasses other sophisticated attention mechanisms in performance. The experimental results are presented in Table 6.

Table 6 shows that introducing different attention mechanisms significantly impacts the model’s performance in detecting micro-vehicle targets in satellite remote sensing imagery. Specifically, CPCA (channel prior convolutional attention) enhances important channel features, improving precision, recall, and mAP. Although MLCA (mixed local channel attention) improves recall, it does so at the expense of precision and mAP. SimAM (simple parameter-free attention module) significantly enhances model performance by directly computing 3D attention weights on feature maps within the convolutional network. BiLRA employs a complex hierarchical attention mechanism, achieving the most significant improvements across all metrics and demonstrating solid capabilities in handling and analyzing complex influences.

4.3.5. Effectiveness of Wise-Intersection over Union Module

To validate the effectiveness of Wise-IoU for object detection models and to demonstrate its superior performance among various advanced loss functions, we compared it with currently advanced loss functions. The experimental results are shown in Table 7.

Table 7 shows that EIoU lacks sensitivity in handling micro targets, leading to a slight decline in performance. Although SIoU theoretically optimizes by considering angle errors, the complexity of angle and shape adjustments may prevent it from fully achieving the desired effect in practical applications. GIoU shows a decline in various metrics in actual experiments, and its theoretical advantages are not fully realized in the practical detection of micro-scale targets. In contrast, Wise-IoU employs a dynamic non-monotonic attention mechanism, making it more effective in handling medium-quality anchor boxes and improving generalization capabilities. This is particularly evident in the accurate localization and identification of micro targets in remote sensing satellite imagery with complex backgrounds and diverse target scales.

4.3.6. Effectiveness of Focaler-IoU Bounding Box Module

To validate the effectiveness of Focaler-IoU for object detection models and to demonstrate its superior performance among advanced bounding box IoUs, we compared it with currently advanced bounding box IoUs. The experimental results are shown in Table 8.

Table 8 shows that different modifications of a bounding box IoU have varying effects on the model’s performance in detecting micro-vehicle targets in satellite remote sensing imagery. Specifically, MPD-IoU and Inner-IoU did not significantly improve recall and precision. Inner-IoU, in particular, showed a slight decrease in both metrics compared to the baseline model, indicating that strategies for scale adjustment and bounding box localization need further optimization. In contrast, Focaler-IoU, by focusing on the regression of hard-to-classify samples and adjusting the weights of positive and negative samples to reduce false detections and improve the localization accuracy of bounding boxes, significantly improved the model’s recall and mAP.

5. Discussion

To thoroughly verify the updated model’s performance improvements, a series of ablation experiments were conducted on the enhanced object detection model. Additionally, a detailed comparative analysis was performed against current state-of-the-art object detection models to highlight our model’s performance advantages.

5.1. Ablation Experiments

Ablation experiments are critical in evaluating the contributions and necessity of each improvement to the model’s performance. This study meticulously tested the model’s performance on the same dataset after improving different components, delving deeply into the specific impact of each improvement on the model’s efficacy. The experimental results are documented in Table 9.

Analyzing the ablation experiment results in Table 9, model ⑧ integrates the feature extraction advantages of FasterNet and the scale adaptability of the dynamic head. This integration enhances feature-level performance and improves the effectiveness of handling targets of different scales, resulting in improvements across all metrics. Model ⑨ builds on model ⑧ by adding a micro-target detection layer, further enhancing the model’s ability to detect micro-sized targets, significantly boosting recall and mAP, thereby strengthening the model’s capability to capture fine features. Model ⑩, based on model ⑨, introduces BiLRA, optimizing feature routing and attention allocation, which enhances the model’s accuracy and flexibility in handling complex scenarios. This maintains the sensitivity to micro targets and improves overall target detection accuracy and recognition range, leading to advancements in all major performance indicators. Model ⑪, building on model ⑩, employs Wise-IoU, which optimizes target localization accuracy by improving bounding box precision and the regression strategy. Although this slightly sacrifices recall, it significantly enhances precision and mAP, markedly improving the model’s ability to locate targets accurately. Model ⑫, based on model ⑪, utilizes the Focaler-IoU bounding box and integrates the advantages of all previous individual and combined improvements. This comprehensive strategy, through the synergy of different techniques, not only balances recall and precision but also achieves the highest mAP.

5.2. Contrast Experiment

To enhance the validation of the proposed NanoSight–YOLO model’s advancements and robustness, we performed a comparative analysis against various leading object detection models, the outcomes of which are detailed in Table 10.

These ablation experiments verify the effectiveness of each improvement and demonstrate that the integration of multiple technologies can complement and enhance each other, achieving results that are difficult to attain with a single technology. This provides an efficient and reliable solution for detecting micro targets in satellite remote sensing in complex scenarios.

In the comparative experiments, this study examined several advanced object detection models. YOLOv3, while performing well in recall and mAP, providing comprehensive target coverage, is less suitable for high-efficiency processing due to its lower precision and high computational cost. Conversely, YOLOv5n and YOLOv8n maintain a lower computational complexity while offering balanced precision and recall, making them ideal for resource-constrained environments. YOLOv6 and YOLOv7-tiny show moderate performance, with YOLOv6 providing a reliable balance between performance and resources and YOLOv7-tiny exhibiting high recall but a lower precision and mAP, potentially leading to more false detections. YOLOv8s demonstrates a strong overall detection capability with a high mAP but at a higher computational cost. Classical object detection models like Faster-RCNN [70] and FCOS [71], also included in the comparison, performed poorly and are not elaborated further.

In contrast, the proposed NanoSight–YOLO model exhibited superior precision, recall, and mAP performance. Despite its relatively higher computational cost, its exceptional detection performance makes it highly suitable for detecting micro-vehicle targets in satellite remote sensing imagery, particularly in applications demanding high accuracy and reliability.

Visual comparative experiments were conducted against the aforementioned advanced models to more intuitively demonstrate the NanoSight–YOLO model’s practical application effectiveness. The visualization results are shown in Figure 7.

Images depicting real-world environments were selected from the dataset to comprehensively evaluate the model’s performance in practical application scenarios. The detection results generated by the model were then superimposed onto these scenarios using heatmaps. These heatmaps provide a visual representation of the spatial distribution of traffic density within specific regions, thereby aiding in a more in-depth analysis of the model’s applicability and performance across various traffic conditions. A comparative study between the heatmap predictions and the actual scenarios enables a more precise observation of the model’s accuracy and reliability in capturing dynamic traffic patterns, thereby offering further insights into its effectiveness and robustness in complex traffic environments. The corresponding visualizations are displayed in Figure 8.

5.3. Adaptability and Generalization Verification Experiments

The DOTA dataset possesses characteristics significantly different from those of the VISO dataset. Although both are satellite remote sensing datasets, they differ considerably in acquisition methods, background complexity, and target sizes. This distinction provides an ideal platform to evaluate whether a model has been over-optimized for specific data or can adapt and perform well in diverse data environments. Testing on the DOTA dataset allows for a comprehensive assessment of the model’s robustness, ensuring that improvements are practical and broadly applicable. The specific verification experiment results are shown in Table 11.

Table 11 shows that compared to previous work, the proposed model in this paper has a slightly lower P than the FCOS model. However, it significantly outperforms existing advanced models in R and mAP. This validates the superior adaptability and generalization capability of the proposed NanoSight–YOLO model, making it suitable for detecting micro-vehicle targets in satellite remote sensing imagery.

6. Conclusions

Detecting micro-vehicle targets in satellite remote sensing imagery holds significant practical importance in intelligent transportation, aiding in monitoring road traffic status. This study addresses the issue of low accuracy in detecting micro-vehicle targets within complex backgrounds of satellite remote sensing imagery. By analyzing the characteristics of satellite remote sensing imagery and the challenges in detecting micro-vehicle targets, this paper proposes a method for monitoring road traffic operational status based on the YOLOv8 object detection model. This method enhances the detection performance of the NanoSight–YOLO model for micro-vehicle targets. It improves its adaptability and generalization capability in complex remote sensing environments, as demonstrated in the following aspects:

The adoption of FasterNet as the new backbone feature extraction network marks a substantial improvement over the traditional CSPDarknet53 network. FasterNet reduces the computational load while maintaining a high sensitivity to crucial spatial features, facilitating more effective feature extraction in complex satellite imagery. These enhancements enable rapid and precise localization and identification of micro-vehicle targets, thus elevating the model’s overall detection performance and adaptability. Additionally, tailored model architecture adjustments have been made, focusing on micro targets with an enhanced utilization of low-feature layers to capture and fuse detailed information more effectively.
Introducing the dynamic head model and the innovative Wise-IoU and Focaler-IoU loss functions represent pivotal developments in our approach. The dynamic head enhances feature representation while maintaining computational efficiency, addressing the limitations related to parameter expansion in the original model’s detection head. Simultaneously, the Wise-IoU loss function alleviates the gradient vanishing issue, and the Focaler-IoU loss function refines the IoU calculation to prioritize difficult-to-detect micro targets. These improvements substantially increase the model’s accuracy, reliability, and generalization capability in complex remote sensing environments.
Integrating the bi-level routing attention (BiLRA) mechanism has significantly bolstered the model’s capability to recognize and precisely locate micro-vehicle targets, effectively reducing the occurrences of missed and false detections. This content-aware dynamic sparse attention mechanism enhances the model’s adaptability and performance, optimizing the analysis and recognition processes in diverse and intricate satellite imagery backgrounds.

Compared to the YOLOv8 model, the NanoSight–YOLO model significantly improves satellite remote sensing data detection performance, with precision increasing by 12.4%, recall by 11.5%, and mAP by 11.5%. Detailed ablation experiments have demonstrated the significance of each improvement, and comparative experiments have further validated its cutting-edge performance. Comparative experiments on the pre-processed DOTA dataset showed a 3.6% increase in precision, a 6.5% increase in recall, and a 4.3% increase in mAP, indicating good model robustness. Despite its strong performance, the unique nature of satellite remote sensing imagery and the demand for high-performance target detection technology present challenges in processing this type of data.

The road traffic monitoring method proposed in this study has been implemented for traffic monitoring during emergencies, such as natural disasters in remote mountainous areas. With the further advancement of satellite technology, the revisit cycle has been reduced to the minute level, making this method effectively applicable to routine traffic emergency management, accident response, and monitoring road traffic conditions during special periods such as holidays and large-scale events, as well as for long-term traffic analysis and planning. As commercial satellites continue to develop and become more widely used, data acquisition costs are expected to decrease significantly. Consequently, the practicality of this method will increase, rendering it more economically viable and demonstrating broad potential for application and academic value.

Future research will focus on further developing and refining the NanoSight–YOLO model and integrating it with advanced deep learning-based object-tracking technology to enhance the performance and efficiency of road traffic operation monitoring. This integration will establish a road traffic operation monitoring system, enabling the more accurate acquisition of traffic flow parameters and more effective monitoring of road traffic conditions, which is crucial for traffic management and planning.

Author Contributions

D.G. led the conceptual design and methodology development of the study; C.Z. was engaged in the development and validation of software, as well as the systematic collection and analysis of data; H.S. participated in data collation and visualization; C.Z. wrote the first draft and reviewed and edited the paper with D.G. and J.Z.; X.Z. was responsible for project management and supervision, as well as the acquisition of funds. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Autonomous Region Key Research and Development Program Project (Grant No.2022B01015), the Technology research and development project of the Xinjiang Communications Investment (Group) Co., Ltd. (Grant No.XJJTZKX-FWCG-202401-0045), and the Ganquanbao Economic Development Zone Science and Technology Program Project (Grant No.GKJ2023XTWL04).

Data Availability Statement

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

Author Jinquan Zhang was employed by Xinjiang Hualing Logistics Co. and author Xiaojiang Zhang was employed by Xinjiang Xinte Energy Logistics Co. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Boukerche, A.; Tao, Y.; Sun, P. Artificial Intelligence-Based Vehicular Traffic Flow Prediction Methods for Supporting Intelligent Transportation Systems. Comput. Netw. 2020, 182, 107484. [Google Scholar] [CrossRef]
Song, Y.; Hong, S.; Hu, C.; He, P.; Tao, L.; Tie, Z.; Ding, C. MEB-YOLO: An Efficient Vehicle Detection Method in Complex Traffic Road Scenes. Comput. Mater. Contin. 2023, 75, 5761–5784. [Google Scholar] [CrossRef]
Ghahremannezhad, H.; Shi, H.; Liu, C. Object Detection in Traffic Videos: A Survey. IEEE Trans. Intell. Transport. Syst. 2023, 24, 6780–6799. [Google Scholar] [CrossRef]
Ma, Z.; Wu, X.; Chu, A.; Huang, L.; Wei, Z. SwinFG: A Fine-Grained Recognition Scheme Based on Swin Transformer. Expert Syst. Appl. 2024, 244, 123021. [Google Scholar] [CrossRef]
Ardianto, S.; Hang, H.-M.; Cheng, W.-H. Fast Vehicle Detection and Tracking on Fisheye Traffic Monitoring Video Using Motion Trail. In Proceedings of the 2023 IEEE International Symposium on Circuits and Systems (ISCAS), Monterey, CA, USA, 21 May 2023; pp. 1–5. [Google Scholar]
Mehrannia, P.; Bagi, S.S.G.; Moshiri, B.; Al-Basir, O.A. Deep Representation of Imbalanced Spatio-temporal Traffic Flow Data for Traffic Accident Detection. IET Intell. Trans. Sys. 2023, 17, 606–619. [Google Scholar] [CrossRef]
Peixoto, M.L.M.; Mota, E.; Maia, A.H.O.; Lobato, W.; Salahuddin, M.A.; Boutaba, R.; Villas, L.A. FogJam: A Fog Service for Detecting Traffic Congestion in a Continuous Data Stream VANET. Ad. Hoc. Netw. 2023, 140, 103046. [Google Scholar] [CrossRef]
Madhavi, G.B.; Bhavani, A.D.; Reddy, Y.S.; Kiran, A.; Chitra, N.T.; Reddy, P.C.S. Traffic Congestion Detection from Surveillance Videos Using Deep Learning. In Proceedings of the 2023 International Conference on Computer, Electronics & Electrical Engineering & their Applications (IC2E3), Srinagar Garhwal, India, 8 June 2023; pp. 1–5. [Google Scholar]
Dai, Z.; Song, H.; Liang, H.; Wu, F.; Wang, X.; Jia, J.; Fang, Y. Traffic Parameter Estimation and Control System Based on Machine Vision. J. Ambient. Intell. Hum. Comput. 2023, 14, 15287–15299. [Google Scholar] [CrossRef]
Liu, H.; Wang, H. Real-Time Anomaly Detection of Network Traffic Based on CNN. Symmetry 2023, 15, 1205. [Google Scholar] [CrossRef]
Duan, X.; Fu, Y.; Wang, K. Network Traffic Anomaly Detection Method Based on Multi-Scale Residual Classifier. Comput. Commun. 2023, 198, 206–216. [Google Scholar] [CrossRef]
Gao, P.; Tian, T.; Li, L.; Ma, J.; Tian, J. DE-CycleGAN: An Object Enhancement Network for Weak Vehicle Detection in Satellite Images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2021, 14, 3403–3414. [Google Scholar] [CrossRef]
Kong, L.; Yan, Z.; Zhang, Y.; Diao, W.; Zhu, Z.; Wang, L. CFTracker: Multi-Object Tracking With Cross-Frame Connections in Satellite Videos. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1–14. [Google Scholar] [CrossRef]
Feng, J.; Zeng, D.; Jia, X.; Zhang, X.; Li, J.; Liang, Y.; Jiao, L. Cross-Frame Keypoint-Based and Spatial Motion Information-Guided Networks for Moving Vehicle Detection and Tracking in Satellite Videos. ISPRS J. Photogramm. Remote Sens. 2021, 177, 116–130. [Google Scholar] [CrossRef]
Yin, Z.; Tang, Y.; Zou, B.; Feng, H. Dynamic Vehicle Detection in Satellite Video With Multiframe Brightness Gradient. IEEE Geosci. Remote Sens. Lett. 2023, 20, 1–5. [Google Scholar] [CrossRef]
Shu, M.; Zhong, Y.; Lv, P. Small Moving Vehicle Detection via Local Enhancement Fusion for Satellite Video. Int. J. Remote Sens. 2021, 42, 7189–7214. [Google Scholar] [CrossRef]
Gagliardi, V.; Tosti, F.; Bianchini Ciampoli, L.; Battagliere, M.L.; D’Amato, L.; Alani, A.M.; Benedetto, A. Satellite Remote Sensing and Non-Destructive Testing Methods for Transport Infrastructure Monitoring: Advances, Challenges and Perspectives. Remote Sens. 2023, 15, 418. [Google Scholar] [CrossRef]
Yin, X.; Wu, G.; Wei, J.; Shen, Y.; Qi, H.; Yin, B. Deep Learning on Traffic Prediction: Methods, Analysis, and Future Directions. IEEE Trans. Intell. Transport. Syst. 2022, 23, 4927–4943. [Google Scholar] [CrossRef]
Sheehan, A.; Beddows, A.; Green, D.C.; Beevers, S. City Scale Traffic Monitoring Using WorldView Satellite Imagery and Deep Learning: A Case Study of Barcelona. Remote Sens. 2023, 15, 5709. [Google Scholar] [CrossRef]
Asokan, A.; Anitha, J. Machine Learning Based Image Processing Techniques for Satellite Image Analysis—A Survey. In Proceedings of the 2019 International Conference on Machine Learning, Big Data, Cloud and Parallel Computing (COMITCon), Faridabad, India, 14–16 February 2019; pp. 119–124. [Google Scholar]
Singh, S.; Kundra, H.; Kundra, S.; Pratima, P.V.; Devi, M.V.A.; Kumar, S.; Hassan, M. Optimal Trained Ensemble of Classification Model for Satellite Image Classification. Multimed. Tools Appl. 2024, 2014, 1–22. [Google Scholar] [CrossRef]
Kaur, A.; Singla, G.; Singh, M.; Mittal, A.; Mittal, R.; Malik, V. Cotton Crop Classification Using Satellite Images with Score Level Fusion Based Hybrid Model. Pattern Anal. Applic 2024, 27, 43. [Google Scholar] [CrossRef]
Kuchkorov, T.; Urmanov, S.; Kuvvatova, M.; Anvarov, I. Satellite Image Formation and Preprocessing Methods. In Proceedings of the 2020 International Conference on Information Science and Communications Technologies (ICISCT), Tashkent, Uzbekistan, 4 November 2020; pp. 1–4. [Google Scholar]
Amiri, M.; Soleimani, S. A Hybrid Atmospheric Satellite Image-Processing Method for Dust and Horizontal Visibility Detection through Feature Extraction and Machine Learning Techniques. J. Indian. Soc. Remote Sens. 2022, 50, 523–532. [Google Scholar] [CrossRef]
Pech-May, F.; Aquino-Santos, R.; Álvarez-Cárdenas, O.; Arandia, J.L.; Rios-Toledo, G. Segmentation and Visualization of Flooded Areas Through Sentinel-1 Images and U-Net. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 8996–9008. [Google Scholar] [CrossRef]
Jin, W.; Cai, Z.; Pan, Y.; Fu, R. ICIHRN: An Interpretable Multilabel Hash Retrieval Method for Satellite Cloud Images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 8662–8672. [Google Scholar] [CrossRef]
Mahdipour, H.; Sharifi, A.; Sookhak, M.; Medrano, C.R. Ultrafusion: Optimal Fuzzy Fusion in Land-Cover Segmentation Using Multiple Panchromatic Satellite Images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 5721–5733. [Google Scholar] [CrossRef]
Lim, S.L.; Sreevalsan-Nair, J.; Daya Sagar, B.S. Multispectral Data Mining: A Focus on Remote Sensing Satellite Images. WIREs Data Min. Knowl. 2024, 14, e1522. [Google Scholar] [CrossRef]
Asokan, A.; Anitha, J.; Ciobanu, M.; Gabor, A.; Naaji, A.; Hemanth, D.J. Image Processing Techniques for Analysis of Satellite Images for Historical Maps Classification—An Overview. Appl. Sci. 2020, 10, 4207. [Google Scholar] [CrossRef]
Kim, T.; Han, Y. Integrated Preprocessing of Multitemporal Very-High-Resolution Satellite Images via Conjugate Points-Based Pseudo-Invariant Feature Extraction. Remote Sens. 2021, 13, 3990. [Google Scholar] [CrossRef]
Vinuja, G.; Devi, N.B. Multitemporal Hyperspectral Satellite Image Analysis and Classification Using Fast Scale Invariant Feature Transform and Deep Learning Neural Network Classifier. Earth Sci Inf. 2023, 16, 877–886. [Google Scholar] [CrossRef]
Zhu, H.; Lv, Y.; Meng, J.; Liu, Y.; Hu, L.; Yao, J.; Lu, X. Vehicle Detection in Multisource Remote Sensing Images Based on Edge-Preserving Super-Resolution Reconstruction. Remote Sens. 2023, 15, 4281. [Google Scholar] [CrossRef]
Kaur, R.; Singh, S. A Comprehensive Review of Object Detection with Deep Learning. Digit. Signal Process. 2023, 132, 103812. [Google Scholar] [CrossRef]
Zhou, X.; Lin, G. Review of Target Detection Algorithms. FCIS 2023, 4, 17–19. [Google Scholar] [CrossRef]
Hussain, M. YOLOv1 to v8: Unveiling Each Variant–A Comprehensive Review of YOLO. IEEE Access 2024, 12, 42816–42833. [Google Scholar] [CrossRef]
Wang, C.-Y.; Yeh, I.-H.; Liao, H.-Y.M. YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information 2024.
Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J.; Ding, G. YOLOv10: Real-Time End-to-End Object Detection 2024.
Zou, Z.; Chen, K.; Shi, Z.; Guo, Y.; Ye, J. Object Detection in 20 Years: A Survey. Proc. IEEE 2023, 111, 257–276. [Google Scholar] [CrossRef]
Cheng, G.; Yuan, X.; Yao, X.; Yan, K.; Zeng, Q.; Xie, X.; Han, J. Towards Large-Scale Small Object Detection: Survey and Benchmarks. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 1–20. [Google Scholar] [CrossRef] [PubMed]
Wen, L.; Cheng, Y.; Fang, Y.; Li, X. A Comprehensive Survey of Oriented Object Detection in Remote Sensing Images. Expert Syst. Appl. 2023, 224, 119960. [Google Scholar] [CrossRef]
Xu, S.; Zhang, M.; Song, W.; Mei, H.; He, Q.; Liotta, A. A Systematic Review and Analysis of Deep Learning-Based Underwater Object Detection. Neurocomputing 2023, 527, 204–232. [Google Scholar] [CrossRef]
Amjoud, A.B.; Amrouch, M. Object Detection Using Deep Learning, CNNs and Vision Transformers: A Review. IEEE Access 2023, 11, 35479–35516. [Google Scholar] [CrossRef]
He, Z.; Zhang, L.; Gao, X.; Zhang, D. Multi-Adversarial Faster-RCNN with Paradigm Teacher for Unrestricted Object Detection. Int. J. Comput. Vis. 2023, 131, 680–700. [Google Scholar] [CrossRef]
Zhang, Y.; Ye, M.; Zhu, G.; Liu, Y.; Guo, P.; Yan, J. FFCA-YOLO for Small Object Detection in Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–15. [Google Scholar] [CrossRef]
Lou, H.; Duan, X.; Guo, J.; Liu, H.; Gu, J.; Bi, L.; Chen, H. DC-YOLOv8: Small-Size Object Detection Algorithm Based on Camera Sensor. Electronics 2023, 12, 2323. [Google Scholar] [CrossRef]
Liu, Y.; Ren, H.; Zhang, Z.; Men, F.; Zhang, P.; Wu, D.; Feng, R. Research on Multi-Cluster Green Persimmon Detection Method Based on Improved Faster RCNN. Front. Plant Sci. 2023, 14, 1177114. [Google Scholar] [CrossRef]
Zhao, H.; Zhang, H.; Zhao, Y. YOLOv7-Sea: Object Detection of Maritime UAV Images Based on Improved YOLOv7. In Proceedings of the 2023 IEEE/CVF Winter Conference on Applications of Computer Vision Workshops (WACVW), Waikoloa, HI, USA, 2–7 January 2023; pp. 233–238. [Google Scholar]
Cheng, T.; Song, L.; Ge, Y.; Liu, W.; Wang, X.; Shan, Y. YOLO-World: Real-Time Open-Vocabulary Object Detection. arXiv 2024, arXiv:2401.17270. [Google Scholar]
Cheng, X.; Huo, Y.; Lin, S.; Dong, Y.; Zhao, S.; Zhang, M.; Wang, H. Deep Feature Aggregation Network for Hyperspectral Anomaly Detection. IEEE Trans. Instrum. Meas. 2024. [Google Scholar] [CrossRef]
Chen, J.; Kao, S.; He, H.; Zhuo, W.; Wen, S.; Lee, C.-H.; Chan, S.-H.G. Run, Don’t Walk: Chasing Higher FLOPS for Faster Neural Networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 12021–12031. [Google Scholar]
Dai, X.; Chen, Y.; Xiao, B.; Chen, D.; Liu, M.; Yuan, L.; Zhang, L. Dynamic Head: Unifying Object Detection Heads with Attentions. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 19–25 June 2021; pp. 7369–7378. [Google Scholar]
Zhu, L.; Wang, X.; Ke, Z.; Zhang, W.; Lau, R. BiFormer: Vision Transformer with Bi-Level Routing Attention. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 10323–10333. [Google Scholar]
Tong, Z.; Chen, Y.; Xu, Z.; Yu, R. Wise-IoU: Bounding Box Regression Loss with Dynamic Focusing Mechanism. arXiv 2023, arXiv:2301.10051. [Google Scholar]
Zhang, H.; Zhang, S. Focaler-IoU: More Focused Intersection over Union Loss 2024. arXiv 2024, arXiv:2401.10525. [Google Scholar]
Yin, Q.; Hu, Q.; Liu, H.; Zhang, F.; Wang, Y.; Lin, Z.; An, W.; Guo, Y. Detecting and Tracking Small and Dense Moving Objects in Satellite Videos: A Benchmark. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–18. [Google Scholar] [CrossRef]
Xia, G.-S.; Bai, X.; Ding, J.; Zhu, Z.; Belongie, S.; Luo, J.; Datcu, M.; Pelillo, M.; Zhang, L. DOTA: A Large-Scale Dataset for Object Detection in Aerial Images. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 3974–3983. [Google Scholar]
Zhao, C.; Guo, D.; Shao, C.; Zhao, K.; Sun, M.; Shuai, H. SatDetX-YOLO: A More Accurate Method for Vehicle Target Detection in Satellite Remote Sensing Imagery. IEEE Access 2024, 12, 46024–46041. [Google Scholar] [CrossRef]
Qin, D.; Leichner, C.; Delakis, M.; Fornoni, M.; Luo, S.; Yang, F.; Wang, W.; Banbury, C.; Ye, C.; Akin, B.; et al. MobileNetV4—Universal Models for the Mobile Ecosystem 2024.
Li, Y.; Hou, Q.; Zheng, Z.; Cheng, M.-M.; Yang, J.; Li, X. Large Selective Kernel Network for Remote Sensing Object Detection. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1 October 2023; pp. 16748–16759. [Google Scholar]
Ma, X.; Dai, X.; Bai, Y.; Wang, Y.; Fu, Y. Rewrite the Stars 2024. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–21 June 2024; pp. 5694–5703. [Google Scholar]
Wang, C.-Y.; Bochkovskiy, A.; Liao, H.-Y.M. YOLOv7: Trainable Bag-of-Freebies Sets New State-of-the-Art for Real-Time Object Detectors 2022. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023. [Google Scholar]
Huang, H.; Chen, Z.; Zou, Y.; Lu, M.; Chen, C. Channel Prior Convolutional Attention for Medical Image Segmentation 2023. Comput. Biol. Med. 2024, 178, 108784. [Google Scholar] [CrossRef] [PubMed]
Wan, D.; Lu, R.; Shen, S.; Xu, T.; Lang, X.; Ren, Z. Mixed Local Channel Attention for Object Detection. Eng. Appl. Artif. Intell. 2023, 123, 106442. [Google Scholar] [CrossRef]
Yang, L.; Zhang, R.-Y.; Li, L.; Xie, X. SimAM: A Simple, Parameter-Free Attention Module for Convolutional Neural Networks. In Proceedings of the International conference on machine learning, Virtual, 8–24 July 2021; pp. 11863–11874. [Google Scholar]
Zhang, Y.-F.; Ren, W.; Zhang, Z.; Jia, Z.; Wang, L.; Tan, T. Focal and Efficient IOU Loss for Accurate Bounding Box Regression 2022. Neurocomputing 2022, 506, 146–157. [Google Scholar] [CrossRef]
Gevorgyan, Z. SIoU Loss: More Powerful Learning for Bounding Box Regression. arXiv 2022, arXiv:2205.12740. [Google Scholar]
Rezatofighi, H.; Tsoi, N.; Gwak, J.; Sadeghian, A.; Reid, I.; Savarese, S. Generalized Intersection Over Union: A Metric and a Loss for Bounding Box Regression. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 658–666. [Google Scholar]
Ma, S.; Xu, Y. MPDIoU: A Loss for Efficient and Accurate Bounding Box Regression. arXiv 2023, arXiv:2307.07662. [Google Scholar]
Zhang, H.; Xu, C.; Zhang, S. Inner-IoU: More Effective Intersection over Union Loss with Auxiliary Bounding Box. arXiv 2023, arXiv:2311.02877. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
Tian, Z.; Shen, C.; Chen, H.; He, T. FCOS: Fully Convolutional One-Stage Object Detection. arXiv 2019, arXiv:1904.01355. [Google Scholar]

Figure 1. Structure of YOLOv8 model.

Figure 2. Structure of NanoSight–YOLO model.

Figure 3. Working principle of PConv and structure of FasterNet.

Figure 4. Dynamic head information flow structure.

Figure 5. Essential components and calculation process of bi-level routing attention.

Figure 6. The minimal bounding box (depicted in red) and the center connection line (depicted in blue). This diagram visually explains the computational elements and concepts involved in the Wise-IoU loss function, showing how to evaluate the intersection and union between the anchor box and the target box and calculate the IoU value using this geometric information.

Figure 7. Comparative detection effectiveness of advanced models.

Figure 8. Comparison results of some actual scenarios combined with their thermal maps.

Table 1. Experimental conditions.

Experimental Environment	Details
Operating system	Ubuntu20.04
CPU	Intel Xeon Platinum 8468 Processor, 64 GB * 32 RAM
GPU	NVIDIA L20 (48 GB) * 4
Pytorch	Torch-2.3.1
GPU acceleration CUDA	CUDA12.1
Programming language	Python3.10

Table 2. Training Parameter.

Training Parameters	Details
Optimizer	SGD
Imagery size	1280 * 1280
Epochs	300
Batch size	32
Workers	8
Momentum	0.937
Decay	5 × 10⁻⁴
Close mosaic	10

Table 3. Comparative experimental results of backbone feature extraction networks.

Base Model	Backbone	Class	P/%	R/%	mAP@0.5/%
①	CSPDarknet	Vehicle	64.7	67.2	68.6
①	MobileNetV4 [58]	Vehicle	62	65.3	65.4
①	LSKNet [59]	Vehicle	61.1	64.6	64.3
①	StarNet [60]	Vehicle	60.9	65.9	65.9
①	FasterNet	Vehicle	67.1	68	70.7

Table 4. Comparative experimental results of detection heads.

Base Model	Decoupling Head	Class	P/%	R/%	mAP@0.5/%
①	-	Vehicle	64.7	67.2	68.6
①	AUX [61]	Vehicle	62.6	66.7	67.3
①	TBD [57]	Vehicle	63.2	61.7	62.8
①	Dynamic	Vehicle	65.5	68.8	69.7

Table 5. Comparative experimental results of adding micro-object detection layers.

Base Model	Layer	Class	P/%	R/%	mAP@0.5/%
①	-	Vehicle	64.7	67.2	68.6
①	Micro-target detection layer	Vehicle	63.3	73.6	73.4

Table 6. Comparative experimental results of attention mechanisms.

Base Model	Attention Mechanism	Class	P/%	R/%	mAP@0.5/%
①	-	Vehicle	64.7	67.2	68.6
①	CPCA [62]	Vehicle	65.5	67.6	70.9
①	MLCA [63]	Vehicle	60.5	68.9	66
①	SimAM [64]	Vehicle	66.6	68.3	70
①	BiLRA	Vehicle	65.6	68.6	71.7

Table 7. Comparative experimental results of loss functions.

Base Model	Loss	Class	P/%	R/%	mAP@0.5/%
①	CIoU	Vehicle	64.7	67.2	68.6
①	EIoU [65]	Vehicle	61.3	67.3	66.4
①	SIoU [66]	Vehicle	63.7	66.8	67.7
①	GIoU [67]	Vehicle	62	67.1	66.6
①	Wise-IoU	Vehicle	65.8	66.7	69.1

Table 8. Comparative experimental results of bounding box IoUs.

Base Model	BBIoU	Class	P/%	R/%	mAP@0.5/%
①	-	Vehicle	64.7	67.2	68.6
①	MPD-IoU [68]	Vehicle	61.3	67.5	66.5
①	Inner-IoU [69]	Vehicle	60.9	67.1	66.3
①	Focaler-IoU	Vehicle	64.4	68.1	69.5

Table 9. Results of ablation experiment.

Group	FasterNet	Dynamic	Micro	BiLRA	WIoU	Focaler	Class	P/%	R/%	mAP@0.5/%
①	-	-	-	-	-	-	Vehicle	64.7	67.2	68.6
②	✓	-	-	-	-	-	Vehicle	67.1	68	70.7
③	-	✓	-	-	-	-	Vehicle	65.5	68.8	69.7
④	-	-	✓	-	-	-	Vehicle	63.3	73.6	73.4
⑤	-	-	-	✓	-	-	Vehicle	65.6	68.6	71.7
⑥	-	-	-	-	✓	-	Vehicle	65.8	66.7	69.1
⑦	-	-	-	-	-	✓	Vehicle	64.4	68.1	69.5
⑧	✓	✓	-	-	-	-	Vehicle	68.8	69	73
⑨	✓	✓	✓	-	-	-	Vehicle	71.1	72.2	76.2
⑩	✓	✓	✓	✓	-	-	Vehicle	73.7	73.8	77.8
⑪	✓	✓	✓	✓	✓	-	Vehicle	78.9	73.3	79.2
⑫	✓	✓	✓	✓	✓	✓	Vehicle	77.1	78.7	80.1

In Table 9, each row’s “✓” indicates the addition of the corresponding improvement to the baseline model, while a blank space indicates its absence.

Table 10. Results of comparison experiments.

Method	Class	P/%	R/%	mAP@0.5/%	GFLOPs/G
YOLOv3	Vehicle	63.6	71.4	70.5	282.2
YOLOv5n	Vehicle	64.6	66.9	68.2	7.1
YOLOv6	Vehicle	62.5	67.7	66.7	11.8
YOLOv7-tiny	Vehicle	62.9	70.4	68.4	13
YOLOv8n	Vehicle	64.7	67.2	68.6	8.1
YOLOv8s	Vehicle	66.1	69.1	70.2	28.4
NanoSight–YOLO	Vehicle	77.1	78.7	80.1	48.6

Table 11. Verification experimental results.

Method	Class	P/%	R/%	mAP@0.5/%	GFLOPs/G
YOLOv3-tiny	Vehicle	78.7	68.4	75.0	18.9
YOLOv5n	Vehicle	80.7	76.1	82.5	7.1
YOLOv6	Vehicle	80.5	75.3	81.3	11.8
YOLOv7-tiny	Vehicle	81.5	74.0	80.8	13
YOLOv8n	Vehicle	80.9	76.1	82.8	8.1
Faster-RCNN	Vehicle	51.8	55.2	49.8	939.56
FCOS	Vehicle	84.7	68.0	80.6	161.56
YOLOv8s	Vehicle	82.2	80.1	84.8	28.4
NanoSight–YOLO	Vehicle	84.5	82.6	87.1	48.6

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Guo, D.; Zhao, C.; Shuai, H.; Zhang, J.; Zhang, X. Enhancing Sustainable Traffic Monitoring: Leveraging NanoSight–YOLO for Precision Detection of Micro-Vehicle Targets in Satellite Imagery. Sustainability 2024, 16, 7539. https://doi.org/10.3390/su16177539

AMA Style

Guo D, Zhao C, Shuai H, Zhang J, Zhang X. Enhancing Sustainable Traffic Monitoring: Leveraging NanoSight–YOLO for Precision Detection of Micro-Vehicle Targets in Satellite Imagery. Sustainability. 2024; 16(17):7539. https://doi.org/10.3390/su16177539

Chicago/Turabian Style

Guo, Dudu, Chenao Zhao, Hongbo Shuai, Jinquan Zhang, and Xiaojiang Zhang. 2024. "Enhancing Sustainable Traffic Monitoring: Leveraging NanoSight–YOLO for Precision Detection of Micro-Vehicle Targets in Satellite Imagery" Sustainability 16, no. 17: 7539. https://doi.org/10.3390/su16177539

APA Style

Guo, D., Zhao, C., Shuai, H., Zhang, J., & Zhang, X. (2024). Enhancing Sustainable Traffic Monitoring: Leveraging NanoSight–YOLO for Precision Detection of Micro-Vehicle Targets in Satellite Imagery. Sustainability, 16(17), 7539. https://doi.org/10.3390/su16177539

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Enhancing Sustainable Traffic Monitoring: Leveraging NanoSight–YOLO for Precision Detection of Micro-Vehicle Targets in Satellite Imagery

Abstract

1. Introduction

2. Related Works

2.1. Satellite Remote Sensing Imagery Data Processing Methods

2.2. Deep Learning Object Detection Methods

3. Materials and Methods

3.1. YOLOv8

3.2. NanoSight–YOLO

3.2.1. FasterNet Module

3.2.2. Dynamic Head Module

3.2.3. Bi-Level Routing Attention Module

3.2.4. Wise-Intersection over Union Module

3.2.5. Focaler-IoU Bounding Box Module

3.2.6. Micro-Object Detection Module

4. Experimental Results and Analysis

4.1. Experimental Platform and Dataset

4.1.1. Experimental Platform

4.1.2. Experimental Dataset

4.2. Evaluation Metrics

4.3. Experimental Results and Analysis

4.3.1. Effectiveness of FasterNet Module

4.3.2. Effectiveness of Dynamic Head Module

4.3.3. Effectiveness of Micro-Target Detection Layer

4.3.4. Effectiveness of Bi-Level Routing Attention Module

4.3.5. Effectiveness of Wise-Intersection over Union Module

4.3.6. Effectiveness of Focaler-IoU Bounding Box Module

5. Discussion

5.1. Ablation Experiments

5.2. Contrast Experiment

5.3. Adaptability and Generalization Verification Experiments

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI