An Asymmetric Selective Kernel Network for Drone-Based Vehicle Detection to Build a High-Accuracy Vehicle Trajectory Dataset

Wang, Zhenyu; Xiong, Lu; Yu, Zhuoping

doi:10.3390/rs17030407

Open AccessArticle

An Asymmetric Selective Kernel Network for Drone-Based Vehicle Detection to Build a High-Accuracy Vehicle Trajectory Dataset

by

Zhenyu Wang

,

Lu Xiong

^*

and

Zhuoping Yu

School of Automotive Studies, Tongji University, Shanghai 201804, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2025, 17(3), 407; https://doi.org/10.3390/rs17030407

Submission received: 1 December 2024 / Revised: 20 January 2025 / Accepted: 21 January 2025 / Published: 24 January 2025

(This article belongs to the Section AI Remote Sensing)

Download

Browse Figures

Versions Notes

Abstract

:

To improve the detection accuracy of the drone-based oriented vehicle object detection network and establish high-accuracy vehicle trajectory datasets, we present a freeway on-ramp vehicle (FRVehicle) detection dataset with oriented bounding box annotations for vehicles in freeway on-ramp scenes from drone videos. Based on this dataset, we analyzed the dimension and angle distribution patterns of road vehicle object oriented bounding boxes and designed an Asymmetric Selective Kernel Network. This algorithm dynamically adjusts the receptive field of the backbone network’s feature extraction to accommodate the detection requirements for vehicles of different sizes. Additionally, we estimate vehicle heights with high-precision object detection results, further enhancing the accuracy of the vehicle trajectory. Comparative experimental results demonstrate that the proposed Asymmetric Selective Kernel Network achieved varying degrees of improvement in detection accuracy on both the FRVehicle dataset and DroneVehicle dataset compared to the symmetric selective kernel network in most scenarios, validating the effectiveness of the method.

Keywords:

vehicle detection dataset; drone-based vehicle detection; asymmetric selective kernel; vehicle height estimation

1. Introduction

Drones have been increasingly deployed across various sectors, including agricultural production, environmental detection, security patrol, and traffic monitoring. This is due to their flexibility in deployment, wide field of aerial view, and low operational costs. Concurrently, current autonomous driving systems urgently require large volumes of high-quality trajectory data from real traffic scenarios for training and validating trajectory prediction, decision-making, and simulation algorithms. Therefore, utilizing drones to collect traffic data to support autonomous driving technology has emerged as a prominent research focus [1,2]. While onboard and roadside sensors can also collect real-world traffic information, drones offer superior advantages with their unobstructed aerial perspective and covert observation capabilities, facilitating the extraction of more complete and naturalistic trajectory information from drone videos [3,4]. Numerous natural traffic scenario datasets based on drone videos have been established using these advantages, including the series of traffic participant trajectory datasets from various German traffic scenarios developed by RWTH Aachen University, including highD, exiD, inD, rounD, and uniD [5,6,7]; the CitySim dataset [8] created by the University of Central Florida; the INTERACTION dataset [9] established by the University of California, Berkeley; the SIND dataset [10] for intersection scenario trajectories built by Tsinghua University; etc.

However, with the advancement of deep learning technology, current trajectory descriptions for vulnerable road users such as pedestrians, cyclists, and motorcyclists typically require additional information, including body orientation and head movement [11]. These crucial features are challenging to extract from the drone aerial view. In contrast, comparatively complete trajectory information of vehicular traffic participants, due to their rigid vehicle bodies with larger footprints, is easier to obtain from the drone’s top-down view, with only vehicle signal light information being lost. Therefore, this research primarily focuses on designing a high-precision vehicle object detection network to extract vehicle trajectory data from drone videos.

In the field of drone-based vehicle detection, vehicles typically appear as rectangular projections in drone aerial views, particularly in freeway scenarios, where large trucks making sharp turns (commonly seen at intersections) are rare, as shown in Figure 1. The oriented bounding box (OBB) annotation is widely adopted to capture vehicle heading information, reduce redundant regions, and prevent overlapping detection boxes in dense scenes. The fundamental architectures of OBB object detection algorithms and horizontal bounding box (HBB) object detection algorithms share similarities in their backbone and neck designs, with improvements in feature extraction and fusion from HBB detection algorithms generally enhancing OBB detection accuracy. Noteworthy implementations of convolutional neural networks (CNNs) include RTMDet’s use of large-kernel depthwise convolutions in both the backbone and neck modules to expand receptive fields [12]; LSKNet’s implementation of the Large Selective Kernel Network for dynamic receptive field adjustment, addressing varying contextual information requirements [13,14]; PKINet’s multiple parallel kernels of varying sizes without dilation to extract dense, multi-scale features, effectively capturing local context [15]; and YOLO11’s introduction of C3K2 blocks in the backbone along with SPPF and C2PSA in the neck to enhance feature extraction while improving computational efficiency [16]. Specialized backbone structures for rotated feature extraction have also emerged, such as adaptive rotated convolution with adaptive rotational convolution kernels [17], ReDet’s rotation-equivariant ResNet [18], and ML-Det’s mixed separable convolutions with 5 × 1, 5 × 5, and 1 × 5 arrangements to capture parallel and vertical image features while reducing parameters [19]. In addition, Transformer-based architectures like PVT, HiVit, and ViTAE have also been successfully applied to rotated object detection [20,21,22,23]. Representatively, there is a rotated varied-size window attention algorithm that is specifically optimized for rotated object detection based on the plain Vision Transformer architecture [24]. It enhances object representation through diversified window contexts while reducing computational costs and memory consumption. Finally, hybrid architectures combining Transformers and CNNs, exemplified by DERT [25], have evolved into specialized networks such as AO2-DETR [26] and DETR-ORD [27], which are specifically optimized for oriented remote sensing object detection. Transformers traditionally possess larger effective receptive fields compared to CNNs, benefiting large object detection and contour regression. However, with the emergence of the separable large-kernel network structure represented by RepLKNet [28], CNNs also have a comparably effective receptive field advantage, resulting in similar detection precision between Transformer-based and CNN-based models on major remote sensing datasets.

The distinctive characteristic of OBB detection networks compared to HBB detection networks lies in their detection heads, which incorporate an additional angle dimension in the prediction output. Most innovations in rotated object detection algorithms focus on designing detection heads and associated loss functions. The periodic nature of angles introduces a numerical discontinuity at periodic transition points. This discontinuity causes significant fluctuations in loss values when angle value differences are used directly as loss metrics. To address abrupt changes in angle-related loss values, various approaches have been developed to improve regression loss functions based on this angular characteristic. These methods include transforming rotated box boundaries into 2D Gaussian distributions and then utilizing metrics such as Kullback–Leibler Divergence [29] or the Gaussian Wasserstein Distance [30,31,32] to measure the bounding box regression loss; implementing Phase-Shifting Coder to map rotation periods into different frequency phases for continuous angle prediction [33]; employing a COBB algorithm that uses continuous functions of HBB and OBB areas to indirectly express changes in target angles and aspect ratios [34]; and adopting a projecting-points-to-axes approach that converts the OBB to a point–axis representation with specifically designed loss functions [35]. Additionally, numerous works have enhanced regression accuracy through improved detection head designs for bounding box angle regression. Notable examples include Oriented RCNN’s two-stage detection framework [36] that generates high-quality oriented proposals before fine regression; rotated RoI Transformer’s implementation of a rotated-position-sensitive RoI alignment module for improved HRoI feature extraction [37]; R3Det’s progressive regression using horizontal anchor boxes, followed by refined rotated boxes for dense scenarios, incorporating a feature refinement module for feature alignment [38]; S²A-Net’s introduction of AlignConv for feature adaptive alignment and active rotating filters for directional information encoding [39]; and oriented RepPoints’ utilization of adaptive point sets to capture the geometric and spatial information of arbitrarily oriented objects [40]. Furthermore, integration with Vision Transformers has led to innovations like the spatial transform decoupling module, which implements a multi-branch network design for decoupled parameter prediction in oriented object detection, along with cascaded activation masks for enhanced feature representation within regions of interest [41].

While numerous detection methods exist today, they primarily focus on mainstream multi-object classification in remote sensing datasets such as DOTA-v1.0 [42,43], FAIR1M [44], and SODA-A [45]. These methods typically achieve relatively low detection box regression accuracy, using an intersection over union (IoU) threshold of 0.5 as the criterion for correct detection. There is a notable lack of specialized remote sensing datasets for extracting high-accuracy vehicle trajectory data information. Additionally, optimized high-precision vehicle object detection networks are scarce. To address these two challenges, this paper presents the following innovative contributions:

We present an open-source vehicle detection dataset of imagery from a drone’s top-down view aimed at extracting high-accuracy vehicle trajectory data, addressing the current gap in such publicly available datasets.
We designed an Asymmetric Selective Kernel Network that enhances feature extraction along the vehicle’s longitudinal edges based on the distribution patterns of OBBs. Additionally, we modified the current vehicle detection dataset’s annotation method to single-label annotation, thereby improving the regression precision of vehicle detection boxes.
We devised a method for vehicle height estimation based on high-precision vehicle detection results, further enhancing the accuracy of vehicle trajectory data.

2. Materials and Methods

2.1. FRVehicle Dataset

In this study, drone videos were collected at a freeway on-ramp in Shanghai, China. Data collection occurred between 2:00 p.m. and 4:00 p.m. during October and November under cloudy weather conditions. The drone maintained a fixed altitude of 120 m throughout the recording. Vehicles in the scene were primarily categorized into three classes: car, bus, and truck. The dataset was created by extracting frames at fixed intervals from the video footage, resulting in 7534 images, with each image ranging from 3540 × 460 to 3550 × 950 pixels and containing two scenes, as shown in Figure 2 and Figure 3. Since the dataset’s primary objective was to extract high-accuracy vehicle trajectory information, only complete vehicle objects were annotated, resulting in a total of 145,252 annotation boxes. To ensure data independence between different sets in the train–validation–test split, instead of random sampling, we chronologically ordered the frames from each video segment and allocated approximately the first 50% of temporal frames to the training set, the middle 20% to the validation set, and the final 30% to the test set. The specific split intervals for each video segment were fine-tuned based on the number of annotation boxes in different sets. The final distribution resulted in 3703 images with 72,689 annotation boxes (approximately 50.0% of total annotations) for the training set, 1462 images with 28,818 annotation boxes (approximately 19.8%) for the validation set, and 2369 images with 43,745 annotation boxes (approximately 30.1%) for the testing set.

Since the FRVehicle dataset focuses on vehicle object detection on freeways, it exhibits the typical characteristics of vehicle objects, where the width distribution is relatively concentrated, predominantly ranging between 40 and 85 pixels, while the length distribution is more dispersed, mainly spanning from 100 to 500 pixels, with length-to-width ratios primarily distributed between 2 and 7. Using the DOTA-v1.0 dataset as a comparison, the distributions of bounding box length, width, length-to-width ratio, and area are shown in Figure 4 and Figure 5. These comparative visualizations demonstrate that the FRVehicle dataset’s bounding boxes generally possess greater lengths, widths, aspect ratios, and overall areas compared to those in the DOTA-v1.0 dataset. While the dataset largely avoids the small-object detection challenges present in the DOTA-v1.0 dataset, it encounters similar high-aspect-ratio-object detection challenges to those observed in the GLH-Bridge dataset [46].

2.2. Asymmetric Selective Kernel Network

The analysis of the FRVehicle dataset revealed a consistent pattern in drone-captured images with an aspect ratio of 16:9, which approximates the minimum 2:1 length-to-width ratio of vehicle objects. To maximize vehicle detection efficiency in single frames, roads are predominantly oriented parallel to the image width, resulting in vehicle longitudinal axes that typically align with the horizontal direction. Except for specific scenarios like intersections and roundabouts, vehicle heading angles rarely exceed 45° relative to the image width direction. Statistical analysis of the FRVehicle dataset’s annotation box angles confirms this directional bias, as shown in Figure 6. Vehicle features in traffic scenes predominantly manifest along the horizontal rather than vertical direction. This observation inspired an analysis of current rotated object detection algorithms, which primarily employ symmetric convolution kernels—an approach well suited for remote sensing datasets like the DOTA-v1.0 dataset, where object angles are uniformly distributed across the angular period, as shown in Figure 6. While symmetric kernels are justified when object features are evenly distributed along horizontal and vertical axes, detecting vehicles on regular roads presents a different scenario. The distinct characteristics of road-based vehicle detection, particularly on freeways, create an asymmetric feature distribution that is more concentrated along the horizontal axis than the vertical axis. Conventional symmetric convolution kernels may therefore incorporate irrelevant environmental noise through their uniform receptive fields, potentially compromising the quality of vehicle feature extraction. This limitation becomes particularly pronounced when using the large convolution kernels typical in modern CNN architectures. Based on these findings, we developed an asymmetric selective convolution module, incorporating principles from deformable convolution networks [47,48] and selective kernel networks [49] to better address the directional nature of vehicle features.

Considering the distribution of vehicle object bounding box dimensions and angles and aiming to leverage the advantages of large convolution kernels, we designed 3 × 5 asymmetric convolution kernels and 3 × 7 asymmetric dilated convolution kernels in the initial feature extraction stage. The 3 × 5 convolution kernel effectively extracts the differences between local vehicle texture and road background texture features, focusing on identifying small vehicle objects. Meanwhile, the 3 × 7 asymmetric dilated convolution kernel expands only in the horizontal direction without dilation in the vertical direction, significantly broadening the horizontal receptive field to better extract the overall contour features of large vehicle targets. In the attention weight generation phase, we similarly designed a 3 × 7 asymmetric convolution kernel to convert spatial pooling feature maps into spatial attention feature maps. These designed kernel shapes closely approximate the distribution patterns of vehicle objects in images, and compared to symmetric convolution kernels, they can compress the network’s receptive field to focus more on horizontal feature extraction while avoiding interference from adjacent irrelevant vertical features. To implement the asymmetric kernel selection mechanism, this research incorporates the large-kernel selection mechanism from LSKNet. By combining our designed series of asymmetric convolution kernels with this mechanism, we obtained our complete Asymmetric Selective Kernel Network (ASKNet). The specific network architecture is shown in Figure 7.

Taking input data X(D, H, W) with D channels, height H, and width W as an example, the specific data flow is as follows:

The input feature map X(D, H, W) undergoes sequential processing through a 3 × 5 initial depthwise separable convolution followed by a 3 × 7 dilated depthwise separable convolution to obtain two spatial feature maps with dimensions (D, H, W).
$F_{1}^{1 \times 1}$ and $F_{2}^{1 \times 1}$ , 1 × 1 convolutions, are applied to both spatial feature maps to reduce their channel dimensions by half, resulting in reduced spatial feature maps ${\tilde{U}}_{1}$ (D/2, H, W) and ${\hat{U}}_{2}$ (D/2, H, W), which are then concatenated along the channel dimension to obtain a spatial concatenated feature map $\overset{ˇ}{U}$ (D, H, W):

$\overset{ˇ}{U} = [{\tilde{U}}_{1}; {\hat{U}}_{2}] .$

(1)
Both maximum pooling and average pooling operations (denoted by $P_{m a x} (\cdot)$ and $P_{a v g} (\cdot)$ ) are performed along the channel dimension on the concatenated feature map to obtain two spatial pooled feature maps, $S A_{m a x}$ and $S A_{a v g}$ , with a single channel (1, H, W):

$S A_{m a x} = P_{m a x} (\overset{ˇ}{U}), S A_{a v g} = P_{a v g} (\overset{ˇ}{U}),$

(2)

which are then concatenated along the channel dimension to create a spatial pooled concatenated feature map $[S A_{m a x}; S A_{a v g}]$ (2, H, W).
The spatial pooled concatenated feature map (2, H, W) is processed using a 3 × 7 convolution kernel $F_{a}^{3 \times 7}$ for attention feature extraction, followed by sigmoid activation $σ (\cdot)$ to generate an attention weight feature map $\tilde{S A}$ (2, H, W):

$\tilde{S A} = σ (F_{a}^{3 \times 7} ([S A_{m a x}; S A_{a v g}])) .$

(3)
The channel-reduced spatial feature maps ${\tilde{U}}_{1}$ (D/2, H, W) and ${\hat{U}}_{2}$ (D/2, H, W) are multiplied with their corresponding single-channel weight feature map and then weighted to obtain a channel-reduced weighted attention feature map (D/2, H, W).
A 1 × 1 convolution $F_{3}^{1 \times 1}$ is applied for channel restoration to obtain an attention feature map S(D, H, W) with the same dimensions as the input feature map X(D, H, W):

$S = F_{3}^{1 \times 1} (\tilde{S A_{1}} \cdot {\tilde{U}}_{1} + \tilde{S A_{2}} \cdot {\hat{U}}_{2}),$

(4)

enabling element-wise multiplication with X to generate the modulated feature map Y(D, H, W) as the final output:

$Y = S \cdot X .$

(5)

2.3. Vehicle Height Estimation

After obtaining high-precision object detection results, we can further enhance vehicle trajectory accuracy through vehicle projection size correction. Current open-source vehicle trajectory datasets based on drone videos generally lack the vehicle height dimension information. While vehicle height is often ignored when converting trajectory datasets into 2D top-down view representations, its impact on trajectory accuracy cannot be overlooked when building high-accuracy vehicle trajectory datasets. Particularly when processing tall vehicles like trucks and buses, due to projection relationships, directly using the centroid of the detection box as the vehicle’s centroid introduces errors that increase as vehicles approach image edges, as shown in Figure 8. The correction process requires estimating the vehicle height by combining the vehicle’s projected position in the image and drone flight altitude to adjust the vehicle’s centroid position. Beyond centroid trajectories, when describing the motion trajectories of large trucks and buses, vehicle length and width dimensions are crucial attribute information that cannot be ignored. After obtaining height information, these two values can also be corrected to help the dataset acquire more accurate vehicle dimension information. Additionally, this research utilized a relatively low drone flight altitude of 120 m, resulting in larger pixel regions occupied by vehicles, which facilitates vehicle height estimation. The reduced field of view caused by the lower flight altitude can be addressed by implementing synchronized filming using multiple drones in parallel formation.

In the specific estimation process, we designate the image center point as the drone camera’s focal point and utilize it to divide the image into four quadrants, as illustrated in Figure 9. Taking the first quadrant located in the upper-left corner of the image as an example, the bottom-right coordinate of the red vehicle represents the real coordinate without errors, while the remaining three points’ coordinates experience partial or complete distortion due to projection relationships. We define the real coordinates as

x r y r

and the projected coordinates as

x p y p

, with the detailed coordinate schematic shown in Figure 10.

We examine the real and projected coordinates on the x-axis for discussion, and the specific projection relationship is shown schematically in Figure 11. The projection relationship in the direction of the y-axis is the same as that of the x-axis.

We approximate the image center

O_{p}

as the drone camera focal point

O_{d}

and use a red rectangle to approximate the vehicle’s rectangular cuboid region. The x-coordinate closer to the image center point denotes the true coordinate

x r_{i}

, while the x-coordinate farther from the image center point is the projected coordinate

x p_{m}

. Between

x p_{m}

and

x r_{i}

, besides the vehicle length

l_{v}

, there is an additional distance

h p_{m}

caused by the height projection due to vehicle height

h_{v}

. After obtaining the drone flight height

h_{d}

and the vehicle’s true coordinates

x r_{1}

,

x r_{5}

at two different positions in the image, along with their corresponding projected coordinates

x p_{2}

,

x p_{6}

, we can calculate the vehicle height

h_{v}

:

h_{v} = h_{d} (1 + \frac{{x r}_{5} - {x r}_{1}}{{x p}_{2} - {x p}_{6}}) .

(6)

The coordinates in the remaining three quadrants can be extrapolated similarly. Multiple coordinate pairs yield several vehicle height measurements, which can be statistically analyzed to help offset some of the observation and detection errors. It is noteworthy that the actual vehicle shape rarely conforms to a regular cuboid, with potential height variations between the vehicle’s front and body sections. Since this calculation method fundamentally derives height measurements from projections of specific vehicle height edges, height estimations from x-coordinates and y-coordinates should be computed independently, and coordinate heights within different quadrants should also be calculated separately. Based on the estimated height data, we can further deduce vehicle dimensional information and the center of mass, ultimately obtaining higher-accuracy vehicle trajectory information.

3. Results

3.1. Vehicle Labels

In this study, we employed a single “vehicle” label to annotate all vehicle objects. While other similar datasets typically utilize multiple classification labels for vehicle object categorization—such as the DroneVehicle dataset [50], which implements five distinct labels (car, van, bus, truck, freight car), and the DOTA-v1.0 dataset and SODA-A dataset, which employs two labels (small-vehicle and large-vehicle)—our experimental findings revealed an unexpected phenomenon. When increasing vehicle classification labels, even with regression and classification performed independently, the network exhibited decreased regression accuracy. The adoption of a single “vehicle” label effectively prevents such issues. Our experimental results shown in Figure 12, comparing the DOTA-v1.0 dataset’s annotation format with our single-label format, demonstrate that using multiple classification labels can lead to false detections, particularly in cases involving small vehicles and cargo loaded on large trucks. Furthermore, the significant variation in truck dimensions presents a challenge, as some small trucks, despite sharing morphological similarities with “large-vehicle”-labeled trucks, are dimensionally closer to cars and thus labeled as “small-vehicles”. This inconsistency can result in the misclassification of larger trucks as small vehicles when their dimensions approximate those of small trucks, thereby affecting the regression accuracy of both “small-vehicle” and “large-vehicle” categories. These errors are not adequately reflected in AP and mAP metrics due to the reduced number of objects in each category after classification. The implementation of a single vehicle label eliminates these false detections, enhances the overall regression accuracy of vehicle detection boxes, and better aligns with our objective of establishing a high-accuracy vehicle trajectory dataset.

While vehicle trajectory datasets require category information (e.g., the highD dataset’s car, bus, and truck classifications), this can be achieved through image classification of cropped vehicle regions, supplemented by vehicle size information, trajectory characteristics, and vehicle operation area information.

3.2. Comparative Study

This paper presents a comparative experimental analysis between the proposed Asymmetric Selective Kernel Network and Symmetric Selective Kernel Network. We selected LSKNet as our baseline network, which, as of 25 November 2024, maintains the highest mAP accuracy among CNN-backbone-based object detection models on the Object Detection In Aerial Images on DOTA-v1.0 Dataset leaderboard. It is noteworthy that modifying the convolution kernel sizes in the backbone architecture would disrupt the pre-trained weights of the backbone network. Therefore, to ensure experimental fairness, all experiments in this study were conducted using models without pre-trained weights. The structural differences between our proposed ASKNet and LSKNet are illustrated in Table 1, where (

k_{1}

,

d_{1}

) represents the smaller convolution kernel, (

k_{2}

,

d_{2}

) represents the larger dilated convolution kernel, and (

k_{a}

,

d_{a}

) represents the attention convolution kernel. Experimental validation was performed on the MMRotate platform using a single RTX 4090 GPU.

3.2.1. Results on FRVehicle Dataset

We initially conducted comparative experiments on our self-constructed FRVehicle dataset. While the primary objective of this study is to establish a high-accuracy vehicle trajectory dataset, the impact of manual annotation errors also requires consideration. After balancing these two factors, the IoU threshold for positive matches in training, validation, and testing was set to 0.9. We performed a series of tests by implementing different detection head architectures. All models were trained for 12 epochs with a batch size of 6, and the learning rates were fine-tuned according to the specific convergence characteristics of different models, resulting in certain variations across models, though models with identical detection heads maintained consistent learning rates. Taking into account the distribution range of vehicle lengths, we segmented the original images into 1024 × 1024 pixels with an overlap distance of 600 pixels before feeding them into the network for training and testing. The test results are presented in Table 2.

Based on the test set results, single-stage frameworks (R3Det and

S^{2}

A-Net) generally achieve lower detection accuracy compared to two-stage frameworks (O-RCNN and RoI Transformer). Despite the vehicle objects in this study having detection boxes that approximate HBBs, R3Det’s progressive regression strategy—which generates horizontal anchor boxes before rotational refinement—does not improve detection performance, and its regression method is more suitable for feature extraction using symmetric kernels. Apart from this exception, in most cases, the proposed ASKNet demonstrates varying degrees of improvement compared to LSKNet under different detection head configurations. Notably, when implementing the

S^{2}

A-Net method, we observe minimal accuracy disparity between the validation and test sets, suggesting robust resistance to overfitting. These results validate the method’s effectiveness on the FRVehicle dataset.

We conducted ablation experiments using Oriented RCNN + ASKNet as a case study to evaluate the detection accuracy when comparing scenarios without the selection mecha- nism versus those with hybrid usage of symmetric and asymmetric kernels, thereby further validating the rationality of both the selection mechanism and asymmetric kernels, with detailed results presented in Table 3. Additionally, we performed comparative experiments against several representative state-of-the-art (SOTA) networks, with the results demon- strated in Table 4.

3.2.2. Results on DroneVehicle Dataset

To evaluate the generalization capability of our proposed algorithm, we conducted generalization tests on the DroneVehicle dataset. DroneVehicle, established by Tianjin University, is a dual-modal vehicle detection dataset containing RGB and infrared images captured at various times throughout the day, with vehicle objects annotated using OBBs. Due to slightly lower annotation precision in some OBBs within this dataset, we reduced the IOU threshold to 0.8 for generalization testing and unified the multi-class vehicle labels into a single “vehicle” label. When experimenting with the complete DroneVehicle dataset, the Oriented RCNN + ASKNet method achieved a test set accuracy of 0.403, indicating relatively low precision. To maintain consistency with our research data, we exclusively utilized the RGB images from DroneVehicle. The RGB images in the DroneVehicle dataset are categorized into Day, Night, and Dark Night based on different lighting conditions. We eliminated all Night and Dark Night images and removed some foggy images from the Day category, as shown in Figure 13. After filtering, approximately half of the original 28,439 RGB images were selected under visible light conditions for the generalization experiment, maintaining the original train–validation–test split configuration.

To enhance detection complexity, a substantial portion of images in this dataset were deliberately captured with road orientations perpendicular to the image’s long edge, as illustrated in Figure 14. To better align the angle distribution with the FRVehicle dataset, we specifically applied a 90-degree rotation to these images while preserving the original orientation of images where roads ran parallel to either the longitudinal axis or diagonal direction. Additionally, we removed annotations where bounding boxes intersected image boundaries, as shown in Figure 14. The angular distribution of all original OBBs and those after 90-degree rotation are presented in Figure 15. To quantify the effects of this modification, we conducted comparative experiments by training and testing models on both the original and rotated datasets.

Although the OBBs of vehicle objects in DroneVehicle dataset are relatively small compared to the FRVehicle dataset, we maintained the kernel size structure of convolutional layers for experimental consistency. The detailed experimental results are presented in Table 5.

The experimental results demonstrate that the detection accuracy of single-stage frameworks (R3Det and

S^{2}

A-Net) is consistently lower than that of two-stage frameworks (O-RCNN and RoI Transformer). R3Det exhibits better utilization of features extracted by symmetric convolution kernels, making it more suitable for processing data with irregular angle distributions. In addition, for the small objects and medium regression precision in this dataset, R3Det’s detection accuracy shows no significant differences compared to

S^{2}

A-Net. In the data comparison experiments, while the rotation-induced transformation of image length and width dimensions disrupts size consistency and introduces additional detection challenges, the regularity in angles introduced by the data transformation provides advantages in reducing detection difficulty. In most cases, the detection accuracy on the test set using rotation data maintains parity with or slightly surpasses that of the original data. When integrated with the proposed Asknet, the final detection accuracy still maintains advantages over the baseline LSKNet.

4. Discussion

This study presents the FRVehicle dataset for vehicle object detection at a freeway on-ramp using drone-based top-down imagery, addressing the current gap in open-source datasets specifically designed for extracting high-accuracy vehicle trajectory data. Leveraging the FRVehicle dataset, we systematically characterized the distinctive data attributes and designed an asymmetric selective kernel module to achieve high-precision vehicle object detection. Furthermore, we implemented vehicle height estimation based on the high-precision detection results, providing theoretical foundations for constructing high-accuracy vehicle trajectory datasets.

The comparative experimental results reveal several important insights regarding the relationship between data complexity and network architecture. Although researchers typically increase data detection complexity and network structural complexity to enhance feature extraction capabilities and model generalization, the natural patterns in real-world data may be more straightforward than anticipated. This phenomenon is exemplified in our drone-based freeway vehicle detection data, where similar patterns emerge in pedestrian detection data from surveillance or vehicle-mounted cameras, particularly in terms of aspect ratio and directional consistency. Rather than artificially complicating these natural patterns to increase network detection difficulty, superior detection accuracy can be achieved by identifying and leveraging these inherent data regularities and adaptively modifying network architectures. Particularly when pursuing high-precision object detection results, the optimal approach lies not in solely relying on network capabilities but in establishing synergy between data formats and network structures so that they complement rather than constrain each other.

In our generalization experiments, we identified performance degradation in the model’s detection accuracy under low-illumination conditions. Future work will focus on establishing a high-precision vehicle detection dataset from drone-based top-down imagery in low-illumination environments. We plan to enhance our algorithm to address the challenges of increased image noise and reduced contrast in low-light conditions, ultimately extending the temporal range for acquiring high-accuracy vehicle trajectory data from drone videos.

Author Contributions

Conceptualization, Z.W.; methodology, Z.W.; software, Z.W.; validation, Z.W.; formal analysis, Z.W.; investigation, Z.W. and L.X.; resources, L.X.; data curation, Z.W.; writing—original draft preparation, Z.W.; writing—review and editing, Z.W., L.X. and Z.Y.; visualization, Z.W.; supervision, Z.Y.; project administration, L.X.; funding acquisition, L.X. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Shanghai Municipal Science and Shanghai Automotive Industry Science and Technology Development Foundation (No. 2407) and Perspective Study Funding of Nanchang Automotive Institute of Intelligence & New Energy, Tongji University (17002380058).

Data Availability Statement

The data and code in this research are available at https://github.com/kaevolx/FRVehicle-ASKNet (accessed on 28 November 2024) and https://github.com/VisDrone/DroneVehicle (accessed on 1 November 2024). The code for the comparison network is available at https://github.com/zcablii/LSKNet (accessed on 1 September 2024).

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

OBB	Oriented bounding box
HBB	Horizontal bounding box
IoU	Intersection over Union
ASKNet	Asymmetric Selective Kernel Network
LSKNet	Large Selective Kernel Network

References

Berghaus, M.; Lamberty, S.; Ehlers, J.; Kalló, E.; Oeser, M. Vehicle trajectory dataset from drone videos including off-ramp and congested traffic—Analysis of data quality, traffic flow, and accident risk. Commun. Transp. Res. 2024, 4, 100133. [Google Scholar] [CrossRef]
Lu, D.; Eaton, E.T.; Van Der Weg, M.; Wang, W.; Como, S.G.; Wishart, J.D.; Yu, H.; Yang, Y. CAROM Air—Vehicle Localization and Traffic Scene Reconstruction from Aerial Videos. In Proceedings of the 2023 IEEE International Conference on Robotics and Automation (ICRA), London, UK, 29 May–2 June 2023; pp. 10666–10673. [Google Scholar]
Wang, Z.; Yu, Z.; Tian, W.; Xiong, L.; Tang, C. A Method for Building Vehicle Trajectory Data Sets Based on Drone Videos; SAE Technical Paper 2023-01-0714; SAE International: Warrendale, PA, USA, 2023. [Google Scholar]
Krajewski, R.; Bock, J.; Kloeker, L.; Eckstein, L. The highD Dataset: A Drone Dataset of Naturalistic Vehicle Trajectories on German Highways for Validation of Highly Automated Driving Systems. In Proceedings of the 2018 21st International Conference on Intelligent Transportation Systems (ITSC), Maui, HI, USA, 4–7 November 2018; pp. 2118–2125. [Google Scholar]
Bock, J.; Krajewski, R.; Moers, T.; Runde, S.; Vater, L.; Eckstein, L. The inD Dataset: A Drone Dataset of Naturalistic Road User Trajectories at German Intersections. In Proceedings of the 2020 IEEE Intelligent Vehicles Symposium (IV), Las Vegas, NV, USA, 19 October–13 November 2019; pp. 1929–1934. [Google Scholar]
Krajewski, R.; Moers, T.; Bock, J.; Vater, L.; Eckstein, L. The rounD Dataset: A Drone Dataset of Road User Trajectories at Roundabouts in Germany. In Proceedings of the 2020 IEEE 23rd International Conference on Intelligent Transportation Systems (ITSC), Rhodes, Greece, 20–23 September 2020; pp. 1–6. [Google Scholar]
Moers, T.; Vater, L.; Krajewski, R.; Bock, J.; Zlocki, A.; Eckstein, L. The exiD Dataset: A Real-World Trajectory Dataset of Highly Interactive Highway Scenarios in Germany. In Proceedings of the 2022 IEEE Intelligent Vehicles Symposium (IV), Aachen, Germany, 4–9 June 2022; pp. 958–964. [Google Scholar]
Zheng, O.; Abdel-Aty, M.A.; Yue, L.; Abdelraouf, A.; Wang, Z.; Mahmoud, N. CitySim: A Drone-Based Vehicle Trajectory Dataset for Safety-Oriented Research and Digital Twins. Transp. Res. Rec. 2022, 2678, 606–621. [Google Scholar] [CrossRef]
Zhan, W.; Sun, L.; Wang, D.; Shi, H.; Clausse, A.; Naumann, M.; Kümmerle, J.; Königshof, H.; Stiller, C.; de La Fortelle, A.; et al. Interaction Dataset: An International, Adversarial and Cooperative Motion Dataset in Interactive Driving Scenarios with Semantic Maps. arXiv 2019, arXiv:1910.03088. [Google Scholar]
Xu, Y.; Shao, W.; Li, J.; Yang, K.-B.; Wang, W.; Huang, H.; Lv, C.; Wang, H. SIND: A Drone Dataset at Signalized Intersection in China. In Proceedings of the 2022 IEEE 25th International Conference on Intelligent Transportation Systems (ITSC), Macau, China, 8–12 October 2022; pp. 2471–2478. [Google Scholar]
Lou, Z.; Cui, Q.; Wang, H.; Tang, X.; Zhou, H. Multimodal Sense-Informed Forecasting of 3D Human Motions. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; pp. 2144–2154. [Google Scholar]
Lyu, C.; Zhang, W.; Huang, H.; Zhou, Y.; Wang, Y.; Liu, Y.; Zhang, S.; Chen, K. RTMDet: An Empirical Study of Designing Real-Time Object Detectors. arXiv 2022, arXiv:2212.07784. [Google Scholar]
Li, Y.; Li, X.; Dai, Y.; Hou, Q.; Liu, L.; Liu, Y.; Cheng, M.-M.; Yang, J. LSKNet: A Foundation Lightweight Backbone for Remote Sensing. Int. J. Comput. Vis. 2024. [Google Scholar] [CrossRef]
Li, Y.; Hou, Q.; Zheng, Z.; Cheng, M.-M.; Yang, J.; Li, X. Large Selective Kernel Network for Remote Sensing Object Detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 2–3 October 2023; pp. 16794–16805. [Google Scholar]
Cai, X.; Lai, Q.; Wang, Y.; Wang, W.; Sun, Z.; Yao, Y. Poly Kernel Inception Network for Remote Sensing Detection. arXiv 2024, arXiv:2403.06258. [Google Scholar] [CrossRef]
Jegham, N.; Koh, C.Y.; Abdelatti, M.; Hendawi, A.M. Evaluating the Evolution of YOLO (You Only Look Once) Models: A Comprehensive Benchmark Study of YOLO11 and Its Predecessors. arXiv 2024, arXiv:2411.00201. [Google Scholar]
Pu, Y.; Wang, Y.; Xia, Z.; Han, Y.; Wang, Y.; Gan, W.; Wang, Z.; Song, S.; Huang, G. Adaptive Rotated Convolution for Rotated Object Detection. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023; pp. 6566–6577. [Google Scholar]
Han, J.; Ding, J.; Xue, N.; Xia, G. ReDet: A Rotation-equivariant Detector for Aerial Object Detection. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 2785–2794. [Google Scholar]
Yu, C.; Jiang, X.; Wu, F.; Fu, Y.; Pei, J.; Zhang, Y.; Li, X.; Fu, T. A Multi-Scale Feature Fusion Based Lightweight Vehicle Target Detection Network on Aerial Optical Images. Remote Sens. 2024, 16, 3637. [Google Scholar] [CrossRef]
Wang, W.; Xie, E.; Li, X.; Fan, D.-P.; Song, K.; Liang, D.; Lu, T.; Luo, P.; Shao, L. Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 548–558. [Google Scholar]
Zhang, X.; Tian, Y.; Huang, W.; Ye, Q.; Dai, Q.; Xie, L.; Tian, Q. HiViT: Hierarchical Vision Transformer Meets Masked Image Modeling. arXiv 2022, arXiv:2205.14949. [Google Scholar]
Xu, Y.; Zhang, Q.; Zhang, J.; Tao, D. ViTAE: Vision Transformer Advanced by Exploring Intrinsic Inductive Bias. arXiv 2021, arXiv:2106.03348. [Google Scholar]
Zhang, Q.; Xu, Y.; Zhang, J.; Tao, D. ViTAEv2: Vision Transformer Advanced by Exploring Inductive Bias for Image Recognition and Beyond. Int. J. Comput. Vis. 2022, 131, 1141–1162. [Google Scholar] [CrossRef]
Wang, D.; Zhang, Q.; Xu, Y.; Zhang, J.; Du, B.; Tao, D.; Zhang, L. Advancing Plain Vision Transformer Toward Remote Sensing Foundation Model. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5607315. [Google Scholar] [CrossRef]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-End Object Detection with Transformers. arXiv 2020, arXiv:2005.12872. [Google Scholar]
Dai, L.; Liu, H.; Tang, H.; Wu, Z.; Song, P. AO2-DETR: Arbitrary-Oriented Object Detection Transformer. IEEE Trans. Circuits Syst. Video Technol. 2022, 33, 2342–2356. [Google Scholar] [CrossRef]
He, X.; Liang, K.; Zhang, W.; Li, F.; Jiang, Z.; Zuo, Z.; Tan, X. DETR-ORD: An Improved DETR Detector for Oriented Remote Sensing Object Detection with Feature Reconstruction and Dynamic Query. Remote Sens. 2024, 16, 3516. [Google Scholar] [CrossRef]
Ding, X.; Zhang, X.; Zhou, Y.; Han, J.; Ding, G.; Sun, J. Scaling Up Your Kernels to 31 × 31: Revisiting Large Kernel Design in CNNs. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 11953–11965. [Google Scholar]
Yang, X.; Yang, X.; Yang, J.; Ming, Q.; Wang, W.; Tian, Q.; Yan, J. Learning High-Precision Bounding Box for Rotated Object Detection via Kullback-Leibler Divergence. In Proceedings of the Advances in Neural Information Processing Systems 34 (NeurIPS 2021), Virtual, 6–14 December 2021. [Google Scholar]
Yang, X.; Yan, J.; Qi, M.; Wang, W.; Zhang, X.; Tian, Q. Rethinking Rotated Object Detection with Gaussian Wasserstein Distance Loss. In Proceedings of the 8th International Conference on Machine Learning, Stockholm, Sweden, 10–12 March 2021. [Google Scholar]
Hou, L.; Lu, K.; Yang, X.; Li, Y.; Xue, J. G-Rep: Gaussian Representation for Arbitrary-Oriented Object Detection. Remote Sens. 2023, 15, 757. [Google Scholar] [CrossRef]
Yang, X.; Zhang, G.; Yang, X.; Zhou, Y.; Wang, W.; Tang, J.; He, T.; Yan, J. Detecting Rotated Objects as Gaussian Distributions and its 3-D Generalization. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 4335–4354. [Google Scholar] [CrossRef] [PubMed]
Yu, Y.; Da, F. Phase-Shifting Coder: Predicting Accurate Orientation in Oriented Object Detection. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 13354–13363. [Google Scholar]
Xiao, Z.; Yang, G.-Y.; Yang, X.; Mu, T.-J.; Yan, J.; Hu, S.-M. Theoretically Achieving Continuous Representation of Oriented Bounding Boxes. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; pp. 16912–16922. [Google Scholar]
Zhao, Z.; Xue, Q.; He, Y.; Bai, Y.; Wei, X.; Gong, Y. Projecting points to axes: Oriented object detection via point-axis representation. In Proceedings of the Computer Vision—ECCV 2024, 18th European Conference, Milan, Italy, 29 September–4 October 2025; pp. 161–179. [Google Scholar]
Xie, X.; Cheng, G.; Wang, J.; Yao, X.; Han, J. Oriented R-CNN for Object Detection. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 3520–3529. [Google Scholar]
Ding, J.; Xue, N.; Long, Y.; Xia, G.-S.; Lu, Q. Learning RoI Transformer for Oriented Object Detection in Aerial Images. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 2849–2858. [Google Scholar]
Yang, X.; Yan, J.; Feng, Z.; He, T. R3Det: Refined Single-Stage Detector with Feature Refinement for Rotating Object. In Proceedings of the Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI-21), Virtual, 2–9 February 2021; Volume 35, pp. 3163–3171. [Google Scholar]
Han, J.; Ding, J.; Li, J.; Xia, G.-S. Align deep features for oriented object detection. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5602511. [Google Scholar] [CrossRef]
Li, W.; Chen, Y.; Hu, K.; Zhu, J. Oriented RepPoints for Aerial Object Detection. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 18 2022; pp. 1–10. [Google Scholar]
Yu, H.; Tian, Y.; Ye, Q.; Liu, Y. Spatial Transform Decoupling for Oriented Object Detection. In Proceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 20–27 February 2024; Volume 38, pp. 6782–6790. [Google Scholar]
Xia, G.-S.; Bai, X.; Ding, J.; Zhu, Z.; Belongie, S.; Luo, J.; Datcu, M.; Pelillo, M.; Zhang, L. DOTA: A Large-Scale Dataset for Object Detection in Aerial Images. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 3974–3983. [Google Scholar]
Ding, J.; Xue, N.; Xia, G.-S.; Bai, X.; Yang, W.; Yang, M.; Belongie, S.; Luo, J.; Datcu, M.; Pelillo, M.; et al. Object Detection in Aerial Images: A Large-Scale Benchmark and Challenges. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 7778–7796. [Google Scholar] [CrossRef]
Sun, X.; Wang, P.; Yan, Z.; Xu, F.; Wang, R.; Diao, W.; Chen, J.; Li, J.; Feng, Y.; Xu, T.; et al. A benchmark dataset for fine-grained object recognition in high-resolution remote sensing imagery. ISPRS J. Photogramm. Remote Sens. 2022, 184, 116–130. [Google Scholar] [CrossRef]
Cheng, G.; Yuan, X.; Yao, X.; Yan, K.; Zeng, Q.; Xie, X.; Han, J. Towards Large-Scale Small Object Detection: Survey and Benchmarks. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 13467–13488. [Google Scholar] [CrossRef]
Li, Y.; Luo, J.; Zhang, Y.; Tan, Y.; Yu, J.-G.; Bai, S. Learning to Holistically Detect Bridges From Large-Size VHR Remote Sensing Imagery. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 46, 11507–11523. [Google Scholar] [CrossRef] [PubMed]
Dai, J.; Qi, H.; Xiong, Y.; Li, Y.; Zhang, G.; Hu, H.; Wei, Y. Deformable convolutional networks. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 764–773. [Google Scholar]
Zhang, X.; Song, Y.; Song, T.; Yang, D.; Ye, Y.; Zhou, J.; Zhang, L. LDConv: Linear deformable convolution for improving convolutional neural networks. Image Vis. Comput. 2024, 149, 105190. [Google Scholar] [CrossRef]
Li, X.; Wang, W.; Hu, X.; Yang, J. Selective Kernel Networks. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 510–519. [Google Scholar]
Sun, Y.; Cao, B.; Zhu, P.; Hu, Q. Drone-based RGB-Infrared Cross-Modality Vehicle Detection via Uncertainty-Aware Learning. IEEE Trans. Circuits Syst. Video Technol. 2022, 32, 6700–6713. [Google Scholar] [CrossRef]

Figure 1. Vehicle projections for different scenes. (a) A freeway scene. (b) An interaction scene.

Figure 2. FRVehicle scene 1.

Figure 3. FRVehicle scene 2.

Figure 4. A comparison of DOTA-v1.0 and FRVehicle dataset OBB parameters. The Dota-v1.0 dataset is shown in green, FRVehicle is shown in blue, the x-axis is the value of the parameter with different labeling boxes, and the y-axis is the distribution density. (a) Length comparison. (b) Width comparison. (c) Length-to-width ratio comparison.

Figure 5. Area comparison. The Dota-v1.0 dataset is shown in green, FRVehicle is shown in blue, the x-axis is the length of the labeled box, the y-axis is the width of the labeled box, and the z-axis is the density of the distribution under different areas.

Figure 6. Distribution comparison of DOTA-v1.0 and FRVehicle datasets’ OBB angle parameters. (a) DOTA-v1.0 OBB angle Distribution. (b) FRVehicle OBB angle Distribution.

Figure 7. A conceptual illustration of the ASK module.

Figure 8. Changes in the projection of the same vehicle at different locations in the image.

Figure 9. The quadrant division of an image.

Figure 10. Vehicle projection in the first quadrant.

Figure 11. A schematic of the first-quadrant projection.

Figure 12. A comparison of double-label labeling and single-label labeling image detection results. The images in (a,b) demonstrate wrong detection due to the vehicle loading of cargo. The images in (c,d) demonstrate inaccurate detection due to a double label.

Figure 13. Removed RGB images.

Figure 14. A schematic of image OBBs. (a) An image where the direction of the road parallel to the diagonal direction of the image is not processed. (b) An image where the road direction perpendicular to the long side is rotated by 90. (c) OBBs that extend beyond image boundaries are removed.

Figure 15. Distribution comparison of the original and corrected data angle parameters. (a) Original data. (b) Corrected data.

Table 1. Comparison of ASKNet and LSKNet convolutional kernel structures.

(k, d) Sequence	ASKNet	LSKNet
Height ( $k_{1}$ , $d_{1}$ )	(3, 1)	(5, 1)
Width ( $k_{1}$ , $d_{1}$ )	(5, 1)	(5, 1)
Height ( $k_{2}$ , $d_{2}$ )	(3, 1)	(7, 3)
Width ( $k_{2}$ , $d_{2}$ )	(7, 3)	(7, 3)
Height ( $k_{a}$ , $d_{a}$ )	(3, 1)	(7, 1)
Width ( $k_{a}$ , $d_{a}$ )	(7, 1)	(7, 1)

Table 2. The AP results on the test set for the FRVehicle dataset.

Methods	ASKNet ${AP}_{90}$	LSKNet ${AP}_{90}$
R3Det	0.494	0.506 ↑
$S^{2}$ A-Net	0.706 ↑	0.637 ¹
Oriented RCNN	0.739 ↑	0.733
RoI Transformer	0.759 ↑	0.757

¹ Test set AP is much lower than validation set AP.

Table 3. Ablation study on the design of our proposed ASKNet module.

Types	3 × 5	5 × 5	3 × 7	7 × 7	${AP}_{90}$	FPS
Unselective	✓	✕	✕	✕	0.732	21.2
Unselective	✕	✕	✓	✕	0.738	21
Hybrid	✓	✕	✕	✓	0.737	20.4
Hybrid	✕	✓	✓	✕	0.737	20.5
Symmetric	✕	✓	✕	✓	0.733	20.2
Asymmetric	✓	✕	✓	✕	0.739 ↑	20.8

Table 4. Comparisons with the SOTA methods on the FRVehicle dataset.

Methods	${AP}_{90}$
RTMDet	0.729
ARC	0.742
PKINet	0.755
RoI Transformer + ASKNet	0.759 ↑

Table 5. The AP results on the test set for the DroneVehicle dataset.

Methods	ASKNet ${AP}_{80}$ (Original Data)	LSKNet ${AP}_{80}$ (Original Data)	ASKNet ${AP}_{80}$ (Rotated Data)	LSKNet ${AP}_{80}$ (Rotated Data)
R3Det	0.580	0.585 ↑	0.576	0.580
$S^{2}$ A-Net	0.573	0.570	0.574 ↑	0.570
Oriented RCNN	0.666 ↑	0.665	0.666 ↑	0.661
RoI Transformer	0.659	0.654	0.669 ↑	0.664

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, Z.; Xiong, L.; Yu, Z. An Asymmetric Selective Kernel Network for Drone-Based Vehicle Detection to Build a High-Accuracy Vehicle Trajectory Dataset. Remote Sens. 2025, 17, 407. https://doi.org/10.3390/rs17030407

AMA Style

Wang Z, Xiong L, Yu Z. An Asymmetric Selective Kernel Network for Drone-Based Vehicle Detection to Build a High-Accuracy Vehicle Trajectory Dataset. Remote Sensing. 2025; 17(3):407. https://doi.org/10.3390/rs17030407

Chicago/Turabian Style

Wang, Zhenyu, Lu Xiong, and Zhuoping Yu. 2025. "An Asymmetric Selective Kernel Network for Drone-Based Vehicle Detection to Build a High-Accuracy Vehicle Trajectory Dataset" Remote Sensing 17, no. 3: 407. https://doi.org/10.3390/rs17030407

APA Style

Wang, Z., Xiong, L., & Yu, Z. (2025). An Asymmetric Selective Kernel Network for Drone-Based Vehicle Detection to Build a High-Accuracy Vehicle Trajectory Dataset. Remote Sensing, 17(3), 407. https://doi.org/10.3390/rs17030407

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

An Asymmetric Selective Kernel Network for Drone-Based Vehicle Detection to Build a High-Accuracy Vehicle Trajectory Dataset

Abstract

1. Introduction

2. Materials and Methods

2.1. FRVehicle Dataset

2.2. Asymmetric Selective Kernel Network

2.3. Vehicle Height Estimation

3. Results

3.1. Vehicle Labels

3.2. Comparative Study

3.2.1. Results on FRVehicle Dataset

3.2.2. Results on DroneVehicle Dataset

4. Discussion

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI