A Review of DEtection TRansformer: From Basic Architecture to Advanced Developments and Visual Perception Applications
Abstract
1. Introduction
2. Fundamental Theory and Architecture of DETR
2.1. Backbone: Extracting Image Features
2.2. Positional Encoding: Injecting Spatial Information
2.3. Transformer Encoder: Enhancing Global Context
2.4. Transformer Decoder: Querying and Decoding Objects
- Self-Attention over Queries: MSA is applied to the object queries, allowing the queries to interact with each other and perceive the objects they might be attending to, thereby helping the model avoid generating duplicate detection boxes and serving as an implicit deduplication mechanism.
- Cross-Attention: This is crucial for decoding object information. It uses the object queries from the previous layer as the query and the image features output by the encoder as the key and value. Through the cross-attention mechanism (also called encoder–decoder attention), each object query can effectively query the encoded image feature map, attend to image regions related to its potential object, and aggregate the corresponding feature information to update its own representation.
- FFN: Its structure and function are similar to the FFN in the encoder, performing further feature transformation on the output of the cross-attention. The object queries are passed layer by layer through the L decoder layers and are progressively refined. The final output is the representation learned by the decoder for these potential objects, containing key information used for the subsequent prediction of their categories and locations.
2.5. Prediction Heads: Generating Final Predictions
- Classification Head: Responsible for predicting the category corresponding to each query. It outputs dimensional logits. After passing through a Softmax function, it yields the probability distribution of each query belonging to one of the object categories or the “no object” background class.
- Bounding Box Head: Responsible for predicting the bounding box corresponding to each query. It outputs four real values, directly regressing the normalized bounding box center coordinate and the box height and width . These coordinates are relative to the entire image’s size.
2.6. Set Prediction Loss and Bipartite Matching: The Key to End-to-End Training
- Class Prediction Cost (): This term scores the classification. Unlike the final loss function, it directly uses the predicted probability of the ground truth class to ensure the cost is commensurate with the box regression costs.
- L1 Bounding Box Cost (): This is the L1 distance between the predicted box and the ground truth box , penalizing differences in center coordinates, height, and width.
- Generalized IoU Cost (): This is the Generalized Intersection over Union (GIoU) loss. It is a scale-invariant metric that is more robust than the L1 loss, as it also considers the shape and orientation of the boxes and not just their overlap.
3. Key Challenges and Technical Evolution
3.1. Enhancing Feature Representation and Efficiency: From Dense to Sparse Attention
- Sparse attention reduces complexity by only calculating a portion of important attention weights. Sparse DETR [27] is a representative of this, proposing to only update feature map tokens explicitly referenced by object queries in the decoder, thereby reducing computational costs during backpropagation. Similarly, PnP-DETR [52] optimizes the attention module in a plug-and-play manner, improving flexibility and efficiency. The challenge of these methods lies in how to effectively determine which attention weights are important and worth calculating and how to achieve sparsification without significantly sacrificing performance.
- Lightweight attention typically combines efficient network structure design. For example, Lite-DETR [53] combines a lightweight network with a sparse attention mechanism, aiming to achieve significant efficiency improvement and latency reduction, which are particularly important for resource-constrained devices but may make certain compromises in accuracy. IA-DETR [54] introduces an indirect attention mechanism, flexibly establishing relationships between object queries, target image features, and query image features; simplifying the traditional cross-attention mechanism; and significantly improving the model’s performance in one-shot object detection.
- Furthermore, although the Swin Transformer [55] is primarily a general visual backbone network architecture and was not directly applied to the optimization of the DETR’s attention mechanism module, its proposed window-based shifted attention mechanism, which computes attention within local windows and uses window shifting for cross-window information exchange, also provides important insights into how to balance computational efficiency and the receptive field in the Transformer architecture. These diverse attempts collectively promote the development of attention mechanisms within the DETR framework [56].
3.2. Stabilizing Training and Accelerating Convergence: Innovations in Query and Supervision
Algorithm Name: Core Algorithm for Contrastive Denoising and Hybrid Query Selection |
Algorithm 1. Contrastive Denoising (CDN) Training Method |
Input: Image, gt_boxes(Ground Truth Bounding Box), gt_labels(Ground Truth Labels) |
Output: Total Training Loss |
function Contrastive_Denoising_Training(image, gt_boxes, gt_labels): |
Algorithm 2. Mixed Query Selection |
Input: Encoder_features: Output features from the Transformer Encoder |
Output: Initial_Decoder_Queries: Initial queries for the Decoder {positional_queries, content_queries} |
Function Mixed_Query_Selection (Encoder_features): |
Helper Function Stubs (Conceptual):
|
- Loss Weight Configuration: The total loss in DETR is a weighted sum of the classification loss, L1 loss, and GIoU loss. The configuration of these weight hyperparameters () is critical [43]. In the original DETR, they were set to (1, 5, 2) to balance the gradient scales of different tasks. In practice, if the model performs poorly in localization, one can try to increase the weights of and ; conversely, if classification errors are frequent, the weight of can be increased.
- Learning Rate Scheduling: DETR training is very sensitive to the learning rate schedule. A common effective practice is to set different learning rates for the CNN backbone and the Transformer module. The backbone, typically using pre-trained weights, requires a smaller learning rate (e.g., 10× smaller than the Transformer part) for fine-tuning. Additionally, a short “warmup” phase at the beginning of training is crucial for stability [70].
- Data Augmentation Strategies: Standard data augmentation methods like random flipping, scaling, and cropping are effective for DETR. It is important to adjust the bounding box and positional encoding coordinates accordingly when resizing images. For models aiming to improve convergence speed or handle dense scenes (e.g., DEIM), advanced techniques like Mosaic and Mixup can be considered.
3.3. Achieving Real-Time Detection: Architectural Optimization and Specialization
- AIFI Module: as illustrated, the AIFI module specifically processes the highest-level feature “S5”. It employs a single Transformer encoder layer, which internally consists of multi-head self-attention followed by feed-forward operations. This allows “S5” features to undergo intra-scale interaction, effectively capturing global contextual dependencies within that scale and producing enhanced features denoted as “F5”.
- CCFF Module: The CCFF module is responsible for integrating features across different scales. It receives the AIFI-enhanced “F5” features along with the original “S3” and “S4” features from the backbone. Within the CCFF block shown in Figure 15, these multi-scale features (F5, S4, and S3) are channeled into distinct parallel “Fusion” paths (labeled as “Fusion (F5 Path)”, “Fusion (S4 Path)”, and “Fusion (S3 Path)”). These paths facilitate bi-directional information flow (indicated by dashed arrows in the diagram, representing top-down and bottom-up interactions similar to a Path Aggregation Network structure), allowing features from different levels to be effectively fused. The outputs resulting from these interactive fusion paths are then aggregated, typically via concatenation (symbolized by “C” in the original RT-DETR paper’s conceptual diagram and implied by the merging arrows in Figure 15), to form the final enhanced multi-scale feature sequence. This enhanced multi-scale feature sequence serves a dual purpose: it is flattened and then passed to the uncertainty-minimal query selection module, and simultaneously, it is directly provided as encoder features to the subsequent decoder and prediction heads.
- Uncertainty-Minimal Query Selection: This module takes the flattened enhanced multi-scale feature sequence from the CCFF. Internally, it processes these encoder features, calculates an uncertainty metric (e.g., based on the discrepancy between predicted localization P and classification C distributions, U = ∣∣P − C∣∣) and then selects the Top-K (e.g., K = 300) features. These selected features, exhibiting minimal uncertainty and thus representing high-quality candidates with strong joint localization and classification confidence, are used to form the “Initial Object Queries”. This selective mechanism significantly reduces the number of queries that proceed to the computationally intensive decoder stage.
- Decoder and Prediction Heads: The decoder block receives two primary inputs: the “Initial Object Queries”(K = 300) and the complete “Enhanced Multi-scale Feature Sequence” (as “Encoder Features”). The object queries are first combined with “Positional Embedding” to incorporate spatial information. They are then iteratively refined through multiple “Transformer Decoder Layers”, attending to the encoder features. Finally, the refined queries from the decoder are fed into a separate “Class Prediction Head” and “Box Prediction Head” to generate the “Class Predictions” () and “Box Predictions” (), respectively.
- Detection Outputs: The outputs from the prediction heads constitute the “Decoder Outputs”. Notably, RT-DETR produces the “Final Detection Results” without requiring Non-Maximum Suppression (NMS), which is a significant advantage for real-time performance.
- Performance Metrics: The evaluation of performance focuses not only on whether the model can detect objects but also on the accuracy of its localization and its robustness across different scenarios.
- AP (Average Precision): This is the core evaluation metric for the object detection task. It integrates the model’s precision and recall, calculated as the area under the precision–recall (P-R) curve across different confidence thresholds. In the context of the COCO dataset, AP typically refers to the average of AP values calculated at 10 different IoU thresholds, ranging from 0.5 to 0.95 with a step of 0.05. This comprehensively reflects the model’s overall performance under various localization accuracy requirements.
- AP50 and AP75: These are AP values at specific IoU thresholds. AP50 (AP at IoU = 0.50) uses a relatively lenient IoU criterion (0.5), a traditional metric from the PASCAL VOC challenge, which primarily measures the model’s ability to detect objects. In contrast, AP75 (AP at IoU = 0.75) employs a stricter localization standard, requiring a higher degree of overlap between the predicted and ground truth boxes, thereby better reflecting the model’s precise localization capability.
- APs, APm, and APl: This set of metrics is used to evaluate the model’s performance on objects of different scales, which is crucial for analyzing its scale robustness. According to the COCO definition, APs correspond to small objects (area < 32 × 32 pixels), APm to medium objects (32 × 32 < area < 96 × 96 pixels), and APl to large objects (area > 96 × 96 pixels). This data allows for an analysis of whether the model has weaknesses in detecting objects of specific sizes, particularly small ones.
- Efficiency Metrics: In addition to performance, the model’s computational cost and inference speed are key to measuring its practical value.
- Parameters: These refer to the total number of learnable parameters in the model, usually measured in millions (Ms). It directly determines the model’s size, affecting storage requirements and loading times.
- GFLOPs (Giga Floating-Point Operations): This indicates the number of Giga Floating-Point Operations required for a single forward pass of the model. It is a theoretical metric for computational complexity, decoupled from specific hardware, allowing for a fair comparison of the computational demands of different models.
- FPS (Frames Per Second): This measures the number of image frames that the model can process per second on specific hardware. It is a highly practical metric that directly reflects the model’s operational speed in real-world deployment scenarios.
- 3.
- At a similar level of performance: RT-DETR-R50 (53.1 AP) achieves comparable accuracy to the SOTA YOLOv8l (52.9 AP) and YOLOv10l (53.2 AP) while maintaining a highly competitive inference speed of 108 FPS.
- 4.
- At a similar speed level: When comparing models in the ~100 FPS range, RT-DETR-R50’s accuracy of 53.1 AP is significantly higher than that of established models like YOLOv5l (49.0 AP).
- Architecture Simplification and Improvement: D2ETR [77] explored the possibility of using only the decoder. Recurrent DETR attempts to introduce a recursive mechanism to process temporal data to improve real-time performance. RT-DETRv2 [78] further enhances the practicality and performance of RT-DETR by optimizing the training strategy and introducing adjustments that do not increase inference cost (Bag of Freebies).
- Model Compression: Model quantization [79] aims to reduce the number of parameters and computations of the model by reducing the bit width of model parameters and activation values to decrease model size and accelerate computation. AQ-DETR [80] explored low-bit quantization-aware training for DETR. Pruning [81] reduces complexity by removing redundant parameters or structural units (such as attention heads, FFN units, etc.) from the model. Pruning DETR [82] improved the inference efficiency of the model through a sparse structured pruning method.
- Lightweight Design: These methods directly start from the architectural level, aiming to design lighter DETR variant models. This includes adopting sparse attention mechanisms (e.g., research like Deformable DETR, LITE-DETR, etc.), using lightweight CNN backbone networks including MobileNet [83], and reducing the number of layers or hidden dimensions of the Transformer encoder–decoder. For example, L-DETR [84] balances the efficiency and accuracy of object detection by combining the DETR framework with the lightweight backbone network PP-LCNet [85].
- Edge-Side Optimization: With the growing demand for edge computing, research efforts have begun to focus on combining hardware characteristics and algorithm optimization to efficiently deploy DETR models on resource-constrained edge devices. Works like SpeedDETR [86] introduced hardware-aware latency prediction models to guide Transformer architecture design, achieving efficient inference on edge GPUs while balancing accuracy and speed. This direction requires closer collaboration between algorithm designers and hardware engineers.
3.4. Specialized Functionality Expansion: Broadening Application Boundaries
3.4.1. Dense Prediction Tasks
- Instance Segmentation: Since DETR essentially performs set prediction, it can be naturally extended to simultaneously predict bounding boxes and pixel-level masks for objects. For example, Mask DINO [90], based on DINO, achieved leading performance in instance segmentation tasks by adding a parallel mask detection head and combining strategies such as instance-level contrastive learning.
- Panoptic Segmentation: DETR’s set prediction idea is also applicable to panoptic segmentation, a task that requires simultaneously segmenting and identifying “Things” and “Stuff” in an image. The original DETR paper has already initially verified its feasibility for this task. Subsequent research, such as Panoptic SegFormer [19], further optimized its panoptic segmentation capabilities by combining a DETR-style set prediction mechanism with a semantically guided segmentation head, improving the consistency of the model at the semantic and instance levels, and significantly increasing panoptic segmentation accuracy, especially in modeling “Stuff” regions.
3.4.2. Three-Dimensional (3D) Vision Tasks
3.4.3. Open-Vocabulary Object Detection Tasks
3.4.4. Other Frontier Vision Tasks
- Continual/Incremental Learning: Research on how to prevent DETR models from forgetting old categories when learning new ones to better address the challenges of expensive data annotation or continuously increasing categories in the real world. Works such as Incremental-DETR [99] and Continual Detection Transformer [100] have studied how to effectively mitigate catastrophic forgetting in DETR models when learning new categories.
- Weakly/Semi-Supervised Learning: Dedicated to completing tasks with limited data annotation. For example, works like Semi-DETR [101] have explored training paradigms that combine a large amount of unlabeled data.
- Domain Adaptation: The goal is to improve the generalization ability of models across different environments and datasets. Works represented by Mean Teacher DETR [104] utilize consistency regularization to align predictions between source and target domains.
- Hybrid Models and Automation: To further improve performance and reduce design costs, hybrid models [107] that combine the advantages of DETR with other detectors (such as YOLO), as well as automated DETR design using techniques like Neural Architecture Search, are also emerging research directions [108].
4. Applications of DETR in Specific Domains
4.1. Autonomous Driving
- The wide variety of traffic participants, such as vehicles, pedestrians, cyclists, and traffic lights, and their complex and dynamic behavior patterns.
- Severe challenges to the robustness of perception algorithms posed by drastic changes in environmental factors such as lighting and weather.
- Frequent mutual occlusion phenomena between objects.
- High demand for accurate estimation of objects’ precise position, size, and pose in 3D space.
- The need to meet the stringent computational efficiency and low latency required for real-time vehicle decision-making [116].
4.1.1. Three-Dimensional Spatial Perception: A Vision-Based Paradigm Shift
4.1.2. Robustness Enhancement: New Avenues for Multi-Model Fusion
4.1.3. Real-Time Performance and Efficiency: A Head-to-Head with CNNs
4.2. Medical Image Analysis
4.2.1. Advantages and Comparison in Dense and Small Lesion Detection
4.2.2. Addressing Data Scarcity and Class Imbalance
4.2.3. Three-Dimensional Volumetric Data Processing and Interpretability
4.3. Remote Sensing Image Analysis
- Images are large in size and high in resolution, making direct processing computationally extremely expensive.
- Object scale varies drastically, with objects spanning multiple sizes potentially coexisting in the same scene.
- Small and dense objects are prevalent, such as dense building clusters, vehicles in parking lots, etc.
- Object orientation is arbitrary, with many objects not aligned horizontally or vertically.
- Backgrounds are complex and diverse, with rich types of ground objects, easily causing interference.
4.3.1. Handling Large-Size Images and Oriented Object Detection
4.3.2. Detecting Small and Dense Objects and Task-Specific Optimization
4.4. Other Frontier Application Explorations
4.4.1. Pedestrian Detection
4.4.2. Fine-Grained Visual Categorization
4.4.3. Video Understanding
4.4.4. Industrial Defect Detection
5. Advanced Challenges and Future Research Directions
5.1. Toward Extreme Efficiency: A Roadmap for Edge Deployment
5.2. Overcoming the Final Hurdles in Small Object Detection
5.3. Enhancing Generalization and Reliability in Open Environments
5.4. Deepening Theoretical Understanding, Interpretability, and Reliability
5.5. Exploring Synergy with Multimodal and Other Frontier Technologies
5.6. Strategic Overview of Synthesis
6. Conclusions
Author Contributions
Funding
Conflicts of Interest
Abbreviations
DETR | DEtection TRansformer; |
CV | Computer Vision; |
AI | Artificial intelligence; |
RPN | Region Proposal Network; |
NMS | Non-Maximum Suppression; |
NLP | Natural Language Processing; |
CNN | Convolutional Neural Network; |
ViT | Vision Transformer; |
MSA | Multi-Head Self-Attention; |
FFN | Feed-Forward Network; |
LN | Layer Normalization; |
MLP | Multi-Layer Perceptron; |
IoU | Intersection over Union; |
NLL | Negative Log-likelihood; |
SOTA | State of the Art; |
CDN | Contrastive Denoising Training; |
RAQG | Ranking-Based Adaptive Query Generation; |
SGL1 | Soft Gradient L1 Loss; |
AP | Average Precision; |
GT | Ground Truth; |
APs | Average Precision, small; |
APl | Average Precision, large; |
APm | Average Precision, medium; |
BEV | Bird’s Eye View; |
OVD | Open-Vocabulary Object Detection; |
VLMs | Vision-Language Models; |
FPN | Feature Pyramid Network; |
OOD | Oriented Object Detection; |
RADA | Rotation-Aligned Deformable Attention; |
FGVC | Fine-Grained Visual Categorization; |
TAL | Temporal Action Localization; |
PCB | Printed Circuit Board; |
CAVs | Concept Activation Vectors; |
GANs | Generative Adversarial Networks. |
References
- Zou, Z.; Chen, K.; Shi, Z.; Guo, Y.; Ye, J. Object Detection in 20 Years: A Survey. Proc. IEEE 2023, 111, 257–276. [Google Scholar] [CrossRef]
- Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
- Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef]
- Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
- Redmon, J.; Farhadi, A. YOLO9000: Better, Faster, Stronger. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 June 2017; pp. 6517–6525. [Google Scholar]
- Yao, W.C.; Bochkovskiy, A.; Yuan, M.L.H. YOLOv7: Trainable Bag-of-Freebies Sets New State-of-the-Art for Real-Time Object Detectors. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 7464–7475. [Google Scholar]
- Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J.; Ding, G. YOLOv10: Real-Time End-to-End Object Detection. Adv. Neural Inf. Process. Syst. 2024, 37, 107984–108011. [Google Scholar]
- Tian, Y.; Ye, Q.; Doermann, D. YOLOv12: Attention-Centric Real-Time Object Detectors. arXiv 2025, arXiv:2502.12524. [Google Scholar] [CrossRef]
- Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. SSD: Single Shot MultiBox Detector; Springer International Publishing: Berlin/Heidelberg, Germany, 2016; pp. 21–37. [Google Scholar]
- Hang, Y.; Fan, W.T. A Survey on Transformer-Based Object Detection: Advances and Applications. Mod. Inf. Technol. 2021, 5, 4. [Google Scholar]
- Zhu, X.; Su, W.; Lu, L.; Li, B.; Dai, J. Deformable DETR: Deformable Transformers for End-to-End Object Detection. In Proceedings of the 9th International Conference on Learning Representations (ICLR), Virtual Only Conference, 3–7 May 2021; pp. 1–16. [Google Scholar]
- Neubeck, A.; Van Gool, L. Efficient Non-Maximum Suppression. In Proceedings of the 18th International Conference on Pattern Recognition (ICPR’06), Hong Kong, China, 20–24 August 2006; pp. 850–855. [Google Scholar]
- Bodla, N.; Singh, B.; Chellappa, R.; Davis, L.S. Soft-NMS—Improving Object Detection with One Line of Code. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 5562–5570. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–7 December 2017; pp. 6000–6010. [Google Scholar]
- Zhu, X.; Cheng, D.; Zhang, Z.; Lin, S.; Dai, J. An Empirical Study of Spatial Attention Mechanisms in Deep Networks. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6687–6696. [Google Scholar]
- Hu, J.; Shen, L.; Sun, G. Squeeze-and-Excitation Networks. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
- Woo, S.; Park, J.; Lee, J.; Kweon, I.S. CBAM: Convolutional Block Attention Module. In Proceedings of the 15th European Conference on Computer Vision—ECCV 2018, Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
- Fu, J.; Liu, J.; Tian, H.; Li, Y.; Bao, Y.; Fang, Z. Dual Attention Network for Scene Segmentation. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–22 June 2019; pp. 3141–3149. [Google Scholar]
- Zheng, S.; Lu, J.; Zhao, H.; Zhu, X.; Luo, Z.; Wang, Y. Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 6877–6886. [Google Scholar]
- Strudel, R.; Garcia, R.; Laptev, I.; Schmid, C. Segmenter: Transformer for Semantic Segmentation. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 7242–7252. [Google Scholar]
- Wanigasekara, P.; Qin, K.; Barut, E.; Yang, F.; Ruan, W.; Su, C. Semantic VL-BERT: Visual Grounding via Attribute Learning. In Proceedings of the 2022 International Joint Conference on Neural Networks (IJCNN), Padua, Italy, 18– 23 July 2022; pp. 1–8. [Google Scholar]
- Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-End Object Detection with Transformers. In Proceedings of the 16th European Conference on Computer Vision—ECCV 2020, Glasgow, UK, 23–28 August 2020; pp. 213–229. [Google Scholar]
- Kuhn, H.W. The Hungarian method for the assignment problem. Nav. Res. Logist. Q. 1955, 2, 83–97. [Google Scholar] [CrossRef]
- Lin, T.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollar, P.; Zitnick, L.C. Microsoft COCO: Common Objects in Context. In Proceedings of the 13th European Conference on Computer Vision—ECCV 2014, Zurich, Switzerlan, 6–12 September 2014; pp. 740–755. [Google Scholar]
- Redmon, J.; Farhadi, A. YOLOv3: An Incremental Improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar] [CrossRef]
- Huang, Y.; Liu, H.; Shuai, H.; Cheng, W. DQ-DETR: DETR with Dynamic Query for Tiny Object Detection. In Proceedings of the 18th European Conference on Computer Vision—ECCV 2024, Milan, Italy, 29 September–4 October 2024; pp. 290–305. [Google Scholar]
- Roh, B.; Shin, J.; Shin, W.; Kim, S. Sparse detr: Efficient end-to-end object detection with learnable sparsity. In Proceedings of the International Conference on Learning Representations, Virtual Only Conference, 25–29 April 2022; pp. 1–23. [Google Scholar]
- Li, F.; Zhang, H.; Liu, S.; Guo, J.; Ni, M.L.; Zhang, L. DN-DETR: Accelerate DETR Training by Introducing Query DeNoising. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 2239–2251. [Google Scholar] [CrossRef]
- Zhang, H.; Li, F.; Liu, S.; Zhang, L.; Su, H.; Zhu, J.; Ni, L.; Shum, Y.H. Dino: Detr with improved denoising anchor boxes for end-to-end object detection. In Proceedings of the 9th International Conference on Learning Representations (ICLR), Virtual Only Conference, 25–29 April 2022; pp. 1–19. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
- Deng, J.; Dong, W.; Socher, R.; Li, L.-J.; Li, K.; Li, F.-F. ImageNet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 248–255. [Google Scholar]
- Liu, S.; Deng, W. Very deep convolutional neural network based image classification using small training sample size. In Proceedings of the 2015 3rd IAPR Asian Conference on Pattern Recognition (ACPR), Kuala Lumpur, Malaysia, 3–6 November 2015; pp. 730–734. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 1026–1034. [Google Scholar]
- Lecun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 1998, 86, 2278–2324. [Google Scholar] [CrossRef]
- Zeiler, M.D.; Fergus, R. Visualizing and Understanding Convolutional Networks. In Proceedings of the 13th European Conference on Computer Vision—ECCV 2014, Zurich, Switzerland, 6–12 September 2014; pp. 0818–833. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 37, 1904–1916. [Google Scholar] [CrossRef]
- Dai, Z.; Cai, B.; Lin, Y.; Chen, J. UP-DETR: Unsupervised Pre-training for Object Detection with Transformers. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 1601–1610. [Google Scholar]
- Meng, D.; Chen, X.; Fan, Z.; Zeng, G.; Li, H.; Yuan, Y. Conditional DETR for Fast Training Convergence. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 3631–3640. [Google Scholar]
- Ba, J.L.; Kiros, J.R.; Hinton, G.E. Layer Normalization. arXiv 2016, arXiv:1607.06450. [Google Scholar] [CrossRef]
- Cai, Z.; Liu, S.; Wang, G.; Ge, Z.; Zhang, X.; Huang, D. Align-DETR: Improving DETR with Simple IoU-aware BCE loss. arXiv 2023, arXiv:2304.07527. [Google Scholar] [CrossRef]
- Sun, Z.; Cao, S.; Yang, Y.; Kitani, K. Rethinking Transformer-based Set Prediction for Object Detection. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 3591–3600. [Google Scholar]
- Liu, S.; Ren, T.; Chen, J.; Zeng, Z.; Zhang, H.; Li, F. Detection Transformer with Stable Matching. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023; pp. 6468–6477. [Google Scholar]
- Wang, Y.; Ha, J. Improved Object Detection with Content and Position Separation in Transformer. Remote Sens. 2024, 16, 353. [Google Scholar] [CrossRef]
- Li, Y.; Miao, N.; Ma, L.; Shuang, F.; Huang, X. Transformer for object detection: Review and benchmark. Eng. Appl. Artif. Intell. 2023, 126, 107021. [Google Scholar] [CrossRef]
- Huang, J.; Li, T. Small Object Detection by DETR via Information Augmentation and Adaptive Feature Fusion. In Proceedings of the 2024 ACM ICMR Workshop on Multimodal Video Retrieval, New York, NY, USA, 10–14 June 2024; pp. 39–44. [Google Scholar]
- Hou, X.; Liu, M.; Zhang, S.; Wei, P.; Chen, B.; Lan, X. Relation DETR: Exploring Explicit Position Relation Prior for Object Detection. In Proceedings of the 18th European Conference on Computer Vision—ECCV 2024, Milan, Italy, 29 September–4 October 2024; pp. 89–105. [Google Scholar]
- Hoanh, N.; Pham, T.V. Focus-Attention Approach in Optimizing DETR for Object Detection from High-Resolution Images. Knowl.-Based Syst. 2024, 296, 10. [Google Scholar] [CrossRef]
- Liu, S.; Li, F.; Zhang, H.; Yang, X.; Qi, X.; Su, H.; Zhu, J.; Zhang, L. Dab-detr: Dynamic anchor boxes are better queries for detr. In Proceedings of the 10th International Conference on Learning Representations, Virtual Only Conference, 25–29 April 2022; pp. 220–240. [Google Scholar]
- Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q. DETRs Beat YOLOs on Real-time Object Detection. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; pp. 16965–16974. [Google Scholar]
- Huang, S.; Lu, Z.; Cun, X.; Yu, Y.; Shen, X.Z.X. DEIM: DETR with Improved Matching for Fast Convergence. arXiv 2024, arXiv:2412.04234. [Google Scholar]
- Xia, Z.; Pan, X.; Song, S.; Li, L.E.; Huang, G. Vision Transformer with Deformable Attention. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 4784–4793. [Google Scholar]
- Wang, T.; Yuan, L.; Chen, Y.; Feng, J.; Yan, S. PnP-DETR: Towards Efficient Visual Analysis with Transformers. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 4641–4650. [Google Scholar]
- Li, F.; Zeng, A.; Liu, S.; Zhang, H.; Li, H.; Zhang, L. Lite DETR: An Interleaved Multi-Scale Encoder for Efficient DETR. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 18558–18567. [Google Scholar]
- Bahaduri, B.; Talaoubrid, H.; Ming, Z.; Mokraoui, A. Indirect Attention: Ia-Detr for One Shot Object Detection. In Proceedings of the 13th International Conference on Learning Representations, Singapore Expo, Changi, Singapore, 24–28 April 2025; pp. 1–12. [Google Scholar]
- Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 9992–10002. [Google Scholar]
- Guo, M.H.; Xu, T.X.; Liu, J.J.; Liu, Z.N.; Jiang, P.T.; Mu, T.J.; Zhang, S.H.; Martin, R.R.; Cheng, M.M.; Hu, S.M. Attention mechanisms in computer vision: A survey. Comput. Vis. Media 2022, 8, 331–368. [Google Scholar] [CrossRef]
- Yao, Z.; Ai, J.; Li, B.; Zhang, C. Efficient DETR: Improving End-to-End Object Detector with Dense Prior. arXiv 2021, arXiv:2104.01318. [Google Scholar] [CrossRef]
- Liu, Y.; Zhang, Y.; Wang, Y.; Zhang, Y.; Tian, J.; Shi, Z. SAP-DETR: Bridging the Gap Between Salient Points and Queries-Based Transformer Detector for Fast Model Convergency. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 15539–15547. [Google Scholar]
- Wang, Y.; Zhang, X.; Yang, T.; Sun, J. Anchor DETR: Query Design for Transformer-Based Detector. Proc. AAAI Conf. Artif. Intell. 2022, 36, 2567–2575. [Google Scholar] [CrossRef]
- Chen, X.; Wei, F.; Zeng, G.; Wang, J. Conditional detr v2: Efficient detection transformer with box queries. arXiv 2022, arXiv:2207.08914. [Google Scholar] [CrossRef]
- Gao, F.; Leng, J.; Gan, J.; Gao, X. Ranking-based adaptive query generation for DETRs in crowded pedestrian detection. Neurocomputing 2025, 612, 128710. [Google Scholar] [CrossRef]
- Zhang, G.; Luo, Z.; Yu, Y.; Cui, K.; Lu, S. Accelerating DETR Convergence via Semantic-Aligned Matching. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 939–948. [Google Scholar]
- Lin, T.; Goyal, P.; Girshick, R.; He, K.; Dollar, P. Focal Loss for Dense Object Detection. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 42, 318–327. [Google Scholar] [CrossRef]
- Rezatofighi, H.; Tsoi, N.; Gwak, J.; Sadeghian, A.; Reid, I.; Savarese, S. Generalized Intersection Over Union: A Metric and a Loss for Bounding Box Regression. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 658–666. [Google Scholar]
- Zheng, Z.; Wang, P.; Liu, W.; Li, J.; Ye, R.; Ren, D. Distance-IoU Loss: Faster and Better Learning for Bounding Box Regression. Proc. AAAI Conf. Artif. Intell. 2020, 34, 12993–13000. [Google Scholar] [CrossRef]
- Pu, Y.; Liang, W.; Hao, Y.; Yuan, Y.; Yang, Y.; Zhang, C.; Hu, H.; Huang, G. Rank-DETR for high quality object detection. In Proceedings of the 37th International Conference on Neural Information Processing Systems, New Orleans, LA, USA, 10–16 December 2023; p. 708. [Google Scholar]
- Choi, H.K.; Paik, C.K.; Ko, H.W.; Park, M.C.; Kim, H.J. Recurrent DETR: Transformer-Based Object Detection for Crowded Scenes. IEEE Access 2023, 11, 78623–78643. [Google Scholar] [CrossRef]
- Li, M.; Jia, T.; Lu, H.; Ma, B.; Wang, H.; Chen, D. CSPCL: Category Semantic Prior Contrastive Learning for Deformable DETR-Based Prohibited Item Detectors. arXiv 2025, arXiv:2501.16665. [Google Scholar] [CrossRef]
- Hadsell, R.; Chopra, S.; Lecun, Y. Dimensionality Reduction by Learning an Invariant Mapping. In Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), New York, NY, USA, 17–22 June 2006; pp. 1735–1742. [Google Scholar]
- Loshchilov, I.; Hutter, F. Decoupled weight decay regularization. In Proceedings of the 9th International Conference on Learning Representations, New Orleans, LA, USA, 6–9 May 2019; pp. 1–18. [Google Scholar]
- Cao, X.; Yuan, P.; Feng, B.; Niu, K. CF-DETR: Coarse-to-Fine Transformers for End-to-End Object Detection. Proc. AAAI Conf. Artif. Intell. 2022, 36, 185–193. [Google Scholar] [CrossRef]
- Bochkovskiy, A.; Wang, C.Y.; Liao, H.Y.M. YOLOv4: Optimal Speed and Accuracy of Object Detection. arXiv 2020, arXiv:2004.10934. [Google Scholar] [CrossRef]
- Khanam, R.; Hussain, M. What is YOLOv5: A deep look into the internal features of the popular object detector. arXiv 2024, arXiv:2407.20892. [Google Scholar] [CrossRef]
- Zhang, G.; Liu, S.; Wang, F.; Sun, J. YOLOX: Exceeding YOLO Series in 2021. arXiv 2021, arXiv:2107.08430. [Google Scholar] [CrossRef]
- Xu, S.; Wang, X.; Lv, W.; Chang, Q.; Cui, C.; Deng, K.; Wang, G.; Dang, Q.; Wei, S.; Du, Y.; et al. PP-YOLOE: An evolved version of YOLO. arXiv 2022, arXiv:2203.16250. [Google Scholar] [CrossRef]
- Varghese, R.; Sambath, M. YOLOv8: A Novel Object Detection Algorithm with Enhanced Performance and Robustness. In Proceedings of the 2024 International Conference on Advances in Data Engineering and Intelligent Computing Systems (ADICS), Chennai, India, 18–19 April 2024; pp. 1–6. [Google Scholar]
- Lin, J.; Mao, X.; Chen, Y.; Xu, L.; He, Y.; Xue, H. D^2ETR: Decoder-Only DETR with Computationally Efficient Cross-Scale Attention. In Proceedings of the 10th International Conference on Learning Representations, Virtual Only Conference, 25–29 April 2022; pp. 1–15. [Google Scholar]
- Lv, W.; Zhao, Y.; Chang, Q.; Huang, K.; Wang, G.; Liu, Y. Rt-detrv2: Improved baseline with bag-of-freebies for real-time detection transformer. arXiv 2024, arXiv:2407.17140. [Google Scholar]
- Courbariaux, M.; Hubara, I.; Soudry, D.; El-Yaniv, R.; Bengio, Y. Binarized Neural Networks: Training Deep Neural Networks with Weights and Activations Constrained to +1 or -1. In Proceedings of the 29th International Conference on Neural Information Processing Systems, Barcelona, Spain, 5–10 December 2016; pp. 1–9. [Google Scholar]
- Wang, R.; Sun, H.; Yang, L.; Lin, S.; Liu, C.; Gao, Y.; Hu, Y.; Zhang, B. AQ-DETR: Low-Bit Quantized Detection Transformer with Auxiliary Queries. Proc. AAAI Conf. Artif. Intell. 2024, 38, 15598–15606. [Google Scholar] [CrossRef]
- Han, S.; Pool, J.; Tran, J.; Dally, W.J. Learning both weights and connections for efficient neural networks. In Proceedings of the 29th International Conference on Neural Information Processing Systems, Montreal, QC, Canada, 7–12 December 2015; pp. 1135–1143. [Google Scholar]
- Sun, H.; Zhang, S.; Tian, X.; Zou, Y. Pruning DETR: Efficient end-to-end object detection with sparse structured pruning. Signal Image Video Process. 2024, 18, 129–135. [Google Scholar] [CrossRef]
- Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. arXiv 2017, arXiv:1704.04861. [Google Scholar]
- Li, T.; Wang, J.; Zhang, T. L-DETR: A Light-Weight Detector for End-to-End Object Detection with Transformers. IEEE Access 2022, 10, 105685–105692. [Google Scholar] [CrossRef]
- Cui, C.; Gao, T.; Wei, S.; Du, Y.; Guo, R.; Dong, S.; Lu, B.; Zhou, Y.; Lv, X.; Liu, Q.; et al. PP-LCNet: A lightweight CPU convolutional neural network. arXiv 2021, arXiv:2109.15099. [Google Scholar]
- Dong, P.; Kong, Z.; Meng, X.; Zhang, P.; Tang, H.; Wang, Y.; Chou, C.H. SpeedDETR: Speed-aware Transformers for End-to-end Object Detection. In Proceedings of the 40th International Conference on Machine Learning, Honolulu, HI, USA, 23–29 July 2023; pp. 8227–8243. [Google Scholar]
- Khan, S.; Naseer, M.; Hayat, M.; Zamir, S.W.; Khan, F.S.; Shah, M. Transformers in Vision: A Survey. ACM Comput. Surv. 2022, 54, 200. [Google Scholar] [CrossRef]
- Hariharan, B.; Arbeláez, P.; Girshick, R.; Malik, J. Simultaneous Detection and Segmentation. In Proceedings of the 13th European Conference on Computer Vision—ECCV 2014, Zurich, Switzerlan, 6–12 September 2014; pp. 297–312. [Google Scholar]
- Kirillov, A.; He, K.; Girshick, R.; Rother, C.; Dollar, P. Panoptic Segmentation. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 9396–9405. [Google Scholar]
- Li, F.; Zhang, H.; Xu, H.; Liu, S.; Zhang, L.; Ni, L.M. Mask DINO: Towards A Unified Transformer-based Framework for Object Detection and Segmentation. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 3041–3050. [Google Scholar]
- Palazzi, A.; Borghi, G.; Abati, D.; Calderara, S.; Cucchiara, R. Learning to map vehicles into bird’s eye view. In Proceedings of the Image Analysis and Processing-ICIAP 2017: 19th International Conference, Catania, Italy, 11–15 September 2017; pp. 233–243. [Google Scholar]
- Liu, Y.; Wang, T.; Zhang, X.; Sun, J. PETR: Position Embedding Transformation for Multi-view 3D Object Detection. In Proceedings of the 17th European Conference on Computer Vision—ECCV 2022, Tel Aviv, Israel, 23–27 October 2022; pp. 531–548. [Google Scholar]
- Zareian, A.; Rosa, K.D.; Hu, D.H.; Chang, S.F. Open-Vocabulary Object Detection Using Captions. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 14388–14397. [Google Scholar]
- Zhu, C.; Chen, L. A Survey on Open-Vocabulary Detection and Segmentation: Past, Present, and Future. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 8954–8975. [Google Scholar] [CrossRef]
- Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the 2rd International Conference on Machine Learning (PmLR), Virtual, 18–24 July 2021; pp. 8748–8763. [Google Scholar]
- Zang, Y.; Li, W.; Zhou, K.; Huang, C.; Loy, C.C. Open-Vocabulary DETR with Conditional Matching. In Proceedings of the 17th European Conference on Computer Vision—ECCV 2022, Tel Aviv, Israel, 23–27 October 2022; pp. 106–122. [Google Scholar]
- Cai, J.; Xu, M.; Li, W.; Xiong, Y.; Xia, W.; Tu, Z. MeMOT: Multi-Object Tracking with Memory. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 8080–8090. [Google Scholar]
- Wu, J.; Jiang, Y.; Bai, S.; Zhang, W.; Bai, X. Seqformer: Sequential transformer for video instance segmentation. In Proceedings of the 17th European Conference on Computer Vision—ECCV 2022, Tel Aviv, Israel, 23–27 October 2022; pp. 553–569. [Google Scholar]
- Dong, N.; Zhang, Y.; Ding, M.; Lee, G.H. Incremental-DETR: Incremental few-shot object detection via self-supervised learning. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; p. 60. [Google Scholar]
- Liu, Y.; Schiele, B.; Vedaldi, A.; Rupprecht, C. Continual Detection Transformer for Incremental Object Detection. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 23799–23808. [Google Scholar]
- Zhang, J.; Lin, X.; Zhang, W.; Wang, K.; Tan, X.; Han, J. Semi-DETR: Semi-Supervised Object Detection with Detection Transformers. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 23809–23818. [Google Scholar]
- Pandey, T.; Pears, N.; Smith, W.A.; McDermid, J.A. E-DETR: Evidential Deep Learning for End-to-End Uncertainty Estimation in Object Detection. In Proceedings of the 13th International Conference on Learning Representations, Singapore Expo, Changi, Singapore, 24–28 April 2025; pp. 1–14. [Google Scholar]
- Kaplan, M.S.; Kandemir, M. Evidential deep learning to quantify classification uncertainty. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, Montreal, QC, Canada, 2–8 December 2018; pp. 3183–3193. [Google Scholar]
- Weng, W.; Yuan, C. Mean teacher DETR with masked feature alignment: A robust domain adaptive detection transformer framework. In Proceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 20–27 February 2024; p. 657. [Google Scholar]
- Xing, Z.; Ren, J.; Fan, X.; Zhang, Y. S-DETR: A Transformer Model for Real-Time Detection of Marine Ships. J. Mar. Sci. Eng. 2023, 11, 696. [Google Scholar] [CrossRef]
- Lv, Z.; Dong, S.; Xia, Z.; He, J.; Zhang, J. Enhanced real-time detection transformer (RT-DETR) for robotic inspection of underwater bridge pier cracks. Autom. Constr. 2025, 170, 105921. [Google Scholar] [CrossRef]
- Ouyang, H. Deyov3: Detr with yolo for real-time object detection. arXiv 2023, arXiv:2309.11851. [Google Scholar]
- Zhou, D.; Jin, X.; Lian, X.; Yang, L.; Xue, Y.; Hou, Q. AutoSpace: Neural Architecture Search with Less Human Interference. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 327–336. [Google Scholar]
- Kamath, A.; Singh, M.; LeCun, Y.; Synnaeve, G.; Misra, I.; Carion, N. Mdetr-modulated detection for end-to-end multi-modal understanding. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 1780–1790. [Google Scholar]
- Zhang, Y.; Chen, J.; Huang, D. Cat-det: Contrastively augmented transformer for multi-modal 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 908–917. [Google Scholar]
- Shi, F.; Gao, R.; Huang, W.; Wang, L. Dynamic MDETR: A Dynamic Multimodal Transformer Decoder for Visual Grounding. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 1181–1198. [Google Scholar] [CrossRef]
- Grigorescu, S.; Trasnea, B.; Cocias, T.; Macesanu, G. A survey of deep learning techniques for autonomous driving. J. Field Robot. 2019, 37, 362–386. [Google Scholar] [CrossRef]
- Litjens, G.; Kooi, T.; Bejnordi, B.E.; Setio, A.A.A.; Ciompi, F.; Ghafoorian, M.; van der Laak, J.A.W.M.; van Ginneken, B.; Sánchez, C.I. A survey on deep learning in medical image analysis. Med. Image Anal. 2017, 42, 60–88. [Google Scholar] [CrossRef] [PubMed]
- Ding, J.; Xue, N.; Xia, G.S.; Bai, X.; Yang, W.; Yang, M.Y. Object Detection in Aerial Images: A Large-Scale Benchmark and Challenges. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 7778–7796. [Google Scholar] [CrossRef]
- Arnold, E.; Al-Jarrah, O.Y.; Dianati, M.; Fallah, S.; Oxtoby, D.; Mouzakitis, A. A Survey on 3D Object Detection Methods for Autonomous Driving Applications. IEEE Trans. Intell. Transp. Syst. 2019, 20, 3782–3795. [Google Scholar] [CrossRef]
- Hu, Y.; Yang, J.; Chen, L.; Li, K.; Sima, C.; Zhu, X.; Chai, S.; Du, S.; Lin, T.; Wang, W.; et al. Planning-oriented autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 17853–17862. [Google Scholar]
- Wang, Y.; Guizilini, V.C.; Zhang, T.; Wang, Y.; Zhao, H.; Solomon, J. DETR3D: 3D Object Detection from Multi-view Images via 3D-to-2D Queries. In Proceedings of the 5th Conference on Robot Learning, PMLR, Baltimore, MD, USA, 17–23 July 2022; pp. 180–191. [Google Scholar]
- Liu, Z.; Tang, H.; Amini, A.; Yang, X.; Mao, H.; Rus, D.L.; Han, S. BEVFusion: Multi-Task Multi-Sensor Fusion with Unified Bird’s-Eye View Representation. In Proceedings of the 2023 IEEE International Conference on Robotics and Automation (ICRA), London, UK, 29 May–2 June 2023; pp. 2774–2781. [Google Scholar]
- Mushtaq, H.; Deng, X.; Azhar, F.; Ali, M.; Sherazi, H.H.R. PLC-Fusion: Perspective-Based Hierarchical and Deep LiDAR Camera Fusion for 3D Object Detection in Autonomous Vehicles. Information 2024, 15, 739. [Google Scholar] [CrossRef]
- Caesar, H.; Bankiti, V.; Lang, A.H.; Vora, S.; Liong, V.E.; Xu, Q.; Krishnan, A.; Pan, Y.; Baldan, G.; Beijbom, O. nuscenes: A multimodal dataset for autonomous driving. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 11621–11631. [Google Scholar]
- Geiger, A.; Lenz, P.; Stiller, C.; Urtasun, R. Vision meets robotics: The kitti dataset. Int. J. Robot. Res. 2013, 32, 1231–1237. [Google Scholar] [CrossRef]
- Liang, S.; Ning, X.; Yu, J.; Guo, K.; Lu, T.; Tang, C. Efficient Computing Platform Design for Autonomous Driving Systems. In Proceedings of the 2021 26th Asia and South Pacific Design Automation Conference (ASP-DAC), Tokyo, Japan, 18–21 January 2021; pp. 734–741. [Google Scholar]
- Khalifa, M.; Albadawy, M. AI in diagnostic imaging: Revolutionising accuracy and efficiency. Comput. Methods Programs Biomed. Update 2024, 5, 100146. [Google Scholar] [CrossRef]
- Tajbakhsh, N.; Jeyaseelan, L.; Li, Q.; Chiang, J.N.; Wu, Z.; Ding, X. Embracing imperfect datasets: A review of deep learning solutions for medical image segmentation. Med. Image Anal. 2020, 63, 101693. [Google Scholar] [CrossRef]
- Tizhoosh, H.R.; Pantanowitz, L. Artificial Intelligence and Digital Pathology: Challenges and Opportunities. J. Pathol. Inform. 2018, 9, 38. [Google Scholar] [CrossRef]
- Sun, H.; Wang, J. Computational biomedical imaging: AI innovations and pitfalls. Med. Plus 2025, 2, 100081. [Google Scholar] [CrossRef]
- Singh, S.P.; Wang, L.; Gupta, S.; Goli, H.; Padmanabhan, P.; Gulyás, B. 3D Deep Learning on Medical Images: A Review. Sensors 2020, 20, 5097. [Google Scholar] [CrossRef]
- Neves, C.P.C.; Teixeira, L.F. Explainable Deep Learning Methods in Medical Image Classification: A Survey. ACM Comput. Surv. 2023, 56, 85. [Google Scholar] [CrossRef]
- Tang, J.; Chen, X.; Fan, L.; Zhu, Z.; Huang, C. LN-DETR: An efficient Transformer architecture for lung nodule detection with multi-scale feature fusion. Neurocomputing 2025, 633, 129827. [Google Scholar] [CrossRef]
- Xu, Y.; Shen, Y.; Fernandez-Granda, C.; Heacock, L.; Geras, K.J. Understanding differences in applying DETR to natural and medical images. arXiv 2024, arXiv:2405.17677. [Google Scholar] [CrossRef]
- Rani, V.; Kumar, M.; Gupta, A.; Sachdeva, M.; Mittal, A.; Kumar, K. Self-supervised learning for medical image analysis: A comprehensive review. Evol. Syst. 2024, 15, 1607–1633. [Google Scholar] [CrossRef]
- Wittmann, B.; Navarro, F.; Shit, S.; Menze, B. Focused decoding enables 3D anatomical detection by transformers. arXiv 2022, arXiv:2207.10774. [Google Scholar] [CrossRef]
- He, K.; Gkioxari, G.; Dollar, P.; Girshick, R. Mask R-CNN. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
- Wang, H.; Li, C.; Wu, Q.; Wang, J. An Improved DETR Based on Angle Denoising and Oriented Boxes Refinement for Remote Sensing Object Detection. Remote Sens. 2024, 16, 4420. [Google Scholar] [CrossRef]
- Xu, X.; Feng, Z.; Cao, C.; Li, M.; Wu, J.; Wu, Z.; Shang, Y.; Ye, S. An Improved Swin Transformer-Based Model for Remote Sensing Object Detection and Instance Segmentation. Remote Sens. 2021, 13, 4779. [Google Scholar] [CrossRef]
- Ma, T.; Mao, M.; Zheng, H.; Gao, P.; Wang, X.; Han, S.; Ding, E.; Zhang, B.; Doermann, D. Oriented object detection with transformer. arXiv 2021, arXiv:2106.03146. [Google Scholar]
- Dong, C.; Jiang, S.; Sun, H.; Li, J.; Yu, Z.; Wang, J.; Wang, J. QEDetr: DETR with Query Enhancement for Fine-Grained Object Detection. Remote Sens. 2025, 17, 893. [Google Scholar] [CrossRef]
- Cao, X.; Wang, H.; Wang, X.; Hu, B. DFS-DETR: Detailed-Feature-Sensitive Detector for Small Object Detection in Aerial Images Using Transformer. Electronics 2024, 13, 3404. [Google Scholar] [CrossRef]
- Zhang, X.; Liu, Q.; Chang, H.; Sun, H. High-Resolution Network with Transformer Embedding Parallel Detection for Small Object Detection in Optical Remote Sensing Images. Remote Sens. 2023, 15, 4497. [Google Scholar] [CrossRef]
- He, X.; Liang, K.; Zhang, W.; Li, F.; Jiang, Z.; Zuo, Z.; Tan, X. DETR-ORD: An Improved DETR Detector for Oriented Remote Sensing Object Detection with Feature Reconstruction and Dynamic Query. Remote Sens. 2024, 16, 3516. [Google Scholar] [CrossRef]
- Kong, Y.; Shang, X.; Jia, S. Drone-DETR: Efficient Small Object Detection for Remote Sensing Image Using Enhanced RT-DETR Model. Sensors 2024, 24, 5496. [Google Scholar] [CrossRef]
- Zheng, A.; Zhang, Y.; Zhang, X.; Qi, X.; Sun, J. Progressive end-to-end object detection in crowded scenes. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 857–866. [Google Scholar]
- Huang, Y.; Yuan, G. AD-DETR: DETR with asymmetrical relation and decoupled attention in crowded scenes. Math. Biosci. Eng. 2023, 20, 14158–14179. [Google Scholar] [CrossRef]
- Paul, D.; Chowdhury, A.; Xiong, X.; Chang, F.-J.; Carlyn, D.; Stevens, S.; Provost, K.L.; Karpatne, A.; Carstens, B.; Rubenstein, D.; et al. A Simple Interpretable Transformer for Fine-Grained Image Classification and Analysis. In Proceedings of the 11th International Conference on Learning Representations, Kigali, Rwanda, 1–5 May 2023; pp. 1–26. [Google Scholar]
- Meinhardt, T.; Kirillov, A.; Leal-Taixe, L.; Feichtenhofer, C. Trackformer: Multi-object tracking with transformers. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 8844–8854. [Google Scholar]
- Wang, B.; Zhao, Y.; Yang, L.; Long, T.; Li, X. Temporal Action Localization in the Deep Learning Era: A Survey. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 2171–2190. [Google Scholar] [CrossRef]
- Lu, C.; Mak, M. DITA: DETR with improved queries for end-to-end temporal action detection. Neurocomputing 2024, 596, 127914. [Google Scholar] [CrossRef]
- Jin, J.; Feng, W.; Lei, Q.; Gui, G.; Wang, W. PCB defect inspection via Deformable DETR. In Proceedings of the 2021 7th International Conference on Computer and Communications (ICCC), Chengdu, China, 10–13 December 2021; pp. 646–651. [Google Scholar]
- Chazhoor, A.A.P.; Ho, E.S.L.; Gao, B.; Woo, W.L. A Review and Benchmark on State-of-the-Art Steel Defects Detection. SN Comput. Sci. 2023, 5, 114. [Google Scholar] [CrossRef]
- Su, Z.; Shao, Y.; Li, P.; Zhang, X.; Zhang, H. Improved RT-DETR Network for High-Quality Defect Detection on Digital Printing Fabric. J. Nat. Fibers 2025, 22, 2476634. [Google Scholar] [CrossRef]
- Du, X.; Zhang, X.; Tan, P. RT-DETR based Lightweight Design and Optimization of Thermal Infrared Object Detection for Resource-Constrained Environments. In Proceedings of the 2024 43rd Chinese Control Conference (CCC), Kunming, China, 28–31 July 2024; pp. 7917–7922. [Google Scholar]
- Liu, H.I.; Tseng, K.Y.W.; Chang, K.C.; Wang, P.J.; Shuai, H.H.; Cheng, W.H. A DeNoising FPN With Transformer R-CNN for Tiny Object Detection. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–15. [Google Scholar] [CrossRef]
- Muandet, K.; Balduzzi, D.; Schölkopf, B. Domain generalization via invariant feature representation. In Proceedings of the 30th International Conference on International Conference on Machine Learning, Atlanta, GA, USA, 16–21 June 2013; pp. 10–18. [Google Scholar]
- Wiegreffe, S.; Pinter, Y. Attention is not not explanation. arXiv 2019, arXiv:1908.04626. [Google Scholar]
- Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 618–626. [Google Scholar]
- Singh, M.T.R.; Guestrin, C. Why Should I Trust You?: Explaining the Predictions of Any Classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 1135–1144. [Google Scholar]
- Lundberg, S.M.; Lee, S. A unified approach to interpreting model predictions. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 4768–4777. [Google Scholar]
- Munir, M.A.; Khan, S.H.; Khan, M.H.; Ali, M.; Khan, F.S. Cal-DETR: Calibrated detection transformer. Adv. Neural Inf. Process. Syst. 2023, 36, 71619–71631. [Google Scholar]
- Gal, Y.; Ghahramani, Z. Dropout as a Bayesian approximation: Representing model uncertainty in deep learning. In Proceedings of the 33rd International Conference on International Conference on Machine Learning, New York, NY, USA, 19–24 June 2016; pp. 1050–1059. [Google Scholar]
- Rui, K.; Hernandez, A.A.; Juanatas, R. Mask Wearing Detection Model based on Deformable Detr. In Proceedings of the 2022 IEEE 14th International Conference on Humanoid, Nanotechnology, Information Technology, Communication and Control, Environment, and Management (HNICEM), Boracay Island, Philippines, 10–14 December 2022; pp. 1–4. [Google Scholar]
Model | Year | Main Innovations | Core Problems Solved |
---|---|---|---|
DETR [22] | 2020 | Architecture: introduced transformer encoder–decoder for end-to-end object detection. Design: eliminated NMS and anchor boxes. | Simplified object detection pipeline by removing handcrafted components. |
Deformable DETR [11] | 2021 | Attention Mechanism: introduced deformable attention to focus on key sampling points. Performance: improve efficiency and small object detection. | Addressed slow convergence and limited feature resolution of origin DETR. |
Conditional DETR [38] | 2021 | Query Design: introduced conditional spatial queries for cross-attention. Training: speed up training convergence | Solved slow training convergence of DETR. |
DAB-DETR [48] | 2022 | Query Design: used Dynamic Anchor Box as Queries. Training: improved convergence speed. | Solved slow training convergence and performance limitations of DETR. |
DN-DETR [28] | 2022 | Training Method: introduced denoising training with noisy ground truth boxes. Matching: stabilized bipartite matching. | Accelerated training convergence of DETR-like models. |
DINO [29] | 2023 | Query Design: combined dynamic anchor boxes and denoising training. Performance: achieved SOTA results. | Improved upon earlier variants’ slow convergence and performance issues. |
RT-DETR [49] | 2024 | Architecture: used efficient hybrid encoder and IoU-aware query selection. Real-time: designed for high-speed inference. | Achieved high-accuracy real-time object detection, reducing computational costs. |
DEIM [50] | 2025 | Matching Strategy: introduced dense O2O matching to increase positive samples. Loss Function: proposed matchability-aware loss (MAL). | Addressed sparse supervision and low-quality matched in DETR, improving convergence and accuracy. |
Models | Backbone | Epochs | AP | AP50 | AP75 | APs | APm | APl | Params (M) | GFLOPs | FPS |
---|---|---|---|---|---|---|---|---|---|---|---|
DETR [22] | ResNet-50 | 500 | 42.0 | 62.4 | 44.2 | 20.5 | 45.8 | 61.1 | 41 | 86 | 28 |
DETR- DC5 [22] | ResNet-50 | 500 | 43.3 | 63.1 | 45.9 | 22.5 | 47.3 | 61.1 | 41 | 187 | 12 |
Deformable DETR [11] | ResNet-50 | 50 | 43.8 | 62.6 | 47.7 | 26.4 | 47.1 | 58.0 | 40 | 173 | 19 |
Sparse DETR [27] | ResNet-50 | 50 | 46.0 | 65.9 | 49.7 | 29.1 | 49.1 | 60.6 | 41 | 121 | 23.2 |
Sparse DETR [27] | Swin Transformer | 50 | 49.3 | 69.5 | 53.3 | 32.0 | 52.7 | 64.9 | 41 | 144 | 17.2 |
Conditional DETR [38] | ResNet-50 | 108 | 43.0 | 64.0 | 45.7 | 22.7 | 46.7 | 61.5 | 44 | 90 | 17.8 |
Conditional DETR [38] | ResNet-101 | 108 | 44.5 | 65.6 | 47.5 | 23.6 | 48.4 | 63.6 | 63 | 156 | - |
DAB- DETR [48] | ResNet-50-DC5 | 50 | 45.7 | 66.2 | 49.0 | 26.1 | 49.4 | 63.1 | 44 | 216 | 17.0 |
DAB- DETR [48] | ResNet-101-DC5 | 50 | 46.6 | 67.0 | 50.2 | 28.1 | 50.5 | 64.1 | 63 | 296 | - |
DN- DETR [49] | ResNet-50 | 12 | 43.4 | 61.9 | 47.2 | 24.8 | 46.8 | 59.4 | 48 | 195 | 13 |
DN- DETR [49] | ResNet-50 | 50 | 49.5 | 67.6 | 53.8 | 31.3 | 52.6 | 65.4 | 47 | 195 | 13 |
CF-DETR [71] | ResNet-50 + TEF | 36 | 47.8 | 66.5 | 52.4 | 31.2 | 50.6 | 62.8 | 41 | 173 | 16 |
CF-DETR [71] | ResNet-101 + TEF | 36 | 49.0 | 68.1 | 53.4 | 31.4 | 52.2 | 64.3 | 60 | 253 | 14 |
RT-DETR [50] | ResNet-50 | 72 | 53.1 | 71.3 | 57.7 | 34.8 | 58.0 | 70.0 | 42 | 136 | 108 |
RT-DETR [50] | ResNet-101 | 72 | 54.3 | 72.7 | 58.6 | 36.0 | 58.8 | 72.1 | 76 | 259 | 74 |
YOLOv3 [25] | DarkNet | 300 | 37.0 | 58.9 | 39.3 | 20.5 | 41.2 | 49.0 | 62.0 | 70.7 | 51 |
YOLOv4 [72] | CSPNet | 300 | 43.5 | 65.7 | 47.3 | 26.7 | 46.7 | 53.3 | ~64 | 140 | 62 |
YOLOv5l [73] | CSPNet | 300 | 49.0 | 67.3 | 53.3 | 29.0 | 53.6 | 64.7 | 46.5 | 109 | ~99 |
YOLOX-L [74] | CSPNet | 300 | 50.0 | 68.5 | 54.5 | 29.8 | 54.5 | 64.4 | 54.2 | 155.6 | ~69 |
PP-YOLOE [75] | CSPNet | ~300 | 51.4 | 68.9 | 55.6 | 31.4 | 55.3 | 66.1 | 52.2 | 110.1 | 78.1 |
YOLOv8l [76] | Advanced CSPNet | ~300 | 52.9 | 69.8 | 57.5 | 35.3 | 58.3 | 69.8 | 43.7 | 165.2 | 110.4 |
YOLOV10l [7] | Enhanced CSPNet | 100 | 53.2 | 70.1 | 58.1 | 35.8 | 58.5 | 69.4 | 24.4 | 120.3 | 137.4 |
Model Variant | Key Sensitive Hyperparameters | Description and Significance |
---|---|---|
DETR [22] | Loss Weight () | Description: These weights balance the classification loss, L1 box loss, and GIoU loss. Significance: Their configuration is critical as it directly impacts the balance between classification accuracy and localization precision. Incorrect balancing is a primary reason for slow convergence. |
Deformable DETR [11] | Number of Sampling Points (K) | Description: The number of key sampling points attended to by each query in the deformable attention mechanism. Significance: A core parameter that trades off computational efficiency and performance. A smaller K leads to faster speed but may lose fine-grained details, while a larger K improves accuracy (especially for small objects) at a higher computational cost. |
Conditional DETR [38] | Spatial Query Transformation | Description: Parameters of the FFN that generate the conditional spatial query from the 2D reference point. Significance: Controls the degree of spatial conditioning. This decoupling of content and spatial queries is key to accelerating convergence. Tuning this helps the model learn localization and recognition tasks more efficiently. |
DAB-DETR [48] | BBox Update Step Size (or Learning rate) | Description: The step size for iteratively refining the 4D anchor box parameters () in each decoder layer. Significance: Directly controls the convergence of the box regression process. An appropriate step size ensures stable and progressive refinement of box predictions, which is the core mechanism of DAB-DETR. |
DN-DETR [28] | Denoising Loss Weight () | Description: The weight of the auxiliary denoising task, which reconstructs ground-truth boxes from noised versions. Significance: This parameter controls the strength of the denoising supervision. A proper value is crucial for stabilizing the bipartite matching process and accelerating convergence, which is the core innovation of this model. |
DINO [29] | Contrastive Denoising Noise Scale | Description: The magnitude of noise applied to create positive and negative samples for Contrastive Denoising Training (CDN). Significance: The noise scale defines the difficulty of the contrastive task. It must be tuned to compel the model to learn a precise boundary between true objects and near-negatives, directly improving localization accuracy and reducing duplicates. |
RT-DETR [49] | Number of Initial Queries (K) | Description: The number of Top-K queries selected by the Uncertainty-Minimal Query Selection module to be fed into the decoder. Significance: This is a key parameter for balancing inference speed and accuracy. A smaller K significantly reduces the computational load in the decoder, enabling real-time performance, but may risk overlooking some objects in dense scenes. |
DEIM [50] | Matchability-Aware Loss (MAL) Parameters | Description: Hyperparameters within the Matchability-Aware Loss function, which modulates the loss based on the quality of the match. Significance: These parameters directly influence how the model prioritizes high-quality matches during training. Fine-tuning them is essential for improving positive sample density and match quality, which addresses the core issue of sparse supervision in DETR. |
Application Domains | Key Domain Challenges | DETR Adaptations and Improvement | Representative Works/Models |
---|---|---|---|
Autonomous Driving | Real-time, Robustness, 3D Perception, Multi-modal Fusion | 3D DETR, Fusion Strategies, Efficiency, Handling Occlusion/Density | DETR3D [117], BEV-Former [91], BEV-Fusion [118] |
Medical Image Analysis | Small lesion, Data Scarcity, 3D Data, Interpretability | Small Object DETR, 3D DETR, Efficient Learning (Self/Semi-supervised), Interpretability/Reliability | LN-DETR [129], E-DETR [102] |
Remote Sensing Image Analysis | Large Images, Dense/Small/Oriented Object, Complex Background | Large Image Handling, Small/Dense Object Queries/Features, OOB | QEDetr [137], DETR-ORD [140] |
Pedestrian Detection | Crowded Scenes, Occlusion, Small Objects | Matching/Query Strategies for Crowds, Small Object Improvements | Recurrent DETR [67], AD-DETR [143] |
Fine-grained Visual Categorization | Subtle Differences, Local Region Localization | Attention for Key Region Localization, Feature Extraction for Classification | Interpretable Transformer [144] |
Video Understanding | Temporal Info, Object Identity, Action Localization | Processing Frame Sequences, Spatiotemporal Attention, Temporal Query, Crossframe Association | Trackformer [145], DITA [147] |
Industrial Defect Detection | Irregular/Tiny Defects, Low Contrast, Complex Background | Feature/Matching for Defect, Global Context, End-to-End Detection | PCB Defect [148], Steel Defect [149], Textile Defect [150] |
Phase | Projected Milestone Dates | Potential Objectives and Research Topics | Suggested Evaluation Metrics (Targets) |
---|---|---|---|
Phase 1: Foundational Optimization | 2025–2026 | Advanced Quantization: further developing robust 8-bit/4-bit Post-Training Quantization (PTQ) and quantization-aware training (QAT) schemes for DETR. Efficient Architecture Search: utilizing hardware-aware NAS to optimize backbones and decoder layers for latency on mobile CPUs/GPUs. | Model Size: <20 MB; Latency (ARM CPU): <100 ms/frame. COCO AP: maintain > 45 AP. |
Phase 2: Aggressive Compression | 2026–2028 | Extreme Quantization: exploring the feasibility of binary/ternary networks (BNN/TNN) for the most computationally intensive modules of DETR. Structural Pruning: designing algorithms for structured pruning of attention heads and FNN layers with minimal accuracy loss. | Model Size: <10 MB; Latencty (ARM CPU): <50 ms/frame. COCO AP: maintain > 40 AP. |
Phase 3: Algorithm–Hardware Co-design | 2028+ | Integer-Only Attention: investigating novel, hardware-friendly attention mechanisms that may avoid softmax and floating-point operations. Co-Design with Accelerators: fostering collaboration with hardware engineers to design specialized NPUs or instruction sets for DETR-specific operations. | Model Size: <5 MB; Latencty (ARM CPU): <10 ms/frame. COCO AP: achieve > 40 AP with minimal power draw. |
Research Direction | Perceived Priority | Estimated Difficulty | Anticipated Data Requirements | Rationale |
---|---|---|---|---|
Extreme Efficiency | High | High | While algorithm development can start on standard benchmarks (e.g., COCO), extensive hardware-specific validation is necessary. | This direction is critical for unlocking widespread, real-world deployment on edge devices, a major current bottleneck. |
Small Object Detection | High | Medium | Progress may be constrained by the limitations of existing datasets. New benchmarks with higher resolution and denser, smaller objects could be required. | This addresses a persistent performance gap in DETR-like models, limiting their applicability in domains like aerial imagery and medical analysis. |
Generalization and Reliability | Medium | High | Research requires new evaluation protocols beyond AP (e.g., metrics for Out-of-Distribution robustness, calibration, fairness). Large-scale, diverse, and unlabeled data is beneficial. | This is crucial for building trust in safety-critical applications, though ensuring robust performance in open-world settings remains a formidable challenge. |
Interpretability and Theory | Low | Very High | Foundational work can be performed on existing datasets, but a deeper understanding likely necessitates new analytical tools and theoretical frameworks. | While perhaps less urgent for immediate performance gains, this is important for long-term trust, debugging, and scientific advancement. |
Synergy with Frontier Tech | Low | Very High | Progress likely depends on the availability of specialized, often multi-modal or simulation-based datasets (e.g., Vision Question–Answer, robotic interaction data). | Represents the long-term potential of DETR as a general perception module, but is highly exploratory and may require fundamental breakthroughs in multiple fields. |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Yu, L.; Tang, L.; Mu, L. A Review of DEtection TRansformer: From Basic Architecture to Advanced Developments and Visual Perception Applications. Sensors 2025, 25, 3952. https://doi.org/10.3390/s25133952
Yu L, Tang L, Mu L. A Review of DEtection TRansformer: From Basic Architecture to Advanced Developments and Visual Perception Applications. Sensors. 2025; 25(13):3952. https://doi.org/10.3390/s25133952
Chicago/Turabian StyleYu, Liang, Lin Tang, and Lisha Mu. 2025. "A Review of DEtection TRansformer: From Basic Architecture to Advanced Developments and Visual Perception Applications" Sensors 25, no. 13: 3952. https://doi.org/10.3390/s25133952
APA StyleYu, L., Tang, L., & Mu, L. (2025). A Review of DEtection TRansformer: From Basic Architecture to Advanced Developments and Visual Perception Applications. Sensors, 25(13), 3952. https://doi.org/10.3390/s25133952