ETQ-Matcher: Efficient Quadtree-Attention-Guided Transformer for Detector-Free Aerial–Ground Image Matching
Abstract
:1. Introduction
- (1)
- We propose an efficient quadtree-attention-guided transformer (ETQ-Matcher) to address significant variations in occlusions, scale differences, illumination changes, and repeated textures between aerial and ground images, which pose substantial difficulties for feature matching.
- (2)
- We devise a feature extraction network called multi-layer transformer with channel attention (MTCA), which is based on transformer and channel attention. MTCA employs an encoder–decoder structure, combining a multi-layer transformer encoder with a channel attention decoder, significantly enhancing feature extraction and fusion, thereby improving subsequent matching performance.
- (3)
- To refine image information, we develop the quadtree-attention feature fusion (QAFF), integrating self-attention and cross-attention. Self-attention identifies the global context and correlations within images. Subsequently, cross-attention calculations are performed in relevant areas to minimize the interference of irrelevant regions on feature expression.
2. Methods
2.1. Overview
2.2. Multi-Layer Transformer with Channel Attention Extractor
2.3. Quadtree Attention Feature Fusion
2.4. Coarse Matching
2.5. Fine Matching
2.6. Loss Function
2.6.1. Coarse Loss Function
2.6.2. Fine Loss Function
3. Experiment
3.1. Datasets
3.2. Model Settings
3.3. Methods and Evaluating Metrics
3.4. Matching Results
3.4.1. Matching Results Visual Analysis
3.4.2. Matching Results Quantitative Analysis
3.4.3. Generalization Performance Visual Analysis
3.4.4. Generalization Performance Quantitative Analysis
3.5. Ablation Analysis
3.6. Discussion
4. Conclusions
Author Contributions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Li, X.; Hu, T.; Li, Z.; Long, M.; Ye, Z.; Huang, J.; Yalikun, Y.; Liu, S.; Liu, Y.; Wang, D.; et al. MRRM: Advanced biomarker alignment in multi-staining pathology images via multi-scale ring rotation-invariant matching. IEEE J. Biomed. Health Inform. 2024, 29, 1189–1198. [Google Scholar] [CrossRef] [PubMed]
- Li, X.; Li, Z.; Hu, T.; Long, M.; Ma, X.; Huang, J.; Liu, Y.; Yalikun, Y.; Liu, S.; Wang, D.; et al. MSGM: An Advanced Deep Multi-Size Guiding Matching Network for Whole Slide Histopathology Images Addressing Staining Variation and Low Visibility Challenges. IEEE J. Biomed. Health Inform. 2024, 28, 6019–6030. [Google Scholar] [PubMed]
- Yang, W.; Xu, C.; Mei, L.; Yao, Y.; Liu, C. LPSO: Multi-source image matching considering the description of local phase sharpness orientation. IEEE Photonics J. 2022, 14, 7811109. [Google Scholar]
- Liang, C.; Dong, Y.; Zhao, C.; Sun, Z. A Coarse-to-Fine Feature Match Network Using Transformers for Remote Sensing Image Registration. Remote Sens. 2023, 15, 3243. [Google Scholar] [CrossRef]
- Tessema, T.; Mortimer, D.; Gupta, S.K.; Mallast, U.; Uzor, S.; Tosti, F. Urban green infrastructure monitoring using remote-sensing techniques. In Proceedings of the Earth Resources and Environmental Remote Sensing/GIS Applications XV SPIE, Edinburgh, UK, 16–20 September 2024; Volume 13197, pp. 232–239. [Google Scholar]
- Chen, M.; Fang, T.; Zhu, Q.; Ge, X.; Zhang, Z.; Zhang, X. Feature-point matching for aerial and ground images by exploiting line segment-based local-region constraints. Photogramm. Eng. Remote Sens. 2021, 87, 767–780. [Google Scholar]
- Gagliardi, V.; Tessema, T.; Tosti, F.; Benedetto, A. Remote sensing monitoring of infrastructure in complex and coastal areas. In Proceedings of the Earth Resources and Environmental Remote Sensing/GIS Applications XV SPIE, Edinburgh, UK, 16–20 September 2024; Volume 13197, pp. 49–56. [Google Scholar]
- Kyriou, A.; Nikolakopoulos, K.; Koukouvelas, I. Synergistic use of UAV and TLS data for precise rockfall monitoring over a hanging monastery. In Proceedings of the Earth Resources and Environmental Remote Sensing/GIS Applications XIII SPIE, Berlin, Germany, 5–7 September 2022; Volume 12268, pp. 34–45. [Google Scholar]
- Mei, L.; Ye, Z.; Xu, C.; Wang, H.; Wang, Y.; Lei, C.; Yang, W.; Li, Y. SCD-SAM: Adapting segment anything model for semantic change detection in remote sensing imagery. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5626713. [Google Scholar] [CrossRef]
- Zhang, Y.; Ma, G.; Wu, J. Air-ground multi-source image matching based on high-precision reference image. Remote Sens. 2022, 14, 588. [Google Scholar] [CrossRef]
- Nikolakopoulos, K.; Koukouvelas, I.; Kyriou, A.; Katsonopoulou, D.; Kormann, M. Accuracy assessment of different remote sensing technologies over the cultural heritage site of Helike, Achaea, Greece. In Proceedings of the Earth Resources and Environmental Remote Sensing/GIS Applications XV SPIE, Edinburgh, UK, 16–20 September 2024; Volume 13197, pp. 278–289. [Google Scholar]
- Xu, S.; Chen, S.; Xu, R.; Wang, C.; Lu, P.; Guo, L. Local feature matching using deep learning: A survey. Inf. Fusion 2024, 107, 102344. [Google Scholar] [CrossRef]
- Yi, K.M.; Trulls, E.; Lepetit, V.; Fua, P. Lift: Learned invariant feature transform. In Proceedings of the Computer Vision—ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part VI 14; Springer: Cham, Switzerland, 2016; pp. 467–483. [Google Scholar]
- DeTone, D.; Malisiewicz, T.; Rabinovich, A. Superpoint: Self-supervised interest point detection and description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Salt Lake City, UT, USA, 18–23 June 2018; pp. 224–236. [Google Scholar]
- Tian, Y.; Yu, X.; Fan, B.; Wu, F.; Heijnen, H.; Balntas, V. Sosnet: Second order similarity regularization for local descriptor learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–17 June 2019; pp. 11016–11025. [Google Scholar]
- Dusmanu, M.; Rocco, I.; Pajdla, T.; Pollefeys, M.; Sivic, J.; Torii, A.; Sattler, T. D2-net: A trainable cnn for joint description and detection of local features. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–17 June 2019; pp. 8092–8101. [Google Scholar]
- Xue, F.; Budvytis, I.; Cipolla, R. Sfd2: Semantic-guided feature detection and description. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 5206–5216. [Google Scholar]
- Sarlin, P.E.; DeTone, D.; Malisiewicz, T.; Rabinovich, A. Superglue: Learning feature matching with graph neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 4938–4947. [Google Scholar]
- Phan, A.V.; Le Nguyen, M.; Nguyen, Y.L.H.; Bui, L.T. Dgcnn: A convolutional neural network over large-scale labeled graphs. Neural Netw. 2018, 108, 533–543. [Google Scholar] [CrossRef] [PubMed]
- Lindenberger, P.; Sarlin, P.E.; Pollefeys, M. Lightglue: Local feature matching at light speed. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–3 October 2023; pp. 17627–17638. [Google Scholar]
- Luo, T.; Yang, L.; Zhang, H.; Qu, C.; Wang, X.; Cui, Y.; Wong, W.F.; Goh, R.S.M. NC-Net: Efficient Neuromorphic Computing Using Aggregated Subnets on a Crossbar-Based Architecture with Nonvolatile Memory. IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 2021, 41, 2957–2969. [Google Scholar] [CrossRef]
- Li, X.; Han, K.; Li, S.; Prisacariu, V. Dual-resolution correspondence networks. Adv. Neural Inf. Process. Syst. 2020, 33, 17346–17357. [Google Scholar]
- Zhou, Q.; Sattler, T.; Leal-Taixe, L. Patch2pix: Epipolar-guided pixel-level correspondences. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 4669–4678. [Google Scholar]
- Sun, J.; Shen, Z.; Wang, Y.; Bao, H.; Zhou, X. LoFTR: Detector-free local feature matching with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 8922–8931. [Google Scholar]
- Wang, Q.; Zhang, J.; Yang, K.; Peng, K.; Stiefelhagen, R. Matchformer: Interleaving attention in transformers for feature matching. In Proceedings of the Asian Conference on Computer Vision, Macao, China, 4–8 December 2022; pp. 2746–2762. [Google Scholar]
- Zhao, J.; Jiao, L.; Wang, C.; Liu, X.; Liu, F.; Li, L.; Yang, S. GeoFormer: A Geometric Representation Transformer for Change Detection. IEEE Trans. Geosci. Remote Sens. 2023, 61, 4410617. [Google Scholar]
- Huang, D.; Chen, Y.; Liu, Y.; Liu, J.; Xu, S.; Wu, W.; Ding, Y.; Tang, F.; Wang, C. Adaptive assignment for geometry aware local feature matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 5425–5434. [Google Scholar]
- Wang, Y.; He, X.; Peng, S.; Tan, D.; Zhou, X. Efficient LoFTR: Semi-dense local feature matching with sparse-like speed. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 21666–21675. [Google Scholar]
- Lu, X.; Du, S. JamMa: Ultra-lightweight Local Feature Matching with Joint Mamba. arXiv 2025, arXiv:2503.03437. [Google Scholar]
- Chen, L.; Zhang, H.; Xiao, J.; Nie, L.; Shao, J.; Liu, W.; Chua, T.S. Sca-cnn: Spatial and channel-wise attention in convolutional networks for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 5659–5667. [Google Scholar]
- Wang, S.; Li, B.Z.; Khabsa, M.; Fang, H.; Ma, H. Linformer: Self-attention with linear complexity. arXiv 2020, arXiv:2006.04768. [Google Scholar]
- Tang, S.; Zhang, J.; Zhu, S.; Tan, P. Quadtree attention for vision transformers. arXiv 2022, arXiv:2201.02767. [Google Scholar]
- Li, Z.; Snavely, N. Megadepth: Learning single-view depth prediction from internet photos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 2041–2050. [Google Scholar]
- Fang, T.; Chen, M.; Li, W.; Ge, X.; Hu, H.; Zhu, Q.; Xu, B.; Ouyang, W. A Novel Depth Information-Guided Multi-View 3D Curve Reconstruction Method. Photogramm. Rec. 2025, 40, e70003. [Google Scholar] [CrossRef]
- Fang, T.; Chen, M.; Hu, H.; Li, W.; Ge, X.; Zhu, Q.; Xu, B. 3-D line segment reconstruction with depth maps for photogrammetric mesh refinement in man-made environments. IEEE Trans. Geosci. Remote Sens. 2023, 61, 4508421. [Google Scholar]
- Tyszkiewicz, M.; Fua, P.; Trulls, E. DISK: Learning local features with policy gradient. Adv. Neural Inf. Process. Syst. 2020, 33, 14254–14265. [Google Scholar]
- Rublee, E.; Rabaud, V.; Konolige, K.; Bradski, G. ORB: An efficient alternative to SIFT or SURF. In Proceedings of the 2011 International Conference on Computer Vision, Barcelona, Spain, 6–13 November 2011; IEEE: Piscataway, NJ, USA, 2011; pp. 2564–2571. [Google Scholar]
- Xu, C.; Xu, J.; Huang, T.; Zhang, H.; Mei, L.; Zhang, X.; Duan, Y.; Yang, W. Progressive matching method of aerial-ground remote sensing image via multi-scale context feature coding. Int. J. Remote Sens. 2023, 44, 5876–5895. [Google Scholar]
- Derpanis, K.G. Overview of the RANSAC Algorithm. Image Rochester NY 2010, 4, 2–3. [Google Scholar]
Methods | Repeat Textures | Occlusions | Illumination | Scale Changes | Success Rate | Metrics | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | NCM | RMSE | ||
ORB | √ | × | × | × | × | × | × | × | × | × | × | × | 6 | ||
D2-Net | √ | √ | × | √ | √ | × | √ | √ | × | × | × | × | 18 | ||
SuperGlue | × | √ | √ | × | √ | √ | × | √ | √ | √ | × | × | 85 | 1.42 | |
LoFTR | √ | √ | × | √ | √ | √ | × | × | √ | × | √ | √ | 74 | ||
Quad-LoFTR | √ | √ | √ | √ | √ | √ | × | × | √ | × | × | √ | 82 | ||
MatchFormer | √ | √ | √ | √ | √ | × | √ | × | √ | √ | × | × | 160 | ||
GeoFormer | √ | √ | √ | √ | √ | √ | √ | × | √ | √ | × | √ | 10/12 | 137 | |
EF-LoFTR | √ | √ | √ | √ | √ | √ | × | √ | √ | × | × | √ | 114 | ||
Ours | √ | √ | √ | √ | √ | √ | √ | √ | √ | √ | √ | √ | 12/12 | 159 | 0.69 |
Category | Method | AUC@5° | AUC@10° | AUC@20° |
---|---|---|---|---|
Sparse | SP + NN | 31.7 | 46.8 | 60.1 |
SP + SG | 49.7 | 67.1 | 80.6 | |
Semi-Dense | DRC-Net | 27.0 | 42.9 | 58.3 |
LoFTR | 48.9 | 63.3 | 75.8 | |
Quadtree | 49.9 | 65.1 | 78.6 | |
Ours | 50.7 | 67.2 | 79.6 |
Methods | Intensity Variations | Temporal Variations | Rotation | Translation | Success Rate | Metrics | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | NCM | RMSE | ||
D2-Net | √ | √ | √ | √ | √ | √ | √ | √ | × | √ | × | √ | 137 | ||
SuperGlue | √ | √ | √ | √ | √ | √ | √ | √ | √ | √ | × | √ | 156 | ||
LoFTR | √ | √ | √ | × | √ | × | √ | √ | √ | √ | √ | √ | 365 | ||
Quad-LoFTR | √ | √ | √ | × | √ | × | √ | √ | √ | √ | √ | × | 380 | ||
MatchFormer | × | × | √ | √ | √ | √ | √ | √ | √ | √ | √ | √ | 418 | ||
GeoFormer | √ | √ | √ | √ | √ | × | √ | √ | √ | √ | √ | √ | 446 | ||
EF-LoFTR | √ | √ | × | √ | √ | √ | √ | √ | √ | √ | √ | √ | 498 | 0.61 | |
Ours | √ | √ | √ | √ | √ | √ | √ | √ | √ | √ | √ | √ | 12/12 | 466 | 0.43 |
Method | NCM | RMSE (Pixels) |
---|---|---|
ResNet-FPN | 341 | 1.97 |
ResNet-FPN+QAFF | 357 | 1.69 |
ResNet-FPN+MTCA | 352 | 1.62 |
ResNet+QAFF+MTCA | 377 | 1.31 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Xu, C.; Wang, B.; Ye, Z.; Mei, L. ETQ-Matcher: Efficient Quadtree-Attention-Guided Transformer for Detector-Free Aerial–Ground Image Matching. Remote Sens. 2025, 17, 1300. https://doi.org/10.3390/rs17071300
Xu C, Wang B, Ye Z, Mei L. ETQ-Matcher: Efficient Quadtree-Attention-Guided Transformer for Detector-Free Aerial–Ground Image Matching. Remote Sensing. 2025; 17(7):1300. https://doi.org/10.3390/rs17071300
Chicago/Turabian StyleXu, Chuan, Beikang Wang, Zhiwei Ye, and Liye Mei. 2025. "ETQ-Matcher: Efficient Quadtree-Attention-Guided Transformer for Detector-Free Aerial–Ground Image Matching" Remote Sensing 17, no. 7: 1300. https://doi.org/10.3390/rs17071300
APA StyleXu, C., Wang, B., Ye, Z., & Mei, L. (2025). ETQ-Matcher: Efficient Quadtree-Attention-Guided Transformer for Detector-Free Aerial–Ground Image Matching. Remote Sensing, 17(7), 1300. https://doi.org/10.3390/rs17071300