An Appearance-Semantic Descriptor with Coarse-to-Fine Matching for Robust VPR
Abstract
:1. Introduction
- This paper proposes an algorithm for generating descriptors that integrate image appearance and semantic information, thereby tightly coupling appearance and semantic information to generate new appearance semantic descriptors. This fusion enhances the expressive power of the descriptors, leading to improved accuracy and robustness in position recognition.
- This paper proposes a coarse-to-fine image matching strategy. The semantic contour is first used to construct the SemLook global descriptor for initial screening. Then, appearance semantic descriptors, namely, SemLook local descriptors, are introduced to obtain more accurate position recognition results. This strategy further improves the accuracy and robustness of position recognition.
- Our proposed algorithm is compared with six state-of-the-art VPR algorithms on three public datasets with appearance and viewpoint changes. The results demonstrate that the SemLook-based VPR approach achieves competitive performance in terms of robustness, accuracy, and computation consumption.
2. Related Work
3. Algorithms
3.1. SemLook Global Descriptor
3.2. Local Descriptors in SemLook
3.2.1. Original Forest Architecture
Algorithm 1 Semantic Object Contour Extraction and Forest Descriptor Encoding |
Input: semanticSegmentationResult Output: Forest descriptor
|
3.2.2. SuperPoint-VLAD
Algorithm 2 Compute SuperPoint-VLAD for Contours |
Input: contours, keypoints1, descriptors1 Output: representations for each contour
|
3.2.3. SemLook Local Descriptor Construction
- Extract the following information from the contours:
- 2.
- For each object, we select the three most distinctive neighboring objects from its surrounding environment based on geometric relationships. Specifically, these neighboring objects are usually the ones closest to the target object in terms of geometric distance. In cases where multiple objects have similar distances, we prioritize those with larger areas. Subsequently, we integrate the semantic categories and appearance features of these neighboring objects into the local descriptor of the target object to capture the spatial relationships and visual features between objects. In this way, we can enhance the discriminative ability of the descriptor and improve its robustness in the face of scene changes.
- 3.
- We use the extracted information to encode each semantic object, obtaining a descriptor for each object, which is called a “tree”.
- 4.
- All the “trees” are then assembled into a collection, serving as the descriptor for the entire image, which is called the “Semlook” local descriptor and represented as G, as shown in Figure 4.
Algorithm 3 Construct Semlook Local Descriptor |
Input: contours, SuperPoint-VLAD Output: Semlook local descriptor G
|
SemLook Global Descriptor Matching
SemLook Local Descriptor Matching
- Compare the semantic categories C of and . If the semantic categories are the same, calculate the similarity of their intrinsic information , which includes calculating the area similarity , the center location similarity , and the appearance similarity . The formulas for calculating the similarities are as follows:
- 2.
- The similarity score S between two “trees” can be calculated using the following Formula (14):
Experiments
Dataset
Evaluation Indicators
- AUC (area under the curve): The AUC metric measures the overall performance of a model by calculating the area enclosed by the precision–recall curve. A higher AUC value, closer to 1.0, indicates a more practical VPR algorithm with higher accuracy and robustness.
- Recall@100%Precision: This metric represents the maximum recall achieved at 100% precision. It indicates the highest recall we can achieve while maintaining a precision of 100%. A value closer to 1.0 indicates better performance of the VPR algorithm under high precision.
- Precision@100%Recall: This metric represents the precision achieved at 100% recall. In other words, it is the precision we can achieve while maintaining a recall of 100%. A value closer to 1.0 indicates better performance of the VPR algorithm.
Experimental Setup
- Using only Forest image descriptors for matching;
- Combining SuperPoint-VLAD with Forest image descriptors to incorporate appearance information and construct Semlook local descriptors, followed by matching;
- Using Semlook global descriptors for initial frame selection and then Forest image descriptors for matching;
- Using Semlook global descriptors for initial frame selection and then using Semlook local descriptors for matching.
4. Results and Discussion
4.1. Analyzing VPR Performance
4.2. Computational Cost Analysis
4.3. Impact of Different Image Encoding Techniques on Performance
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Bay, H.; Ess, A.; Tuytelaars, T.; Van Gool, L. Speeded-up robust features (SURF). Comput. Vis. Image Underst. 2008, 110, 346–359. [Google Scholar] [CrossRef]
- Rublee, E.; Rabaud, V.; Konolige, K.; Bradski, G. ORB: An efficient alternative to SIFT or SURF. In Proceedings of the IEEE International Conference on Computer Vision, ICCV 2011, Barcelona, Spain, 6–13 November 2011. [Google Scholar] [CrossRef]
- Arandjelović, R.; Gronat, P.; Torii, A.; Pajdla, T.; Sivic, J. NetVLAD: CNN architecture for weakly supervised place recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 1437–1451. [Google Scholar] [CrossRef] [PubMed]
- Yu, X.; Chaturvedi, S.; Feng, C.; Taguchi, Y.; Lee, T.-Y.; Fernandes, C.; Ramalingam, S. VLASE: Vehicle localization by aggregating semantic edges. In Proceedings of the 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Madrid, Spain, 1–5 October 2018; pp. 3196–3203. [Google Scholar]
- Benbihi, A.; Arravechia, S.; Geist, M.; Pradalier, C. Image-based place recognition on bucolic environment across seasons from semantic edge description. In Proceedings of the 2020 IEEE International Conference on Robotics and Automation (ICRA), Paris, France, 31 May–31 August 2020; pp. 3032–3038. [Google Scholar]
- Gawel, A.; Del Don, C.; Siegwart, R.; Nieto, J.; Cadena, C. X-View: Graph-Based Semantic Multi-View Localization. IEEE Robot. Autom. Lett. 2018, 3, 1687–1694. [Google Scholar] [CrossRef]
- Hou, P.; Chen, J.; Nie, J.; Liu, Y.; Zhao, J. Forest: A Lightweight Semantic Image Descriptor for Robust Visual Place Recognition. IEEE Robot. Autom. Lett. 2022, 7, 12531–12538. [Google Scholar] [CrossRef]
- DeTone, D.; Malisiewicz, T.; Rabinovich, A. Superpoint: Self-supervised interest point detection and description. arXiv 2017. [Google Scholar] [CrossRef]
- Jégou, H.; Douze, M.; Schmid, C.; Pérez, P. Aggregating local descriptors into a compact image representation. In Proceedings of the 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Francisco, CA, USA, 13–18 June 2010; pp. 3304–3311. [Google Scholar]
- Paul, R.; Newman, P. FAB-MAP 3D: Topological mapping with spatial and visual appearance. In Proceedings of the 2010 IEEE International Conference on Robotics and Automation, Anchorage, AK, USA, 3–7 May 2010. [Google Scholar] [CrossRef]
- Gálvez-López, D.; Tardos, J.D. Bags of binary words for fast place recognition in image sequences. IEEE Trans. Robot. 2012, 28, 1188–1197. [Google Scholar] [CrossRef]
- Garcia-Fidalgo, E.; Ortiz, A. iBoW-lCD: An appearance-based loop closure detection approach using incremental bags of binary words. IEEE Robot. Automat. Lett. 2018, 3, 3051–3057. [Google Scholar] [CrossRef]
- Zaffar, M.; Ehsan, S.; Milford, M.; McDonald-Maier, K. CoHOG: A light-weight, compute-efficient, and training-free visual place recognition technique for changing environments. IEEE Robot. Automat. Lett. 2020, 5, 1835–1842. [Google Scholar] [CrossRef]
- Dalal, N.; Triggs, B. Histograms of Oriented Gradients for Human Detection. In Proceedings of the IEEE Computer Society Conference on Computer Vision & Pattern Recognition, San Diego, CA, USA, 20–25 June 2005. [Google Scholar] [CrossRef]
- Chen, Z.; Lam, O.; Jacobson, A.; Milford, M. Convolutional neural network-based place recognition. In Proceedings of the 16th Australasian Conference on Robotics and Automation, Parkville, Australia, 2–4 December 2014; pp. 1–8. [Google Scholar]
- Hou, Y.; Zhang, H.; Zhou, S. Convolutional neural network-based image representation for visual loop closure detection. In Proceedings of the 2015 IEEE International Conference on Information and Automation, Lijiang, China, 8–10 August 2015; pp. 2238–2245. [Google Scholar]
- Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 2012, 25, 84–90. [Google Scholar] [CrossRef]
- Hausler, S.; Garg, S.; Xu, M.; Milford, M.; Fischer, T. Patch-NetVLAD: Multi-scale fusion of locally-global descriptors for place recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 14141–14152. [Google Scholar] [CrossRef]
- Chen, Z.; Jacobson, A.; Sünderhauf, N.; Upcroft, B.; Liu, L.; Shen, C.; Reid, I.; Milford, M. Deep learning features at scale for visual place recognition. In Proceedings of the 2017 IEEE International Conference on Robotics and Automation (ICRA), Singapore, 29 May–3 June 2017; pp. 3223–3230. [Google Scholar]
- Dai, X. HybridNet: A fast vehicle detection system for autonomous driving. Signal Process. Image Commun. 2019, 70, 79–88. [Google Scholar] [CrossRef]
- Khaliq, A.; Ehsan, S.; Chen, Z.; Milford, M.; McDonald-Maier, K. A holistic visual place recognition approach using lightweight CNNsfor significant viewpoint and appearance changes. IEEE Trans. Robot. 2020, 36, 561–569. [Google Scholar] [CrossRef]
- Garg, S.; Suenderhauf, N.; Milford, M. Lost? appearance-invariant place recognition for opposite viewpoints using visual semantics. Robot. Sci. Syst. 2018, XIV, 1–10. [Google Scholar]
- Guo, X.; Hu, J.; Chen, J.; Deng, F.; Lam, T.L. Semantic histogram based graph matching for real-time multi-robot global localization in large scale environment. IEEE Robot. Autom. Lett. 2021, 6, 8349–8356. [Google Scholar] [CrossRef]
- Shih, F.Y.; Wu, Y.-T. Fast Euclidean distance transformation in two scans using a 3 × 3 neighborhood. Comput. Vis. Image Underst. 2004, 93, 195–205. [Google Scholar] [CrossRef]
- Suzuki, S.; Be, K. Topological structural analysis of digitized binary images by border following. Comput. Vis. Graph. Image Process. 1985, 30, 32–46. [Google Scholar] [CrossRef]
- Wang, Z.; Bovik, A.C.; Sheikh, H.R.; Simoncelli, E.P. Image quality assessment: From error visibility to structural similarity. IEEE Trans. Image Process. 2004, 13, 600–612. [Google Scholar] [CrossRef] [PubMed]
- Sattler, T.; Maddern, W.; Toft, C.; Torii, A.; Hammarstrand, L.; Stenborg, E.; Safari, D.; Okutomi, M.; Pollefeys, M.; Sivic, J.; et al. Benchmarking 6DOF outdoor visual localization in changing conditions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 8601–8610. [Google Scholar]
- Zaffar, M.; Garg, S.; Milford, M.; Kooij, J.; Flynn, D.; McDonald-Maier, K.; Ehsan, S. VPR-Bench: An open-source visual place recognition evaluation framework with quantifiable viewpoint and appearance change. Int. J. Comput. Vis. 2021, 129, 2136–2174. [Google Scholar] [CrossRef]
- Maddern, W.; Pascoe, G.; Linegar, C.; Newman, P. 1 year, 1000 km: The oxford robotcar dataset. Int. J. Robot. Res. 2017, 36, 3–15. [Google Scholar] [CrossRef]
- Ros, G.; Sellart, L.; Materzynska, J.; Vazquez, D.; Lopez, A.M. The synthia dataset: A large collection of synthetic images for semantic segmentation of urban scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 3234–3243. [Google Scholar] [CrossRef]
- Cordts, M.; Omran, M.; Ramos, S.; Rehfeld, T.; Enzweiler, M.; Benenson, R.; Franke, U.; Roth, S.; Schiele, B. The cityscapes dataset for semantic urban scene understanding. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 3213–3223. [Google Scholar]
- Larsson, M.M.; Stenborg, E.; Hammarstrand, L.; Pollefeys, M.; Sattler, T.; Kahl, F. A cross-season correspondence dataset for robust semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 9532–9542. [Google Scholar]
Datasets | Ours | Forest | Patch-NetVLAD | AlexNet | RegionVLAD | HOG | CoHOG |
---|---|---|---|---|---|---|---|
Extended-CMU Season (Slice6) | 1.0/1.0/1.0 | 0.99/0.97/0.58 | 1.0/1.0/1.0 | 0.99/0.99/0.90 | 0.99/0.98/0.80 | 0.99/0.98/0.94 | 1.0/1.0/1.0 |
Extended-CMU Season (Slice7) | 1.0/1.0/1.0 | 0.93/0.86/0.36 | 1.0/1.0/1.0 | 0.99/0.97/0.59 | 0.99/0.98/0.47 | 0.98/0.91/0.34 | 0.99/0.98/0.74 |
Extended-CMU Season(Slice8) | 0.99/0.91/0.78 | 0.85/0.68/0.17 | 0.99/0.97/0.99 | 0.98/0.88/0.62 | 0.95/0.70/0.41 | 0.91/0.45/0.45 | 0.97/0.84/0.61 |
RobotCar Seasons v2 (Sun, Winter) | 0.93/0.88/0.21 | 0.88/0.85/0.06 | 1.0/1.0/1.0 | 0.99/0.93/0.88 | 0.99/0.97/0.87 | 0.96/0.82/0.72 | 0.84/0.68/0.25 |
SYNTHIA (Fog, Rainnight) | 0.99/0.99/0.98 | 0.99/0.99/0.84 | 0.99/0.99/0.95 | 0.90/0.72/0.23 | 0.77/0.48/0.05 | 0.99/0.94/0.59 | 0.90/0.90/0.01 |
SYNTHIA (Fog, Sunset) | 0.99/0.99/0.83 | 0.99/0.98/0.83 | 1.0/1.0/1.0 | 0.89/0.79/0.06 | 0.47/0.40/0.02 | 0.99/0.96/0.52 | 0.90/0.91/0.01 |
Moule (ms/Frame) | Semlook | Edge (Coarse) | Foest + AP (Fine) | Forest | Region VLAD | AlexNet_ | Patch-NetVLAD | HOG | CoHOG |
---|---|---|---|---|---|---|---|---|---|
Three Datasets | Three Datasets | Three Datasets | Three Datasets | Three Datasets | Three Datasets | Three Datasets | Three Datasets | Three Datasets | |
Encoding | 68.41/54.46/40.99 | 9.50/3.64/3.07 | 58.91/50.82/37.92 | 40.3/32.6/25.9 | 1384.6/1135.9/918.4 | 1625.7/681.4/517.8 | 526.1/415.3/375.7 | 11.5/4.84/3.67 | 270.9/113.7/86.3 |
Matching | 0.16/0.09/0.04 | 0.12/0.07/0.03 | 0.67/0.48/0.31 | 0.58/0.34/0.21 | 0.19/0.085/0.07 | 771.1/379.6/211.8 | 108.3/49.7/41.6 | 0.14/0.07/0.04 | 3.23/1.72/0.96 |
Total | 68.57/54.55/41.03 | 9.62/3.71/3.10 | 59.58/51.30/38.23 | 40.8/32.9/26.1 | 1384.7/1136.9/918.5 | 2396.8/106.1/729.6 | 964.8/465.8/417.5 | 11.6/4.91/3.71 | 274.1/115.5/87.3 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Chen, J.; Li, W.; Hou, P.; Yang, Z.; Zhao, H. An Appearance-Semantic Descriptor with Coarse-to-Fine Matching for Robust VPR. Sensors 2024, 24, 2203. https://doi.org/10.3390/s24072203
Chen J, Li W, Hou P, Yang Z, Zhao H. An Appearance-Semantic Descriptor with Coarse-to-Fine Matching for Robust VPR. Sensors. 2024; 24(7):2203. https://doi.org/10.3390/s24072203
Chicago/Turabian StyleChen, Jie, Wenbo Li, Pengshuai Hou, Zipeng Yang, and Haoyu Zhao. 2024. "An Appearance-Semantic Descriptor with Coarse-to-Fine Matching for Robust VPR" Sensors 24, no. 7: 2203. https://doi.org/10.3390/s24072203
APA StyleChen, J., Li, W., Hou, P., Yang, Z., & Zhao, H. (2024). An Appearance-Semantic Descriptor with Coarse-to-Fine Matching for Robust VPR. Sensors, 24(7), 2203. https://doi.org/10.3390/s24072203