Spatiotemporal Information, Near-Field Perception, and Service for Tourists by Distributed Camera and BeiDou Positioning System in Mountainous Scenic Areas
Abstract
:1. Introduction
2. Experimental Data
2.1. The Experimental Area
2.2. Data Sources
- Public dataset: The public datasets utilized for training the tourist detection model are primarily sourced from CrowdHuman [32], MOT17 [33], and CityPersons [34]. The CrowdHuman dataset, released by Megvii Technology in China, is designed for pedestrian detection. It contains 24,370 images, mainly sourced from Google searches, with a total of 470,000 human instances. On average, each image includes approximately 22.6 individuals, featuring various levels of occlusion. Each human instance is meticulously annotated with head bounding boxes, visible body region bounding boxes, and full-body bounding boxes. The MOT17 dataset is a comprehensive collection for multiple object tracking, building upon its predecessor, MOT16. Introduced by Milan et al., it serves as a benchmark in multi-object tracking, driving advancements in more sophisticated and precise tracking systems. The dataset includes a variety of scenes, both indoor and outdoor, presented in video format, with each pedestrian thoroughly annotated. The CityPersons dataset, introduced by Shanshan Zhang, Rodrigo Benenson, and Bernt Schiele at the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), comprises 5050 images, with an average of about seven pedestrians per image. Annotations in this dataset include the visible region and full body for each pedestrian.
- Supplementary dataset: The regional supplementary training dataset consists of over 1000 manually collected pedestrian images from the Tianmeng scenic area. All images in this dataset are pre-annotated for the overall pedestrian area and are used to train pedestrian detection models. The images have a resolution of 1920 × 1080 pixels and include over 3500 target samples. To enhance the model’s detection capability, the training process begins with pre-training on publicly available datasets, followed by fine-tuning using the supplementary dataset. This approach is designed to balance the model’s generalization ability with its accuracy in specific scenarios.
3. Related Work
3.1. Tourist Target Accurate Detection
3.2. Dynamic Target Automatic Tracking
3.3. Geospatial Positioning of Video Targets
- Camera calibration: This involves calibrating the camera to obtain its intrinsic and extrinsic parameters. Intrinsic parameters, such as focal length, principal point coordinates, and distortion coefficients, help define the camera’s imaging model. Extrinsic parameters, including the rotation and translation matrices, determine the camera’s position and orientation relative to the monitored scene.
- Pixel coordinate extraction and distortion correction: The pixel coordinates of the detected target are distortion-corrected to eliminate any image distortion introduced by the camera during imaging.
- Spatial positioning: The monitoring area is considered relatively flat or divided into several planes. The spatial positioning of targets is simplified through a plane-constrained approach, which leverages the geometric properties of planes to streamline the calculation process.
4. Results Analysis
4.1. Accurate Detection of Tourists in Multi-Scene Scenic Spots
4.2. Tourist Target Dynamic Tracking and Precise Positioning
5. Discussion
5.1. Near-Field Passive Quasi-Real-Time Perception System
- Improved YOLOX model for dynamic detection: The YOLOX model has been enhanced to more effectively detect tourist targets in complex mountainous scenes. Integrating the Convolutional Block Attention Module (CBAM) into the YOLOX network has been shown to improve detection accuracy by 3.29% to 4.11%, depending on where it is inserted. In this study, the deepest network model, YOLOX-x, was employed to maximize feature extraction capabilities, with the CBAM module strategically placed after the Spatial Pyramid Pooling (SPP) module to enhance feature relevance. This configuration facilitates effective multi-scale feature extraction, improving the detection of tourists of varying sizes within the camera’s field of view while compensating for the complex background.The adaptive feature fusion (ASFF) method further optimizes the contribution of feature maps at different levels, boosting precision and recall rates for small targets without imposing significant computational overhead. These enhancements have markedly improved both the speed and accuracy of detection, ensuring high accuracy and recall rates even in challenging scenarios involving small targets and dense crowds. As a result, YOLOX is highly suitable for real-time target detection in mountainous scenic areas. Additionally, the model’s generalizability and applicability can be further expanded by modifying the network and incorporating other novel self-attention modules.
- Introduction of BYTE tracking algorithm for dynamic target tracking: Dynamic multi-object tracking is a crucial step in achieving precise perception of tourists in mountainous scenic areas. In the tracking process, data association methods are essential, as they calculate the similarity between trajectories and detection boxes, matching them based on this similarity. Location, movement, and appearance serve as key clues for associative matching, enabling continuous tracking and effectively handling occlusions.The BYTE tracking algorithm, known for its flexibility and compatibility with various association methods, was incorporated into this system, significantly improving tracking accuracy. BYTE ensures continuous tracking even when targets are partially occluded, demonstrating its robustness in solving dynamic detection and tracking challenges. By integrating trackers and employing the BYTE data association method, the system enhances detection accuracy and expands application scenarios, focusing on multi-frame video target tracking. Additionally, a local area detection method was introduced to reduce false positives caused by overlapping detection areas, overcoming occlusion issues and improving detection accuracy in complex scenes.
- Construction of a near-field sensing network: A near-field sensing network was developed for tourist information in the Tianmeng Mountain area, based on spatial location correlation and utilizing the unified spatiotemporal benchmark of BeiDou. This network integrates multiple near-field cameras, originally used for decentralized single-point monitoring, into a unified perception system. This transformation from independent video monitoring devices into an interconnected network enables the provision of near-field dynamic information and scene-state awareness. By pulling real-time video streams and combining them with object detection and positioning methods, the network integrates detection, tracking, and positioning processes. This allows for real-time dynamic updates and management of precise spatiotemporal information, directly supporting safety management and the development of smart scenic areas.
5.2. Technical Difficulties and Error Analysis
5.3. Scalability and Limitations
6. Conclusions
- The introduction of the CBAM and ASFF modules effectively improves the recognition accuracy of small tourist targets in outdoor long-distance video frames of mountainous scenic spots in the YOLOX network.
- The use of the BYTE target dynamic tracking algorithm enables dynamic tracking of moving tourist targets in complex scenes, addressing the issue of target occlusion in mountainous scenic areas and enhancing the accuracy of model detection.
- By utilizing the spatial position correlation of the BeiDou system, the article achieves comprehensive management and collaborative perception of multiple distributed cameras in mountainous scenic areas within the spatiotemporal benchmark of the BeiDou system. This integration of multiple near-field cameras for scattered single-point monitoring into a unified near-field perception system enables passive and accurate perception of tourist spatiotemporal information in the mountainous scenic field based on multi-channel distributed videos.
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Li, D.; Deng, L.; Cai, Z. Statistical analysis of tourist flow in tourist spots based on big data platform and DA-HKRVM algorithms. Pers. Ubiquit. Comput. 2020, 24, 87–101. [Google Scholar] [CrossRef]
- Liu, J.; Du, J.; Sun, Z.; Jia, Y. Tourism emergency data mining and intelligent prediction based on networking autonomic system. In Proceedings of the 2010 International Conference on Networking, Sensing and Control (ICNSC), IEEE, Chicago, IL, USA, 10–12 April 2010; pp. 238–242. [Google Scholar]
- Ervina, E.; Wulung, S.R.P.; Octaviany, V. Tourist perception of visitor management strategy in North Bandung Protected Area. J. Bus. Hosp. Tour. 2020, 6, 303. [Google Scholar] [CrossRef]
- Qin, S.; Man, J.; Wang, X.; Li, C.; Dong, H.; Ge, X. Applying big data analytics to monitor tourist flow for the scenic area operation management. Discret. Dyn. Nat. Soc. 2019, 2019, 8239047. [Google Scholar] [CrossRef]
- Gstaettner, A.M.; Rodger, K.; Lee, D. Managing the safety of nature Park visitor perceptions on risk and risk management. J. Ecotour. 2022, 21, 246–265. [Google Scholar] [CrossRef]
- Zhou, J. Design of intelligent scenic area guide system based on visual communication. In Proceedings of the 2020 International Conference on Intelligent Transportation, Big Data & Smart City (ICITBS), IEEE, Vientiane, Laos, 11–12 January 2020; pp. 498–501. [Google Scholar]
- Shen, H.; Lin, D.; Yang, X.; He, S. Vision-based multiobject tracking through UAV swarm. IEEE Geosci. Remote Sens. Lett. 2023, 20, 1–5. [Google Scholar] [CrossRef]
- Hmidani, O.; Alaoui, E.M.I. A comprehensive survey of the R-CNN family for object detection. In Proceedings of the 2022 5th International Conference on Advanced Communication Technologies and Networking (CommNet), IEEE, Virtual, 12–14 December 2022; pp. 1–6. [Google Scholar]
- Terven, J.; Cordova-Esparza, D. A comprehensive review of YOLO architectures in computer vision: From YOLOv1 to YOLOv8 and YOLO-NAS. Mach. Learn. Knowl. Extr. 2023, 5, 1680–1716. [Google Scholar] [CrossRef]
- Afyouni, I.; Al Aghbari, Z.; Razack, R.A. Multi-feature, multi-modal, and multi-source social event detection: A comprehensive survey. Inf. Fusion 2022, 79, 279–308. [Google Scholar] [CrossRef]
- Kalake, L.; Wan, W.; Hou, L. Analysis based on recent deep learning approaches applied in real-time multi-object tracking: A review. IEEE Access 2021, 9, 32650–32671. [Google Scholar] [CrossRef]
- Zheng, D.; Dong, W.; Hu, H.; Chen, X.; Wang, Y. Less is more: Focus attention for efficient detr. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 6674–6683. [Google Scholar]
- Redmon, J.; Farhadi, A. YOLOv3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
- Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
- Cheng, B.; Wei, Y.; Shi, H.; Feris, R.; Xiong, J.; Huang, T. Revisiting RCNN: On awakening the classification power of Faster RCNN. In Computer Vision—ECCV 2018; Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y., Eds.; Springer International Publishing: Cham, Switzerland, 2018; Volume 11219, pp. 473–490. [Google Scholar]
- Girshick, R. Fast R-CNN. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
- Liu, B.; Zhao, W.; Sun, Q. Study of object detection based on Faster R-CNN. In Proceedings of the 2017 Chinese Automation Congress (CAC), IEEE, Jinan, China, 20–22 October 2017; pp. 6233–6236. [Google Scholar]
- Ge, Z.; Liu, S.; Wang, F.; Li, Z.; Sun, J. YOLOX: Exceeding YOLO series in 2021. arXiv 2021, arXiv:2107.08430. [Google Scholar]
- Liu, W.; Wen, Y.; Yu, Z.; Yang, M. Large-margin softmax loss for convolutional neural networks. arXiv 2017, arXiv:1612.02295. [Google Scholar]
- Mahrishi, M.; Morwal, S.; Muzaffar, A.W.; Bhatia, S.; Dadheech, P.; Rahmani, M.K.I. Video index point detection and extraction framework using custom YoloV4 Darknet object detection model. IEEE Access 2021, 9, 143378–143391. [Google Scholar] [CrossRef]
- Bewley, A.; Ge, Z.; Ott, L.; Ramos, F.; Upcroft, B. Simple online and realtime tracking. In Proceedings of the 2016 IEEE International Conference on Image Processing (ICIP), Phoenix, AZ, USA, 25–28 September 2016; pp. 3464–3468. [Google Scholar]
- Bathija, A.; Sharma, G. Visual object detection and tracking using YOLO and SORT. Int. J. Eng. Res. 2022, 8, 11. [Google Scholar]
- Wojke, N.; Bewley, A.; Paulus, D. Simple online and realtime tracking with a deep association metric. In Proceedings of the 2017 IEEE International Conference on Image Processing (ICIP), Beijing, China, 17–20 September 2017; pp. 3645–3649. [Google Scholar]
- Parico, A.I.B.; Ahamed, T. Real time pear fruit detection and counting using YOLOv4 models and deep SORT. Sensors 2021, 21, 4803. [Google Scholar] [CrossRef]
- Chen, L.; Ai, H.; Zhuang, Z.; Shang, C. Real-time multiple people tracking with deeply learned candidate selection and person re-identification. In Proceedings of the 2018 IEEE International Conference on Multimedia and Expo (ICME), San Diego, CA, USA, 23–27 July 2018; pp. 1–6. [Google Scholar]
- Wu, H.; Du, C.; Ji, Z.; Gao, M.; He, Z. SORT-YM: An algorithm of multi-object tracking with YOLOv4-tiny and motion prediction. Electronics 2021, 10, 2319. [Google Scholar] [CrossRef]
- Pang, J.; Qiu, L.; Li, X.; Chen, H.; Li, Q.; Darrell, T.; Yu, F. Quasi-dense similarity learning for multiple object tracking. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 164–173. [Google Scholar]
- Fischer, T.; Huang, T.E.; Pang, J.; Qiu, L.; Chen, H.; Darrell, T.; Yu, F. Qdtrack: Quasi-dense similarity learning for appearance-only multiple object tracking. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 15380–15393. [Google Scholar] [CrossRef] [PubMed]
- Niu, Z.; Zhong, G.; Yu, H. A review on the attention mechanism of deep learning. Neurocomputing 2021, 452, 48–62. [Google Scholar] [CrossRef]
- Ghaffarian, S.; Valente, J.; Van Der Voort, M.; Tekinerdogan, B. Effect of attention mechanism in deep learning-based remote sensing image processing: A systematic literature review. Remote Sens. 2021, 13, 2965. [Google Scholar] [CrossRef]
- Zhang, Y.; Sun, P.; Jiang, Y.; Yu, D.; Weng, F.; Yuan, Z.; Luo, P.; Liu, W.; Wang, X. ByteTrack: Multi-object tracking by associating every detection box. In Computer Vision—ECCV 2022; Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T., Eds.; Springer Nature: Cham, Switzerland, 2022; Volume 13682, pp. 1–21. [Google Scholar]
- Shao, S.; Zhao, Z.; Li, B.; Xiao, T.; Yu, G.; Zhang, X.; Sun, J. Crowdhuman: A benchmark for detecting humans in a crowd. arXiv 2018, arXiv:1805.00123. [Google Scholar]
- Milan, A.; Leal-Taixe, L.; Reid, I.; Roth, S.; Schindler, K. MOT16: A benchmark for multi-object tracking. arXiv 2016, arXiv:1603.00831. [Google Scholar]
- Zhang, S.; Benenson, R.; Schiele, B. CityPersons: A diverse dataset for pedestrian detection. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 4457–4465. [Google Scholar]
- Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.S. CBAM: Convolutional block attention module. In Computer Vision—ECCV 2018; Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y., Eds.; Springer International Publishing: Cham, Switzerland, 2018; Volume 11211, pp. 3–19. [Google Scholar]
- Liu, S.; Huang, D.; Wang, Y. Learning spatial fusion for single-shot object detection. arXiv 2019, arXiv:1911.09516. [Google Scholar]
- Juarez-Salazar, R.; Zheng, J.; Diaz-Ramirez, V.H. Distorted pinhole camera modeling and calibration. Appl. Opt. 2020, 59, 11310–11318. [Google Scholar] [CrossRef] [PubMed]
- Han, S.; Miao, S.; Hao, X.; Chen, R. A spatial localization method for dynamic objects in surveillance video. Survey. Mapp. Bull. 2022, 8, 87. [Google Scholar]
- Jang, B.; Kim, H.; Kim, J.-W. Survey of landmark-based indoor positioning technologies. Inf. Fusion 2023, 89, 166–188. [Google Scholar] [CrossRef]
- Jain, D.K.; Zhao, X.; González-Almagro, G.; Gan, C.; Kotecha, K. Multimodal pedestrian detection using metaheuristics with deep convolutional neural network in crowded scenes. Inf. Fusion 2023, 95, 401–414. [Google Scholar] [CrossRef]
- Yang, G.; Zhu, D. Survey on algorithms of people counting in dense crowd and crowd density estimation. Multimed. Tools Appl. 2023, 82, 13637–13648. [Google Scholar] [CrossRef]
- Yang, R.; Li, W.; Shang, X.; Zhu, D.; Man, X. KPE-YOLOv5: An Improved Small Target Detection Algorithm Based on YOLOv5. Electronics 2023, 12, 817. [Google Scholar] [CrossRef]
- Specker, A.; Beyerer, J. ReidTrack: Reid-only Multi-target Multi-camera Tracking. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Vancouver, BC, Canada, 18–22 June 2023; IEEE: New York, NY, USA, 2023; pp. 5442–5452. [Google Scholar]
- Han, S.; Miao, S.; Hao, X.; Chen, R. A Review of the Development of Fusion Technology of Surveillance Videos and Geographic Information. Bull. Surv. Mapp. 2022, 5, 1–6. [Google Scholar]
- Zi, Y.; Xie, F.; Song, X.; Jiang, Z.; Wang, P. Thin cloud removal for remote sensing images using a physical-model-based CycleGAN with unpaired data. IEEE Geosci. Remote Sens. Lett. 2021, 19, 1–5. [Google Scholar] [CrossRef]
- Qin, X.; Wang, Z.; Bai, Y.; Xie, H.; Jia, H.; Li, C. FFA-Net: Feature fusion attention network for single image dehazing. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 11908–11915. [Google Scholar]
Methods | Head | Neck | Backbone | AP0.5 (%) | Inference Time (ms) |
---|---|---|---|---|---|
YOLOX-x | × | × | × | 85.60 | 17.30 |
CBAM+YOLOX | × | × | ✓ | 87.71 | 19.42 |
× | ✓ | × | 87.62 | 20.16 | |
✓ | × | × | 87.16 | 18.92 | |
✓ | ✓ | × | 86.12 | 20.26 | |
✓ | × | ✓ | 86.56 | 20.34 | |
× | ✓ | ✓ | 86.85 | 20.98 | |
✓ | ✓ | ✓ | 86.79 | 21.39 |
Methods | AP0.5 (%) | Precision (%) | Recall (%) | Inference Time (ms) |
---|---|---|---|---|
YOLOX-x | 85.60 | 88.65 | 92.53 | 17.30 |
CBAM+YOLOX | 87.71 | 90.22 | 93.12 | 19.42 |
CBAM+ASFF+YOLOX | 88.05 | 90.56 | 93.25 | 25.62 |
Method | Precision (%) | Recall (%) | AP0.5 (%) |
---|---|---|---|
YOLOv7 | 88.26 | 91.03 | 0.8556 |
YOLOv8 | 88.54 | 91.25 | 0.8512 |
YOLOv9 | 88.97 | 92.51 | 0.8571 |
Ours | 90.56 | 93.25 | 0.8805 |
Dataset | Precision (%) | Recall (%) | AP0.5 (%) | FPS |
---|---|---|---|---|
Tourist Center (a) | 90.69 | 93.57 | 0.8763 | 23.1 |
Xiaodiao Area (b) | 90.41 | 92.12 | 0.8655 | 24.5 |
Cable Car Area (c) | 91.56 | 93.24 | 0.8812 | 27.6 |
Yu Huang Palace (d) | 90.01 | 92.04 | 0.8591 | 26.2 |
Actual Lon. | Actual Lat. | Ave. Lon. | Ave. Lat. | Mean Error Lon. | Mean Error Lat. | Max. Error | Variance | RMSE |
---|---|---|---|---|---|---|---|---|
596362.2370 | 3927815.6560 | 596362.7752 | 3927815.3728 | 0.5382 | −0.2832 | 0.7392 | 0.041 | 0.6117 |
596362.1732 | 3927817.3080 | 596362.1396 | 3927817.8300 | −0.0334 | 0.5253 | 0.8208 | 0.0121 | 0.5392 |
596354.8362 | 3927809.1040 | 596355.3667 | 3927808.8624 | 0.5305 | −0.2416 | 1.1217 | 0.0262 | 0.6065 |
596356.4685 | 3927810.1020 | 596357.8639 | 3927810.3748 | 1.3953 | 0.2728 | 1.1424 | 0.0399 | 1.1483 |
596349.4518 | 3927811.7780 | 596349.4524 | 3927811.8632 | 0.0006 | 0.0252 | 0.1359 | 0.0001 | 0.0945 |
596356.7993 | 3927811.4090 | 596356.9778 | 3927812.1811 | 0.1785 | 0.7721 | 1.1288 | 0.0010 | 0.8018 |
596355.1397 | 3927810.4240 | 596354.6866 | 3927810.5372 | −0.4531 | 0.1132 | 1.0073 | 0.0263 | 0.5015 |
596357.3409 | 3927817.6080 | 596357.6832 | 3927817.3966 | 0.3423 | −0.2114 | 0.4424 | 0.0001 | 0.4072 |
596357.2758 | 3927816.0650 | 596357.4817 | 3927815.1813 | 0.2059 | −0.8837 | 1.0440 | 0.0037 | 0.9111 |
596349.0143 | 3927725.8510 | 596349.3317 | 3927725.1624 | 0.3174 | −0.6886 | 0.8397 | 0.0006 | 0.7641 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Published by MDPI on behalf of the International Society for Photogrammetry and Remote Sensing. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Shi, K.; Zhu, C.; Li, J.; Zhang, X.; Yang, F.; Zhang, K.; Shen, Q. Spatiotemporal Information, Near-Field Perception, and Service for Tourists by Distributed Camera and BeiDou Positioning System in Mountainous Scenic Areas. ISPRS Int. J. Geo-Inf. 2024, 13, 370. https://doi.org/10.3390/ijgi13100370
Shi K, Zhu C, Li J, Zhang X, Yang F, Zhang K, Shen Q. Spatiotemporal Information, Near-Field Perception, and Service for Tourists by Distributed Camera and BeiDou Positioning System in Mountainous Scenic Areas. ISPRS International Journal of Geo-Information. 2024; 13(10):370. https://doi.org/10.3390/ijgi13100370
Chicago/Turabian StyleShi, Kuntao, Changming Zhu, Junli Li, Xin Zhang, Fan Yang, Kun Zhang, and Qian Shen. 2024. "Spatiotemporal Information, Near-Field Perception, and Service for Tourists by Distributed Camera and BeiDou Positioning System in Mountainous Scenic Areas" ISPRS International Journal of Geo-Information 13, no. 10: 370. https://doi.org/10.3390/ijgi13100370