A Clustering Visualization Method for Density Partitioning of Trajectory Big Data Based on Multi-Level Time Encoding
Abstract
:1. Introduction
2. Study Area and Data Source
2.1. Study Area
2.2. Data Source
3. Methods
3.1. Model for Storing Trajectory Data
3.1.1. Multi-Level Time Encoding Data Storage Model
3.1.2. Optimization of the Data Storage Model
3.2. Visualization Method Based on Density Partitioning Clustering Algorithm
3.2.1. Construction of Density Partitioning Clustering Algorithm Model
- Initially, road vector data is encoded sequentially, ensuring the code remains consistent for the same road segment. It was subsequently overlaying the taxi trajectory point data onto the road data. The road encoding is attributed to trajectory points data that fall upon the corresponding road vector data. The road encoding matching the trajectory point is noted as , and the road encoding reaching the trajectory point is registered as . Equation (4) employs a presence factor R to distinguish whether two trajectory points, X and Y, are situated along a path.
- When the road encoding of is the same as , the factor assumes a value of 0, indicating that the two trajectory points lie on the same road. Conversely, should there be a disparity, factor carries a weight of 1, denoting that the two trajectory points do not share the same route.
- When equals 0, the computation of the distance between the two data points employs the Euclidean distance formula. Conversely, when equals 1, calculating the distance between the two data points necessitates bypassing the buildings to acquire the actual distance.
- In the scenario where points and do not lie upon the same road, the determination hinges upon whether line segment intersects with obstacles. If there is no intersection, the Euclidean distance is employed. However, if a meeting exists, it becomes imperative to compute the visible points of the buildings to ascertain the minimum distance of line segment . The method of calculating the visible points is as follows: The line segment intersects with buildings , ..., represents any vertex from the group of buildings. If the vertices on both sides of are not situated on the opposite side of the line containing , then becomes the edge visible point of .
- By following step (4), the edge visible points for both and can be determined, with denoting the edge visible point for the trajectory point and representing the edge visible point for the trajectory point . When there are coincident points , ... in and , then the points and can be connected with the help of this point. Consequently, the real distance bypassing the building is .
- If there are no overlapping points, then find the visible point for each point in the set of visible points of point . If the point overlaps with the point in is recorded as . Consequently, the actual distance bypassing the building is .
- The data results are clustered into clusters after implementing the abovementioned method to improve the DBSCAN distance calculation. Then, the K-means clustering algorithm is applied by setting the parameter k = 1 to determine the clusters’ centroid coordinates and their corresponding attribute values. The flowchart of the density-based partitioning clustering algorithm is shown in Figure 3. In this figure, ①, ②, and ③ respectively signify three distinct routes that exist between two data points. Each of them navigates around the obstruction presented by the buildings.
3.2.2. Heat Map Visualization Method
4. Experimental Validation and Result Analysis
4.1. Experimental Environment
4.2. Determination of Clustering Parameters and Evaluation of Clustering Results
4.3. Comparative Analysis of Retrieval Speed
4.4. Comparative Analysis of Heat Map Rendering Speed
5. Conclusions and Outlook
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Luo, Q.; Shu, H.; Xu, Y.; Liu, W. Analysis of urban residents’ commuting activities based on mobile trajectory data support. J. Wuhan Univ. Inf. Sci. Ed. 2021, 46, 718–725. [Google Scholar] [CrossRef]
- Liang, S. Research on the Method and Application of MapReduce in Mobile Track Big Data Mining. Recent Adv. Electr. Electron. Eng. (Former. Recent Pat. Electr. Electron. Eng.) 2021, 14, 20–28. [Google Scholar] [CrossRef]
- Zheng, Y.; Chen, Y.; Li, Q.; Xie, X.; Ma, W.Y. Understanding transportation modes based on GPS data for web applications. ACM Trans. Web (TWEB) 2010, 4, 1–36. [Google Scholar] [CrossRef]
- Zhang, H.; Zhang, J.; Guo, X.; Lu, J.; Lu, H. Cloud storage and heatmap generation method for trajectory big data. Bull. Surv. Mapp. 2021, 146–149. [Google Scholar] [CrossRef]
- Bala, P. Introduction of Big Data with Analytics of Big Data; IGI Global: Hershey, PA, USA, 2021. [Google Scholar]
- Li, D.; Yao, Y.; Shao, Z. Big data in smart city. J. Wuhan Univ. (Inf. Sci. Ed.) 2014, 39, 631–640. [Google Scholar] [CrossRef]
- Gupta, P.; Mittal, P.K.; Gopal, G. Big Data: Problems, Challenges and Techniques. 2015. Available online: https://www.researchgate.net/publication/321134019_Big_Data_Problems_Challenges_and_Techniques (accessed on 18 October 2022).
- Jiang, S.; Li, C.; Wang, L.; Hu, Y.; Wang, C. LatentMap: Effective auto-encoding of density maps for spatiotemporal data visualizations. Graph. Vis. Comput. 2021, 4, 200019. [Google Scholar] [CrossRef]
- Zhang, H. Research on Trajectory Big Data Model and Visualization Method Based on Hadoop. Master’s Thesis, Beijing Architecture University, Beijing, China, 2021. [Google Scholar] [CrossRef]
- Jeyaraj, R.; Pugalendhi, G.; Paul, A. Hadoop Framework. In Big Data with Hadoop MapReduce; Apple Academic Press: New York, NY, USA, 2020. [Google Scholar]
- Xu, H. Research on mass monitoring data Retrieval Technology based on HBase. J. Phys. Conf. Ser. 2021, 1871, 012133. [Google Scholar] [CrossRef]
- Hughes, J.N.; Annex, A.; Eichelberger, C.N.; Fox, A.; Hulbert, A.; Ronquest, M. GeoMesa: A distributed architecture for spatio-temporal fusion. In Proceedings of the SPIE Defense + Security, Baltimore, MD, USA, 20–24 April 2015; Volume 94730F. [Google Scholar] [CrossRef]
- Alarabi, L.; Mokbel, M.F. A demonstration of st-hadoop: A mapreduce framework for big spatio-temporal data. Proc. VLDB Endow. 2017, 10, 1961–1964. [Google Scholar] [CrossRef]
- Bao, Y.; Huang, Z.; Gong, X.; Zhang, Y.; Yin, G.; Wang, H. Optimizing segmented trajectory data storage with HBase for improved spatio-temporal query efficiency. Int. J. Digit. Earth 2023, 16, 1124–1143. [Google Scholar] [CrossRef]
- Wang, K.; Liu, G.; Zhai, M.; Wang, Z.; Zhou, C. Building an efficient storage model of spatial-temporal information based on HBase. J. Spat. Sci. 2019, 64, 301–317. [Google Scholar] [CrossRef]
- He, Y.; Tan, H.; Luo, W.; Feng, S.; Fan, J. MR-DBSCAN: A scalable MapReduce-based DBSCAN algorithm for heavily skewed data. Front. Comput. Sci. 2014, 8, 83–99. [Google Scholar] [CrossRef]
- Xu, J.; Smith, T.J. Massive data storage and sharing algorithm in distributed heterogeneous environment. J. Intell. Fuzzy Syst. 2018, 35, 4017–4026. [Google Scholar] [CrossRef]
- Nishimura, S.; Das, S.; Agrawal, D.; El Abbadi, A. -HBase: Design and implementation of an elastic data infrastructure for cloud-scale location services. Distrib. Parallel Databases 2013, 31, 289–319. [Google Scholar] [CrossRef]
- Yao, Z.; Zhang, J.; Li, T.; Ding, Y. A trajectory big data storage model incorporating partitioning and spatio-temporal multidimensional hierarchical organization. ISPRS Int. J. Geo-Inf. 2022, 11, 621. [Google Scholar] [CrossRef]
- Dou, H.; Xu, B.; Shen, F.; Zhao, J. V-SOINN: A Topology Preserving Visualization Method for Multidimensional Data. Neurocomputing 2021, 449, 280–289. [Google Scholar] [CrossRef]
- Eadie, A.; Vásquez, I.C.; Liang, X.; Wang, X.; Souders, C.L., II; El Chehouri, J.; Hoskote, R.; Feswick, A.F.; Cowie, A.M.; Loughery, J.R.; et al. Transcriptome network data in larval zebrafish (Danio rerio) following exposure to the phenylpyrazole fipronil. Data Brief 2020, 33, 106413. [Google Scholar] [CrossRef] [PubMed]
- Wang, Q.; Farahat, A.; Gupta, C.; Zheng, S. Deep Time Series Models for Scarce Data. Neurocomputing 2021, 456, 504–518. [Google Scholar] [CrossRef]
- Paspatis, I.; Tsohou, A.; Kokolakis, S. AppAware: A policy visualization model for mobile applications. Inf. Comput. Secur. 2020, 28, 116–132. [Google Scholar] [CrossRef]
- Keim, D.; Qu, H.; Ma, K.L. Big-Data Visualization. IEEE Comput. Graph. Appl. 2013, 33, 20–21. [Google Scholar] [CrossRef]
- Tang, J.; Liu, F.; Wang, Y.; Wang, H. Uncovering urban human mobility from large scale taxi GPS data. Phys. A Stat. Mech. Its Appl. 2015, 438, 140–153. [Google Scholar] [CrossRef]
- Huang, Z.; Gao, S.; Cai, C.; Zheng, H.; Pan, Z.; Li, W. A rapid density method for taxi passengers hot spot recognition and visualization based on DBSCAN+. Sci. Rep. 2021, 11, 9420. [Google Scholar] [CrossRef] [PubMed]
- Yu, D. A review of spatial clustering algorithms based on obstacle constraints. Comput. Syst. Appl. 2015, 24, 9–13. [Google Scholar]
- Wan, J.; Cui, M.; He, Y.; Li, S. Voronoi diagram-based clustering algorithm for uncertain data in obstacle space. Comput. Res. Dev. 2019, 56, 977–991. [Google Scholar]
- Tung, A.K.H.; Hou, J.; Han, J. Spatial clustering in the presence of obstacles. In Proceedings of the 17th International Conference on Data Engineering, Heidelberg, Germany, 2–6 April 2001; pp. 359–367. [Google Scholar]
- Ng, R.T. Efficient and Effective Clustering Methods for Spatial Data Mining. In Proceedings of the 20th VLDB Conference, Santiago de Chile, Chile, 12–15 September 1994. [Google Scholar]
- Estivill-Castro, V.; Lee, I. Autoclust+: Automatic clustering of point-data sets in the presence of obstacles. In TSDM 2000: Temporal, Spatial, and Spatio-Temporal Data Mining, Proceedings of the International Workshop on Temporal, Spatial, and Spatio-Temporal Data Mining, Lyon, France, 12 September 2000; Springer: Berlin/Heidelberg, Germany, 2000; pp. 133–146. [Google Scholar]
- Zaiane, O.R.; Lee, C.H. Clustering spatial data when facing physical constraints. In Proceedings of the 2002 IEEE International Conference on Data Mining, Maebashi City, Japan, 9–12 December 2002; pp. 737–740. [Google Scholar]
- Ester, M. A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. In Proceedings of the KDD’96: Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, Portland, OR, USA, 2–4 August 1996. [Google Scholar]
- Zhang, X.; Wang, J.; Wu, F.; Fan, Z.; Li, X. A Novel Spatial Clustering with Obstacles Constraints Based on Genetic Algorithms and K-Medoids. In Proceedings of the Sixth International Conference on Intelligent Systems Design and Applications, Jian, China, 16–18 October 2006; pp. 605–610. [Google Scholar] [CrossRef]
- Zhang, X.; Wu, J.; Si, H.; Yang, T.; Liu, Y. Spatial Clustering with Obstacles Constraints Using Ant Colony and Particle Swarm Optimization. In PAKDD 2007: Emerging Technologies in Knowledge Discovery and Data Mining, Proceedings of the International Conference on Emerging Technologies in Knowledge Discovery & Data Mining, Nanjing, China, 22–25 May 2007; Springer: Berlin/Heidelberg, Germany, 2007. [Google Scholar] [CrossRef]
- Yang, T.; Zhang, X.; Liu, Y. A new algorithm for spatial clustering with obstacles by combining QPSO and K-Medoids. Electron. Des. Eng. 2011, 19, 74–77, 80. [Google Scholar] [CrossRef]
- Lv, J.; Zhang, Y. Research on the preprocessing technology of massive cab trajectory data under the support of Hadoop. Urban Surv. 2016, 4, 46–49. [Google Scholar] [CrossRef]
- Fu, Y.; Wu, Y.; Zhang, J.; Zheng, K.; Zhao, C.; Zheng, K.; Fang, F. MapReduce-based parallel partitioning algorithm for spatial data. Surv. Mapp. Bull. 2017, 11, 96–100. [Google Scholar] [CrossRef]
- Fairbanks, K.D. An analysis of Ext4 for digital forensics. Digit. Investig. 2012, 9, S118–S130. [Google Scholar] [CrossRef]
- Gilmore, W.J. MySQL Storage Engines and Datatypes. In Beginning PHP and MySQL: From Novice to Professional; Apress: New York, NY, USA, 2008; pp. 693–729. [Google Scholar]
- Tong, X.; Wang, R.; Wang, L.; Lai, G.; Ding, L. An effective multi-scale time period dissection method with integer coding calculation. J. Surv. Mapp. 2016, 45, 66–76. [Google Scholar]
- Zhang, J.; Liu, X.; Gang, W. Cache optimization for compressed databases in multiple storage environments. Comput. Appl. 2018, 38, 1404–1409, 1435. [Google Scholar]
- Zheng, H.; He, H.; Liu, G.; Zhao, B.; Ji, G.; Yu, Z. Research on storage method of spatio-temporal trajectory data. J. Nanjing Norm. Univ. (Nat. Sci. Ed.) 2017, 40, 38–44. [Google Scholar]
- Lei, Y. Vehicle Trajectory Data Management and Analysis Based on HBase. Master’s Thesis, Southwest Jiaotong University, Chengdu, China, 2017. [Google Scholar]
- Chen, J.; Chu, L.; Xia, D. A MapReduce-based method for storing and querying vector spatial data. Comput. Digit. Eng. 2017, 45, 712–715, 719. [Google Scholar]
- Wu, Y. A review of clustering algorithms. Comput. Sci. 2015, 42, 491–499, 524. [Google Scholar]
- Han, L.Z.; Qian, X.Z.; Luo, J. DBSCAN multi-density clustering algorithm based on region partitioning. Comput. Appl. Res. 2018, 35, 1668–1671, 1685. [Google Scholar]
- Tian, C.; Yang, W.; Yang, D.; Wang, Y.; Sun, S. Based on K-Means and DBSCAN clustering algorithm according to the background of student behavior analysis and research based on comprehensive university data. Sci. Technol. Innov. 2020, 3, 86–88. [Google Scholar] [CrossRef]
- Wang, G.; Lin, G.Y. Improved adaptive parametric DBSCAN clustering algorithm. Comput. Eng. Appl. 2020, 56, 45–51. [Google Scholar]
- Yu, Z.-H.; Hao, H.-L.; Zhang, B.-C. Research on nondestructive detection of sprouted potato based on Euclidean distance. Agric. Mech. Res. 2015, 37, 174–177. [Google Scholar] [CrossRef]
- Wang, T.; Liu, W.; Liu, C. Optimization algorithm for black holes based on Euclidean distance. J. Shenyang Univ. Technol. 2016, 38, 201–205. [Google Scholar]
- Shen, Y.; Zhang, T.; Xu, J. Analysis of bus operating hours based on K-means clustering algorithm. Transp. Syst. Eng. Inf. 2014, 14, 87–93. [Google Scholar] [CrossRef]
- Guo, Y.; Zhang, X.; Liu, L.; Ding, L.; Niu, X. K-means clustering algorithm for optimizing initial clustering centers. Comput. Eng. Appl. 2020, 56, 172–178. [Google Scholar]
- Zhang, F.; Yuan, Z.; Xiao, F. Spark-based heatmap visualization method for big data. J. Comput. Aided Des. Graph. 2016, 28, 1881–1886. [Google Scholar]
- Luo, A.; Cai, D.; Li, Y.; Wang, Y. A real-time mapping method of thematic heat maps for mobile terminals. Surv. Mapp. Sci. 2016, 41, 179–183. [Google Scholar] [CrossRef]
- Zhang, L.; Yang, J.; Wang, G.; Zhang, L. A thermal map generation method with structural constraints for indoor spaces. J. Surv. Mapp. Sci. Technol. 2018, 35, 533–539. [Google Scholar]
- Yang, W.; Liu, J.; Wang, Y. Heatmap-based calculation method for spatial distribution of geographic objects. Surv. Mapp. Bull. 2012, 2012, 391–393, 398. [Google Scholar]
- Zhao, T.; Hua, Y.; Li, L.; Li, L.; Yang, F. A research on visual representation of geotagged data based on Heat Map. Surv. Mapp. Eng. 2016, 25, 28–32. [Google Scholar] [CrossRef]
- Yang, Z.; Li, L.; Yang, F. A heat map generation algorithm for millions of data. Surv. Mapp. Sci. 2018, 43, 85–89. [Google Scholar] [CrossRef]
Data Type | Data Format | Data Volume |
---|---|---|
Taxi trajectory point data | csv | 32 G |
Road network data | shp | 20.5 M |
Building contour data | shp | 17.9 M |
Boundary data of each district | shp | 7.51 M |
Field Name | Data Sample | Description |
---|---|---|
TIME | 31 May 2019 | Date |
POSITION_TIME | 23:31:00 | Point of time |
LNG | 118.023224 | Longitude |
LAT | 24.49147 | Latitude |
CAR_NO | 300bf55568114df822bed19e86e821e8 | License plate number |
Time Level | Coding Rule | Coding Result |
---|---|---|
Minute | Minutes from 0:00 on 1 January 1970 | 27,237,600 |
Hour | Hours from 0:00 on 1 January 1970 | 453,960 |
Day | Days from 1 January 1970 | 18,915 |
Row Key | TIMESTAMP | TAXIDATA | |||
---|---|---|---|---|---|
TIME_Code | LAT | LNG | CAR_NO | ||
1 | T1 | TIME_Code1 | LAT1 | LNG1 | CAR_NO1 |
2 | T2 | TIME_Code1 | LAT2 | LNG2 | CAR_NO2 |
… | … | … | … | … | … |
… | … | … | … | … | … |
N | Tn | TIME_Coden | LATn | LNGn | CAR_NOn |
Row Key | TIMESTAMP | Column Family | ||
---|---|---|---|---|
LAT | LNG | PROPERTIES | ||
Min/Hour/Day/TS1 | T1 | {LAT1,LAT2…LATn} | {LNG1,LNG2…LNGn} | … |
Min/Hour/Day/TS2 | T2 | {LAT1,LAT2…LATn} | {LNG1,LNG2…LNGn} | … |
Min/Hour/Day/TS3 | … | … | … | … |
Frame Name | Configuration |
---|---|
Hadoop | 2.7.6 |
JDK | JDK1.8 |
HBase | 2.1.9 |
Zookeeper | 3.4.14 |
Eps | Minpts | Traditional DBSCAN Algorithm | Improved Hybrid Clustering Algorithm |
---|---|---|---|
0.0029 | 50 | 0.624551212 | 0.815335636 |
0.0029 | 51 | 0.641254035 | 0.824145512 |
0.0029 | 52 | 0.691455214 | 0.822155423 |
0.0029 | 53 | 0.742684135 | 0.832145142 |
0.0029 | 54 | 0.712442351 | 0.841223244 |
0.0029 | 55 | 0.713623125 | 0.845512341 |
0.0029 | 56 | 0.754215315 | 0.845215214 |
0.0029 | 57 | 0.792145221 | 0.861542341 |
0.0029 | 59 | 0.792532131 | 0.854136422 |
.... | .... | ..... | ..... |
Eps | Minpts | Silhouette Coefficient |
---|---|---|
0.0016 | 62 | 0.814642156 |
0.0017 | 64 | 0.754129534 |
0.0018 | 63 | 0.765243894 |
0.0019 | 62 | 0.814536452 |
0.0020 | 62 | 0.824153594 |
0.0021 | 63 | 0.834658912 |
0.0022 | 63 | 0.845245821 |
0.0023 | 64 | 0.865147238 |
0.0024 | 64 | 0.846185723 |
0.0025 | 64 | 0.754124534 |
0.0026 | 67 | 0.812210354 |
0.0027 | 66 | 0.854221521 |
0.0028 | 67 | 0.842545722 |
0.0029 | 69 | 0.898514235 |
0.0030 | 69 | 0.894125612 |
0.0031 | 69 | 0.874456124 |
0.0032 | 69 | 0.845625317 |
0.0033 | 65 | 0.671452362 |
0.0034 | 66 | 0.685422435 |
0.0035 | 66 | 0.714524585 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Wei, B.; Zhang, J.; Hu, C.; Wen, Z. A Clustering Visualization Method for Density Partitioning of Trajectory Big Data Based on Multi-Level Time Encoding. Appl. Sci. 2023, 13, 10714. https://doi.org/10.3390/app131910714
Wei B, Zhang J, Hu C, Wen Z. A Clustering Visualization Method for Density Partitioning of Trajectory Big Data Based on Multi-Level Time Encoding. Applied Sciences. 2023; 13(19):10714. https://doi.org/10.3390/app131910714
Chicago/Turabian StyleWei, Boan, Jianqin Zhang, Chaonan Hu, and Zheng Wen. 2023. "A Clustering Visualization Method for Density Partitioning of Trajectory Big Data Based on Multi-Level Time Encoding" Applied Sciences 13, no. 19: 10714. https://doi.org/10.3390/app131910714
APA StyleWei, B., Zhang, J., Hu, C., & Wen, Z. (2023). A Clustering Visualization Method for Density Partitioning of Trajectory Big Data Based on Multi-Level Time Encoding. Applied Sciences, 13(19), 10714. https://doi.org/10.3390/app131910714