A Clustering Visualization Method for Density Partitioning of Trajectory Big Data Based on Multi-Level Time Encoding

Wei, Boan; Zhang, Jianqin; Hu, Chaonan; Wen, Zheng

doi:10.3390/app131910714

Open AccessArticle

A Clustering Visualization Method for Density Partitioning of Trajectory Big Data Based on Multi-Level Time Encoding

School of Geomatics and Urban Spatial Informatics, Beijing University of Civil Engineering and Architecture, Beijing 102612, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2023, 13(19), 10714; https://doi.org/10.3390/app131910714

Submission received: 25 August 2023 / Revised: 25 September 2023 / Accepted: 25 September 2023 / Published: 26 September 2023

(This article belongs to the Section Earth Sciences)

Download

Browse Figures

Versions Notes

Abstract

:

The proliferation of the Internet and the widespread adoption of mobile devices have given rise to an immense volume of real-time trajectory big data. However, a single computer and conventional databases with limited scalability struggle to manage this data effectively. During the process of visual rendering, issues such as page stuttering and subpar visual outcomes often arise. This paper, founded on a distributed architecture, introduces a multi-level time encoding method using “minutes”, “hours”, and “days” as fundamental units, achieving a storage model for trajectory data at multi-scale time. Furthermore, building upon an improved DBSCAN clustering algorithm and integrating it with the K-means clustering algorithm, a novel density-based partitioning clustering algorithm has been introduced, which incorporates road coefficients to circumvent architectural obstacles, successfully resolving page stuttering issues and significantly enhancing the quality of visualization. The results indicate the following: (1) when data is extracted using the units of “minutes”, “hours”, and “days”, the retrieval efficiency of this model is 6.206 times, 12.475 times, and 18.634 times higher, respectively, compared to the retrieval efficiency of the original storage model. As the volume of retrieved data increases, the retrieval efficiency of the proposed storage model becomes increasingly superior to that of the original storage model. Under identical experimental conditions, this model’s retrieval efficiency also outperforms the space–time-coded storage model; (2) Under a consistent rendering level, the clustered trajectory data, when compared to the unclustered raw data, has shown a 40% improvement in the loading speed of generating heat maps. There is an absence of page stuttering. Furthermore, the heat kernel phenomenon in the heat map was also resolved while enhancing the visualization rendering speed.

Keywords:

trajectory big data; multi-level time encoding; density partitioning clustering algorithm; data visualization

1. Introduction

With the continuous development of smart cities and the widespread use of mobile networks [1] and terminal devices, people have been able to conveniently access and record trajectory data of mobile objects such as taxis, buses, and shared bicycles that are used in their daily urban lives [2]. A substantial amount of trajectory point data can be collected in real-time through GPS terminals [3]. The volume of this data has already exceeded terabytes (TB) and petabytes (PB) and has even reached the scale of zettabytes (ZB). Consequently, exploring a novel technology that can effectively handle storage, management, and computation of transportation trajectory big data has become an urgent problem to address in GIS applications [4]. Current trajectory big data exhibits characteristics of Volume, Velocity, Variety, Value, and Veracity [5]. However, it also possesses certain traits and limitations, such as sorting based on time, inconsistent data time intervals, and low data quality [6]. Each trajectory point contains location information and is sorted based on time, providing a detailed depiction of the spatiotemporal dynamics of mobile objects and encompassing a wealth of undiscovered spatiotemporal regularity information [7]. Visualization techniques are indispensable to fully explore the potential geographic spatial information embedded within massive trajectory data. They enable individuals to analyze, express, and communicate a vast amount of complex and unobservable spatial information more efficiently [8]. However, more than conventional relational database storage is needed to provide sufficient technical support for visualizing massive taxi trajectory data. Furthermore, issues may arise when dealing with large-scale data for visualization, such as long mapping times, low interactivity, and cluttered visual structures, which can hinder users’ comprehension of critical information [9]. Therefore, researching and implementing a solution suitable for the storage and rapid visualization of massive trajectory data is of paramount importance.

The design of a high-performance data-storage model is a prerequisite for achieving rapid visualization of massive trajectory data. In recent years, with the rapid development of distributed storage and computing capabilities, high-quality platforms have emerged for storing, managing, and retrieving spatiotemporal data. Due to its comprehensive ecosystem and its characteristics of low cost and high scalability, Hadoop has gained widespread recognition and adoption by developers both domestically and internationally [10]. HBase, known for its scalability, fault tolerance, and high-speed concurrency, has been extensively utilized for storing spatiotemporal data, laying the foundation for spatiotemporal data querying, processing, and big data analysis [11]. Within the realm of contemporary commercial spatial data memory management products, GeoMesa [12] integrates various prevalent distributed computing database technologies such as HBase, Accumulo, Google Bigtable, and Cassandra to facilitate spatial searches involving points, lines, surfaces, or polygons. Additionally, GeoMesa harnesses Kafka for real-time stream management of spatial data. Louai Alarabi et al. [13], building upon the Hadoop architecture, proposed the innovative ST-Hadoop framework that supports the storage and management of temporal data. They constructed multiple hierarchical time index structures based on different temporal dimensions, providing a novel approach to temporal retrieval using spatial index structures, thereby mitigating superfluous data retrieval during querying. Yi Bao et al. [14], rooted in the HBase database, engineered and materialized a prototype system for segmented storage of trajectory data. Compared to the GeoMesa system, this segmented architecture offers superior query speed and memory utilization performance. Ke Wang et al. [15], abstracting geographic space as meta-semantic entities, stored them as minimal storage units within HBase. They proposed an efficient organizational and storage model that effectively enhances data query speed by capitalizing on the HBase database’s robust scalability and real-time read features. Yaobin HE et al. [16] proposed the MR-DBSCAN algorithm, which partitions data based on computation cost to accelerate query performance. Additionally, MapReduce is employed to execute parallel local clustering for each partition. Jinsong Xu et al. [17] presented a storage and sharing algorithm for massive data in a distributed heterogeneous environment leveraging HBase. This algorithm effectively addresses the challenge of high storage occupancy and significantly reduces sharing latency. Shoji Nishimura et al. [18] introduced a multi-dimensional spatial data storage framework called MD-HBase, built upon HBase. This framework employs quad-trees or K-d trees for spatial partitioning and Z-encoding, storing the encoded outcomes as row keys within HBase. Zhixin Yao et al. [19] devised a trajectory big data model integrating data partitioning and spatiotemporal multi-angle hierarchical organization. This model addresses the storage skewness and hot write issues in the distributed HBase database, thereby boosting the write rate of massive data.

The research mentioned above, based on the Hadoop distributed storage and computing platform, has all, to some extent, addressed the challenges of storing and managing massive data. The time information of trajectory data is a crucial characteristic of space–time data. However, these studies have yet to leverage the time information of trajectory data to address the issues. Hence, to address this gap, this research introduces a multi-level time encoding storage model based on the time information of trajectory data. This model is designed for the flexible and efficient storage and management of massive trajectory data. Introducing this model can resolve data storage issues arising from the lack of uniformity in time formats and enhance the computational efficiency for multi-scale time. Furthermore, compared to the studies mentioned earlier, this research has also optimized the storage structure of the time encoding storage model based on real-world usage scenarios. This optimization aims to enhance data retrieval efficiency within the same time frame.

With the widespread adoption of positioning technologies like GPS in various aspects of daily life, the acquisition of information from traffic big data has become remarkably facile [3]. Grasping the underlying regularities within this voluminous data and comprehending them intuitively have emerged as current research focal points and challenges. Consequently, the field of data visualization has flourished. Data visualization is an interdisciplinary realm encompassing human–computer interaction, computer graphics, image science, statistical analysis, and geographic information. It synthesizes various knowledge domains and skills, encompassing data processing, algorithm design, software development, and human–computer interaction. It employs visual forms such as images, charts, and animations to showcase relationships and trends within data, thereby enhancing the efficiency of data reading and comprehension. As for data types, contemporary visualization research increasingly delves into areas such as multidimensional data, time-series data, network data, and hierarchical data [20,21,22,23]. Given the colossal scale of traffic trajectory big data, directly visualizing massive datasets could lead to visual clutter, hindering the extraction of key insights. For instance, when generating heatmaps, excessive data volume might result in color gradients that are ill-suited to the dataset, causing issues of indistinct contrast and even generating heat kernel phenomena [24] due to high local point densities. Moreover, querying sizable data could engender extended rendering times, dampening information interactivity. Hence, undertaking appropriate data processing can enhance the support for data visualization. Currently, employing clustering algorithms for data analysis and processing has emerged as a pivotal technical field. J. Tang et al., employed the DBSCAN algorithm to cluster pick-up and drop-off locations, establishing a maximum entropy model to evaluate urban traffic distribution [25]. Zi he Huang et al. [26] introduced a density-based spatial clustering algorithm named DBSCAN+, which efficiently segments and extracts the highest-density clusters from large-scale data. This addresses the issues of slow clustering speed and the inability to identify suitable cluster centers for taxi trajectory data under traditional density algorithms. However, obstacles like buildings, lakes, and mountains are commonly present in natural geographic spaces, impacting cluster results and preventing an accurate reflection of data characteristics. Therefore, traditional spatial clustering algorithms are unsuitable for real-world spatial scenarios with obstacle constraints [27]. With the continuous evolution of information technology, pursuing more rational and practical clustering outcomes has led to the burgeoning focus on spatial clustering algorithms that navigate around obstacles, becoming a prominent area of research [28]. The COD-CLARANS algorithm Anthony K.H. Tung et al. [29] proposed is an early example of clustering algorithms grounded in obstacle constraints. It builds upon the partition-based CLARANS [30] clustering algorithm and introduces obstacle distance, employing pruning strategies to achieve clustering with obstructed data. However, the overall efficiency of this algorithm is relatively low. Estivill-Castro et al. [31] introduced the AUTOCLUST+ algorithm based on graph-theoretical Delaunay triangulation. Zaiane OR et al. [32], building upon the foundation of the traditional density-based clustering algorithm DBSCAN [33], introduced an obstacle model and proposed a novel algorithm named DBCLuC, which transforms obstacle models into a collection of polygons. The obstacle-constrained clustering algorithm based on intelligent optimization [34,35] is formed by integrating the optimization models from intelligent optimization algorithms with the clustering process. Yang Tengfei [36] introduced the QKSCO algorithm for obstacle-constrained spatial clustering, which integrates QPSO and K-medoids. The precision of clustering outcomes is elevated by applying QPSO’s rapid global convergence to the separation process of global clustering.

The studies mentioned above have thoroughly considered the impact of real-world spatial obstacles on data clustering results. They have also introduced various improvements and adaptations to traditional clustering algorithms, which have, to some extent, enhanced the accuracy of clustering outcomes, offering fresh perspectives for subsequent research in related clustering domains. However, the generality of the clustering algorithms proposed in these studies is low, and they are only partially suitable for the transportation industry. Moreover, studies that effectively present the significant effects of clustering through visualization are scarce. Therefore, this study introduces a density-based partitioning clustering algorithm incorporating road coefficients to circumvent architectural obstacles. This algorithm is designed for handling massive trajectory data and can effectively enhance the precision of clustering results by filtering out a substantial amount of invalid data. It demonstrates strong applicability in urban transportation trajectory data. Additionally, this study introduces a heatmap visualization method that combines clustering algorithms. This method not only intuitively reflects the natural distribution of data after avoiding buildings but also highlights the rapid visualization effect of the data.

This paper proposes a trajectory data storage model based on the HBase distributed database as its foundation. This model constructs integer encoding based on fundamental units of “minutes”, “hours”, and “days”, achieving the storage of trajectory data across multi-level time scales. Furthermore, this paper presents a density-based partitioning clustering algorithm incorporating road coefficients to circumvent architectural obstacles. This innovative approach not only preserves the intrinsic features of the original trajectory positions but also prevents unnecessary data transmission, thereby ensuring the authenticity and precision of clustering outcomes. Moreover, it also facilitates the rapid loading of heatmaps for abundant trajectory data points and enhances the rendering efficiency of the heatmaps.

The rest of the paper is organized in the following way. The second part briefly introduces the study area and data sources; the third part describes our research methods; the fourth part shows the experimental validation and result analysis, followed by conclusions and outlook.

2. Study Area and Data Source

2.1. Study Area

Xiamen, located in the southeastern coastal region of Fujian Province in East China, features an undulating terrain primarily characterized by hilly topography. It is one of China’s five major Special Economic Zones. As the leading city in the economic development of Fujian Province, Xiamen has attracted a significant convergence of people, goods, and commercial activities. It boasts beautiful natural surroundings, a pleasant climate, a rich cultural atmosphere, and a diverse and distinctive culture. It is acclaimed as one of the best places to live and the best tourist city in China. In theory, the characteristics of high population density and urban scale result in a significant demand for taxi services, leading to a substantial volume of taxi operations and travel activities, thus generating a wealth of taxi trajectory data. The complexity of the transportation network and the unique terrain also contribute to the diversity of taxi trajectory data in Xiamen. In practice, urban management and traffic planning in Xiamen have improved the convenience and efficiency of urban transportation by optimizing road layouts and enhancing transportation infrastructure, which has benefited taxis’ operation and travel activities. Furthermore, technological application advancements and data collection techniques have made acquiring taxi trajectory data more efficient. Therefore, selecting Xiamen as the study area holds significant theoretical and practical value.

Xiamen is divided into a total of six administrative districts. Siming District is Xiamen’s historical and cultural nucleus, serving as the epicenter of its political, commercial, and cultural activities. Huli District, located at the city’s heart, boasts a reputation for its bustling commercial districts and convenient transportation network. The proximity of these two districts has fostered the generation of exceedingly bountiful travel data. Jimei District holds the mantle of Xiamen’s educational and research hub, while Haicang District, nestled in the southeastern part of the city, boasts abundant marine resources and natural splendor, making it an area with relatively dense travel activities. Despite their expansive territorial reach, Tong’an District and Xiang’an District experience disparate identities. The former thrives on agriculture, while the latter is a burgeoning developmental precinct. Consequently, these two districts register the lowest density of personnel travel. The kernel density analysis map of travel data in the study area, as shown in Figure 1, illustrates that the travel data of people in Xiamen City is extensively distributed among the six administrative districts. The areas of high and low travel density accurately reflect the functional attributes of each section.

2.2. Data Source

This experiment utilizes Xiamen City’s taxi trajectory and relevant geographic vector data, as shown in Table 1. The geographic vector data for Xiamen City encompasses its road network data, building outlines, and the boundaries of each district (http://www.rivermap.cn/ (accessed on 20 October 2022)). The geographic vector data is primarily utilized for data preprocessing and distance calculation in clustering algorithms. The selected dataset spans from 20 May 2020 to 7 June 2020. The data was sourced from the 2020 Digital China Innovation Competition (https://data.xm.gov.cn/opendata-competition/index.html#/ (accessed on 18 October 2022)). The spatial dataset range is from 117.908136° E to 118.337183° E longitude and 24.425163° N to 24.817838° N latitude. During this period, the travel data of Xiamen’s residents are densely distributed across the administrative districts. Approximately two million trajectory data points are generated daily, totaling around 1.6 GB, showcasing strong representativeness and research value. The data structure of taxi trajectory points in Xiamen City is illustrated in Table 2. During data collection for each trajectory, issues arise due to the instability of the sampling equipment, leading to problems like missing data points, data redundancy, and erroneous data. Hence, the initial step involves data cleansing, employing the average road interpolation method to fill in missing values between trajectory points due to the instability of sampling equipment. Redundant data and erroneous data points outside the designated research area are removed. Subsequently, longitude and latitude coordinates are adjusted to complete the coordinate transformation process. After performing data cleansing on the Xiamen taxi trajectory data, it was observed that the cleaning efficiency reached 88.754%.

3. Methods

This study presents a solution for storing and rapidly visualizing massive trajectory data. To begin with, handling such voluminous trajectory data necessitates performing data preprocessing [37] on a Hadoop cluster. Herein, HDFS is responsible for distributed data storage, while MapReduce [38] is responsible for parallel data processing and cleansing. Secondly, utilizing timestamps to unify time data and hierarchically partition time data for multi-level time data encoding and, once more, employing the time encoding as the row key and storing it in HBase, facilitating swift retrieval of trajectory data within the same time frame. Finally, invalid trajectory data is filtered out utilizing the density-based partitioning clustering algorithm proposed in this paper, enabling the swift visualization of the valid trajectory data. The technical flowchart is illustrated in Figure 2.

3.1. Model for Storing Trajectory Data

In the face of the challenges posed by the vast volume of traffic trajectory data, devising an efficient storage model for storing and retrieving extensive trajectory data within big data frameworks becomes exceedingly crucial. Currently, research on space–time data primarily focuses on spatial data, while time information is often treated in a simplistic manner, such as using timestamps [39], strings [40], and time-counting methods for auxiliary processing. However, the reality is that big data contains a variety of time information that is not fully utilized. Furthermore, in many cases, achieving uniform organization of multi-scale time information and enhancing the efficiency of multi-scale time calculations pose significant challenges [41]. Therefore, this research, grounded in the architecture of MapReduce and HBase, has devised a data storage model predicated on multi-level time encoding. However, the trajectory data points of taxis within the same period are not stored continuously on the storage device. Accessing them can generate a significant amount of IO [42], thereby reducing data access efficiency [43]. Other storage methods, such as vehicle-based trajectory [44] or spatial-information-based [45] storage methods, do not guarantee adequate support for the query conditions required for this type of analysis. Therefore, this paper has optimized the time encoding storage model based on real-world usage scenarios.

3.1.1. Multi-Level Time Encoding Data Storage Model

The time data format of the gathered trajectory points in this experiment is “year/month/day hour:minute: second”, which is not highly versatile for retrieval and computation. This paper introduces a data storage model based on multi-level time encoding to uniformly organize multi-scale time information and enhance the efficiency of multi-scale time calculations. First, within the MapReduce programming model, we compose the Mapper function to parse each cleaned data record and extract time information. Secondly, we divide the time hierarchy into ‘minutes,’ ‘hours,’ and ‘days,’ create a customized encoding formula, and utilize the timestamp to convert the extracted time information into multi-level time encoding in integer form. Then, we employ the encoded time as the key and the original data as the value and output this to the Reducer function. Thirdly, in the Reducer function, we aggregate and analyze the multi-level time encoding key-value output by the Mapper. Finally, we write the output from the Reducer function to the Hadoop Distributed File System (HDFS) for shared access within the cluster. The results of multi-level time encoding are continuous integers, which can be directly added or subtracted based on the retrieval time range to improve computational efficiency. Equations (1)–(3), respectively, present the time encoding corresponding to the units of “minutes”, “hours”, and “days”.

M i n u t e I d = [\frac{T S}{M I N U T E D_M I L}]

(1)

H o u r I d = [\frac{T S}{H O U R_M I L}]

(2)

D a y I d = [\frac{H o u r I d + G M T + 8}{24}]

(3)

where

T S

denotes a timestamp, signifying the count of milliseconds since the commencement of 1 January 1970;

M I N U T E D_M I L

signifies the milliseconds within a single minute;

H O U R_M I L

signifies the milliseconds within an hour;

G M T

denotes Greenwich Mean Time; and

G M T + 8

represents the time zone of China; square brackets [ ] denote rounding down to the nearest integer. Taking the example of 15 October 2021, at 8 o’clock, the encoding is illustrated as shown in Table 3.

After defining the data table structure, data is stored in the database encoded according to different time units. Each time level’s data forms an individual data table. As per the structure of HBase, the data storage model comprises the Rowkey, TimeStamp, and Column Family. Given the distinctive nature of the data in this paper, records of multiple taxis’ trajectories at the same moment are present in the same table. Consequently, as the row key serves as a unique index for the data table, it cannot be utilized for storing time encoding. However, the column family, TAXIDATA, can dynamically expand as required and thus can accommodate the storage of time encoding. The timestamp, which records the time, serves to distinguish data versions. Hence, the initial storage model of encoded taxi trajectory point data is depicted in Table 4. Here, TIME_Code denotes the time encoding of trajectory points, while LAT and LNG represent the geographical coordinates of the trajectory points.

3.1.2. Optimization of the Data Storage Model

The taxi trajectory data simultaneously encompasses the trajectory point data generated by all currently operational taxis in Xiamen City. If one intends to retrieve all trajectory point data for a specific period, indexing each eligible data row based on time as a retrieval criterion is essential. This experiment has refined its storage structure to augment retrieval efficiency. It continues to utilize encoding in “minutes”, “hours”, and “days”. This approach enhances the efficiency of data retrieval by aggregating all trajectory points generated at the same instance into a singular cell unit. Since the time encoding is unique, it serves as the row key. The improved storage model is shown in Table 5. This model is capable of accommodating around 8400 data points within each cell set when adopting “minutes” as the unit, approximately 100,000 data points when adopting “hours” as the unit, and a remarkable 1.4 million data points when utilizing “days” as the unit.

3.2. Visualization Method Based on Density Partitioning Clustering Algorithm

When articulating extensive traffic trajectory data visually, the substantial magnitude of its data volume often leads to a suboptimal visual outcome, thus impeding the facilitation of unearthing the underlying patterns inherent in urban transportation. In the current stage, employing clustering algorithms to handle extensive datasets has emerged as the foremost technological approach. However, the constraints imposed by real-world geographical obstacles influence the clustering outcomes of data, thus failing to reflect the unique characteristics of the data authentically. Hence, this section introduces a density-based partitioning clustering algorithm that incorporates road coefficients to circumvent architectural obstacles aimed at better serving the visualization of extensive traffic trajectory data.

3.2.1. Construction of Density Partitioning Clustering Algorithm Model

To enhance the speed of generating maps and the visualization quality for massive trajectory data, mainstream clustering algorithms often adopt DBSCAN [46,47]. DBSCAN is one of the most classic density-based clustering algorithms. It clusters sample points based on the spatial distribution density of the dataset, making it capable of identifying clusters of arbitrary shapes within noisy data [48,49], and it exhibits excellent clustering performance. The similarity between traffic trajectory point data is closely related to the occlusion caused by urban roads and buildings. Because the DBSCAN clustering algorithm uses Euclidean distance [50,51] as the similarity measure, it ignores the influence of other data attributes on the distance measurement results. Hence, it has some errors when reflecting the similarity between data points, leading to decreased clustering quality. This is especially noticeable when clustering urban traffic trajectory data. Furthermore, DBSCAN’s clustering results are clusters of arbitrary shapes in space, essentially collections of individual geographic points. Typically, clustering algorithms employ the mean to ascertain the central point of a cluster, yet this deviates from the actual scenario. To address these issues, this paper introduces a density partitioning clustering algorithm based on an improved DBSCAN integrated with K-means [52,53], incorporating road coefficients to circumvent architectural obstacles. This algorithm improves upon the shortages of the traditional DBSCAN clustering algorithm when calculating the similarity distance between data points. When two data points are not on the same road, it computes the shortest accurate distance bypassing buildings between them. Subsequently, the algorithm employs the K-means clustering method to determine the cluster centroids based on the clustered results. The algorithmic process is expounded as follows:

Initially, road vector data is encoded sequentially, ensuring the code remains consistent for the same road segment. It was subsequently overlaying the taxi trajectory point data onto the road data. The road encoding is attributed to trajectory points data that fall upon the corresponding road vector data. The road encoding matching the trajectory point $X$ is noted as $G (X)$ , and the road encoding reaching the trajectory point $Y$ is registered as $G (Y)$ . Equation (4) employs a presence factor R to distinguish whether two trajectory points, X and Y, are situated along a path.

$R (X, Y) = {\begin{matrix} 0, G (X) = G (Y) \\ 1, G (X) \neq G (Y) \end{matrix}$

(4)
When the road encoding of $G (X)$ is the same as $G (Y)$ , the factor $R$ assumes a value of 0, indicating that the two trajectory points lie on the same road. Conversely, should there be a disparity, factor $R$ carries a weight of 1, denoting that the two trajectory points do not share the same route.
When $R$ equals 0, the computation of the distance between the two data points employs the Euclidean distance formula. Conversely, when $R$ equals 1, calculating the distance between the two data points necessitates bypassing the buildings to acquire the actual distance.
In the scenario where points $X$ and $Y$ do not lie upon the same road, the determination hinges upon whether line segment $X Y$ intersects with obstacles. If there is no intersection, the Euclidean distance is employed. However, if a meeting exists, it becomes imperative to compute the visible points of the buildings to ascertain the minimum distance of line segment $X Y$ . The method of calculating the visible points is as follows: The line segment $X Y$ intersects with buildings $O_{1}$ , $O_{2}$ ..., $P_{i}$ represents any vertex from the group of buildings. If the vertices on both sides of $P_{i}$ are not situated on the opposite side of the line containing $X P_{i}$ , then $P_{i}$ becomes the edge visible point of $X$ .
By following step (4), the edge visible points for both $X$ and $Y$ can be determined, with $X P_{i}$ denoting the edge visible point for the trajectory point $X$ and $Y P_{i}$ representing the edge visible point for the trajectory point $Y$ . When there are coincident points $Q_{1}$ , $Q_{2}$ ... in $X P_{i}$ and $Y P_{i}$ , then the points $X$ and $Y$ can be connected with the help of this point. Consequently, the real distance bypassing the building is $Ddist = \min ({| X Q_{1} | + | Q_{1} Y |}, {| X Q_{2} | + | Q_{2} Y |}, \dots)$ .
If there are no overlapping points, then find the visible point $X P_{i j}$ for each point $X P_{i}$ in the set of visible points $X P_{}$ of point $X$ . If the point $X P_{i j}$ overlaps with the point in $Y P_{}$ is recorded as $H$ . Consequently, the actual distance bypassing the building is $Ddist = \min ({| X X P_{i} | + | X P_{i} H | + | H Y |}, \dots)$ .
The data results are clustered into clusters after implementing the abovementioned method to improve the DBSCAN distance calculation. Then, the K-means clustering algorithm is applied by setting the parameter k = 1 to determine the clusters’ centroid coordinates $O$ and their corresponding attribute values. The flowchart of the density-based partitioning clustering algorithm is shown in Figure 3. In this figure, ①, ②, and ③ respectively signify three distinct routes that exist between two data points. Each of them navigates around the obstruction presented by the buildings.

3.2.2. Heat Map Visualization Method

Employing the clustering algorithm proposed in this paper to deal with massive traffic trajectory data can not only avoid unrealistic data reflection and inaccurate clustering results caused by building occlusion but also avoid a large number of invalid data transmissions, thus improving the rendering rate of visualization and enhancing the rendering effect of visualization. This study takes heatmap visualization as an illustrative case and explores data visualization rooted in density-based partitioning clustering algorithms.

The heatmap [54,55,56] is a frequently employed data visualization technique illustrating the density distribution of data. Its intuitive portrayal of urban traffic trajectory data aids individuals in swiftly comprehending data distribution, uncovering latent patterns and regularities, thereby supporting decision-making and problem-solving efforts.

While crafting a heatmap [57,58,59], the initial step involves mapping data points within the screen’s visual range. This process gives rise to the dataset

P_{i}

, as delineated by Equation (5).

P_{i} = (X_{i}, Y_{i}, Z_{i}) i = 1, 2, 3 \dots

(5)

In Equation (5),

X_{i}

and

Y_{i}

represent the horizontal and vertical coordinates of data points on the screen, respectively, while

Z_{i}

denotes the attribute value of the data point. Subsequently, the heatmap’s rendering radius ‘

r

’ is established. Within the screen space, each data point is positioned within a square grid cell with a side length of

r / 2

. The number of rows and columns in the data grid is determined by the screen coordinates of the data points, as outlined in Equation (6).

{\begin{matrix} R o w = \frac{2 y_{i}}{r} \\ C o l u m n = \frac{2 x_{i}}{r} \end{matrix} i = 1, 2, 3 \dots

(6)

In Equation (6),

x_{i}

and

y_{i}

represent data points’ horizontal and vertical coordinates, while ‘

r

’ represents the rendering radius. Ultimately, while creating the heatmap, it is imperative to conduct K-means clustering on the set of data points within each grid cell. Suppose the data point set within a specific cell is denoted as

P_{n}

, where ‘

n

’ is the number of data points, and the data points have coordinates

X_{n}

and

Y_{n}

, along with the attribute value

Z_{n}

. The coordinates

(X, Y)

and attribute value

(Z)

for the center of each cell after clustering can be computed using Equation (7). With

(X, Y)

as the center and extending outward with a radius of ‘r’, transparent gradient circles are drawn, with the magnitude of its grayscale values contingent upon the

Z

value. Once all clustering points have been plotted, a grayscale image is formed, and the heatmap is obtained by assigning colors based on heat level thresholds. In this study, we utilize density-based partitioning clustering algorithms and modify data point attribute values to adjust grayscale values. When performing the heat map visualization expression, it reflects the actual distribution of the data after circumventing the buildings and dramatically improves the data visualization’s rendering effect. The complete data processing and visualization process is shown in Figure 4.

{\begin{matrix} X = \frac{\sum_{i = 1}^{n} X_{i} Z_{i}}{\sum_{i = 1}^{n} Z i} \\ Y = \frac{\sum_{i = 1}^{n} Y_{i} Z_{i}}{\sum_{i = 1}^{n} Z_{i}} \\ Z = \sum_{i = 1}^{n} Z_{i} \end{matrix} i = 1, 2, 3 \dots, n

(7)

4. Experimental Validation and Result Analysis

4.1. Experimental Environment

The distributed computing cluster consists of one Controller node and three Agent nodes. Each server is configured with a CUP Intel Core i7-9700 @ 3.00 GHz octa-core, 16 GB of RAM, and CentOS 6.9 under the Linux platform. Deploy the Hadoop, JDK, HBase, and Zookeeper frameworks individually on each node, with the framework versions specified in Table 6.

4.2. Determination of Clustering Parameters and Evaluation of Clustering Results

The experiment utilizes taxi trajectory data from 31 May 2020, as a sample to ascertain the approximate range of the

E p s

value. Road-network-based distances bypassing buildings are employed as similar distances between data points, and the D-minimum distribution is utilized to ascertain the value of

E p s

. Provided a dataset

X (x_{1}, x_{2}, \dots, x_{k}, \dots, x_{n})

, the distribution of similar distances among the points is sequentially computed. We represent the point with the closest distance to each point as

d_{m i n}

. Subsequently, all the

d_{m i n}

values of the points are arranged to obtain

D (d_{m i n - 1}, d_{m i n - 2}, \dots, d_{m i n - k}, \dots, d_{m i n - n})

. The graphical representation of the distribution of curves in D is displayed in Figure 5. The similar distances within the dataset exhibit substantial variation, ranging from [0.0016, 0.0035]. Furthermore, approximately 99.06% of trajectory points fall below the threshold of 0.0035. Consequently, the clustering

E p s

value for this experiment is defined within the range of [0.0016, 0.0035].

Within the range of the specified Eps value, the taxi trajectory data for 31 May 2020 undergoes testing to select the optimal Minimum Inclusion Points, MinPts, using the silhouette coefficient as the evaluation criterion. Initially, the latitude and longitude of the test data are matched with the road network vector data. Next, the parameter value range for the neighborhood radius Eps is set to [0.0016, 0.0035]. The radius values are input at intervals of 0.0001. An integer within the scope of [50, 70] is chosen as the minimum inclusion point MinPts incoming parameter, and the silhouette coefficient serves as the criterion for assessing clustering quality. The clustering comparison outcomes of selected data are presented in Table 7 following numerous experiments.

As observed in the table, the silhouette coefficient of the improved hybrid clustering algorithm is significantly superior to that of the conventional algorithm. Hence, the density division clustering algorithm proposed in this paper surpasses the traditional DBSCAN clustering algorithm based on Euclidean distance in terms of clustering accuracy. By testing the Minpts values within the Eps range, the optimal Minpts value corresponding to each Eps is selected using the silhouette coefficient. From Table 8, it can be deduced that, with Eps assuming the value of 0.0029 and Minpts adopting the value of 69, the silhouette coefficient attains its zenith, thus signifying the pinnacle of clustering efficacy.

4.3. Comparative Analysis of Retrieval Speed

This paper introduces a data storage model based on multi-level time encoding. Based on the stored time coding in the column family, the initial storage model allows retrieval of all taxi trajectory point data within the same period. However, the time coding cannot serve as the row key of the data table to establish a unique index for more efficient retrieval of all taxi trajectory point data within the same period. Consequently, the original storage model presents limitations in this particular usage scenario. Thus, this experiment refines the storage structure of the original data storage model in harmony with practical application to enhance the retrieval efficiency of the storage model within this specific scenario.

To ascertain the enhanced retrieval efficiency of the improved data storage model, this paper conducts retrieval speed tests on both the enhanced and original models using different periods (minutes, hours, days) as the unit. Each test is repeated 100 times to calculate the average retrieval speed for each storage model. The results are depicted in the bar chart shown in Figure 6, indicating that the improved model outperforms the original model in retrieval speed. Furthermore, from the line graph in Figure 6, the following observations can be made: when extracting data in minutes, the retrieval speed of the improved storage model is 6.206 times faster than that of the original storage model; when extracting data in hours, the retrieval speed of the improved storage model is 12.475 times faster than that of the actual storage model; and when extracting data in days, the retrieval speed of the improved storage model is 18.634 times faster than the original storage model. Hence, as the volume of retrieved data continues to increases, the retrieval efficiency of this storage model becomes increasingly superior to that of the actual storage model.

Similarly, this paper conducted retrieval speed tests on the improved and space–time-coded storage models [19]. The experimental results are shown in Figure 7. When extracting data in minutes, the retrieval speed of the improved storage model is 1.151 times faster than that of the space–time-coded storage model; when extracting data in hours, the retrieval speed of the improved storage model is 1.585 times faster than that of the space–time-coded storage model; and when extracting data in days, the retrieval speed of the improved storage model is 1.962 times faster than the space–time-coded storage model. The results prove that, when retrieving data based on time information as a query condition, the improved storage model exhibits higher retrieval efficiency than the space–time-coded storage model. This is because utilizing time encoding allows for the rapid localization and extraction of data within a specified time range. In contrast, space–time coding considers spatial and time information, requiring a more complex calculation process during queries, which, to some extent, reduces retrieval efficiency. Therefore, time encoding is more direct and efficient in such a usage scenario.

4.4. Comparative Analysis of Heat Map Rendering Speed

The present experiment employs the optimal clustering parameters with an Eps value of 0.0029 and a Minpts value 69. Utilizing a density-based partitioning clustering algorithm that incorporates road coefficients to circumvent architectural obstacles, clustering is performed on the taxi trajectory data from Xiamen City. Subsequently, a heatmap is generated based on the outcomes of this clustering analysis. We employ the refined data storage model to extract trajectory point data before and after clustering for visualization. After engaging in a sequence of 10 trials, the rendering efficiency of heatmaps is assessed across consistent zoom levels. The outcomes of these evaluations are illustrated in Figure 8. The comparative results show that the visualization rendering time for post-clustered trajectory data noticeably decreases. For non-clustered trajectory point data, there is an enhancement of approximately 40% in loading speed. This improved loading speed eradicates conspicuous instances of map stuttering when panning and zooming. Consequently, it is discernible that the density-based partitioning clustering algorithm, which incorporates road coefficients to circumvent architectural obstacles, as proposed in this study, efficiently enhances the heatmap rendering rate of trajectory data by reducing data transmission volume. This innovation successfully addresses the previous challenges associated with prolonged map rendering times and diminished interactivity when visualizing extensive datasets.

The heatmaps generated from taxi trajectory point data before and after clustering are illustrated in Figure 9 and Figure 10, respectively. As is discernible from the figures, it is apparent that the heatmap generated prior to data clustering exhibits a rather disordered visual presentation, making it arduous to extract meaningful insights from the visualization results; the heatmap of Figure 10, as improved by the present paper’s clustering algorithm, achieves a heightened visual refinement. This enhancement effectively circumvents the impact of architectural obstructions on the veracity of the clustered results derived from taxi trajectory data. Extricating the heat points from architectural impediments achieves a more precise representation of the authentic data. The visual outcomes after clustering not only preserve the positional attributes of data points but also effectively addresses the heat kernel phenomenon caused by the abundance of local data points within the original data during visualization, resulting in a substantial enhancement of the heatmap’s visualization effectiveness, rendering a clear visual effect that is conducive to fully unearthing the latent geographical spatial information embedded within the extensive reservoir of trajectory data.

5. Conclusions and Outlook

Based on the trajectory data of taxis in Xiamen City, this paper constructs a Hadoop-distributed storage and computing platform. Leveraging the MapReduce programming model and HBase database, the research has performed the cleansing and distributed storage of taxi trajectory point data. This paper introduces a multi-level time encoding method and devises a trajectory big data storage model based on various time scales. This innovation addresses storage challenges arising from the non-uniformity of time formats and enhances the computational efficiency of multi-scale time. Furthermore, we have also refined the storage structure of the original storage model based on real-world usage scenarios to enhance data retrieval efficiency. Experimental results demonstrate that as the volume of retrieved data continues to increase, the retrieval efficiency of this storage model becomes increasingly superior to that of the original storage model. Moreover, it also outperforms the retrieval efficiency of the space–time-coded storage model. The introduction of this model has significantly improved the retrieval efficiency of the database and provided critical technology for information mining and visual presentation of trajectory big data. This model not only rectifies the shortcomings in utilizing time information within existing research but also flexibly optimizes data storage structures and holds potential applications in fields such as trajectory data analysis, IoT applications, data warehousing, and big data analytics. It is expected to enhance storage efficiency, accelerate data analysis, and drive innovation in related domains.

This paper also introduces a novel density-based partitioning clustering algorithm that incorporates road coefficients to circumvent architectural obstacles, yielding superior clustering results compared to traditional methods. By avoiding the impact of architectural obstacles on the authenticity of trajectory data clustering outcomes, this algorithm circumvents the transmission of numerous redundant data. (1) It notably enhances the rendering speed of taxi trajectory data heatmaps, consequently alleviating stuttering during map panning and zooming, which addresses prolonged map generation times and subpar interactivity stemming from the issue of excessive data volume. (2) Furthermore, it has the added capability to more authentically portray the data clustering effects within the actual geographical space. The outcomes post-clustering not only retain the positional attributes of data points but also address the heat kernel phenomenon arising from localized high data volume during visualization, the substantial enhancement of which significantly elevates the overall visualization impact. This algorithm has filled a void in existing research, which lacked generality within the transportation industry, and combines visual analysis methods to present clustering results intuitively. It can be crucial in urban transportation planning and management, urban vehicle scheduling and management, GIS applications, and more. It enables city decision-makers and planners to extract valuable geographic spatial information from a large volume of valid trajectory data, providing decisive support for urban decision-making and analysis.

However, this paper still has its limitations. Firstly, the approach employed in this study involves storing vast taxi trajectory data using the units of “minutes”, “hours”, and “days”. As such, the same data instances are stored three times, resulting in unnecessary storage consumption. Future research endeavors could explore optimizing storage space by adopting a unified storage unit. Secondly, while effective, the clustering algorithm proposed in this study exhibits relative complexity and demands significant computational resources. Subsequent investigations could focus on refining the algorithm based on the unique characteristics of data points for improved efficiency.

Author Contributions

Conceptualization, B.W. and J.Z.; methodology, B.W.; validation, B.W. and J.Z.; formal analysis, B.W.; investigation, J.Z. and C.H.; resources, J.Z., C.H. and Z.W.; data curation, C.H.; writing—original draft preparation, B.W.; writing—review and editing, J.Z.; visualization, B.W.; supervision, C.H. and Z.W.; project administration, J.Z.; funding acquisition, J.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China (grant No. 42371416), and the Beijing University of Civil Engineering and Architecture 2023 Doctoral Postgraduate Research Ability Improvement Program (grand No. DG2023017 and DG2023018).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data sharing is not applicable to this article.

Conflicts of Interest

The authors declare no conflict of interest.

References

Luo, Q.; Shu, H.; Xu, Y.; Liu, W. Analysis of urban residents’ commuting activities based on mobile trajectory data support. J. Wuhan Univ. Inf. Sci. Ed. 2021, 46, 718–725. [Google Scholar] [CrossRef]
Liang, S. Research on the Method and Application of MapReduce in Mobile Track Big Data Mining. Recent Adv. Electr. Electron. Eng. (Former. Recent Pat. Electr. Electron. Eng.) 2021, 14, 20–28. [Google Scholar] [CrossRef]
Zheng, Y.; Chen, Y.; Li, Q.; Xie, X.; Ma, W.Y. Understanding transportation modes based on GPS data for web applications. ACM Trans. Web (TWEB) 2010, 4, 1–36. [Google Scholar] [CrossRef]
Zhang, H.; Zhang, J.; Guo, X.; Lu, J.; Lu, H. Cloud storage and heatmap generation method for trajectory big data. Bull. Surv. Mapp. 2021, 146–149. [Google Scholar] [CrossRef]
Bala, P. Introduction of Big Data with Analytics of Big Data; IGI Global: Hershey, PA, USA, 2021. [Google Scholar]
Li, D.; Yao, Y.; Shao, Z. Big data in smart city. J. Wuhan Univ. (Inf. Sci. Ed.) 2014, 39, 631–640. [Google Scholar] [CrossRef]
Gupta, P.; Mittal, P.K.; Gopal, G. Big Data: Problems, Challenges and Techniques. 2015. Available online: https://www.researchgate.net/publication/321134019_Big_Data_Problems_Challenges_and_Techniques (accessed on 18 October 2022).
Jiang, S.; Li, C.; Wang, L.; Hu, Y.; Wang, C. LatentMap: Effective auto-encoding of density maps for spatiotemporal data visualizations. Graph. Vis. Comput. 2021, 4, 200019. [Google Scholar] [CrossRef]
Zhang, H. Research on Trajectory Big Data Model and Visualization Method Based on Hadoop. Master’s Thesis, Beijing Architecture University, Beijing, China, 2021. [Google Scholar] [CrossRef]
Jeyaraj, R.; Pugalendhi, G.; Paul, A. Hadoop Framework. In Big Data with Hadoop MapReduce; Apple Academic Press: New York, NY, USA, 2020. [Google Scholar]
Xu, H. Research on mass monitoring data Retrieval Technology based on HBase. J. Phys. Conf. Ser. 2021, 1871, 012133. [Google Scholar] [CrossRef]
Hughes, J.N.; Annex, A.; Eichelberger, C.N.; Fox, A.; Hulbert, A.; Ronquest, M. GeoMesa: A distributed architecture for spatio-temporal fusion. In Proceedings of the SPIE Defense + Security, Baltimore, MD, USA, 20–24 April 2015; Volume 94730F. [Google Scholar] [CrossRef]
Alarabi, L.; Mokbel, M.F. A demonstration of st-hadoop: A mapreduce framework for big spatio-temporal data. Proc. VLDB Endow. 2017, 10, 1961–1964. [Google Scholar] [CrossRef]
Bao, Y.; Huang, Z.; Gong, X.; Zhang, Y.; Yin, G.; Wang, H. Optimizing segmented trajectory data storage with HBase for improved spatio-temporal query efficiency. Int. J. Digit. Earth 2023, 16, 1124–1143. [Google Scholar] [CrossRef]
Wang, K.; Liu, G.; Zhai, M.; Wang, Z.; Zhou, C. Building an efficient storage model of spatial-temporal information based on HBase. J. Spat. Sci. 2019, 64, 301–317. [Google Scholar] [CrossRef]
He, Y.; Tan, H.; Luo, W.; Feng, S.; Fan, J. MR-DBSCAN: A scalable MapReduce-based DBSCAN algorithm for heavily skewed data. Front. Comput. Sci. 2014, 8, 83–99. [Google Scholar] [CrossRef]
Xu, J.; Smith, T.J. Massive data storage and sharing algorithm in distributed heterogeneous environment. J. Intell. Fuzzy Syst. 2018, 35, 4017–4026. [Google Scholar] [CrossRef]
Nishimura, S.; Das, S.; Agrawal, D.; El Abbadi, A. -HBase: Design and implementation of an elastic data infrastructure for cloud-scale location services. Distrib. Parallel Databases 2013, 31, 289–319. [Google Scholar] [CrossRef]
Yao, Z.; Zhang, J.; Li, T.; Ding, Y. A trajectory big data storage model incorporating partitioning and spatio-temporal multidimensional hierarchical organization. ISPRS Int. J. Geo-Inf. 2022, 11, 621. [Google Scholar] [CrossRef]
Dou, H.; Xu, B.; Shen, F.; Zhao, J. V-SOINN: A Topology Preserving Visualization Method for Multidimensional Data. Neurocomputing 2021, 449, 280–289. [Google Scholar] [CrossRef]
Eadie, A.; Vásquez, I.C.; Liang, X.; Wang, X.; Souders, C.L., II; El Chehouri, J.; Hoskote, R.; Feswick, A.F.; Cowie, A.M.; Loughery, J.R.; et al. Transcriptome network data in larval zebrafish (Danio rerio) following exposure to the phenylpyrazole fipronil. Data Brief 2020, 33, 106413. [Google Scholar] [CrossRef] [PubMed]
Wang, Q.; Farahat, A.; Gupta, C.; Zheng, S. Deep Time Series Models for Scarce Data. Neurocomputing 2021, 456, 504–518. [Google Scholar] [CrossRef]
Paspatis, I.; Tsohou, A.; Kokolakis, S. AppAware: A policy visualization model for mobile applications. Inf. Comput. Secur. 2020, 28, 116–132. [Google Scholar] [CrossRef]
Keim, D.; Qu, H.; Ma, K.L. Big-Data Visualization. IEEE Comput. Graph. Appl. 2013, 33, 20–21. [Google Scholar] [CrossRef]
Tang, J.; Liu, F.; Wang, Y.; Wang, H. Uncovering urban human mobility from large scale taxi GPS data. Phys. A Stat. Mech. Its Appl. 2015, 438, 140–153. [Google Scholar] [CrossRef]
Huang, Z.; Gao, S.; Cai, C.; Zheng, H.; Pan, Z.; Li, W. A rapid density method for taxi passengers hot spot recognition and visualization based on DBSCAN+. Sci. Rep. 2021, 11, 9420. [Google Scholar] [CrossRef] [PubMed]
Yu, D. A review of spatial clustering algorithms based on obstacle constraints. Comput. Syst. Appl. 2015, 24, 9–13. [Google Scholar]
Wan, J.; Cui, M.; He, Y.; Li, S. Voronoi diagram-based clustering algorithm for uncertain data in obstacle space. Comput. Res. Dev. 2019, 56, 977–991. [Google Scholar]
Tung, A.K.H.; Hou, J.; Han, J. Spatial clustering in the presence of obstacles. In Proceedings of the 17th International Conference on Data Engineering, Heidelberg, Germany, 2–6 April 2001; pp. 359–367. [Google Scholar]
Ng, R.T. Efficient and Effective Clustering Methods for Spatial Data Mining. In Proceedings of the 20th VLDB Conference, Santiago de Chile, Chile, 12–15 September 1994. [Google Scholar]
Estivill-Castro, V.; Lee, I. Autoclust+: Automatic clustering of point-data sets in the presence of obstacles. In TSDM 2000: Temporal, Spatial, and Spatio-Temporal Data Mining, Proceedings of the International Workshop on Temporal, Spatial, and Spatio-Temporal Data Mining, Lyon, France, 12 September 2000; Springer: Berlin/Heidelberg, Germany, 2000; pp. 133–146. [Google Scholar]
Zaiane, O.R.; Lee, C.H. Clustering spatial data when facing physical constraints. In Proceedings of the 2002 IEEE International Conference on Data Mining, Maebashi City, Japan, 9–12 December 2002; pp. 737–740. [Google Scholar]
Ester, M. A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. In Proceedings of the KDD’96: Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, Portland, OR, USA, 2–4 August 1996. [Google Scholar]
Zhang, X.; Wang, J.; Wu, F.; Fan, Z.; Li, X. A Novel Spatial Clustering with Obstacles Constraints Based on Genetic Algorithms and K-Medoids. In Proceedings of the Sixth International Conference on Intelligent Systems Design and Applications, Jian, China, 16–18 October 2006; pp. 605–610. [Google Scholar] [CrossRef]
Zhang, X.; Wu, J.; Si, H.; Yang, T.; Liu, Y. Spatial Clustering with Obstacles Constraints Using Ant Colony and Particle Swarm Optimization. In PAKDD 2007: Emerging Technologies in Knowledge Discovery and Data Mining, Proceedings of the International Conference on Emerging Technologies in Knowledge Discovery & Data Mining, Nanjing, China, 22–25 May 2007; Springer: Berlin/Heidelberg, Germany, 2007. [Google Scholar] [CrossRef]
Yang, T.; Zhang, X.; Liu, Y. A new algorithm for spatial clustering with obstacles by combining QPSO and K-Medoids. Electron. Des. Eng. 2011, 19, 74–77, 80. [Google Scholar] [CrossRef]
Lv, J.; Zhang, Y. Research on the preprocessing technology of massive cab trajectory data under the support of Hadoop. Urban Surv. 2016, 4, 46–49. [Google Scholar] [CrossRef]
Fu, Y.; Wu, Y.; Zhang, J.; Zheng, K.; Zhao, C.; Zheng, K.; Fang, F. MapReduce-based parallel partitioning algorithm for spatial data. Surv. Mapp. Bull. 2017, 11, 96–100. [Google Scholar] [CrossRef]
Fairbanks, K.D. An analysis of Ext4 for digital forensics. Digit. Investig. 2012, 9, S118–S130. [Google Scholar] [CrossRef]
Gilmore, W.J. MySQL Storage Engines and Datatypes. In Beginning PHP and MySQL: From Novice to Professional; Apress: New York, NY, USA, 2008; pp. 693–729. [Google Scholar]
Tong, X.; Wang, R.; Wang, L.; Lai, G.; Ding, L. An effective multi-scale time period dissection method with integer coding calculation. J. Surv. Mapp. 2016, 45, 66–76. [Google Scholar]
Zhang, J.; Liu, X.; Gang, W. Cache optimization for compressed databases in multiple storage environments. Comput. Appl. 2018, 38, 1404–1409, 1435. [Google Scholar]
Zheng, H.; He, H.; Liu, G.; Zhao, B.; Ji, G.; Yu, Z. Research on storage method of spatio-temporal trajectory data. J. Nanjing Norm. Univ. (Nat. Sci. Ed.) 2017, 40, 38–44. [Google Scholar]
Lei, Y. Vehicle Trajectory Data Management and Analysis Based on HBase. Master’s Thesis, Southwest Jiaotong University, Chengdu, China, 2017. [Google Scholar]
Chen, J.; Chu, L.; Xia, D. A MapReduce-based method for storing and querying vector spatial data. Comput. Digit. Eng. 2017, 45, 712–715, 719. [Google Scholar]
Wu, Y. A review of clustering algorithms. Comput. Sci. 2015, 42, 491–499, 524. [Google Scholar]
Han, L.Z.; Qian, X.Z.; Luo, J. DBSCAN multi-density clustering algorithm based on region partitioning. Comput. Appl. Res. 2018, 35, 1668–1671, 1685. [Google Scholar]
Tian, C.; Yang, W.; Yang, D.; Wang, Y.; Sun, S. Based on K-Means and DBSCAN clustering algorithm according to the background of student behavior analysis and research based on comprehensive university data. Sci. Technol. Innov. 2020, 3, 86–88. [Google Scholar] [CrossRef]
Wang, G.; Lin, G.Y. Improved adaptive parametric DBSCAN clustering algorithm. Comput. Eng. Appl. 2020, 56, 45–51. [Google Scholar]
Yu, Z.-H.; Hao, H.-L.; Zhang, B.-C. Research on nondestructive detection of sprouted potato based on Euclidean distance. Agric. Mech. Res. 2015, 37, 174–177. [Google Scholar] [CrossRef]
Wang, T.; Liu, W.; Liu, C. Optimization algorithm for black holes based on Euclidean distance. J. Shenyang Univ. Technol. 2016, 38, 201–205. [Google Scholar]
Shen, Y.; Zhang, T.; Xu, J. Analysis of bus operating hours based on K-means clustering algorithm. Transp. Syst. Eng. Inf. 2014, 14, 87–93. [Google Scholar] [CrossRef]
Guo, Y.; Zhang, X.; Liu, L.; Ding, L.; Niu, X. K-means clustering algorithm for optimizing initial clustering centers. Comput. Eng. Appl. 2020, 56, 172–178. [Google Scholar]
Zhang, F.; Yuan, Z.; Xiao, F. Spark-based heatmap visualization method for big data. J. Comput. Aided Des. Graph. 2016, 28, 1881–1886. [Google Scholar]
Luo, A.; Cai, D.; Li, Y.; Wang, Y. A real-time mapping method of thematic heat maps for mobile terminals. Surv. Mapp. Sci. 2016, 41, 179–183. [Google Scholar] [CrossRef]
Zhang, L.; Yang, J.; Wang, G.; Zhang, L. A thermal map generation method with structural constraints for indoor spaces. J. Surv. Mapp. Sci. Technol. 2018, 35, 533–539. [Google Scholar]
Yang, W.; Liu, J.; Wang, Y. Heatmap-based calculation method for spatial distribution of geographic objects. Surv. Mapp. Bull. 2012, 2012, 391–393, 398. [Google Scholar]
Zhao, T.; Hua, Y.; Li, L.; Li, L.; Yang, F. A research on visual representation of geotagged data based on Heat Map. Surv. Mapp. Eng. 2016, 25, 28–32. [Google Scholar] [CrossRef]
Yang, Z.; Li, L.; Yang, F. A heat map generation algorithm for millions of data. Surv. Mapp. Sci. 2018, 43, 85–89. [Google Scholar] [CrossRef]

Figure 1. Study area and spatial distribution map of taxi trajectory data.

Figure 2. Technological roadmap.

Figure 3. Flow chart of density partitioning clustering algorithm.

Figure 4. Heat map visualization complete process.

Figure 5. D-minimum distribution.

Figure 6. Comparison of retrieval speed between the original storage model and the improved storage model.

Figure 7. Comparison of retrieval speed between the space–time-coded storage model and the improved storage model.

Figure 8. Comparison of heat map rendering rates before and after clustering.

Figure 9. Visualization of the heat map before clustering.

Figure 10. Visualization of the heat map after clustering.

Table 1. Data basics.

Data Type	Data Format	Data Volume
Taxi trajectory point data	csv	32 G
Road network data	shp	20.5 M
Building contour data	shp	17.9 M
Boundary data of each district	shp	7.51 M

Table 2. Structure and exemplar of taxi trajectory data.

Field Name	Data Sample	Description
TIME	31 May 2019	Date
POSITION_TIME	23:31:00	Point of time
LNG	118.023224	Longitude
LAT	24.49147	Latitude
CAR_NO	300bf55568114df822bed19e86e821e8	License plate number

Table 3. Encoding example.

Time Level	Coding Rule	Coding Result
Minute	Minutes from 0:00 on 1 January 1970	27,237,600
Hour	Hours from 0:00 on 1 January 1970	453,960
Day	Days from 1 January 1970	18,915

Table 4. Original storage model of taxi trajectory data at different time levels.

Row Key	TIMESTAMP	TAXIDATA
Row Key	TIMESTAMP	TIME_Code	LAT	LNG	CAR_NO
1	T1	TIME_Code1	LAT1	LNG1	CAR_NO1
2	T2	TIME_Code1	LAT2	LNG2	CAR_NO2
…	…	…	…	…	…
…	…	…	…	…	…
N	Tn	TIME_Coden	LATn	LNGn	CAR_NOn

Table 5. Optimized data storage model.

Row Key	TIMESTAMP	Column Family
Row Key	TIMESTAMP	LAT	LNG	PROPERTIES
Min/Hour/Day/TS1	T1	{LAT1,LAT2…LATn}	{LNG1,LNG2…LNGn}	…
Min/Hour/Day/TS2	T2	{LAT1,LAT2…LATn}	{LNG1,LNG2…LNGn}	…
Min/Hour/Day/TS3	…	…	…	…

Table 6. Environment configuration of the experiment.

Frame Name	Configuration
Hadoop	2.7.6
JDK	JDK1.8
HBase	2.1.9
Zookeeper	3.4.14

Table 7. Comparison of silhouette coefficients of different clustering methods.

Eps	Minpts	Traditional DBSCAN Algorithm	Improved Hybrid Clustering Algorithm
0.0029	50	0.624551212	0.815335636
0.0029	51	0.641254035	0.824145512
0.0029	52	0.691455214	0.822155423
0.0029	53	0.742684135	0.832145142
0.0029	54	0.712442351	0.841223244
0.0029	55	0.713623125	0.845512341
0.0029	56	0.754215315	0.845215214
0.0029	57	0.792145221	0.861542341
0.0029	59	0.792532131	0.854136422
....	....	.....	.....

Table 8. Comparison of silhouette coefficients for different values of clustering parameters.

Eps	Minpts	Silhouette Coefficient
0.0016	62	0.814642156
0.0017	64	0.754129534
0.0018	63	0.765243894
0.0019	62	0.814536452
0.0020	62	0.824153594
0.0021	63	0.834658912
0.0022	63	0.845245821
0.0023	64	0.865147238
0.0024	64	0.846185723
0.0025	64	0.754124534
0.0026	67	0.812210354
0.0027	66	0.854221521
0.0028	67	0.842545722
0.0029	69	0.898514235
0.0030	69	0.894125612
0.0031	69	0.874456124
0.0032	69	0.845625317
0.0033	65	0.671452362
0.0034	66	0.685422435
0.0035	66	0.714524585

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wei, B.; Zhang, J.; Hu, C.; Wen, Z. A Clustering Visualization Method for Density Partitioning of Trajectory Big Data Based on Multi-Level Time Encoding. Appl. Sci. 2023, 13, 10714. https://doi.org/10.3390/app131910714

AMA Style

Wei B, Zhang J, Hu C, Wen Z. A Clustering Visualization Method for Density Partitioning of Trajectory Big Data Based on Multi-Level Time Encoding. Applied Sciences. 2023; 13(19):10714. https://doi.org/10.3390/app131910714

Chicago/Turabian Style

Wei, Boan, Jianqin Zhang, Chaonan Hu, and Zheng Wen. 2023. "A Clustering Visualization Method for Density Partitioning of Trajectory Big Data Based on Multi-Level Time Encoding" Applied Sciences 13, no. 19: 10714. https://doi.org/10.3390/app131910714

APA Style

Wei, B., Zhang, J., Hu, C., & Wen, Z. (2023). A Clustering Visualization Method for Density Partitioning of Trajectory Big Data Based on Multi-Level Time Encoding. Applied Sciences, 13(19), 10714. https://doi.org/10.3390/app131910714

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Clustering Visualization Method for Density Partitioning of Trajectory Big Data Based on Multi-Level Time Encoding

Abstract

1. Introduction

2. Study Area and Data Source

2.1. Study Area

2.2. Data Source

3. Methods

3.1. Model for Storing Trajectory Data

3.1.1. Multi-Level Time Encoding Data Storage Model

3.1.2. Optimization of the Data Storage Model

3.2. Visualization Method Based on Density Partitioning Clustering Algorithm

3.2.1. Construction of Density Partitioning Clustering Algorithm Model

3.2.2. Heat Map Visualization Method

4. Experimental Validation and Result Analysis

4.1. Experimental Environment

4.2. Determination of Clustering Parameters and Evaluation of Clustering Results

4.3. Comparative Analysis of Retrieval Speed

4.4. Comparative Analysis of Heat Map Rendering Speed

5. Conclusions and Outlook

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI