1. Introduction
Yunnan Province, with its complex geological and topographical diversity, is notably prone to geologic hazards such as landslides, mudslides, and avalanches. These recurrent geohazards pose significant threats to human safety, social stability, and sustainable economic progress. As of 2020, the province had recorded 23,267 geohazards, posing risks to approximately 3,780,400 people, and causing an estimated CNY 79.673 billion in property damage. Particularly during the “14th Five-Year Plan” period, accelerated urbanization and infrastructure development have magnified the impact of anthropogenic activities on the geological environment. Moreover, climate anomalies and frequent earthquakes compound the prevailing problem of geohazards [
1]. The effective monitoring, analysis, and prevention of geologic disasters necessitate an efficient spatio-temporal index. Such an index is not only crucial for real-time monitoring and early warning of geologic disasters—allowing for a quick analysis, understanding of geological event evolution, and implementation of necessary emergency measures to protect lives and property—but also aids in effectively managing and querying data on geological phenomena, changes, and trends, thereby offering valuable support for decision-making processes.
Spatial indexes (e.g., quadtrees [
2], KD-tree [
3], R-tree [
4], grid indexes [
5], etc.) are the key to realizing efficient retrieval and storage of spatial data [
6,
7,
8]. In recent years, a large number of spatial indexing techniques and methods have been proposed by domestic and foreign scholars and related researchers, with a wealth of spatial indexing techniques and methodologies emerging in recent times. Although a myriad of indexing techniques exist, the predominant dynamic spatial indexing structure in current use is the R-tree, as originally proposed by Guttman, along with its numerous variants [
9,
10,
11,
12,
13,
14,
15,
16]. These include the VoR-tree, as proposed by Mehdi Sharifzadeh, which effectively amalgamates Voronoi diagrams into the R-tree to enable efficient nearest-neighbor querying. Also worth mentioning is the LAZY R-tree, suggested by Y. Yang, which enhances the R-tree construction process with a delayed splitting method to bolster indexing efficiency. Poonam Goyal contributed to the Grid-R-tree, a merging of the R-tree with the grid, designed explicitly to cater to the querying requirements of diverse data mining algorithms, etc.
The R-tree is a variant of the B-tree-based indexing structure with a fully dynamic indexing structure. However, since the R-tree is composed based on MBR (Minimum Bounding Rectangle), the spatial objects in the index as well as the nodes at each level are represented by it, which can easily lead to rectangle overlapping, thus triggering a multiplex query situation during querying [
17]; not only that, the space utilization of the R-tree’s leaf nodes is also low, and the space within the nodes cannot be fully utilized [
18]. For this reason, Kamel et al. [
19] proposed the Hilbert-R tree, which utilizes the Hilbert curve to encode and arrange spatial objects to obtain the MBR, which reduces the overlap rate and improves the querying efficiency of the spatial data, but the shortcoming is that the performance is low when the spatial data distribution is not uniform.
Addressing the inherent limitations of R-tree’s ability to handle unequally distributed data, various scholars have begun to explore the amalgamation of tree-based spatial indexing techniques with clustering methodologies. Among them, Liu et al. [
20] proposed a K-means algorithm-based technique for generating a static R-tree. By leveraging the characteristics of clustering, they managed to enhance the data similarity within nodes and reduce the similarity between nodes, thus diminishing the overlap of Minimum Bounding Rectangles (MBRs). Wang et al. [
21] proposed the construction of an R-tree based on the K-medoids algorithm, which compensates for the K-means algorithm’s susceptibility to spatial data noise points and promotes data compactness. Jiang et al. [
22], on the other hand, proposed a Gaussian Mixture Model (GMM) clustering-algorithm-based Hilbert-R tree structure. By using GMM to preprocess the spatial data, they achieved high intra-cluster data similarity and low inter-cluster similarity, ensuring that neighboring data points resided in the same leaf node while reducing the MBR overlap rate. To address the challenge of handling voluminous geological data, Yu-Hang Zhang [
23] innovatively integrated the deep clustering algorithm into the construction of a Hilbert-R tree, creating an efficient data indexing structure. Huan Cheng [
24], on the other hand, endeavored to expedite the storage of unevenly distributed data and the construction of rapid indexing for substantial data. To achieve this, he enhanced the K-means clustering algorithm, producing the CUK, and coupled it with the stacked long short-term memory (LSTM) model, thereby optimizing the utility of the Hilbert-R tree.
Geohazard monitoring typically deals with spatial data that are unevenly distributed. While the Hilbert-R tree, constructed through the integration of clustering algorithms, does offer expedited indexing of these data, it grapples with numerous challenges within real-time monitoring and early warning applications. Key among these is the dynamic nature of geohazard data, which necessitates real-time updating. While the Hilbert-R tree is well equipped to handle static data, it offers limited capabilities in managing the real-time updating of significant data. Furthermore, its indexing ability is largely confined to the spatial dimension, rendering it unable to satisfy the multidimensional query requirement, particularly the temporal dimension. Moreover, due to the sheer volume of data associated with geohazard monitoring, there arises a need for efficient processing and storage capabilities for large-scale data. Taking these problems into consideration, we propose an improved scheme in this paper based on the stream clustering algorithm CluStream’s spatio-temporal indexing model BCHR-index, which has the following contributions:
- (1)
Confronting the limitation of traditional spatial indexing, which excludes the temporal dimension, we utilize the joint B+ tree to index the temporal dimension, thereby facilitating multidimensional spatio-temporal queries;
- (2)
We capitalize on the micro-clusters generated by the CluStream algorithm in our stream processing stage. In combination with the B+ tree, we construct in-memory indexes to satisfy the necessities of real-time geohazard data stream monitoring and enable a rapid response during the warning process;
- (3)
We leverage the Hilbert-R tree enhanced with the CluStream data stream clustering algorithm to preprocess multidimensional spatial data. This strategy serves to minimize the areas of node MBRs and reduce their similarities, thus avoiding excessive overlap between MBRs and unnecessary multi-path retrieval during querying processes;
- (4)
Employing the open-source columnar database, HBase, within the Hadoop big data processing framework, we achieve efficient storage of geohazard data.
The remainder of the paper is structured as follows:
Section 2 introduces the overall model architecture and the structure of the BCHR tree.
Section 3 explains the implementation of the Hilbert-R tree, based on the CluStream clustering algorithm and the multidimensional range query algorithm. In
Section 4, we conduct relevant experiments on the model.
Section 5 concludes the study, discussing the limitations and suggesting new directions for future work.
2. Model Overview
Figure 1 shows in detail the model architecture realized in this paper, i.e., the BCHR-index, which contains three main parts: the client, index layer, and storage layer. The client is mainly responsible for initiating requests and accepting responses; not directly involved in data storage and processing, it is responsible for continuously outputting real-time streaming data and sending query requests to the index layer, and the real-time streaming data will be sent to the index layer and the storage layer for processing, respectively, and before the data transmission, the client will also transform the time information of the data into Unix timestamps, and at the same time, the spatial coordinates will be transformed into a Hilbert code to facilitate the subsequent construction of the index.
Concurrently, the storage layer shoulders the responsibility of accommodating voluminous geohazard data utilizing the HBase database. We opt to store historical geohazard data in the underlying HDFS while sourcing high-incidence geohazard point data into the Block Cache. For the consistent influx of real-time streaming data from the client, the Client-side Caching function of HBase is employed to steer and inscribe the data into memory, culminating in data batch writing into HBase at fixed intervals. This methodology enhances the response speed for incoming real-time geohazard streaming data and optimizes data writing efficiency, all while diminishing the frequency of index updates. HDFS ensures that multiple copies of a single datum are dispersed across different nodes. This ensures swift data recovery through copies from other nodes even if one node fails, and this ensures business continuity and the preservation of geohazard data integrity. Meanwhile, with the growth of data volume, there is no need to make significant changes to the existing application architecture, just adding more server nodes to the HBase cluster to expand the system’s storage capacity and processing capacity, which can effectively deal with a large number of geohazard monitoring data storage and access requirements, as well as read and write operations.
The indexing layer, the BCHR tree, is mainly responsible for the corresponding query operation in the face of the request sent by the client, and its work is separate from HBase, including indexing of the current data, as well as indexing of historical data in two parts.
Figure 2 shows the principle and framework of the indexing layer. Four of the sub-structures play different roles, described as follows:
- (1)
The Hilbert-R tree serves as the principal component for facilitating spatio-temporal queries, executing spatial dimension queries based on the spatial coordinates of the objects under consideration.
- (2)
The CluStream algorithm processes spatial objects in the leaf nodes of a Hilbert-R tree. This technique streamlines the clustering of spatial datasets and minimizes node overlap, as well as dead space.
- (3)
The B+ tree is employed for indexing the time dimension of the BCHR tree, enabling the filtering of temporal information during spatio-temporal queries.
- (4)
The Rowkey of HBase is designed for data querying, and is a composite of the Hilbert code and Unix timestamp in this study. Utilizing the query results from both the B+ tree and the Hilbert-R tree, the Rowkey can directly pinpoint the location of data in HBase and identify the data needed to meet the query parameters.
In dealing with geohazard data, it is demonstrated that the traditional R-tree is ill equipped to support multidimensional spatio-temporal queries or respond expediently to real-time geohazard monitoring data. To address this, this paper defines a time threshold, T1. If the timestamp of incoming data is less than T1, the data are committed to memory and regarded as ‘current data’; conversely, when the timestamp surpasses T1, the data are written to HBase and branded as ‘historical data’. For instantaneous data, the indexing layer retrieves data cached in memory, whereas, for historical data, it queries the data stored in HBase. This strategy notably mitigates the maintenance overhead of the Hilbert-R tree. As elucidated in
Figure 3, during query processing, the temporal dimension is first filtered using a B+ tree, effectively narrowing down the query range to determine the data region. Thereafter, onward queries on the spatial dimension are conducted via the Hilbert-R tree. The final querying results are a composite of the B+ tree and Hilbert-R tree queries, yielding the Rowkey of the prospective spatial object, revealing its information stored in HBase.
5. Conclusions and Outlook
Confronted with the real-time generation of massive geological disaster data, there is an imperative need for an efficient real-time stream data processing framework to satisfy the rapid response demand of real-time monitoring and early warning of geological disasters. As one of the most widely used spatial index structures, the R-tree exhibits commendable performance in dealing with static data, yet it struggles with handling streaming data and does not flexibly cater to temporal indexing needs. Consequently, this study proposes a spatio-temporal index model based on a data stream clustering algorithm, the BCHR-index, to meet the requirement for multidimensional spatio-temporal queries of geological disaster data. The BCHR-index model harnesses the properties of the stream clustering algorithm CluStream and employs a real-time/offline two-tier processing framework paired with a B+ tree to construct the BCHR tree, partitioning data into real-time and offline stages. Thanks to the small data volume of the real-time data stream, the CluStream-method-generated micro-clusters can construct indices in real-time, enabling nearly instantaneous responses to geological streaming data. The offline phase builds a Hilbert-R tree using spatial data processed with the clustering algorithm, utilizing the cluster centers as leaf nodes. This maintains the continuity and integrity of the spatial distribution of geological disasters, enhancing the spatial query efficiency during the monitoring process. Even when dealing with unevenly distributed geological disaster data, the model boasts millisecond-level response times. Taking into account the sheer volume of geological disaster monitoring data, the model leverages HBase for storing such data, ensuring a certain degree of fault tolerance and scalability. However, improvements can still be made to the model. In future works, (1) we plan to further enhance the real-time indexing. Despite the millisecond-level responses of the index presented in this study, each real-time query requires the reconstruction of the index, thus adding a temporal overhead to a certain extent. (2) We will explore the best way to select the K value; the CluStream algorithm uses the K-mean method to generate clusters, and when using the CluStream algorithm to process the leaf nodes of the BCHR tree, this paper selects the K value of 1% of the number of datasets, but different types of geohazards and occurrence areas; the most suitable K value is different, so the K value selection needs to be more flexible and variable. (3) Finally, future investigations will look to enhance the robustness of our system, ensuring rapid and accurate responses in emergencies to protect people’s lives and properties.